All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00 of 66] Transparent Hugepage Support #32
@ 2010-11-03 15:27 ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Some of some relevant user of the project:

KVM Virtualization
GCC (kernel build included, requires a few liner patch to enable)
JVM
VMware Workstation
HPC

It would be great if it could go in -mm.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=blob;f=Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog

first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later: git fetch; git checkout -f origin/master

The tree is rebased and git pull won't work.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc1/transparent_hugepage-32/
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc1/transparent_hugepage-32.gz

Diff #31 -> #32:

 b/clear_copy_huge_page             |   59 ++++++++--------

hugetlbfs.c copy_huge_page was renamed to copy_user_huge_page.

 b/kvm_transparent_hugepage         |   38 +++++-----

Adjust hva_to_pfn interface change.

 b/lumpy-compaction                 |   48 +++++++++++++

Disable lumpy reclaim when CONFIG_COMPACTION is enabled. Agreed with Mel @
Kernel Summit. Mel wants to defer the full removal of lumpy reclaim of a few
releases. Disabling lumpy reclaim is needed to prevent THP to render the system
totally unusable when reclaim starts.

 b/pmd_trans_huge_migrate           |   31 +++-----

Migrate can now migrate hugetlb pages. This has a chance to break on THP but it
seems all the "magic hugetlbfs code paths" are activeted by the "destination"
page to be huge. That never happens with THP that in fact would split the page
making the source not huge either. So it seems the current code may co-exist
with THP too without further changes.

This update also fixes a false positive BUG_ON in remove_migration_pte that
could materialize after handling the CPU errata(s) that shows the CPU don't
like 4k and 2M simultaneous TLB entries. To implement the workaround without
increasing the TLB flush cost of split_huge_page I had to set the hugepmd as
non-present during the TLB flush (so opening a micro-window for a false
positive in the BUG_ON check). The BUG_ON simply can be safely removed now, in
turn solving the false positive.

 remove-lumpy_reclaim               |  131 -------------------------------------

lumpy reclaim not removed anymore but it gets disabled at runtime by enabling
CONFIG_COMPACTION=y at compile time (and setting CONFIG_TRANSPARENT_HUGEPAGE=y
implicitly selects CONFIG_COMPACTION=y of course).

Full diffstat:

 Documentation/vm/transhuge.txt        |  283 ++++
 arch/alpha/include/asm/mman.h         |    2 
 arch/mips/include/asm/mman.h          |    2 
 arch/parisc/include/asm/mman.h        |    2 
 arch/powerpc/mm/gup.c                 |   12 
 arch/x86/include/asm/kvm_host.h       |    1 
 arch/x86/include/asm/paravirt.h       |   23 
 arch/x86/include/asm/paravirt_types.h |    6 
 arch/x86/include/asm/pgtable-2level.h |    9 
 arch/x86/include/asm/pgtable-3level.h |   23 
 arch/x86/include/asm/pgtable.h        |  149 ++
 arch/x86/include/asm/pgtable_64.h     |   28 
 arch/x86/include/asm/pgtable_types.h  |    3 
 arch/x86/kernel/paravirt.c            |    3 
 arch/x86/kernel/tboot.c               |    2 
 arch/x86/kernel/vm86_32.c             |    1 
 arch/x86/kvm/mmu.c                    |   60 
 arch/x86/kvm/paging_tmpl.h            |    4 
 arch/x86/mm/gup.c                     |   28 
 arch/x86/mm/pgtable.c                 |   66 
 arch/xtensa/include/asm/mman.h        |    2 
 drivers/base/node.c                   |   21 
 fs/Kconfig                            |    2 
 fs/exec.c                             |   44 
 fs/proc/meminfo.c                     |   14 
 fs/proc/page.c                        |   14 
 include/asm-generic/mman-common.h     |    2 
 include/asm-generic/pgtable.h         |  130 +
 include/linux/compaction.h            |   13 
 include/linux/gfp.h                   |   14 
 include/linux/huge_mm.h               |  170 ++
 include/linux/khugepaged.h            |   66 
 include/linux/kvm_host.h              |    4 
 include/linux/memory_hotplug.h        |   14 
 include/linux/mm.h                    |  114 +
 include/linux/mm_inline.h             |   19 
 include/linux/mm_types.h              |    3 
 include/linux/mmu_notifier.h          |   66 
 include/linux/mmzone.h                |    1 
 include/linux/page-flags.h            |   36 
 include/linux/sched.h                 |    1 
 include/linux/swap.h                  |    2 
 kernel/fork.c                         |   12 
 kernel/futex.c                        |   55 
 mm/Kconfig                            |   38 
 mm/Makefile                           |    1 
 mm/compaction.c                       |   48 
 mm/huge_memory.c                      | 2290 ++++++++++++++++++++++++++++++++++
 mm/hugetlb.c                          |   69 -
 mm/ksm.c                              |   52 
 mm/madvise.c                          |    8 
 mm/memcontrol.c                       |  138 +-
 mm/memory-failure.c                   |    2 
 mm/memory.c                           |  199 ++
 mm/memory_hotplug.c                   |   14 
 mm/mempolicy.c                        |   14 
 mm/migrate.c                          |    7 
 mm/mincore.c                          |    7 
 mm/mmap.c                             |    7 
 mm/mmu_notifier.c                     |   20 
 mm/mprotect.c                         |   20 
 mm/mremap.c                           |    9 
 mm/page_alloc.c                       |   31 
 mm/pagewalk.c                         |    1 
 mm/rmap.c                             |  115 -
 mm/sparse.c                           |    4 
 mm/swap.c                             |  117 +
 mm/swap_state.c                       |    6 
 mm/swapfile.c                         |    2 
 mm/vmscan.c                           |   43 
 mm/vmstat.c                           |    1 
 virt/kvm/iommu.c                      |    2 
 virt/kvm/kvm_main.c                   |   56 
 73 files changed, 4485 insertions(+), 362 deletions(-)

FAQ:

Q: When will 1G pages be supported? (by far the most frequently asked question
   in the last two days)
A: Not any time soon but it's not entirly impossible... The benefit of going
   from 2M to 1G is likely much lower than the benefit of going from 4k to 2M
   so it's unlikely to be a worthwhile effort for a while.

Q: When this will work on filebacked pages? (pagecache/swapcache/tmpfs)
A: Not until it's merged in mainline. It's already feature complete for many
   usages and the moment we expand into pagecache the patch would grow
   significantly.

Q: When will KSM will scan inside Transparent Hugepages?
A: Working on that, this should materialize soon enough.

Q: What is the next place where to remove split_huge_page_pmd()?
A: mremap. JVM uses mremap in the garbage collector so the ~18% boost (no virt)
   has further margin for optimizations.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 00 of 66] Transparent Hugepage Support #32
@ 2010-11-03 15:27 ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Some of some relevant user of the project:

KVM Virtualization
GCC (kernel build included, requires a few liner patch to enable)
JVM
VMware Workstation
HPC

It would be great if it could go in -mm.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=blob;f=Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog

first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later: git fetch; git checkout -f origin/master

The tree is rebased and git pull won't work.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc1/transparent_hugepage-32/
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc1/transparent_hugepage-32.gz

Diff #31 -> #32:

 b/clear_copy_huge_page             |   59 ++++++++--------

hugetlbfs.c copy_huge_page was renamed to copy_user_huge_page.

 b/kvm_transparent_hugepage         |   38 +++++-----

Adjust hva_to_pfn interface change.

 b/lumpy-compaction                 |   48 +++++++++++++

Disable lumpy reclaim when CONFIG_COMPACTION is enabled. Agreed with Mel @
Kernel Summit. Mel wants to defer the full removal of lumpy reclaim of a few
releases. Disabling lumpy reclaim is needed to prevent THP to render the system
totally unusable when reclaim starts.

 b/pmd_trans_huge_migrate           |   31 +++-----

Migrate can now migrate hugetlb pages. This has a chance to break on THP but it
seems all the "magic hugetlbfs code paths" are activeted by the "destination"
page to be huge. That never happens with THP that in fact would split the page
making the source not huge either. So it seems the current code may co-exist
with THP too without further changes.

This update also fixes a false positive BUG_ON in remove_migration_pte that
could materialize after handling the CPU errata(s) that shows the CPU don't
like 4k and 2M simultaneous TLB entries. To implement the workaround without
increasing the TLB flush cost of split_huge_page I had to set the hugepmd as
non-present during the TLB flush (so opening a micro-window for a false
positive in the BUG_ON check). The BUG_ON simply can be safely removed now, in
turn solving the false positive.

 remove-lumpy_reclaim               |  131 -------------------------------------

lumpy reclaim not removed anymore but it gets disabled at runtime by enabling
CONFIG_COMPACTION=y at compile time (and setting CONFIG_TRANSPARENT_HUGEPAGE=y
implicitly selects CONFIG_COMPACTION=y of course).

Full diffstat:

 Documentation/vm/transhuge.txt        |  283 ++++
 arch/alpha/include/asm/mman.h         |    2 
 arch/mips/include/asm/mman.h          |    2 
 arch/parisc/include/asm/mman.h        |    2 
 arch/powerpc/mm/gup.c                 |   12 
 arch/x86/include/asm/kvm_host.h       |    1 
 arch/x86/include/asm/paravirt.h       |   23 
 arch/x86/include/asm/paravirt_types.h |    6 
 arch/x86/include/asm/pgtable-2level.h |    9 
 arch/x86/include/asm/pgtable-3level.h |   23 
 arch/x86/include/asm/pgtable.h        |  149 ++
 arch/x86/include/asm/pgtable_64.h     |   28 
 arch/x86/include/asm/pgtable_types.h  |    3 
 arch/x86/kernel/paravirt.c            |    3 
 arch/x86/kernel/tboot.c               |    2 
 arch/x86/kernel/vm86_32.c             |    1 
 arch/x86/kvm/mmu.c                    |   60 
 arch/x86/kvm/paging_tmpl.h            |    4 
 arch/x86/mm/gup.c                     |   28 
 arch/x86/mm/pgtable.c                 |   66 
 arch/xtensa/include/asm/mman.h        |    2 
 drivers/base/node.c                   |   21 
 fs/Kconfig                            |    2 
 fs/exec.c                             |   44 
 fs/proc/meminfo.c                     |   14 
 fs/proc/page.c                        |   14 
 include/asm-generic/mman-common.h     |    2 
 include/asm-generic/pgtable.h         |  130 +
 include/linux/compaction.h            |   13 
 include/linux/gfp.h                   |   14 
 include/linux/huge_mm.h               |  170 ++
 include/linux/khugepaged.h            |   66 
 include/linux/kvm_host.h              |    4 
 include/linux/memory_hotplug.h        |   14 
 include/linux/mm.h                    |  114 +
 include/linux/mm_inline.h             |   19 
 include/linux/mm_types.h              |    3 
 include/linux/mmu_notifier.h          |   66 
 include/linux/mmzone.h                |    1 
 include/linux/page-flags.h            |   36 
 include/linux/sched.h                 |    1 
 include/linux/swap.h                  |    2 
 kernel/fork.c                         |   12 
 kernel/futex.c                        |   55 
 mm/Kconfig                            |   38 
 mm/Makefile                           |    1 
 mm/compaction.c                       |   48 
 mm/huge_memory.c                      | 2290 ++++++++++++++++++++++++++++++++++
 mm/hugetlb.c                          |   69 -
 mm/ksm.c                              |   52 
 mm/madvise.c                          |    8 
 mm/memcontrol.c                       |  138 +-
 mm/memory-failure.c                   |    2 
 mm/memory.c                           |  199 ++
 mm/memory_hotplug.c                   |   14 
 mm/mempolicy.c                        |   14 
 mm/migrate.c                          |    7 
 mm/mincore.c                          |    7 
 mm/mmap.c                             |    7 
 mm/mmu_notifier.c                     |   20 
 mm/mprotect.c                         |   20 
 mm/mremap.c                           |    9 
 mm/page_alloc.c                       |   31 
 mm/pagewalk.c                         |    1 
 mm/rmap.c                             |  115 -
 mm/sparse.c                           |    4 
 mm/swap.c                             |  117 +
 mm/swap_state.c                       |    6 
 mm/swapfile.c                         |    2 
 mm/vmscan.c                           |   43 
 mm/vmstat.c                           |    1 
 virt/kvm/iommu.c                      |    2 
 virt/kvm/kvm_main.c                   |   56 
 73 files changed, 4485 insertions(+), 362 deletions(-)

FAQ:

Q: When will 1G pages be supported? (by far the most frequently asked question
   in the last two days)
A: Not any time soon but it's not entirly impossible... The benefit of going
   from 2M to 1G is likely much lower than the benefit of going from 4k to 2M
   so it's unlikely to be a worthwhile effort for a while.

Q: When this will work on filebacked pages? (pagecache/swapcache/tmpfs)
A: Not until it's merged in mainline. It's already feature complete for many
   usages and the moment we expand into pagecache the patch would grow
   significantly.

Q: When will KSM will scan inside Transparent Hugepages?
A: Working on that, this should materialize soon enough.

Q: What is the next place where to remove split_huge_page_pmd()?
A: mremap. JVM uses mremap in the garbage collector so the ~18% boost (no virt)
   has further margin for optimizations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Compaction is more reliable than lumpy, and lumpy makes the system unusable
when it runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -274,6 +274,7 @@ unsigned long shrink_slab(unsigned long 
 static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
 				   bool sync)
 {
+#ifndef CONFIG_COMPACTION
 	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
 
 	/*
@@ -294,11 +295,14 @@ static void set_lumpy_reclaim_mode(int p
 		sc->lumpy_reclaim_mode = mode;
 	else
 		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+#endif
 }
 
 static void disable_lumpy_reclaim_mode(struct scan_control *sc)
 {
+#ifndef CONFIG_COMPACTION
 	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+#endif
 }
 
 static inline int is_page_cache_freeable(struct page *page)
@@ -321,9 +325,11 @@ static int may_write_to_queue(struct bac
 	if (bdi == current->backing_dev_info)
 		return 1;
 
+#ifndef CONFIG_COMPACTION
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		return 1;
+#endif
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Compaction is more reliable than lumpy, and lumpy makes the system unusable
when it runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -274,6 +274,7 @@ unsigned long shrink_slab(unsigned long 
 static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
 				   bool sync)
 {
+#ifndef CONFIG_COMPACTION
 	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
 
 	/*
@@ -294,11 +295,14 @@ static void set_lumpy_reclaim_mode(int p
 		sc->lumpy_reclaim_mode = mode;
 	else
 		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+#endif
 }
 
 static void disable_lumpy_reclaim_mode(struct scan_control *sc)
 {
+#ifndef CONFIG_COMPACTION
 	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+#endif
 }
 
 static inline int is_page_cache_freeable(struct page *page)
@@ -321,9 +325,11 @@ static int may_write_to_queue(struct bac
 	if (bdi == current->backing_dev_info)
 		return 1;
 
+#ifndef CONFIG_COMPACTION
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		return 1;
+#endif
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to its
final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location
before rmap_walk is able to find the copy. This leaves a dangling
migration PTE behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the temporary
stack and the other which covers the new location. This guarantees that rmap
can always find the migration PTE even if it is copied while rmap_walk is
taking place.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -248,10 +249,9 @@ static int __bprm_mm_init(struct linux_b
 	 * use STACK_TOP because that can depend on attributes which aren't
 	 * configured yet.
 	 */
-	BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
 	vma->vm_end = STACK_TOP_MAX;
 	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vma->vm_flags = VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
+	vma->vm_flags = VM_STACK_FLAGS;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	err = insert_vm_struct(mm, vma);
@@ -520,7 +520,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -532,17 +534,43 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	tmp_vma->vm_start = new_start;
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
 		return -ENOMEM;
+	}
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -568,7 +596,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }
@@ -638,7 +666,6 @@ int setup_arg_pages(struct linux_binprm 
 	else if (executable_stack == EXSTACK_DISABLE_X)
 		vm_flags &= ~VM_EXEC;
 	vm_flags |= mm->def_flags;
-	vm_flags |= VM_STACK_INCOMPLETE_SETUP;
 
 	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
 			vm_flags);
@@ -653,9 +680,6 @@ int setup_arg_pages(struct linux_binprm 
 			goto out_unlock;
 	}
 
-	/* mprotect_fixup is overkill to remove the temporary stack flags */
-	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
-
 	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
 	stack_size = vma->vm_end - vma->vm_start;
 	/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -111,9 +111,6 @@ extern unsigned int kobjsize(const void 
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
-/* Bits set in the VMA until the stack is in its final location */
-#define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
-
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1202,20 +1202,6 @@ static int try_to_unmap_cluster(unsigned
 	return ret;
 }
 
-static bool is_vma_temporary_stack(struct vm_area_struct *vma)
-{
-	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
-
-	if (!maybe_stack)
-		return false;
-
-	if ((vma->vm_flags & VM_STACK_INCOMPLETE_SETUP) ==
-						VM_STACK_INCOMPLETE_SETUP)
-		return true;
-
-	return false;
-}
-
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1244,21 +1230,7 @@ static int try_to_unmap_anon(struct page
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address;
-
-		/*
-		 * During exec, a temporary VMA is setup and later moved.
-		 * The VMA is moved under the anon_vma lock but not the
-		 * page tables leading to a race where migration cannot
-		 * find the migration ptes. Rather than increasing the
-		 * locking requirements of exec(), migration skips
-		 * temporary VMAs until after exec() completes.
-		 */
-		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
-			continue;
-
-		address = vma_address(page, vma);
+		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Page migration requires rmap to be able to find all migration ptes
created by migration. If the second rmap_walk clearing migration PTEs
misses an entry, it is left dangling causing a BUG_ON to trigger during
fault. For example;

[  511.201534] kernel BUG at include/linux/swapops.h:105!
[  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
[  511.201534] last sysfs file: /sys/block/sde/size
[  511.201534] CPU 0
[  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
[  511.888526]
[  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
[  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
[  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
[  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
[  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
[  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
[  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
[  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
[  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
[  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
[  513.068415] Stack:
[  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
[  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
[  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
[  513.228068] Call Trace:
[  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
[  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
[  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
[  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
[  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
[  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
[  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
[  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
[  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
[  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
[  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
[  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
[  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
[  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
[  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
[  514.808221]  RSP <ffff880037b979d8>
[  514.906179] ---[ end trace 4f88495edc224d6b ]---

This particular BUG_ON is caused by a race between shift_arg_pages and
migration. During exec, a temporary stack is created and later moved to its
final location. If migration selects a page within the temporary stack,
the page tables and migration PTE can be copied to the new location
before rmap_walk is able to find the copy. This leaves a dangling
migration PTE behind that later triggers the bug.

This patch fixes the problem by using two VMAs - one which covers the temporary
stack and the other which covers the new location. This guarantees that rmap
can always find the migration PTE even if it is copied while rmap_walk is
taking place.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
 #include <linux/fs_struct.h>
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
+#include <linux/rmap.h>
 
 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -248,10 +249,9 @@ static int __bprm_mm_init(struct linux_b
 	 * use STACK_TOP because that can depend on attributes which aren't
 	 * configured yet.
 	 */
-	BUG_ON(VM_STACK_FLAGS & VM_STACK_INCOMPLETE_SETUP);
 	vma->vm_end = STACK_TOP_MAX;
 	vma->vm_start = vma->vm_end - PAGE_SIZE;
-	vma->vm_flags = VM_STACK_FLAGS | VM_STACK_INCOMPLETE_SETUP;
+	vma->vm_flags = VM_STACK_FLAGS;
 	vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	err = insert_vm_struct(mm, vma);
@@ -520,7 +520,9 @@ static int shift_arg_pages(struct vm_are
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	unsigned long moved_length;
 	struct mmu_gather *tlb;
+	struct vm_area_struct *tmp_vma;
 
 	BUG_ON(new_start > new_end);
 
@@ -532,17 +534,43 @@ static int shift_arg_pages(struct vm_are
 		return -EFAULT;
 
 	/*
+	 * We need to create a fake temporary vma and index it in the
+	 * anon_vma list in order to allow the pages to be reachable
+	 * at all times by the rmap walk for migrate, while
+	 * move_page_tables() is running.
+	 */
+	tmp_vma = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	if (!tmp_vma)
+		return -ENOMEM;
+	*tmp_vma = *vma;
+	INIT_LIST_HEAD(&tmp_vma->anon_vma_chain);
+	/*
 	 * cover the whole range: [new_start, old_end)
 	 */
-	if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
+	tmp_vma->vm_start = new_start;
+	/*
+	 * The tmp_vma destination of the copy (with the new vm_start)
+	 * has to be at the end of the anon_vma list for the rmap_walk
+	 * to find the moved pages at all times.
+	 */
+	if (unlikely(anon_vma_clone(tmp_vma, vma))) {
+		kmem_cache_free(vm_area_cachep, tmp_vma);
 		return -ENOMEM;
+	}
 
 	/*
 	 * move the page tables downwards, on failure we rely on
 	 * process cleanup to remove whatever mess we made.
 	 */
-	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length))
+	moved_length = move_page_tables(vma, old_start,
+					vma, new_start, length);
+
+	vma->vm_start = new_start;
+	/* rmap walk will already find all pages using the new_start */
+	unlink_anon_vmas(tmp_vma);
+	kmem_cache_free(vm_area_cachep, tmp_vma);
+
+	if (length != moved_length) 
 		return -ENOMEM;
 
 	lru_add_drain();
@@ -568,7 +596,7 @@ static int shift_arg_pages(struct vm_are
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
 	 */
-	vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
+	vma->vm_end = new_end;
 
 	return 0;
 }
@@ -638,7 +666,6 @@ int setup_arg_pages(struct linux_binprm 
 	else if (executable_stack == EXSTACK_DISABLE_X)
 		vm_flags &= ~VM_EXEC;
 	vm_flags |= mm->def_flags;
-	vm_flags |= VM_STACK_INCOMPLETE_SETUP;
 
 	ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
 			vm_flags);
@@ -653,9 +680,6 @@ int setup_arg_pages(struct linux_binprm 
 			goto out_unlock;
 	}
 
-	/* mprotect_fixup is overkill to remove the temporary stack flags */
-	vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
-
 	stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
 	stack_size = vma->vm_end - vma->vm_start;
 	/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -111,9 +111,6 @@ extern unsigned int kobjsize(const void 
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
 
-/* Bits set in the VMA until the stack is in its final location */
-#define VM_STACK_INCOMPLETE_SETUP	(VM_RAND_READ | VM_SEQ_READ)
-
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
 #endif
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1202,20 +1202,6 @@ static int try_to_unmap_cluster(unsigned
 	return ret;
 }
 
-static bool is_vma_temporary_stack(struct vm_area_struct *vma)
-{
-	int maybe_stack = vma->vm_flags & (VM_GROWSDOWN | VM_GROWSUP);
-
-	if (!maybe_stack)
-		return false;
-
-	if ((vma->vm_flags & VM_STACK_INCOMPLETE_SETUP) ==
-						VM_STACK_INCOMPLETE_SETUP)
-		return true;
-
-	return false;
-}
-
 /**
  * try_to_unmap_anon - unmap or unlock anonymous page using the object-based
  * rmap method
@@ -1244,21 +1230,7 @@ static int try_to_unmap_anon(struct page
 
 	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
 		struct vm_area_struct *vma = avc->vma;
-		unsigned long address;
-
-		/*
-		 * During exec, a temporary VMA is setup and later moved.
-		 * The VMA is moved under the anon_vma lock but not the
-		 * page tables leading to a race where migration cannot
-		 * find the migration ptes. Rather than increasing the
-		 * locking requirements of exec(), migration skips
-		 * temporary VMAs until after exec() completes.
-		 */
-		if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
-				is_vma_temporary_stack(vma))
-			continue;
-
-		address = vma_address(page, vma);
+		unsigned long address = vma_address(page, vma);
 		if (address == -EFAULT)
 			continue;
 		ret = try_to_unmap_one(page, vma, address, flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 03 of 66] transparent hugepage support documentation
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Documentation/vm/transhuge.txt

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,283 @@
+= Transparent Hugepage Support =
+
+== Objective ==
+
+Performance critical computing applications dealing with large memory
+working sets are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent Hugepage Support is an alternative to
+libhugetlbfs that offers the same feature of libhugetlbfs but without
+the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).
+
+In the future it can expand over the pagecache layer starting with
+tmpfs to reduce even further the hugetlbfs usages.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components: 1) the TLB miss will run faster (especially with
+virtualization using nested pagetables but also on bare metal without
+virtualization) and 2) a single TLB entry will be mapping a much
+larger amount of virtual memory in turn reducing the number of TLB
+misses. With virtualization and nested pagetables the TLB can be
+mapped of larger size only if both KVM and the Linux guest are using
+hugepages but a significant speedup already happens if only one of the
+two is using hugepages just because of the fact the TLB miss is going
+to run faster.
+
+== Design ==
+
+- "graceful fallback": mm components which don't have transparent
+  hugepage knownledge fall back to breaking a transparent hugepage and
+  working on the regular pages and their respective regular pmd/pte
+  mappings
+
+- if an hugepage allocation fails because of memory fragmentation,
+  regular pages should be gracefully allocated instead and mixed in
+  the same vma without any failure or significant delay and generally
+  without userland noticing
+
+- if some task quits and more hugepages become available (either
+  immediately in the buddy or through the VM), guest physical memory
+  backed by regular pages should be relocated on hugepages
+  automatically (with khugepaged)
+
+- it doesn't require boot-time memory reservation and in turn it uses
+  hugepages whenever possible (the only possible reservation here is
+  kernelcore= to avoid unmovable pages to fragment all the memory but
+  such a tweak is not specific to transparent hugepage support and
+  it's a generic feature that applies to all dynamic high order
+  allocations in the kernel)
+
+- this initial support only offers the feature in the anonymous memory
+  regions but it'd be ideal to move it to tmpfs and the pagecache
+  later
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent hugepage
+allocation failures to be noticeable from userland. It allows paging
+and all other advanced VM features to be available on the
+hugepages. It requires no modifications for applications to take
+advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature, like for example they've been optimized before to avoid
+a flood of mmap system calls for every malloc(4k). Optimizing userland
+is by far not mandatory and khugepaged already can take care of long
+lived page allocations even for hugepage unaware applications that
+deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+== sysfs ==
+
+Transparent Hugepage Support can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
+avoid the risk of consuming more memory resources) or enabled system
+wide. This can be achieved with one of:
+
+echo always >/sys/kernel/mm/transparent_hugepage/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+hugepages in case they're not immediately free to madvise regions or
+to never try to defrag memory and simply fallback to regular pages
+unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact
+we use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+khugepaged will be automatically started when
+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
+be automatically shutdown if it's set to "never".
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged:
+
+echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can se this to 0 to run khugepaged at 100% utilization of one core):
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt.
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+== Boot parameter ==
+
+You can change the sysfs boot time defaults of Transparent Hugepage
+Support by passing the parameter "transparent_hugepage=always" or
+"transparent_hugepage=madvise" or "transparent_hugepage=never"
+(without "") to the kernel command line.
+
+== Need of restart ==
+
+The transparent_hugepage/enabled values only affect future
+behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to
+the regions registered in khugepaged.
+
+== get_user_pages and follow_page ==
+
+get_user_pages and follow_page if run on a hugepage, will return the
+head or tail pages as usual (exactly as they would do on
+hugetlbfs). Most gup users will only care about the actual physical
+address of the page and its temporary pinning to release after the I/O
+is complete, so they won't ever notice the fact the page is huge. But
+if any driver is going to mangle over the page structure of the tail
+page (like for checking page->mapping or other bits that are relevant
+for the head page and not the tail page), it should be updated to jump
+to check head page instead (while serializing properly against
+split_huge_page() to avoid the head and tail pages to disappear from
+under it, see the futex code to see an example of that, hugetlbfs also
+needed special handling in futex code for similar reasons).
+
+NOTE: these aren't new constraints to the GUP API, and they match the
+same constrains that applies to hugetlbfs too, so any driver capable
+of handling GUP on hugetlbfs will also work fine on transparent
+hugepage backed mappings.
+
+In case you can't handle compound pages if they're returned by
+follow_page, the FOLL_SPLIT bit can be specified as parameter to
+follow_page, so that it will split the hugepages before returning
+them. Migration for example passes FOLL_SPLIT as parameter to
+follow_page because it's not hugepage aware and in fact it can't work
+at all on hugetlbfs (but it instead works fine on transparent
+hugepages thanks to FOLL_SPLIT). migration simply can't deal with
+hugepages being returned (as it's not only checking the pfn of the
+page and pinning it during the copy but it pretends to migrate the
+memory in regular page sizes and with regular pte/pmd mappings).
+
+== Optimizing the applications ==
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+== Hugetlbfs ==
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.
+
+== Graceful fallback ==
+
+Code walking pagetables but unware about huge pmds can simply call
+split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+pmd_offset. It's trivial to make the code transparent hugepage aware
+by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+missing after pmd_offset returns the pmd. Thanks to the graceful
+fallback design, with a one liner change, you can avoid to write
+hundred if not thousand of lines of complex code to make your code
+hugepage aware.
+
+If you're not walking pagetables but you run into a physical hugepage
+but you can't handle it natively in your code, you can split it by
+calling split_huge_page(page). This is what the Linux VM does before
+it tries to swapout the hugepage for example.
+
+== Locking in hugepage aware code ==
+
+We want as much code as possible hugepage aware, as calling
+split_huge_page() or split_huge_page_pmd() has a cost.
+
+To make pagetable walks huge pmd aware, all you need to do is to call
+pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
+mmap_sem in read (or write) mode to be sure an huge pmd cannot be
+created from under you by khugepaged (khugepaged collapse_huge_page
+takes the mmap_sem in write mode in addition to the anon_vma lock). If
+pmd_trans_huge returns false, you just fallback in the old code
+paths. If instead pmd_trans_huge returns true, you have to take the
+mm->page_table_lock and re-run pmd_trans_huge. Taking the
+page_table_lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_page can run in parallel to the
+pagetable walk). If the second pmd_trans_huge returns false, you
+should just drop the page_table_lock and fallback to the old code as
+before. Otherwise you should run pmd_trans_splitting on the pmd. In
+case pmd_trans_splitting returns true, it means split_huge_page is
+already in the middle of splitting the page. So if pmd_trans_splitting
+returns true it's enough to drop the page_table_lock and call
+wait_split_huge_page and then fallback the old code paths. You are
+guaranteed by the time wait_split_huge_page returns, the pmd isn't
+huge anymore. If pmd_trans_splitting returns false, you can proceed to
+process the huge pmd and the hugepage natively. Once finished you can
+drop the page_table_lock.
+
+== compound_lock, get_user_pages and put_page ==
+
+split_huge_page internally has to distribute the refcounts in the head
+page to the tail pages before clearing all PG_head/tail bits from the
+page structures. It can do that easily for refcounts taken by huge pmd
+mappings. But the GUI API as created by hugetlbfs (that returns head
+and tail pages if running get_user_pages on an address backed by any
+hugepage), requires the refcount to be accounted on the tail pages and
+not only in the head pages, if we want to be able to run
+split_huge_page while there are gup pins established on any tail
+page. Failure to be able to run split_huge_page if there's any gup pin
+on any tail page, would mean having to split all hugepages upfront in
+get_user_pages which is unacceptable as too many gup users are
+performance critical and they must work natively on hugepages like
+they work natively on hugetlbfs already (hugetlbfs is simpler because
+hugetlbfs pages cannot be splitted so there wouldn't be requirement of
+accounting the pins on the tail pages for hugetlbfs). If we wouldn't
+account the gup refcounts on the tail pages during gup, we won't know
+anymore which tail page is pinned by gup and which is not while we run
+split_huge_page. But we still have to add the gup pin to the head page
+too, to know when we can free the compound page in case it's never
+splitted during its lifetime. That requires changing not just
+get_page, but put_page as well so that when put_page runs on a tail
+page (and only on a tail page) it will find its respective head page,
+and then it will decrease the head page refcount in addition to the
+tail page refcount. To obtain a head page reliably and to decrease its
+refcount without race conditions, put_page has to serialize against
+__split_huge_page_refcount using a special per-page lock called
+compound_lock.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 03 of 66] transparent hugepage support documentation
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Documentation/vm/transhuge.txt

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,283 @@
+= Transparent Hugepage Support =
+
+== Objective ==
+
+Performance critical computing applications dealing with large memory
+working sets are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent Hugepage Support is an alternative to
+libhugetlbfs that offers the same feature of libhugetlbfs but without
+the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).
+
+In the future it can expand over the pagecache layer starting with
+tmpfs to reduce even further the hugetlbfs usages.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components: 1) the TLB miss will run faster (especially with
+virtualization using nested pagetables but also on bare metal without
+virtualization) and 2) a single TLB entry will be mapping a much
+larger amount of virtual memory in turn reducing the number of TLB
+misses. With virtualization and nested pagetables the TLB can be
+mapped of larger size only if both KVM and the Linux guest are using
+hugepages but a significant speedup already happens if only one of the
+two is using hugepages just because of the fact the TLB miss is going
+to run faster.
+
+== Design ==
+
+- "graceful fallback": mm components which don't have transparent
+  hugepage knownledge fall back to breaking a transparent hugepage and
+  working on the regular pages and their respective regular pmd/pte
+  mappings
+
+- if an hugepage allocation fails because of memory fragmentation,
+  regular pages should be gracefully allocated instead and mixed in
+  the same vma without any failure or significant delay and generally
+  without userland noticing
+
+- if some task quits and more hugepages become available (either
+  immediately in the buddy or through the VM), guest physical memory
+  backed by regular pages should be relocated on hugepages
+  automatically (with khugepaged)
+
+- it doesn't require boot-time memory reservation and in turn it uses
+  hugepages whenever possible (the only possible reservation here is
+  kernelcore= to avoid unmovable pages to fragment all the memory but
+  such a tweak is not specific to transparent hugepage support and
+  it's a generic feature that applies to all dynamic high order
+  allocations in the kernel)
+
+- this initial support only offers the feature in the anonymous memory
+  regions but it'd be ideal to move it to tmpfs and the pagecache
+  later
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent hugepage
+allocation failures to be noticeable from userland. It allows paging
+and all other advanced VM features to be available on the
+hugepages. It requires no modifications for applications to take
+advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature, like for example they've been optimized before to avoid
+a flood of mmap system calls for every malloc(4k). Optimizing userland
+is by far not mandatory and khugepaged already can take care of long
+lived page allocations even for hugepage unaware applications that
+deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+== sysfs ==
+
+Transparent Hugepage Support can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
+avoid the risk of consuming more memory resources) or enabled system
+wide. This can be achieved with one of:
+
+echo always >/sys/kernel/mm/transparent_hugepage/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+hugepages in case they're not immediately free to madvise regions or
+to never try to defrag memory and simply fallback to regular pages
+unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact
+we use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+khugepaged will be automatically started when
+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
+be automatically shutdown if it's set to "never".
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged:
+
+echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can se this to 0 to run khugepaged at 100% utilization of one core):
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt.
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+== Boot parameter ==
+
+You can change the sysfs boot time defaults of Transparent Hugepage
+Support by passing the parameter "transparent_hugepage=always" or
+"transparent_hugepage=madvise" or "transparent_hugepage=never"
+(without "") to the kernel command line.
+
+== Need of restart ==
+
+The transparent_hugepage/enabled values only affect future
+behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to
+the regions registered in khugepaged.
+
+== get_user_pages and follow_page ==
+
+get_user_pages and follow_page if run on a hugepage, will return the
+head or tail pages as usual (exactly as they would do on
+hugetlbfs). Most gup users will only care about the actual physical
+address of the page and its temporary pinning to release after the I/O
+is complete, so they won't ever notice the fact the page is huge. But
+if any driver is going to mangle over the page structure of the tail
+page (like for checking page->mapping or other bits that are relevant
+for the head page and not the tail page), it should be updated to jump
+to check head page instead (while serializing properly against
+split_huge_page() to avoid the head and tail pages to disappear from
+under it, see the futex code to see an example of that, hugetlbfs also
+needed special handling in futex code for similar reasons).
+
+NOTE: these aren't new constraints to the GUP API, and they match the
+same constrains that applies to hugetlbfs too, so any driver capable
+of handling GUP on hugetlbfs will also work fine on transparent
+hugepage backed mappings.
+
+In case you can't handle compound pages if they're returned by
+follow_page, the FOLL_SPLIT bit can be specified as parameter to
+follow_page, so that it will split the hugepages before returning
+them. Migration for example passes FOLL_SPLIT as parameter to
+follow_page because it's not hugepage aware and in fact it can't work
+at all on hugetlbfs (but it instead works fine on transparent
+hugepages thanks to FOLL_SPLIT). migration simply can't deal with
+hugepages being returned (as it's not only checking the pfn of the
+page and pinning it during the copy but it pretends to migrate the
+memory in regular page sizes and with regular pte/pmd mappings).
+
+== Optimizing the applications ==
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+== Hugetlbfs ==
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.
+
+== Graceful fallback ==
+
+Code walking pagetables but unware about huge pmds can simply call
+split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+pmd_offset. It's trivial to make the code transparent hugepage aware
+by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+missing after pmd_offset returns the pmd. Thanks to the graceful
+fallback design, with a one liner change, you can avoid to write
+hundred if not thousand of lines of complex code to make your code
+hugepage aware.
+
+If you're not walking pagetables but you run into a physical hugepage
+but you can't handle it natively in your code, you can split it by
+calling split_huge_page(page). This is what the Linux VM does before
+it tries to swapout the hugepage for example.
+
+== Locking in hugepage aware code ==
+
+We want as much code as possible hugepage aware, as calling
+split_huge_page() or split_huge_page_pmd() has a cost.
+
+To make pagetable walks huge pmd aware, all you need to do is to call
+pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
+mmap_sem in read (or write) mode to be sure an huge pmd cannot be
+created from under you by khugepaged (khugepaged collapse_huge_page
+takes the mmap_sem in write mode in addition to the anon_vma lock). If
+pmd_trans_huge returns false, you just fallback in the old code
+paths. If instead pmd_trans_huge returns true, you have to take the
+mm->page_table_lock and re-run pmd_trans_huge. Taking the
+page_table_lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_page can run in parallel to the
+pagetable walk). If the second pmd_trans_huge returns false, you
+should just drop the page_table_lock and fallback to the old code as
+before. Otherwise you should run pmd_trans_splitting on the pmd. In
+case pmd_trans_splitting returns true, it means split_huge_page is
+already in the middle of splitting the page. So if pmd_trans_splitting
+returns true it's enough to drop the page_table_lock and call
+wait_split_huge_page and then fallback the old code paths. You are
+guaranteed by the time wait_split_huge_page returns, the pmd isn't
+huge anymore. If pmd_trans_splitting returns false, you can proceed to
+process the huge pmd and the hugepage natively. Once finished you can
+drop the page_table_lock.
+
+== compound_lock, get_user_pages and put_page ==
+
+split_huge_page internally has to distribute the refcounts in the head
+page to the tail pages before clearing all PG_head/tail bits from the
+page structures. It can do that easily for refcounts taken by huge pmd
+mappings. But the GUI API as created by hugetlbfs (that returns head
+and tail pages if running get_user_pages on an address backed by any
+hugepage), requires the refcount to be accounted on the tail pages and
+not only in the head pages, if we want to be able to run
+split_huge_page while there are gup pins established on any tail
+page. Failure to be able to run split_huge_page if there's any gup pin
+on any tail page, would mean having to split all hugepages upfront in
+get_user_pages which is unacceptable as too many gup users are
+performance critical and they must work natively on hugepages like
+they work natively on hugetlbfs already (hugetlbfs is simpler because
+hugetlbfs pages cannot be splitted so there wouldn't be requirement of
+accounting the pins on the tail pages for hugetlbfs). If we wouldn't
+account the gup refcounts on the tail pages during gup, we won't know
+anymore which tail page is pinned by gup and which is not while we run
+split_huge_page. But we still have to add the gup pin to the head page
+too, to know when we can free the compound page in case it's never
+splitted during its lifetime. That requires changing not just
+get_page, but put_page as well so that when put_page runs on a tail
+page (and only on a tail page) it will find its respective head page,
+and then it will decrease the head page refcount in addition to the
+tail page refcount. To obtain a head page reliably and to decrease its
+refcount without race conditions, put_page has to serialize against
+__split_huge_page_refcount using a special per-page lock called
+compound_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 04 of 66] define MADV_HUGEPAGE
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -53,6 +53,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -77,6 +77,8 @@
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 #define MADV_HWPOISON    100		/* poison a page for testing */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -59,6 +59,8 @@
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -83,6 +83,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,6 +45,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 04 of 66] define MADV_HUGEPAGE
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
---

diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
--- a/arch/alpha/include/asm/mman.h
+++ b/arch/alpha/include/asm/mman.h
@@ -53,6 +53,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
--- a/arch/mips/include/asm/mman.h
+++ b/arch/mips/include/asm/mman.h
@@ -77,6 +77,8 @@
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 #define MADV_HWPOISON    100		/* poison a page for testing */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
--- a/arch/parisc/include/asm/mman.h
+++ b/arch/parisc/include/asm/mman.h
@@ -59,6 +59,8 @@
 #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 #define MAP_VARIABLE	0
diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
--- a/arch/xtensa/include/asm/mman.h
+++ b/arch/xtensa/include/asm/mman.h
@@ -83,6 +83,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -45,6 +45,8 @@
 #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
 #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
 
+#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
+
 /* compatibility flags */
 #define MAP_FILE	0
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 05 of 66] compound_lock
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -14,6 +14,7 @@
 #include <linux/mm_types.h>
 #include <linux/range.h>
 #include <linux/pfn.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -302,6 +303,40 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_lock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_unlock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_lock_irqsave(struct page *page,
+					 unsigned long *flagsp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long flags;
+	local_irq_save(flags);
+	compound_lock(page);
+	*flagsp = flags;
+#endif
+}
+
+static inline void compound_unlock_irqrestore(struct page *page,
+					      unsigned long flags)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	compound_unlock(page);
+	local_irq_restore(flags);
+#endif
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,9 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	PG_compound_lock,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -397,6 +400,12 @@ static inline void __ClearPageTail(struc
 #define __PG_MLOCKED		0
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
+#else
+#define __PG_COMPOUND_LOCK		0
+#endif
+
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -406,7 +415,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_COMPOUND_LOCK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 05 of 66] compound_lock
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -14,6 +14,7 @@
 #include <linux/mm_types.h>
 #include <linux/range.h>
 #include <linux/pfn.h>
+#include <linux/bit_spinlock.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -302,6 +303,40 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
+static inline void compound_lock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_lock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_unlock(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bit_spin_unlock(PG_compound_lock, &page->flags);
+#endif
+}
+
+static inline void compound_lock_irqsave(struct page *page,
+					 unsigned long *flagsp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long flags;
+	local_irq_save(flags);
+	compound_lock(page);
+	*flagsp = flags;
+#endif
+}
+
+static inline void compound_unlock_irqrestore(struct page *page,
+					      unsigned long flags)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	compound_unlock(page);
+	local_irq_restore(flags);
+#endif
+}
+
 static inline struct page *compound_head(struct page *page)
 {
 	if (unlikely(PageTail(page)))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -108,6 +108,9 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	PG_compound_lock,
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -397,6 +400,12 @@ static inline void __ClearPageTail(struc
 #define __PG_MLOCKED		0
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
+#else
+#define __PG_COMPOUND_LOCK		0
+#endif
+
 /*
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
@@ -406,7 +415,8 @@ static inline void __ClearPageTail(struc
 	 1 << PG_private | 1 << PG_private_2 | \
 	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
-	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
+	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_COMPOUND_LOCK)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 06 of 66] alter compound get_page/put_page
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,6 +16,16 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
 			put_page(page);
 			return 0;
 		}
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,6 +105,16 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -351,9 +351,17 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -56,17 +56,83 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_cold_page(page, 0);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			unsigned long flags;
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock_irqsave(page_head, &flags);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock_irqrestore(page_head, flags);
+				VM_BUG_ON(PageHead(page_head));
+			out_put_head:
+				if (put_page_testzero(page_head))
+					__put_single_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			compound_unlock_irqrestore(page_head, flags);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+		} else {
+			/* page_head is a dangling pointer */
+			VM_BUG_ON(PageTail(page));
+			goto out_put_single;
+		}
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -75,7 +141,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 06 of 66] alter compound get_page/put_page
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Alter compound get_page/put_page to keep references on subpages too, in order
to allow __split_huge_page_refcount to split an hugepage even while subpages
have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,6 +16,16 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 /*
  * The performance critical leaf functions are made noinline otherwise gcc
  * inlines everything into a single function which results in too much
@@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
 			put_page(page);
 			return 0;
 		}
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,6 +105,16 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
+static inline void pin_huge_page_tail(struct page *page)
+{
+	/*
+	 * __split_huge_page_refcount() cannot run
+	 * from under us.
+	 */
+	VM_BUG_ON(atomic_read(&page->_count) < 0);
+	atomic_inc(&page->_count);
+}
+
 static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
 		unsigned long end, int write, struct page **pages, int *nr)
 {
@@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
 	do {
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
+		if (PageTail(page))
+			pin_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -351,9 +351,17 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
-	page = compound_head(page);
-	VM_BUG_ON(atomic_read(&page->_count) == 0);
+	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	if (unlikely(PageTail(page))) {
+		/*
+		 * This is safe only because
+		 * __split_huge_page_refcount can't run under
+		 * get_page().
+		 */
+		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
+		atomic_inc(&page->first_page->_count);
+	}
 }
 
 static inline struct page *virt_to_head_page(const void *x)
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -56,17 +56,83 @@ static void __page_cache_release(struct 
 		del_page_from_lru(zone, page);
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
+}
+
+static void __put_single_page(struct page *page)
+{
+	__page_cache_release(page);
 	free_hot_cold_page(page, 0);
 }
 
+static void __put_compound_page(struct page *page)
+{
+	compound_page_dtor *dtor;
+
+	__page_cache_release(page);
+	dtor = get_compound_page_dtor(page);
+	(*dtor)(page);
+}
+
 static void put_compound_page(struct page *page)
 {
-	page = compound_head(page);
-	if (put_page_testzero(page)) {
-		compound_page_dtor *dtor;
-
-		dtor = get_compound_page_dtor(page);
-		(*dtor)(page);
+	if (unlikely(PageTail(page))) {
+		/* __split_huge_page_refcount can run under us */
+		struct page *page_head = page->first_page;
+		smp_rmb();
+		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
+			unsigned long flags;
+			if (unlikely(!PageHead(page_head))) {
+				/* PageHead is cleared after PageTail */
+				smp_rmb();
+				VM_BUG_ON(PageTail(page));
+				goto out_put_head;
+			}
+			/*
+			 * Only run compound_lock on a valid PageHead,
+			 * after having it pinned with
+			 * get_page_unless_zero() above.
+			 */
+			smp_mb();
+			/* page_head wasn't a dangling pointer */
+			compound_lock_irqsave(page_head, &flags);
+			if (unlikely(!PageTail(page))) {
+				/* __split_huge_page_refcount run before us */
+				compound_unlock_irqrestore(page_head, flags);
+				VM_BUG_ON(PageHead(page_head));
+			out_put_head:
+				if (put_page_testzero(page_head))
+					__put_single_page(page_head);
+			out_put_single:
+				if (put_page_testzero(page))
+					__put_single_page(page);
+				return;
+			}
+			VM_BUG_ON(page_head != page->first_page);
+			/*
+			 * We can release the refcount taken by
+			 * get_page_unless_zero now that
+			 * split_huge_page_refcount is blocked on the
+			 * compound_lock.
+			 */
+			if (put_page_testzero(page_head))
+				VM_BUG_ON(1);
+			/* __split_huge_page_refcount will wait now */
+			VM_BUG_ON(atomic_read(&page->_count) <= 0);
+			atomic_dec(&page->_count);
+			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
+			compound_unlock_irqrestore(page_head, flags);
+			if (put_page_testzero(page_head))
+				__put_compound_page(page_head);
+		} else {
+			/* page_head is a dangling pointer */
+			VM_BUG_ON(PageTail(page));
+			goto out_put_single;
+		}
+	} else if (put_page_testzero(page)) {
+		if (PageHead(page))
+			__put_compound_page(page);
+		else
+			__put_single_page(page);
 	}
 }
 
@@ -75,7 +141,7 @@ void put_page(struct page *page)
 	if (unlikely(PageCompound(page)))
 		put_compound_page(page);
 	else if (put_page_testzero(page))
-		__page_cache_release(page);
+		__put_single_page(page);
 }
 EXPORT_SYMBOL(put_page);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 07 of 66] update futex compound knowledge
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Futex code is smarter than most other gup_fast O_DIRECT code and knows about
the compound internals. However now doing a put_page(head_page) will not
release the pin on the tail page taken by gup-fast, leading to all sort of
refcounting bugchecks. Getting a stable head_page is a little tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called, otherwise
it's guaranteed unnecessary. And if it's a tail page compound_head has to run
atomically inside irq disabled section __get_user_pages_fast before returning.
Otherwise ->first_page won't be a stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it means
the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn when we
run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are always safe
to call compound_lock and after taking the compound lock on head page we can
finally re-check if the page returned by gup-fast is still a tail page. in
which case we're set and we didn't need to split the hugepage in order to take
a futex on it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -219,7 +219,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page;
+	struct page *page, *page_head;
 	int err;
 
 	/*
@@ -251,11 +251,46 @@ again:
 	if (err < 0)
 		return err;
 
-	page = compound_head(page);
-	lock_page(page);
-	if (!page->mapping) {
-		unlock_page(page);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page_head = page;
+	if (unlikely(PageTail(page))) {
 		put_page(page);
+		/* serialize against __split_huge_page_splitting() */
+		local_irq_disable();
+		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head) {
+				get_page(page_head);
+				put_page(page);
+			}
+			local_irq_enable();
+		} else {
+			local_irq_enable();
+			goto again;
+		}
+	}
+#else
+	page_head = compound_head(page);
+	if (page != page_head) {
+		get_page(page_head);
+		put_page(page);
+	}
+#endif
+
+	lock_page(page_head);
+	if (!page_head->mapping) {
+		unlock_page(page_head);
+		put_page(page_head);
 		goto again;
 	}
 
@@ -266,20 +301,20 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page_head)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page->mapping->host;
-		key->shared.pgoff = page->index;
+		key->shared.inode = page_head->mapping->host;
+		key->shared.pgoff = page_head->index;
 	}
 
 	get_futex_key_refs(key);
 
-	unlock_page(page);
-	put_page(page);
+	unlock_page(page_head);
+	put_page(page_head);
 	return 0;
 }
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 07 of 66] update futex compound knowledge
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Futex code is smarter than most other gup_fast O_DIRECT code and knows about
the compound internals. However now doing a put_page(head_page) will not
release the pin on the tail page taken by gup-fast, leading to all sort of
refcounting bugchecks. Getting a stable head_page is a little tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called, otherwise
it's guaranteed unnecessary. And if it's a tail page compound_head has to run
atomically inside irq disabled section __get_user_pages_fast before returning.
Otherwise ->first_page won't be a stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it means
the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to wait
for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn when we
run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are always safe
to call compound_lock and after taking the compound lock on head page we can
finally re-check if the page returned by gup-fast is still a tail page. in
which case we're set and we didn't need to split the hugepage in order to take
a futex on it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -219,7 +219,7 @@ get_futex_key(u32 __user *uaddr, int fsh
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
-	struct page *page;
+	struct page *page, *page_head;
 	int err;
 
 	/*
@@ -251,11 +251,46 @@ again:
 	if (err < 0)
 		return err;
 
-	page = compound_head(page);
-	lock_page(page);
-	if (!page->mapping) {
-		unlock_page(page);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page_head = page;
+	if (unlikely(PageTail(page))) {
 		put_page(page);
+		/* serialize against __split_huge_page_splitting() */
+		local_irq_disable();
+		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
+			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head) {
+				get_page(page_head);
+				put_page(page);
+			}
+			local_irq_enable();
+		} else {
+			local_irq_enable();
+			goto again;
+		}
+	}
+#else
+	page_head = compound_head(page);
+	if (page != page_head) {
+		get_page(page_head);
+		put_page(page);
+	}
+#endif
+
+	lock_page(page_head);
+	if (!page_head->mapping) {
+		unlock_page(page_head);
+		put_page(page_head);
 		goto again;
 	}
 
@@ -266,20 +301,20 @@ again:
 	 * it's a read-only handle, it's expected that futexes attach to
 	 * the object not the particular process.
 	 */
-	if (PageAnon(page)) {
+	if (PageAnon(page_head)) {
 		key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 	} else {
 		key->both.offset |= FUT_OFF_INODE; /* inode-based key */
-		key->shared.inode = page->mapping->host;
-		key->shared.pgoff = page->index;
+		key->shared.inode = page_head->mapping->host;
+		key->shared.pgoff = page_head->index;
 	}
 
 	get_futex_key_refs(key);
 
-	unlock_page(page);
-	put_page(page);
+	unlock_page(page_head);
+	put_page(page_head);
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

page_count shows the count of the head page, but the actual check is done on
the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
 {
 	printk(KERN_ALERT
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
-		page, page_count(page), page_mapcount(page),
+		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
 	dump_page_flags(page->flags);
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

page_count shows the count of the head page, but the actual check is done on
the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
 {
 	printk(KERN_ALERT
 	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
-		page, page_count(page), page_mapcount(page),
+		page, atomic_read(&page->_count), page_mapcount(page),
 		page->mapping, page->index);
 	dump_page_flags(page->flags);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 09 of 66] clear compound mapping
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages. But crash if mapping is set for any tail page, also
the PageAnon check is meaningless for tail pages. This check only makes sense
for the head page, for tail page it can only hide bugs and we definitely don't
want to hide bugs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -646,13 +646,10 @@ static bool free_pages_prepare(struct pa
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
-	for (i = 0; i < (1 << order); i++) {
-		struct page *pg = page + i;
-
-		if (PageAnon(pg))
-			pg->mapping = NULL;
-		bad += free_pages_check(pg);
-	}
+	if (PageAnon(page))
+		page->mapping = NULL;
+	for (i = 0; i < (1 << order); i++)
+		bad += free_pages_check(page + i);
 	if (bad)
 		return false;
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 09 of 66] clear compound mapping
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Clear compound mapping for anonymous compound pages like it already happens for
regular anonymous pages. But crash if mapping is set for any tail page, also
the PageAnon check is meaningless for tail pages. This check only makes sense
for the head page, for tail page it can only hide bugs and we definitely don't
want to hide bugs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -646,13 +646,10 @@ static bool free_pages_prepare(struct pa
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
-	for (i = 0; i < (1 << order); i++) {
-		struct page *pg = page + i;
-
-		if (PageAnon(pg))
-			pg->mapping = NULL;
-		bad += free_pages_check(pg);
-	}
+	if (PageAnon(page))
+		page->mapping = NULL;
+	for (i = 0; i < (1 << order); i++)
+		bad += free_pages_check(page + i);
 	if (bad)
 		return false;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 10 of 66] add native_set_pmd_at
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -530,6 +530,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 10 of 66] add native_set_pmd_at
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -530,6 +530,12 @@ static inline void native_set_pte_at(str
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd_at(struct mm_struct *mm, unsigned long addr,
+				     pmd_t *pmdp , pmd_t pmd)
+{
+	native_set_pmd(pmdp, pmd);
+}
+
 #ifndef CONFIG_PARAVIRT
 /*
  * Rules for using pte_update - it must be called after any PTE update which

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 11 of 66] add pmd paravirt ops
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -435,6 +435,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -442,6 +447,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -543,6 +554,18 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+#endif
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -265,10 +265,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -421,8 +421,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 11 of 66] add pmd paravirt ops
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Paravirt ops pmd_update/pmd_update_defer/pmd_set_at. Not all might be necessary
(vmware needs pmd_update, Xen needs set_pmd_at, nobody needs pmd_update_defer),
but this is to keep full simmetry with pte paravirt ops, which looks cleaner
and simpler from a common code POV.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -435,6 +435,11 @@ static inline void pte_update(struct mm_
 {
 	PVOP_VCALL3(pv_mmu_ops.pte_update, mm, addr, ptep);
 }
+static inline void pmd_update(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update, mm, addr, pmdp);
+}
 
 static inline void pte_update_defer(struct mm_struct *mm, unsigned long addr,
 				    pte_t *ptep)
@@ -442,6 +447,12 @@ static inline void pte_update_defer(stru
 	PVOP_VCALL3(pv_mmu_ops.pte_update_defer, mm, addr, ptep);
 }
 
+static inline void pmd_update_defer(struct mm_struct *mm, unsigned long addr,
+				    pmd_t *pmdp)
+{
+	PVOP_VCALL3(pv_mmu_ops.pmd_update_defer, mm, addr, pmdp);
+}
+
 static inline pte_t __pte(pteval_t val)
 {
 	pteval_t ret;
@@ -543,6 +554,18 @@ static inline void set_pte_at(struct mm_
 		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+			      pmd_t *pmdp, pmd_t pmd)
+{
+	if (sizeof(pmdval_t) > sizeof(long))
+		/* 5 arg words */
+		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
+	else
+		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+}
+#endif
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 	pmdval_t val = native_pmd_val(pmd);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -265,10 +265,16 @@ struct pv_mmu_ops {
 	void (*set_pte_at)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, pte_t pteval);
 	void (*set_pmd)(pmd_t *pmdp, pmd_t pmdval);
+	void (*set_pmd_at)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp, pmd_t pmdval);
 	void (*pte_update)(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep);
 	void (*pte_update_defer)(struct mm_struct *mm,
 				 unsigned long addr, pte_t *ptep);
+	void (*pmd_update)(struct mm_struct *mm, unsigned long addr,
+			   pmd_t *pmdp);
+	void (*pmd_update_defer)(struct mm_struct *mm,
+				 unsigned long addr, pmd_t *pmdp);
 
 	pte_t (*ptep_modify_prot_start)(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -421,8 +421,11 @@ struct pv_mmu_ops pv_mmu_ops = {
 	.set_pte = native_set_pte,
 	.set_pte_at = native_set_pte_at,
 	.set_pmd = native_set_pmd,
+	.set_pmd_at = native_set_pmd_at,
 	.pte_update = paravirt_nop,
 	.pte_update_defer = paravirt_nop,
+	.pmd_update = paravirt_nop,
+	.pmd_update_defer = paravirt_nop,
 
 	.ptep_modify_prot_start = __ptep_modify_prot_start,
 	.ptep_modify_prot_commit = __ptep_modify_prot_commit,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 12 of 66] no paravirt version of pmd ops
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -35,6 +35,7 @@ extern struct mm_struct *pgd_page_get_mm
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -59,6 +60,8 @@ extern struct mm_struct *pgd_page_get_mm
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 12 of 66] no paravirt version of pmd ops
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

No paravirt version of set_pmd_at/pmd_update/pmd_update_defer.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -35,6 +35,7 @@ extern struct mm_struct *pgd_page_get_mm
 #else  /* !CONFIG_PARAVIRT */
 #define set_pte(ptep, pte)		native_set_pte(ptep, pte)
 #define set_pte_at(mm, addr, ptep, pte)	native_set_pte_at(mm, addr, ptep, pte)
+#define set_pmd_at(mm, addr, pmdp, pmd)	native_set_pmd_at(mm, addr, pmdp, pmd)
 
 #define set_pte_atomic(ptep, pte)					\
 	native_set_pte_atomic(ptep, pte)
@@ -59,6 +60,8 @@ extern struct mm_struct *pgd_page_get_mm
 
 #define pte_update(mm, addr, ptep)              do { } while (0)
 #define pte_update_defer(mm, addr, ptep)        do { } while (0)
+#define pmd_update(mm, addr, ptep)              do { } while (0)
+#define pmd_update_defer(mm, addr, ptep)        do { } while (0)
 
 #define pgd_val(x)	native_pgd_val(x)
 #define __pgd(x)	native_make_pgd(x)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 13 of 66] export maybe_mkwrite
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,6 +416,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2048,19 +2048,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 13 of 66] export maybe_mkwrite
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

huge_memory.c needs it too when it fallbacks in copying hugepages into regular
fragmented pages if hugepage allocation fails during COW.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -416,6 +416,19 @@ static inline void set_compound_order(st
 }
 
 /*
+ * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
+ * servicing faults for write access.  In the normal case, do always want
+ * pte_mkwrite.  But get_user_pages can cause write faults for mappings
+ * that do not have writing enabled, when used by access_process_vm.
+ */
+static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pte = pte_mkwrite(pte);
+	return pte;
+}
+
+/*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
  * zeroes, and text pages of executables and shared libraries have
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2048,19 +2048,6 @@ static inline int pte_unmap_same(struct 
 	return same;
 }
 
-/*
- * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
- * servicing faults for write access.  In the normal case, do always want
- * pte_mkwrite.  But get_user_pages can cause write faults for mappings
- * that do not have writing enabled, when used by access_process_vm.
- */
-static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pte = pte_mkwrite(pte);
-	return pte;
-}
-
 static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
 {
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 14 of 66] comment reminder in destroy_compound_page
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -352,6 +352,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 14 of 66] comment reminder in destroy_compound_page
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Warn destroy_compound_page that __split_huge_page_refcount is heavily dependent
on its internal behavior.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -352,6 +352,7 @@ void prep_compound_page(struct page *pag
 	}
 }
 
+/* update __split_huge_page_refcount if you change this function */
 static int destroy_compound_page(struct page *page, unsigned long order)
 {
 	int i;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 15 of 66] config_transparent_hugepage
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/fs/Kconfig b/fs/Kconfig
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -147,7 +147,7 @@ config HUGETLBFS
 	  If unsure, say N.
 
 config HUGETLB_PAGE
-	def_bool HUGETLBFS
+	def_bool HUGETLBFS || TRANSPARENT_HUGEPAGE
 
 source "fs/configfs/Kconfig"
 
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -302,6 +302,20 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 	  See Documentation/nommu-mmap.txt for more information.
 
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage Support" if EMBEDDED
+	depends on X86_64 && MMU
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.
+
 #
 # UP and nommu archs use km based percpu allocator
 #

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 15 of 66] config_transparent_hugepage
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add config option.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/fs/Kconfig b/fs/Kconfig
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -147,7 +147,7 @@ config HUGETLBFS
 	  If unsure, say N.
 
 config HUGETLB_PAGE
-	def_bool HUGETLBFS
+	def_bool HUGETLBFS || TRANSPARENT_HUGEPAGE
 
 source "fs/configfs/Kconfig"
 
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -302,6 +302,20 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 	  See Documentation/nommu-mmap.txt for more information.
 
+config TRANSPARENT_HUGEPAGE
+	bool "Transparent Hugepage Support" if EMBEDDED
+	depends on X86_64 && MMU
+	default y
+	help
+	  Transparent Hugepages allows the kernel to use huge pages and
+	  huge tlb transparently to the applications whenever possible.
+	  This feature can improve computing performance to certain
+	  applications by speeding up page faults during memory
+	  allocation, by reducing the number of tlb misses and by speeding
+	  up the pagetable walking.
+
+	  If memory constrained on embedded, you may want to say N.
+
 #
 # UP and nommu archs use km based percpu allocator
 #

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 16 of 66] special pmd_trans_* functions
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
 #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
 
 #define __HAVE_ARCH_PTE_SAME
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -348,6 +348,11 @@ extern void untrack_pfn_vma(struct vm_ar
 				unsigned long size);
 #endif
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_trans_huge(pmd) 0
+#define pmd_trans_splitting(pmd) 0
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 16 of 66] special pmd_trans_* functions
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

These returns 0 at compile time when the config option is disabled, to allow
gcc to eliminate the transparent hugepage function calls at compile time
without additional #ifdefs (only the export of those functions have to be
visible to gcc but they won't be required at link time and huge_memory.o can be
not built at all).

_PAGE_BIT_UNUSED1 is never used for pmd, only on pte.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
 #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
 
 #define __HAVE_ARCH_PTE_SAME
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -22,6 +22,7 @@
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
 #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
 #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
+#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
 #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
 
 /* If _PAGE_BIT_PRESENT is clear, we use these: */
@@ -45,6 +46,7 @@
 #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
 #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
 #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
+#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
 #define __HAVE_ARCH_PTE_SPECIAL
 
 #ifdef CONFIG_KMEMCHECK
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -348,6 +348,11 @@ extern void untrack_pfn_vma(struct vm_ar
 				unsigned long size);
 #endif
 
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_trans_huge(pmd) 0
+#define pmd_trans_splitting(pmd) 0
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_GENERIC_PGTABLE_H */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -25,6 +25,26 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);		\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_PMD_SIZE);	\
+		}							\
+		__changed;						\
+	})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young(__vma, __address, __ptep)		\
 ({									\
@@ -39,6 +59,25 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(__vma, __address, __ptep)		\
 ({									\
@@ -50,6 +89,24 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_PMD_SIZE);		\
+	__young;							\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
@@ -59,6 +116,20 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_get_and_clear(__mm, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
 ({									\
@@ -90,6 +161,22 @@ do {									\
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
@@ -99,10 +186,45 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_wrprotect(mm, address, pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_splitting_flush(__vma, __address, __pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
 #endif
 
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmd_same(A,B)	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
 #define page_test_dirty(page)		(0)
 #endif
@@ -351,6 +473,9 @@ extern void untrack_pfn_vma(struct vm_ar
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd) 0
 #define pmd_trans_splitting(pmd) 0
+#ifndef __HAVE_ARCH_PMD_WRITE
+#define pmd_write(pmd)	({ BUG(); 0; })
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -25,6 +25,26 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({								\
+		int __changed = !pmd_same(*(__pmdp), __entry);		\
+		VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);		\
+		if (__changed) {					\
+			set_pmd_at((__vma)->vm_mm, __address, __pmdp,	\
+				   __entry);				\
+			flush_tlb_range(__vma, __address,		\
+					(__address) + HPAGE_PMD_SIZE);	\
+		}							\
+		__changed;						\
+	})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_access_flags(__vma, __address, __pmdp, __entry, __dirty) \
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
 #define ptep_test_and_clear_young(__vma, __address, __ptep)		\
 ({									\
@@ -39,6 +59,25 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	int r = 1;							\
+	if (!pmd_young(__pmd))						\
+		r = 0;							\
+	else								\
+		set_pmd_at((__vma)->vm_mm, (__address),			\
+			   (__pmdp), pmd_mkold(__pmd));			\
+	r;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_test_and_clear_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(__vma, __address, __ptep)		\
 ({									\
@@ -50,6 +89,24 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__young = pmdp_test_and_clear_young(__vma, __address, __pmdp);	\
+	if (__young)							\
+		flush_tlb_range(__vma, __address,			\
+				(__address) + HPAGE_PMD_SIZE);		\
+	__young;							\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush_young(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
@@ -59,6 +116,20 @@
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_get_and_clear(__mm, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = *(__pmdp);					\
+	pmd_clear((__mm), (__address), (__pmdp));			\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_get_and_clear(__mm, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
 #define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
 ({									\
@@ -90,6 +161,22 @@ do {									\
 })
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_clear_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd;							\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	__pmd = pmdp_get_and_clear((__vma)->vm_mm, __address, __pmdp);	\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+	__pmd;								\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_clear_flush(__vma, __address, __pmdp)	\
+	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
 struct mm_struct;
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
@@ -99,10 +186,45 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_set_wrprotect(mm, address, pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmdp_splitting_flush(__vma, __address, __pmdp)			\
+({									\
+	pmd_t __pmd = pmd_mksplitting(*(__pmdp));			\
+	VM_BUG_ON((__address) & ~HPAGE_PMD_MASK);			\
+	set_pmd_at((__vma)->vm_mm, __address, __pmdp, __pmd);		\
+	/* tlb flush only to serialize against gup-fast */		\
+	flush_tlb_range(__vma, __address, (__address) + HPAGE_PMD_SIZE);\
+})
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmdp_splitting_flush(__vma, __address, __pmdp) BUG()
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
 #define pte_same(A,B)	(pte_val(A) == pte_val(B))
 #endif
 
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define pmd_same(A,B)	(pmd_val(A) == pmd_val(B))
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define pmd_same(A,B)	({ BUG(); 0; })
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
 #define page_test_dirty(page)		(0)
 #endif
@@ -351,6 +473,9 @@ extern void untrack_pfn_vma(struct vm_ar
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_trans_huge(pmd) 0
 #define pmd_trans_splitting(pmd) 0
+#ifndef __HAVE_ARCH_PMD_WRITE
+#define pmd_write(pmd)	({ BUG(); 0; })
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 18 of 66] add pmd mangling functions to x86
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
 pte_t *populate_extra_pte(unsigned long vaddr);
 #endif	/* __ASSEMBLY__ */
 
+#ifndef __ASSEMBLY__
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_X86_32
 # include "pgtable_32.h"
 #else
 # include "pgtable_64.h"
 #endif
 
-#ifndef __ASSEMBLY__
-#include <linux/mm_types.h>
-
 static inline int pte_none(pte_t pte)
 {
 	return !pte.pte;
@@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -59,6 +59,16 @@ static inline void native_set_pte_atomic
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	*pmdp = pmd;
+}
+
+static inline void native_pmd_clear(pmd_t *pmd)
+{
+	native_set_pmd(pmd, native_make_pmd(0));
+}
+
 static inline pte_t native_ptep_get_and_clear(pte_t *xp)
 {
 #ifdef CONFIG_SMP
@@ -72,14 +82,17 @@ static inline pte_t native_ptep_get_and_
 #endif
 }
 
-static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 {
-	*pmdp = pmd;
-}
-
-static inline void native_pmd_clear(pmd_t *pmd)
-{
-	native_set_pmd(pmd, native_make_pmd(0));
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(xp);
+	return ret;
+#endif
 }
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
@@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmdp);
+}
+
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -320,6 +320,25 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+
+	return changed;
+}
+#endif
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -335,6 +354,23 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+#endif
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -347,6 +383,36 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 18 of 66] add pmd mangling functions to x86
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add needed pmd mangling functions with simmetry with their pte counterparts.
pmdp_freeze_flush is the only exception only present on the pmd side and it's
needed to serialize the VM against split_huge_page, it simply atomically clears
the present bit in the same way pmdp_clear_flush_young atomically clears the
accessed bit (and both need to flush the tlb to make it effective, which is
mandatory to happen synchronously for pmdp_freeze_flush).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
 pte_t *populate_extra_pte(unsigned long vaddr);
 #endif	/* __ASSEMBLY__ */
 
+#ifndef __ASSEMBLY__
+#include <linux/mm_types.h>
+
 #ifdef CONFIG_X86_32
 # include "pgtable_32.h"
 #else
 # include "pgtable_64.h"
 #endif
 
-#ifndef __ASSEMBLY__
-#include <linux/mm_types.h>
-
 static inline int pte_none(pte_t pte)
 {
 	return !pte.pte;
@@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
  * Currently stuck as a macro due to indirect forward reference to
  * linux/mmzone.h's __section_mem_map_addr() definition:
  */
-#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
+#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
 
 /*
  * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -59,6 +59,16 @@ static inline void native_set_pte_atomic
 	native_set_pte(ptep, pte);
 }
 
+static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+{
+	*pmdp = pmd;
+}
+
+static inline void native_pmd_clear(pmd_t *pmd)
+{
+	native_set_pmd(pmd, native_make_pmd(0));
+}
+
 static inline pte_t native_ptep_get_and_clear(pte_t *xp)
 {
 #ifdef CONFIG_SMP
@@ -72,14 +82,17 @@ static inline pte_t native_ptep_get_and_
 #endif
 }
 
-static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
 {
-	*pmdp = pmd;
-}
-
-static inline void native_pmd_clear(pmd_t *pmd)
-{
-	native_set_pmd(pmd, native_make_pmd(0));
+#ifdef CONFIG_SMP
+	return native_make_pmd(xchg(&xp->pmd, 0));
+#else
+	/* native_local_pmdp_get_and_clear,
+	   but duplicated because of cyclic dependency */
+	pmd_t ret = *xp;
+	native_pmd_clear(xp);
+	return ret;
+#endif
 }
 
 static inline void native_set_pud(pud_t *pudp, pud_t pud)
@@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
+	pmd_update(mm, addr, pmdp);
+}
+
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return native_make_pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -320,6 +320,25 @@ int ptep_set_access_flags(struct vm_area
 	return changed;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+	int changed = !pmd_same(*pmdp, entry);
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	if (changed && dirty) {
+		*pmdp = entry;
+		pmd_update_defer(vma->vm_mm, address, pmdp);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+
+	return changed;
+}
+#endif
+
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
 {
@@ -335,6 +354,23 @@ int ptep_test_and_clear_young(struct vm_
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+			      unsigned long addr, pmd_t *pmdp)
+{
+	int ret = 0;
+
+	if (pmd_young(*pmdp))
+		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+					 (unsigned long *) &pmdp->pmd);
+
+	if (ret)
+		pmd_update(vma->vm_mm, addr, pmdp);
+
+	return ret;
+}
+#endif
+
 int ptep_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pte_t *ptep)
 {
@@ -347,6 +383,36 @@ int ptep_clear_flush_young(struct vm_are
 	return young;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+
+	return young;
+}
+
+void pmdp_splitting_flush(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp)
+{
+	int set;
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
+				(unsigned long *)&pmdp->pmd);
+	if (set) {
+		pmd_update(vma->vm_mm, address, pmdp);
+		/* need tlb flush only to serialize against gup-fast */
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 19 of 66] bail out gup_fast on splitting pmd
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is splitting, not
only if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -160,7 +160,18 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 19 of 66] bail out gup_fast on splitting pmd
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Force gup_fast to take the slow path and block if the pmd is splitting, not
only if it's none.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -160,7 +160,18 @@ static int gup_pmd_range(pud_t pud, unsi
 		pmd_t pmd = *pmdp;
 
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(pmd))
+		/*
+		 * The pmd_trans_splitting() check below explains why
+		 * pmdp_splitting_flush has to flush the tlb, to stop
+		 * this gup-fast code from running while we set the
+		 * splitting bit in the pmd. Returning zero will take
+		 * the slow path that will call wait_split_huge_page()
+		 * if the pmd is still in splitting state. gup-fast
+		 * can't because it has irq disabled and
+		 * wait_split_huge_page() would never return as the
+		 * tlb flush IPI wouldn't run.
+		 */
+		if (pmd_none(pmd) || pmd_trans_splitting(pmd))
 			return 0;
 		if (unlikely(pmd_large(pmd))) {
 			if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 20 of 66] pte alloc trans splitting
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -133,7 +133,7 @@ static int map_tboot_page(unsigned long 
 	pmd = pmd_alloc(&tboot_mm, pud, vaddr);
 	if (!pmd)
 		return -1;
-	pte = pte_alloc_map(&tboot_mm, pmd, vaddr);
+	pte = pte_alloc_map(&tboot_mm, NULL, pmd, vaddr);
 	if (!pte)
 		return -1;
 	set_pte_at(&tboot_mm, vaddr, pte, pfn_pte(pfn, prot));
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,7 +1117,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1186,16 +1187,18 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
 		NULL: pte_offset_kernel(pmd, address))
 
 extern void free_area_init(unsigned long * zones_size);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -394,9 +394,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -416,14 +418,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -436,10 +442,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3215,7 +3222,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -47,7 +47,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -62,7 +63,8 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+	if (pmd_none(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -147,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 20 of 66] pte alloc trans splitting
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

pte alloc routines must wait for split_huge_page if the pmd is not
present and not null (i.e. pmd_trans_splitting). The additional
branches are optimized away at compile time by pmd_trans_splitting if
the config option is off. However we must pass the vma down in order
to know the anon_vma lock to wait for.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -133,7 +133,7 @@ static int map_tboot_page(unsigned long 
 	pmd = pmd_alloc(&tboot_mm, pud, vaddr);
 	if (!pmd)
 		return -1;
-	pte = pte_alloc_map(&tboot_mm, pmd, vaddr);
+	pte = pte_alloc_map(&tboot_mm, NULL, pmd, vaddr);
 	if (!pte)
 		return -1;
 	set_pte_at(&tboot_mm, vaddr, pte, pfn_pte(pfn, prot));
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,7 +1117,8 @@ static inline int __pmd_alloc(struct mm_
 int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
 #endif
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address);
 int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);
 
 /*
@@ -1186,16 +1187,18 @@ static inline void pgtable_page_dtor(str
 	pte_unmap(pte);					\
 } while (0)
 
-#define pte_alloc_map(mm, pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
-		NULL: pte_offset_map(pmd, address))
+#define pte_alloc_map(mm, vma, pmd, address)				\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, vma,	\
+							pmd, address))?	\
+	 NULL: pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc(mm, pmd, address))? \
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, NULL,	\
+							pmd, address))?	\
 		NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
-	((unlikely(!pmd_present(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, address))? \
 		NULL: pte_offset_kernel(pmd, address))
 
 extern void free_area_init(unsigned long * zones_size);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -394,9 +394,11 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
-int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
+int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long address)
 {
 	pgtable_t new = pte_alloc_one(mm, address);
+	int wait_split_huge_page;
 	if (!new)
 		return -ENOMEM;
 
@@ -416,14 +418,18 @@ int __pte_alloc(struct mm_struct *mm, pm
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
 	spin_lock(&mm->page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	wait_split_huge_page = 0;
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm->nr_ptes++;
 		pmd_populate(mm, pmd, new);
 		new = NULL;
-	}
+	} else if (unlikely(pmd_trans_splitting(*pmd)))
+		wait_split_huge_page = 1;
 	spin_unlock(&mm->page_table_lock);
 	if (new)
 		pte_free(mm, new);
+	if (wait_split_huge_page)
+		wait_split_huge_page(vma->anon_vma, pmd);
 	return 0;
 }
 
@@ -436,10 +442,11 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&init_mm.page_table_lock);
-	if (!pmd_present(*pmd)) {	/* Has another populated it ? */
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
-	}
+	} else
+		VM_BUG_ON(pmd_trans_splitting(*pmd));
 	spin_unlock(&init_mm.page_table_lock);
 	if (new)
 		pte_free_kernel(&init_mm, new);
@@ -3215,7 +3222,7 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, pmd, address);
+	pte = pte_alloc_map(mm, vma, pmd, address);
 	if (!pte)
 		return VM_FAULT_OOM;
 
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -47,7 +47,8 @@ static pmd_t *get_old_pmd(struct mm_stru
 	return pmd;
 }
 
-static pmd_t *alloc_new_pmd(struct mm_struct *mm, unsigned long addr)
+static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -62,7 +63,8 @@ static pmd_t *alloc_new_pmd(struct mm_st
 	if (!pmd)
 		return NULL;
 
-	if (!pmd_present(*pmd) && __pte_alloc(mm, pmd, addr))
+	VM_BUG_ON(pmd_trans_huge(*pmd));
+	if (pmd_none(*pmd) && __pte_alloc(mm, vma, pmd, addr))
 		return NULL;
 
 	return pmd;
@@ -147,7 +149,7 @@ unsigned long move_page_tables(struct vm
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
-		new_pmd = alloc_new_pmd(vma->vm_mm, new_addr);
+		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 21 of 66] add pmd mmu_notifier helpers
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 21 of 66] add pmd mmu_notifier helpers
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add mmu notifier helpers to handle pmd huge operations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -243,6 +243,32 @@ static inline void mmu_notifier_mm_destr
 	__pte;								\
 })
 
+#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	pmd_t __pmd;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+	__pmd;								\
+})
+
+#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
+({									\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
+	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
+					    (__address)+HPAGE_PMD_SIZE);\
+	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
+	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
+					  (__address)+HPAGE_PMD_SIZE);	\
+})
+
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -254,6 +280,17 @@ static inline void mmu_notifier_mm_destr
 	__young;							\
 })
 
+#define pmdp_clear_flush_young_notify(__vma, __address, __pmdp)		\
+({									\
+	int __young;							\
+	struct vm_area_struct *___vma = __vma;				\
+	unsigned long ___address = __address;				\
+	__young = pmdp_clear_flush_young(___vma, ___address, __pmdp);	\
+	__young |= mmu_notifier_clear_flush_young(___vma->vm_mm,	\
+						  ___address);		\
+	__young;							\
+})
+
 #define set_pte_at_notify(__mm, __address, __ptep, __pte)		\
 ({									\
 	struct mm_struct *___mm = __mm;					\
@@ -305,7 +342,10 @@ static inline void mmu_notifier_mm_destr
 }
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define pmdp_clear_flush_notify pmdp_clear_flush
+#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 22 of 66] clear page compound
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -347,7 +347,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -355,6 +355,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -392,6 +399,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
+	clear_bit(PG_compound, &page->flags);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 22 of 66] clear page compound
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page must transform a compound page to a regular page and needs
ClearPageCompound.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -347,7 +347,7 @@ static inline void set_page_writeback(st
  * tests can be used in performance sensitive paths. PageCompound is
  * generally not used in hot code paths.
  */
-__PAGEFLAG(Head, head)
+__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
 __PAGEFLAG(Tail, tail)
 
 static inline int PageCompound(struct page *page)
@@ -355,6 +355,13 @@ static inline int PageCompound(struct pa
 	return page->flags & ((1L << PG_head) | (1L << PG_tail));
 
 }
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON(!PageHead(page));
+	ClearPageHead(page);
+}
+#endif
 #else
 /*
  * Reduce page flag use as much as possible by overlapping
@@ -392,6 +399,14 @@ static inline void __ClearPageTail(struc
 	page->flags &= ~PG_head_tail_mask;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void ClearPageCompound(struct page *page)
+{
+	BUG_ON((page->flags & PG_head_tail_mask) != (1 << PG_compound));
+	clear_bit(PG_compound, &page->flags);
+}
+#endif
+
 #endif /* !PAGEFLAGS_EXTENDED */
 
 #ifdef CONFIG_MMU

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 23 of 66] add pmd_huge_pte to mm_struct
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,6 +310,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 	/* How many tasks sharing this mm are OOM_DISABLE */
 	atomic_t oom_disable_count;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -667,6 +670,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 23 of 66] add pmd_huge_pte to mm_struct
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This increase the size of the mm struct a bit but it is needed to preallocate
one pte for each hugepage so that split_huge_page will not require a fail path.
Guarantee of success is a fundamental property of split_huge_page to avoid
decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
would otherwise force the hugepage-unaware VM code to learn rolling back in the
middle of its pte mangling operations (if something we need it to learn
handling pmd_trans_huge natively rather being capable of rollback). When
split_huge_page runs a pte is needed to succeed the split, to map the newly
splitted regular pages with a regular pte.  This way all existing VM code
remains backwards compatible by just adding a split_huge_page* one liner. The
memory waste of those preallocated ptes is negligible and so it is worth it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -310,6 +310,9 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
+#endif
 	/* How many tasks sharing this mm are OOM_DISABLE */
 	atomic_t oom_disable_count;
 };
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,9 @@ void __mmdrop(struct mm_struct *mm)
 	mm_free_pgd(mm);
 	destroy_context(mm);
 	mmu_notifier_mm_destroy(mm);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON(mm->pmd_huge_pte);
+#endif
 	free_mm(mm);
 }
 EXPORT_SYMBOL_GPL(__mmdrop);
@@ -667,6 +670,10 @@ struct mm_struct *dup_mm(struct task_str
 	mm->token_priority = 0;
 	mm->last_interval = 0;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	mm->pmd_huge_pte = NULL;
+#endif
+
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 24 of 66] split_huge_page_mm/vma
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:27   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_pmd compat code. Each one of those would need to be expanded to
hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,6 +514,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,6 +154,7 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,6 +88,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(walk->mm, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 24 of 66] split_huge_page_mm/vma
@ 2010-11-03 15:27   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:27 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

split_huge_page_pmd compat code. Each one of those would need to be expanded to
hundred of lines of complex code without a fully reliable
split_huge_page_pmd design.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -179,6 +179,7 @@ static void mark_screen_rdonly(struct mm
 	if (pud_none_or_clear_bad(pud))
 		goto out;
 	pmd = pmd_offset(pud, 0xA0000);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		goto out;
 	pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,6 +514,7 @@ static inline int check_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,6 +154,7 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,6 +88,7 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -34,6 +34,7 @@ static int walk_pmd_range(pud_t *pud, un
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		split_huge_page_pmd(walk->mm, pmd);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk->pte_hole)
 				err = walk->pte_hole(addr, next, walk);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 25 of 66] split_huge_page paging
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan, but
for khugepaged to be safe, I'm relying on either mmap_sem write mode, or
PG_lock taken, so split_huge_page has to run either with mmap_sem read/write
mode or PG_lock taken. Calling it from isolate_lru_page would make locking more
complicated, in addition to that split_huge_page would deadlock if called by
__isolate_lru_page because it has to take the lru lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -385,6 +385,8 @@ static void collect_procs_anon(struct pa
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
+	if (unlikely(split_huge_page(page)))
+		return;
 	read_lock(&tasklist_lock);
 	av = page_lock_anon_vma(page);
 	if (av == NULL)	/* Not actually mapped anymore */
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1372,6 +1372,7 @@ int try_to_unmap(struct page *page, enum
 	int ret;
 
 	BUG_ON(!PageLocked(page));
+	BUG_ON(PageTransHuge(page));
 
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -157,6 +157,12 @@ int add_to_swap(struct page *page)
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -964,6 +964,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 25 of 66] split_huge_page paging
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Paging logic that splits the page before it is unmapped and added to swap to
ensure backwards compatibility with the legacy swap code. Eventually swap
should natively pageout the hugepages to increase performance and decrease
seeking and fragmentation of swap space. swapoff can just skip over huge pmd as
they cannot be part of swap yet. In add_to_swap be careful to split the page
only if we got a valid swap entry so we don't split hugepages with a full swap.

In theory we could split pages before isolating them during the lru scan, but
for khugepaged to be safe, I'm relying on either mmap_sem write mode, or
PG_lock taken, so split_huge_page has to run either with mmap_sem read/write
mode or PG_lock taken. Calling it from isolate_lru_page would make locking more
complicated, in addition to that split_huge_page would deadlock if called by
__isolate_lru_page because it has to take the lru lock to add the tail pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -385,6 +385,8 @@ static void collect_procs_anon(struct pa
 	struct task_struct *tsk;
 	struct anon_vma *av;
 
+	if (unlikely(split_huge_page(page)))
+		return;
 	read_lock(&tasklist_lock);
 	av = page_lock_anon_vma(page);
 	if (av == NULL)	/* Not actually mapped anymore */
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1372,6 +1372,7 @@ int try_to_unmap(struct page *page, enum
 	int ret;
 
 	BUG_ON(!PageLocked(page));
+	BUG_ON(PageTransHuge(page));
 
 	if (unlikely(PageKsm(page)))
 		ret = try_to_unmap_ksm(page, flags);
diff --git a/mm/swap_state.c b/mm/swap_state.c
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -157,6 +157,12 @@ int add_to_swap(struct page *page)
 	if (!entry.val)
 		return 0;
 
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page))) {
+			swapcache_free(entry, NULL);
+			return 0;
+		}
+
 	/*
 	 * Radix-tree node allocations from PF_MEMALLOC contexts could
 	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/swapfile.c b/mm/swapfile.c
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -964,6 +964,8 @@ static inline int unuse_pmd_range(struct
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (unlikely(pmd_trans_huge(*pmd)))
+			continue;
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 26 of 66] clear_copy_huge_page
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1574,5 +1574,14 @@ static inline int is_hwpoison_address(un
 
 extern void dump_page(struct page *page);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_user_huge_page(struct page *dst, struct page *src,
+				unsigned long addr, struct vm_area_struct *vma,
+				unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -394,70 +394,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_user_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-
-static void copy_user_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_user_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
 
 static void copy_gigantic_page(struct page *dst, struct page *src)
 {
@@ -2454,7 +2390,8 @@ retry_avoidcopy:
 		return VM_FAULT_OOM;
 	}
 
-	copy_user_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma,
+			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
@@ -2558,7 +2495,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address, huge_page_size(h));
+		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3602,3 +3602,74 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
+				    unsigned long addr,
+				    struct vm_area_struct *vma,
+				    unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_user_huge_page(struct page *dst, struct page *src,
+			 unsigned long addr, struct vm_area_struct *vma,
+			 unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_user_gigantic_page(dst, src, addr, vma,
+					pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 26 of 66] clear_copy_huge_page
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Move the copy/clear_huge_page functions to common code to share between
hugetlb.c and huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1574,5 +1574,14 @@ static inline int is_hwpoison_address(un
 
 extern void dump_page(struct page *page);
 
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+extern void clear_huge_page(struct page *page,
+			    unsigned long addr,
+			    unsigned int pages_per_huge_page);
+extern void copy_user_huge_page(struct page *dst, struct page *src,
+				unsigned long addr, struct vm_area_struct *vma,
+				unsigned int pages_per_huge_page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -394,70 +394,6 @@ static int vma_has_reserves(struct vm_ar
 	return 0;
 }
 
-static void clear_gigantic_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-	struct page *p = page;
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++, p = mem_map_next(p, page, i)) {
-		cond_resched();
-		clear_user_highpage(p, addr + i * PAGE_SIZE);
-	}
-}
-static void clear_huge_page(struct page *page,
-			unsigned long addr, unsigned long sz)
-{
-	int i;
-
-	if (unlikely(sz/PAGE_SIZE > MAX_ORDER_NR_PAGES)) {
-		clear_gigantic_page(page, addr, sz);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < sz/PAGE_SIZE; i++) {
-		cond_resched();
-		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
-	}
-}
-
-static void copy_user_gigantic_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-
-	for (i = 0; i < pages_per_huge_page(h); ) {
-		cond_resched();
-		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-
-static void copy_user_huge_page(struct page *dst, struct page *src,
-			   unsigned long addr, struct vm_area_struct *vma)
-{
-	int i;
-	struct hstate *h = hstate_vma(vma);
-
-	if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
-		copy_user_gigantic_page(dst, src, addr, vma);
-		return;
-	}
-
-	might_sleep();
-	for (i = 0; i < pages_per_huge_page(h); i++) {
-		cond_resched();
-		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
-	}
-}
 
 static void copy_gigantic_page(struct page *dst, struct page *src)
 {
@@ -2454,7 +2390,8 @@ retry_avoidcopy:
 		return VM_FAULT_OOM;
 	}
 
-	copy_user_huge_page(new_page, old_page, address, vma);
+	copy_user_huge_page(new_page, old_page, address, vma,
+			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
 	/*
@@ -2558,7 +2495,7 @@ retry:
 			ret = -PTR_ERR(page);
 			goto out;
 		}
-		clear_huge_page(page, address, huge_page_size(h));
+		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
 
 		if (vma->vm_flags & VM_MAYSHARE) {
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3602,3 +3602,74 @@ void might_fault(void)
 }
 EXPORT_SYMBOL(might_fault);
 #endif
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
+static void clear_gigantic_page(struct page *page,
+				unsigned long addr,
+				unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *p = page;
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page;
+	     i++, p = mem_map_next(p, page, i)) {
+		cond_resched();
+		clear_user_highpage(p, addr + i * PAGE_SIZE);
+	}
+}
+void clear_huge_page(struct page *page,
+		     unsigned long addr, unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		clear_gigantic_page(page, addr, pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
+	}
+}
+
+static void copy_user_gigantic_page(struct page *dst, struct page *src,
+				    unsigned long addr,
+				    struct vm_area_struct *vma,
+				    unsigned int pages_per_huge_page)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < pages_per_huge_page; ) {
+		cond_resched();
+		copy_user_highpage(dst, src, addr + i*PAGE_SIZE, vma);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void copy_user_huge_page(struct page *dst, struct page *src,
+			 unsigned long addr, struct vm_area_struct *vma,
+			 unsigned int pages_per_huge_page)
+{
+	int i;
+
+	if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) {
+		copy_user_gigantic_page(dst, src, addr, vma,
+					pages_per_huge_page);
+		return;
+	}
+
+	might_sleep();
+	for (i = 0; i < pages_per_huge_page; i++) {
+		cond_resched();
+		copy_user_highpage(dst + i, src + i, addr + i*PAGE_SIZE, vma);
+	}
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 27 of 66] kvm mmu transparent hugepage support
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -543,10 +543,20 @@ static int has_wrprotected_page(struct k
 
 static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
 {
-	unsigned long page_size;
+	unsigned long page_size, addr;
 	int i, ret = 0;
 
-	page_size = kvm_host_page_size(kvm, gfn);
+	page_size = kvm_host_page_size(kvm, gfn, &addr);
+
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE && !kvm_is_error_hva(addr)) {
+		pfn_t pfn = hva_to_pfn(kvm, addr, 0);
+
+		if (!is_error_pfn(pfn) && !kvm_is_mmio_pfn(pfn) &&
+		    PageTransCompound(pfn_to_page(pfn)))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_pfn_clean(pfn);
+	}
 
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
@@ -2283,6 +2293,8 @@ static int nonpaging_map(struct kvm_vcpu
 	pfn_t pfn;
 	unsigned long mmu_seq;
 
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
 	level = mapping_level(vcpu, gfn);
 
 	/*
@@ -2294,8 +2306,6 @@ static int nonpaging_map(struct kvm_vcpu
 
 	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 
 	/* mmio */
@@ -2601,12 +2611,12 @@ static int tdp_page_fault(struct kvm_vcp
 	if (r)
 		return r;
 
-	level = mapping_level(vcpu, gfn);
-
-	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
+	level = mapping_level(vcpu, gfn);
+
+	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -561,13 +561,13 @@ static int FNAME(page_fault)(struct kvm_
 		return 0;
 	}
 
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
 	if (walker.level >= PT_DIRECTORY_LEVEL) {
 		level = min(walker.level, mapping_level(vcpu, walker.gfn));
 		walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
 	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
 
 	/* mmio */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -301,6 +301,7 @@ void kvm_set_page_dirty(struct page *pag
 void kvm_set_page_accessed(struct page *page);
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr);
+pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic);
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
@@ -325,7 +326,8 @@ int kvm_clear_guest_page(struct kvm *kvm
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
+unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn,
+				 unsigned long *addr);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -83,7 +83,7 @@ int kvm_iommu_map_pages(struct kvm *kvm,
 		}
 
 		/* Get the page size we could use to map */
-		page_size = kvm_host_page_size(kvm, gfn);
+		page_size = kvm_host_page_size(kvm, gfn, NULL);
 
 		/* Make sure the page_size does not exceed the memslot */
 		while ((gfn + (page_size >> PAGE_SHIFT)) > end_gfn)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,8 +102,36 @@ static pfn_t fault_pfn;
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
 	if (pfn_valid(pfn)) {
-		struct page *page = compound_head(pfn_to_page(pfn));
-		return PageReserved(page);
+		struct page *head;
+		struct page *tail = pfn_to_page(pfn);
+		head = compound_head(tail);
+		if (head != tail) {
+			smp_rmb();
+			/*
+			 * head may be a dangling pointer.
+			 * __split_huge_page_refcount clears PageTail
+			 * before overwriting first_page, so if
+			 * PageTail is still there it means the head
+			 * pointer isn't dangling.
+			 */
+			if (PageTail(tail)) {
+				/*
+				 * the "head" is not a dangling
+				 * pointer but the hugepage may have
+				 * been splitted from under us (and we
+				 * may not hold a reference count on
+				 * the head page so it can be reused
+				 * before we run PageReferenced), so
+				 * we've to recheck PageTail before
+				 * returning what we just read.
+				 */
+				int reserved = PageReserved(head);
+				smp_rmb();
+				if (PageTail(tail))
+					return reserved;
+			}
+		}
+		return PageReserved(tail);
 	}
 
 	return true;
@@ -884,7 +912,8 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn)
+unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn,
+				 unsigned long *addrp)
 {
 	struct vm_area_struct *vma;
 	unsigned long addr, size;
@@ -892,6 +921,8 @@ unsigned long kvm_host_page_size(struct 
 	size = PAGE_SIZE;
 
 	addr = gfn_to_hva(kvm, gfn);
+	if (addrp)
+		*addrp = addr;
 	if (kvm_is_error_hva(addr))
 		return PAGE_SIZE;
 
@@ -946,7 +977,7 @@ unsigned long gfn_to_hva(struct kvm *kvm
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
+pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 {
 	struct page *page[1];
 	int npages;

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 27 of 66] kvm mmu transparent hugepage support
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Marcelo Tosatti <mtosatti@redhat.com>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -543,10 +543,20 @@ static int has_wrprotected_page(struct k
 
 static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
 {
-	unsigned long page_size;
+	unsigned long page_size, addr;
 	int i, ret = 0;
 
-	page_size = kvm_host_page_size(kvm, gfn);
+	page_size = kvm_host_page_size(kvm, gfn, &addr);
+
+	/* check for transparent hugepages */
+	if (page_size == PAGE_SIZE && !kvm_is_error_hva(addr)) {
+		pfn_t pfn = hva_to_pfn(kvm, addr, 0);
+
+		if (!is_error_pfn(pfn) && !kvm_is_mmio_pfn(pfn) &&
+		    PageTransCompound(pfn_to_page(pfn)))
+			page_size = KVM_HPAGE_SIZE(2);
+		kvm_release_pfn_clean(pfn);
+	}
 
 	for (i = PT_PAGE_TABLE_LEVEL;
 	     i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
@@ -2283,6 +2293,8 @@ static int nonpaging_map(struct kvm_vcpu
 	pfn_t pfn;
 	unsigned long mmu_seq;
 
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
 	level = mapping_level(vcpu, gfn);
 
 	/*
@@ -2294,8 +2306,6 @@ static int nonpaging_map(struct kvm_vcpu
 
 	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 
 	/* mmio */
@@ -2601,12 +2611,12 @@ static int tdp_page_fault(struct kvm_vcp
 	if (r)
 		return r;
 
-	level = mapping_level(vcpu, gfn);
-
-	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
-
 	mmu_seq = vcpu->kvm->mmu_notifier_seq;
 	smp_rmb();
+	level = mapping_level(vcpu, gfn);
+
+	gfn &= ~(KVM_PAGES_PER_HPAGE(level) - 1);
+
 	pfn = gfn_to_pfn(vcpu->kvm, gfn);
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu->kvm, gfn, pfn);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -561,13 +561,13 @@ static int FNAME(page_fault)(struct kvm_
 		return 0;
 	}
 
+	mmu_seq = vcpu->kvm->mmu_notifier_seq;
+	smp_rmb();
 	if (walker.level >= PT_DIRECTORY_LEVEL) {
 		level = min(walker.level, mapping_level(vcpu, walker.gfn));
 		walker.gfn = walker.gfn & ~(KVM_PAGES_PER_HPAGE(level) - 1);
 	}
 
-	mmu_seq = vcpu->kvm->mmu_notifier_seq;
-	smp_rmb();
 	pfn = gfn_to_pfn(vcpu->kvm, walker.gfn);
 
 	/* mmio */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -301,6 +301,7 @@ void kvm_set_page_dirty(struct page *pag
 void kvm_set_page_accessed(struct page *page);
 
 pfn_t hva_to_pfn_atomic(struct kvm *kvm, unsigned long addr);
+pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic);
 pfn_t gfn_to_pfn_atomic(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn(struct kvm *kvm, gfn_t gfn);
 pfn_t gfn_to_pfn_memslot(struct kvm *kvm,
@@ -325,7 +326,8 @@ int kvm_clear_guest_page(struct kvm *kvm
 int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len);
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn);
 int kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn);
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn);
+unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn,
+				 unsigned long *addr);
 void mark_page_dirty(struct kvm *kvm, gfn_t gfn);
 
 void kvm_vcpu_block(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/iommu.c b/virt/kvm/iommu.c
--- a/virt/kvm/iommu.c
+++ b/virt/kvm/iommu.c
@@ -83,7 +83,7 @@ int kvm_iommu_map_pages(struct kvm *kvm,
 		}
 
 		/* Get the page size we could use to map */
-		page_size = kvm_host_page_size(kvm, gfn);
+		page_size = kvm_host_page_size(kvm, gfn, NULL);
 
 		/* Make sure the page_size does not exceed the memslot */
 		while ((gfn + (page_size >> PAGE_SHIFT)) > end_gfn)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,8 +102,36 @@ static pfn_t fault_pfn;
 inline int kvm_is_mmio_pfn(pfn_t pfn)
 {
 	if (pfn_valid(pfn)) {
-		struct page *page = compound_head(pfn_to_page(pfn));
-		return PageReserved(page);
+		struct page *head;
+		struct page *tail = pfn_to_page(pfn);
+		head = compound_head(tail);
+		if (head != tail) {
+			smp_rmb();
+			/*
+			 * head may be a dangling pointer.
+			 * __split_huge_page_refcount clears PageTail
+			 * before overwriting first_page, so if
+			 * PageTail is still there it means the head
+			 * pointer isn't dangling.
+			 */
+			if (PageTail(tail)) {
+				/*
+				 * the "head" is not a dangling
+				 * pointer but the hugepage may have
+				 * been splitted from under us (and we
+				 * may not hold a reference count on
+				 * the head page so it can be reused
+				 * before we run PageReferenced), so
+				 * we've to recheck PageTail before
+				 * returning what we just read.
+				 */
+				int reserved = PageReserved(head);
+				smp_rmb();
+				if (PageTail(tail))
+					return reserved;
+			}
+		}
+		return PageReserved(tail);
 	}
 
 	return true;
@@ -884,7 +912,8 @@ int kvm_is_visible_gfn(struct kvm *kvm, 
 }
 EXPORT_SYMBOL_GPL(kvm_is_visible_gfn);
 
-unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn)
+unsigned long kvm_host_page_size(struct kvm *kvm, gfn_t gfn,
+				 unsigned long *addrp)
 {
 	struct vm_area_struct *vma;
 	unsigned long addr, size;
@@ -892,6 +921,8 @@ unsigned long kvm_host_page_size(struct 
 	size = PAGE_SIZE;
 
 	addr = gfn_to_hva(kvm, gfn);
+	if (addrp)
+		*addrp = addr;
 	if (kvm_is_error_hva(addr))
 		return PAGE_SIZE;
 
@@ -946,7 +977,7 @@ unsigned long gfn_to_hva(struct kvm *kvm
 }
 EXPORT_SYMBOL_GPL(gfn_to_hva);
 
-static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
+pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic)
 {
 	struct page *page[1];
 	int npages;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 28 of 66] _GFP_NO_KSWAPD
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -81,13 +81,15 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
+#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1996,7 +1996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 28 of 66] _GFP_NO_KSWAPD
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Transparent hugepage allocations must be allowed not to invoke kswapd or any
other kind of indirect reclaim (especially when the defrag sysfs is control
disabled). It's unacceptable to swap out anonymous pages (potentially
anonymous transparent hugepages) in order to create new transparent hugepages.
This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
machine and so having it suffer an unbearable slowdown, so another one with
guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
memory intensive workloads, makes no sense). If a transparent hugepage
allocation fails the slowdown is minor and there is total fallback, so kswapd
should never be asked to swapout memory to allow the high order allocation to
succeed.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -81,13 +81,15 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
+#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1996,7 +1996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
 		goto nopage;
 
 restart:
-	wake_all_kswapd(order, zonelist, high_zoneidx);
+	if (!(gfp_mask & __GFP_NO_KSWAPD))
+		wake_all_kswapd(order, zonelist, high_zoneidx);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
 	if (!wait) {
-		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Not worth trying to allocate harder for
+		 * __GFP_NOMEMALLOC even if it can't schedule.
+		 */
+		if  (!(gfp_mask & __GFP_NOMEMALLOC))
+			alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Not worth throwing away the precious reserved free memory pool for allocations
that can fail gracefully (either through mempool or because they're transhuge
allocations later falling back to 4k allocations).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
 	if (!wait) {
-		alloc_flags |= ALLOC_HARDER;
+		/*
+		 * Not worth trying to allocate harder for
+		 * __GFP_NOMEMALLOC even if it can't schedule.
+		 */
+		if  (!(gfp_mask & __GFP_NOMEMALLOC))
+			alloc_flags |= ALLOC_HARDER;
 		/*
 		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 30 of 66] transparent hugepage core
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
* * *
adapt to mm_counter in -mm

From: Andrea Arcangeli <aarcange@redhat.com>

The interface changed slightly.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
* * *
transparent hugepage bootparam

From: Andrea Arcangeli <aarcange@redhat.com>

Allow transparent_hugepage=always|never|madvise at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -108,6 +108,9 @@ struct vm_area_struct;
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
+#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
+			 __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,126 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)				\
+	((transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern int split_huge_page(struct page *page);
+extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
+#define split_huge_page_pmd(__mm, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (unlikely(pmd_trans_huge(*____pmd)))			\
+			__split_huge_page_pmd(__mm, ____pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		spin_unlock_wait(&(__anon_vma)->root->lock);		\
+		/*							\
+		 * spin_unlock_wait() is just a loop in C and so the	\
+		 * CPU can reorder anything around it.			\
+		 */							\
+		smp_mb();						\
+		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
+		       pmd_trans_huge(*____pmd));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
+#define HPAGE_PMD_MASK ({ BUG(); 0; })
+#define HPAGE_PMD_SIZE ({ BUG(); 0; })
+
+#define transparent_hugepage_enabled(__vma) 0
+
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_pmd(__mm, __pmd)	\
+	do { } while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -240,6 +243,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,899 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
+ * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
+ * memory just to allocate one more hugepage.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init hugepage_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(hugepage_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+	if (!strcmp(str, "always")) {
+		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			&transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		ret = 1;
+	} else if (!strcmp(str, "madvise")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			  &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			&transparent_hugepage_flags);
+		ret = 1;
+	} else if (!strcmp(str, "never")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		printk(KERN_WARNING
+		       "transparent_hugepage= cannot parse, ignored\n");
+	return ret;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long haddr, pmd_t *pmd,
+					struct page *page)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		/*
+		 * The spinlocking to take the lru_lock inside
+		 * page_add_new_anon_rmap() acts as a full memory
+		 * barrier to be sure clear_huge_page writes become
+		 * visible after the set_pmd_at() write.
+		 */
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	}
+out:
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
+		return VM_FAULT_OOM;
+	/* if an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+	/*
+	 * A regular pmd is established and it can't morph into a huge pmd
+	 * from under us anymore at this point because we hold the mmap_sem
+	 * read mode and khugepaged takes it in write mode. So now it's
+	 * safe to run pte_offset_map().
+	 */
+	pte = pte_offset_map(pmd, address);
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd))) {
+		pte_free(dst_mm, pgtable);
+		goto out_unlock;
+	}
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+		pte_free(dst_mm, pgtable);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	VM_BUG_ON(!PageHead(page));
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	if (transparent_hugepage_enabled(vma) &&
+	    !transparent_hugepage_debug_cow())
+		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	else
+		new_page = NULL;
+
+	if (unlikely(!new_page)) {
+		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+						   pmd, orig_pmd, page, haddr);
+		put_page(page);
+		goto out;
+	}
+
+	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	spin_lock(&mm->page_table_lock);
+	put_page(page);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		VM_BUG_ON(!PageHead(page));
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			VM_BUG_ON(!PageHead(page));
+			spin_unlock(&tlb->mm->page_table_lock);
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_page(*pmd) != page)
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd)) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->root->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+		BUG_ON(page_count(page_tail) <= 0);
+		/*
+		 * Tail pages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(page_tail);
+	}
+
+	/*
+	 * Only the head page (now become a regular page) is required
+	 * to be pinned by the caller.
+	 */
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		/*
+		 * Up to this point the pmd is present and huge and
+		 * userland has the whole access to the hugepage
+		 * during the split (which happens in place). If we
+		 * overwrite the pmd with the not-huge version
+		 * pointing to the pte here (which of course we could
+		 * if all CPUs were bug free), userland could trigger
+		 * a small page size TLB miss on the small sized TLB
+		 * while the hugepage TLB entry is still established
+		 * in the huge TLB. Some CPU doesn't like that. See
+		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
+		 * Erratum 383 on page 93. Intel should be safe but is
+		 * also warns that it's only safe if the permission
+		 * and cache attributes of the two entries loaded in
+		 * the two TLB is identical (which should be the case
+		 * here). But it is generally safer to never allow
+		 * small and huge TLB entries for the same virtual
+		 * address to be loaded simultaneously. So instead of
+		 * doing "pmd_populate(); flush_tlb_range();" we first
+		 * mark the current pmd notpresent (atomically because
+		 * here the pmd_trans_huge and pmd_trans_splitting
+		 * must remain set at all times on the pmd until the
+		 * split is complete for this pmd), then we flush the
+		 * SMP TLB and finally we write the non-huge version
+		 * of the pmd entry with pmd_populate.
+		 */
+		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmd_populate(mm, pmd, pgtable);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->root->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct anon_vma_chain *avc;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
+
+void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!page_count(page));
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	split_huge_page(page);
+
+	put_page(page);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -726,9 +726,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, vma, pmd, address);
-	if (!pte)
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
+
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
 		return VM_FAULT_OOM;
+	/* if an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+	/*
+	 * A regular pmd is established and it can't morph into a huge pmd
+	 * from under us anymore at this point because we hold the mmap_sem
+	 * read mode and khugepaged takes it in write mode. So now it's
+	 * safe to run pte_offset_map().
+	 */
+	pte = pte_offset_map(pmd, address);
 
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL;
+	if (pmd_trans_huge(*pmd))
+		return NULL;
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
@@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 30 of 66] transparent hugepage core
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Lately I've been working to make KVM use hugepages transparently
without the usual restrictions of hugetlbfs. Some of the restrictions
I'd like to see removed:

1) hugepages have to be swappable or the guest physical memory remains
   locked in RAM and can't be paged out to swap

2) if a hugepage allocation fails, regular pages should be allocated
   instead and mixed in the same vma without any failure and without
   userland noticing

3) if some task quits and more hugepages become available in the
   buddy, guest physical memory backed by regular pages should be
   relocated on hugepages automatically in regions under
   madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
   kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
   not null)

4) avoidance of reservation and maximization of use of hugepages whenever
   possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
   1 machine with 1 database with 1 database cache with 1 database cache size
   known at boot time. It's definitely not feasible with a virtualization
   hypervisor usage like RHEV-H that runs an unknown number of virtual machines
   with an unknown size of each virtual machine with an unknown amount of
   pagecache that could be potentially useful in the host for guest not using
   O_DIRECT (aka cache=off).

hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization, becasue
with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
case only the hypervisor uses transparent hugepages, and they decrease the
tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
the linux guest both uses this patch (though the guest will limit the addition
speedup to anonymous regions only for now...).  Even more important is that the
tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
paging or no-virtualization scenario. So maximizing the amount of virtual
memory cached by the TLB pays off significantly more with NPT/EPT than without
(even if there would be no significant speedup in the tlb-miss runtime).

The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on regular
anonymous vmas. This is what this patch tries to achieve in the least intrusive
possible way. We want hugepages and hugetlb to be used in a way so that all
applications can benefit without changes (as usual we leverage the KVM
virtualization design: by improving the Linux VM at large, KVM gets the
performance boost too).

The most important design choice is: always fallback to 4k allocation
if the hugepage allocation fails! This is the _very_ opposite of some
large pagecache patches that failed with -EIO back then if a 64k (or
similar) allocation failed...

Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an
operation that can't fail. This way the reliability of the swapping
isn't decreased (no need to allocate memory when we are short on
memory to swap) and it's trivial to plug a split_huge_page* one-liner
where needed without polluting the VM. Over time we can teach
mprotect, mremap and friends to handle pmd_trans_huge natively without
calling split_huge_page*. The fact it can't fail isn't just for swap:
if split_huge_page would return -ENOMEM (instead of the current void)
we'd need to rollback the mprotect from the middle of it (ideally
including undoing the split_vma) which would be a big change and in
the very wrong direction (it'd likely be simpler not to call
split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.

The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
and incremental and it'll just be an "harmless" addition later if this
initial part is agreed upon. It also should be noted that locking-wise
replacing regular pages with hugepages is going to be very easy if
compared to what I'm doing below in split_huge_page, as it will only
happen when page_count(page) matches page_mapcount(page) if we can
take the PG_lock and mmap_sem in write mode. collapse_huge_page will
be a "best effort" that (unlike split_huge_page) can fail at the
minimal sign of trouble and we can try again later. collapse_huge_page
will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
work similar to madvise(MADV_MERGEABLE).

The default I like is that transparent hugepages are used at page fault time.
This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
control knob can be set to three values "always", "madvise", "never" which
mean respectively that hugepages are always used, or only inside
madvise(MADV_HUGEPAGE) regions, or never used.
/sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
allocation should defrag memory aggressively "always", only inside "madvise"
regions, or "never".

The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point
of view. In short there is no current existing way to serialize the
O_DIRECT final put_page against split_huge_page_refcount so I had to
invent a new one (O_DIRECT loses knowledge on the mapping status by
the time gup_fast returns so...). And I didn't want to impact all
gup/gup_fast users for now, maybe if we change the gup interface
substantially we can avoid this locking, I admit I didn't think too
much about it because changing the gup unpinning interface would be
invasive.

If we ignored O_DIRECT we could stick to the existing compound
refcounting code, by simply adding a
get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
set we'd just BUG_ON if nobody registered itself in the current task
mmu notifier list yet). But O_DIRECT is fundamental for decent
performance of virtualized I/O on fast storage so we can't avoid it to
solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.

Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young
bit on the pmd, that didn't have a range check but I think KVM will be
fine because the whole point of hugepages is that EPT/NPT will also
use a huge pmd when they notice gup returns pages with PageCompound set,
so they won't care of a range and there's just the pmd young bit to
check in that case.

NOTE: in some cases if the L2 cache is small, this may slowdown and
waste memory during COWs because 4M of memory are accessed in a single
fault instead of 8k (the payoff is that after COW the program can run
faster). So we might want to switch the copy_huge_page (and
clear_huge_page too) to not temporal stores. I also extensively
researched ways to avoid this cache trashing with a full prefault
logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
patches that fully implemented prefault) but I concluded they're not
worth it and they add an huge additional complexity and they remove all tlb
benefits until the full hugepage has been faulted in, to save a little bit of
memory and some cache during app startup, but they still don't improve
substantially the cache-trashing during startup if the prefault happens in >4k
chunks.  One reason is that those 4k pte entries copied are still mapped on a
perfectly cache-colored hugepage, so the trashing is the worst one can generate
in those copies (cow of 4k page copies aren't so well colored so they trashes
less, but again this results in software running faster after the page fault).
Those prefault patches allowed things like a pte where post-cow pages were
local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
the middle of some hugepage mapped read-only. If it doesn't payoff
substantially with todays hardware it will payoff even less in the future with
larger l2 caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or with
the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
ensure not a single hugepage is allocated at boot time. It is simple enough to
just disable transparent hugepage globally and let transparent hugepages be
allocated selectively by applications in the MADV_HUGEPAGE region (both at page
fault time, and if enabled with the collapse_huge_page too through the kernel
daemon).

This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like power
have certain tlb limits that prevents mixing different page size in the same
regions so they will not fit in this framework that requires "graceful
fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
hugetlbfs remains a perfect fit for those because its software limits happen to
match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
sizes like 1GByte that cannot be hoped to be found not fragmented after a
certain system uptime and that would be very expensive to defragment with
relocation, so requiring reservation. hugetlbfs is the "reservation way", the
point of transparent hugepages is not to have any reservation at all and
maximizing the use of cache and hugepages at all times automatically.

Some performance result:

vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988

============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (3UL*1024*1024*1024)

int main()
{
	char *p = malloc(SIZE), *p2;
	struct timeval before, after;

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset page fault %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	memset(p, 0, SIZE);
	gettimeofday(&after, NULL);
	printf("memset second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	gettimeofday(&before, NULL);
	for (p2 = p; p2 < p+SIZE; p2 += 4096)
		*p2 = 0;
	gettimeofday(&after, NULL);
	printf("random access second tlb miss %Lu\n",
	       (after.tv_sec-before.tv_sec)*1000000UL +
	       after.tv_usec-before.tv_usec);

	return 0;
}
============

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
* * *
adapt to mm_counter in -mm

From: Andrea Arcangeli <aarcange@redhat.com>

The interface changed slightly.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---
* * *
transparent hugepage bootparam

From: Andrea Arcangeli <aarcange@redhat.com>

Allow transparent_hugepage=always|never|madvise at boot.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
 	return pmd_set_flags(pmd, _PAGE_RW);
 }
 
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -108,6 +108,9 @@ struct vm_area_struct;
 				 __GFP_HARDWALL | __GFP_HIGHMEM | \
 				 __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS)
+#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
+			 __GFP_NO_KSWAPD)
 
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
new file mode 100644
--- /dev/null
+++ b/include/linux/huge_mm.h
@@ -0,0 +1,126 @@
+#ifndef _LINUX_HUGE_MM_H
+#define _LINUX_HUGE_MM_H
+
+extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
+				      struct vm_area_struct *vma,
+				      unsigned long address, pmd_t *pmd,
+				      unsigned int flags);
+extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+			 struct vm_area_struct *vma);
+extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       pmd_t orig_pmd);
+extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
+extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+					  unsigned long addr,
+					  pmd_t *pmd,
+					  unsigned int flags);
+extern int zap_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd);
+
+enum transparent_hugepage_flag {
+	TRANSPARENT_HUGEPAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+#ifdef CONFIG_DEBUG_VM
+	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
+#endif
+};
+
+enum page_check_address_pmd_flag {
+	PAGE_CHECK_ADDRESS_PMD_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
+	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
+};
+extern pmd_t *page_check_address_pmd(struct page *page,
+				     struct mm_struct *mm,
+				     unsigned long address,
+				     enum page_check_address_pmd_flag flag);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define HPAGE_PMD_SHIFT HPAGE_SHIFT
+#define HPAGE_PMD_MASK HPAGE_MASK
+#define HPAGE_PMD_SIZE HPAGE_SIZE
+
+#define transparent_hugepage_enabled(__vma)				\
+	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_defrag(__vma)				\
+	((transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
+	 (transparent_hugepage_flags &					\
+	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
+	  (__vma)->vm_flags & VM_HUGEPAGE))
+#ifdef CONFIG_DEBUG_VM
+#define transparent_hugepage_debug_cow()				\
+	(transparent_hugepage_flags &					\
+	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
+#else /* CONFIG_DEBUG_VM */
+#define transparent_hugepage_debug_cow() 0
+#endif /* CONFIG_DEBUG_VM */
+
+extern unsigned long transparent_hugepage_flags;
+extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+			  pmd_t *dst_pmd, pmd_t *src_pmd,
+			  struct vm_area_struct *vma,
+			  unsigned long addr, unsigned long end);
+extern int handle_pte_fault(struct mm_struct *mm,
+			    struct vm_area_struct *vma, unsigned long address,
+			    pte_t *pte, pmd_t *pmd, unsigned int flags);
+extern int split_huge_page(struct page *page);
+extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
+#define split_huge_page_pmd(__mm, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (unlikely(pmd_trans_huge(*____pmd)))			\
+			__split_huge_page_pmd(__mm, ____pmd);		\
+	}  while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		spin_unlock_wait(&(__anon_vma)->root->lock);		\
+		/*							\
+		 * spin_unlock_wait() is just a loop in C and so the	\
+		 * CPU can reorder anything around it.			\
+		 */							\
+		smp_mb();						\
+		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
+		       pmd_trans_huge(*____pmd));			\
+	} while (0)
+#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
+#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+#if HPAGE_PMD_ORDER > MAX_ORDER
+#error "hugepages can't be allocated by the buddy allocator"
+#endif
+
+extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
+#define HPAGE_PMD_MASK ({ BUG(); 0; })
+#define HPAGE_PMD_SIZE ({ BUG(); 0; })
+
+#define transparent_hugepage_enabled(__vma) 0
+
+#define transparent_hugepage_flags 0UL
+static inline int split_huge_page(struct page *page)
+{
+	return 0;
+}
+#define split_huge_page_pmd(__mm, __pmd)	\
+	do { } while (0)
+#define wait_split_huge_page(__anon_vma, __pmd)	\
+	do { } while (0)
+#define PageTransHuge(page) 0
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
+#if BITS_PER_LONG > 32
+#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
+#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -240,6 +243,7 @@ struct inode;
  * files which need it (119 of them)
  */
 #include <linux/page-flags.h>
+#include <linux/huge_mm.h>
 
 /*
  * Methods to modify the page usage count.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
 }
 
 static inline void
+__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
+		       struct list_head *head)
+{
+	list_add(&page->lru, head);
+	__inc_zone_state(zone, NR_LRU_BASE + l);
+	mem_cgroup_add_lru_list(page, l);
+}
+
+static inline void
 add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
-	list_add(&page->lru, &zone->lru[l].list);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
-	mem_cgroup_add_lru_list(page, l);
+	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
 }
 
 static inline void
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void __lru_cache_add(struct page *, enum lru_list lru);
 extern void lru_cache_add_lru(struct page *, enum lru_list lru);
+extern void lru_add_page_tail(struct zone* zone,
+			      struct page *page, struct page *page_tail);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
new file mode 100644
--- /dev/null
+++ b/mm/huge_memory.c
@@ -0,0 +1,899 @@
+/*
+ *  Copyright (C) 2009  Red Hat, Inc.
+ *
+ *  This work is licensed under the terms of the GNU GPL, version 2. See
+ *  the COPYING file in the top-level directory.
+ */
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+unsigned long transparent_hugepage_flags __read_mostly =
+	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+
+#ifdef CONFIG_SYSFS
+static ssize_t double_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag enabled,
+				enum transparent_hugepage_flag req_madv)
+{
+	if (test_bit(enabled, &transparent_hugepage_flags)) {
+		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
+		return sprintf(buf, "[always] madvise never\n");
+	} else if (test_bit(req_madv, &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+static ssize_t double_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag enabled,
+				 enum transparent_hugepage_flag req_madv)
+{
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		set_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		set_bit(req_madv, &transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(enabled, &transparent_hugepage_flags);
+		clear_bit(req_madv, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_FLAG,
+				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+}
+static struct kobj_attribute enabled_attr =
+	__ATTR(enabled, 0644, enabled_show, enabled_store);
+
+static ssize_t single_flag_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf,
+				enum transparent_hugepage_flag flag)
+{
+	if (test_bit(flag, &transparent_hugepage_flags))
+		return sprintf(buf, "[yes] no\n");
+	else
+		return sprintf(buf, "yes [no]\n");
+}
+static ssize_t single_flag_store(struct kobject *kobj,
+				 struct kobj_attribute *attr,
+				 const char *buf, size_t count,
+				 enum transparent_hugepage_flag flag)
+{
+	if (!memcmp("yes", buf,
+		    min(sizeof("yes")-1, count))) {
+		set_bit(flag, &transparent_hugepage_flags);
+	} else if (!memcmp("no", buf,
+			   min(sizeof("no")-1, count))) {
+		clear_bit(flag, &transparent_hugepage_flags);
+	} else
+		return -EINVAL;
+
+	return count;
+}
+
+/*
+ * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
+ * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
+ * memory just to allocate one more hugepage.
+ */
+static ssize_t defrag_show(struct kobject *kobj,
+			   struct kobj_attribute *attr, char *buf)
+{
+	return double_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static ssize_t defrag_store(struct kobject *kobj,
+			    struct kobj_attribute *attr,
+			    const char *buf, size_t count)
+{
+	return double_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
+}
+static struct kobj_attribute defrag_attr =
+	__ATTR(defrag, 0644, defrag_show, defrag_store);
+
+#ifdef CONFIG_DEBUG_VM
+static ssize_t debug_cow_show(struct kobject *kobj,
+				struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static ssize_t debug_cow_store(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
+}
+static struct kobj_attribute debug_cow_attr =
+	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
+#endif /* CONFIG_DEBUG_VM */
+
+static struct attribute *hugepage_attr[] = {
+	&enabled_attr.attr,
+	&defrag_attr.attr,
+#ifdef CONFIG_DEBUG_VM
+	&debug_cow_attr.attr,
+#endif
+	NULL,
+};
+
+static struct attribute_group hugepage_attr_group = {
+	.attrs = hugepage_attr,
+	.name = "transparent_hugepage",
+};
+#endif /* CONFIG_SYSFS */
+
+static int __init hugepage_init(void)
+{
+#ifdef CONFIG_SYSFS
+	int err;
+
+	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	if (err)
+		printk(KERN_ERR "hugepage: register sysfs failed\n");
+#endif
+	return 0;
+}
+module_init(hugepage_init)
+
+static int __init setup_transparent_hugepage(char *str)
+{
+	int ret = 0;
+	if (!str)
+		goto out;
+	if (!strcmp(str, "always")) {
+		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			&transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		ret = 1;
+	} else if (!strcmp(str, "madvise")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			  &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			&transparent_hugepage_flags);
+		ret = 1;
+	} else if (!strcmp(str, "never")) {
+		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
+			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		ret = 1;
+	}
+out:
+	if (!ret)
+		printk(KERN_WARNING
+		       "transparent_hugepage= cannot parse, ignored\n");
+	return ret;
+}
+__setup("transparent_hugepage=", setup_transparent_hugepage);
+
+static void prepare_pmd_huge_pte(pgtable_t pgtable,
+				 struct mm_struct *mm)
+{
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	if (!mm->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
+	mm->pmd_huge_pte = pgtable;
+}
+
+static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd);
+	return pmd;
+}
+
+static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long haddr, pmd_t *pmd,
+					struct page *page)
+{
+	int ret = 0;
+	pgtable_t pgtable;
+
+	VM_BUG_ON(!PageCompound(page));
+	pgtable = pte_alloc_one(mm, haddr);
+	if (unlikely(!pgtable)) {
+		put_page(page);
+		return VM_FAULT_OOM;
+	}
+
+	clear_huge_page(page, haddr, HPAGE_PMD_NR);
+	__SetPageUptodate(page);
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_none(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		put_page(page);
+		pte_free(mm, pgtable);
+	} else {
+		pmd_t entry;
+		entry = mk_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		/*
+		 * The spinlocking to take the lru_lock inside
+		 * page_add_new_anon_rmap() acts as a full memory
+		 * barrier to be sure clear_huge_page writes become
+		 * visible after the set_pmd_at() write.
+		 */
+		page_add_new_anon_rmap(page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		prepare_pmd_huge_pte(pgtable, mm);
+		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		spin_unlock(&mm->page_table_lock);
+	}
+
+	return ret;
+}
+
+static inline struct page *alloc_hugepage(int defrag)
+{
+	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
+			   HPAGE_PMD_ORDER);
+}
+
+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			       unsigned long address, pmd_t *pmd,
+			       unsigned int flags)
+{
+	struct page *page;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pte_t *pte;
+
+	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
+		if (unlikely(anon_vma_prepare(vma)))
+			return VM_FAULT_OOM;
+		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		if (unlikely(!page))
+			goto out;
+
+		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
+	}
+out:
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
+		return VM_FAULT_OOM;
+	/* if an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+	/*
+	 * A regular pmd is established and it can't morph into a huge pmd
+	 * from under us anymore at this point because we hold the mmap_sem
+	 * read mode and khugepaged takes it in write mode. So now it's
+	 * safe to run pte_offset_map().
+	 */
+	pte = pte_offset_map(pmd, address);
+	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	struct page *src_page;
+	pmd_t pmd;
+	pgtable_t pgtable;
+	int ret;
+
+	ret = -ENOMEM;
+	pgtable = pte_alloc_one(dst_mm, addr);
+	if (unlikely(!pgtable))
+		goto out;
+
+	spin_lock(&dst_mm->page_table_lock);
+	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+	if (unlikely(!pmd_trans_huge(pmd))) {
+		pte_free(dst_mm, pgtable);
+		goto out_unlock;
+	}
+	if (unlikely(pmd_trans_splitting(pmd))) {
+		/* split huge page running from under us */
+		spin_unlock(&src_mm->page_table_lock);
+		spin_unlock(&dst_mm->page_table_lock);
+		pte_free(dst_mm, pgtable);
+
+		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+		goto out;
+	}
+	src_page = pmd_page(pmd);
+	VM_BUG_ON(!PageHead(src_page));
+	get_page(src_page);
+	page_dup_rmap(src_page);
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+	prepare_pmd_huge_pte(pgtable, dst_mm);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(&src_mm->page_table_lock);
+	spin_unlock(&dst_mm->page_table_lock);
+out:
+	return ret;
+}
+
+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+	pgtable_t pgtable;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	/* FIFO */
+	pgtable = mm->pmd_huge_pte;
+	if (list_empty(&pgtable->lru))
+		mm->pmd_huge_pte = NULL;
+	else {
+		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+					      struct page, lru);
+		list_del(&pgtable->lru);
+	}
+	return pgtable;
+}
+
+static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
+					struct vm_area_struct *vma,
+					unsigned long address,
+					pmd_t *pmd, pmd_t orig_pmd,
+					struct page *page,
+					unsigned long haddr)
+{
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	int ret = 0, i;
+	struct page **pages;
+
+	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
+			GFP_KERNEL);
+	if (unlikely(!pages)) {
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+					  vma, address);
+		if (unlikely(!pages[i])) {
+			while (--i >= 0)
+				put_page(pages[i]);
+			kfree(pages);
+			ret |= VM_FAULT_OOM;
+			goto out;
+		}
+	}
+
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		copy_user_highpage(pages[i], page + i,
+				   haddr + PAGE_SHIFT*i, vma);
+		__SetPageUptodate(pages[i]);
+		cond_resched();
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_free_pages;
+	VM_BUG_ON(!PageHead(page));
+
+	pmdp_clear_flush_notify(vma, haddr, pmd);
+	/* leave pmd empty until pte is filled */
+
+	pgtable = get_pmd_huge_pte(mm);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+		pte_t *pte, entry;
+		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		page_add_new_anon_rmap(pages[i], vma, haddr);
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		set_pte_at(mm, haddr, pte, entry);
+		pte_unmap(pte);
+	}
+	kfree(pages);
+
+	mm->nr_ptes++;
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+	page_remove_rmap(page);
+	spin_unlock(&mm->page_table_lock);
+
+	ret |= VM_FAULT_WRITE;
+	put_page(page);
+
+out:
+	return ret;
+
+out_free_pages:
+	spin_unlock(&mm->page_table_lock);
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		put_page(pages[i]);
+	kfree(pages);
+	goto out;
+}
+
+int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
+{
+	int ret = 0;
+	struct page *page, *new_page;
+	unsigned long haddr;
+
+	VM_BUG_ON(!vma->anon_vma);
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		goto out_unlock;
+
+	page = pmd_page(orig_pmd);
+	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+	haddr = address & HPAGE_PMD_MASK;
+	if (page_mapcount(page) == 1) {
+		pmd_t entry;
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
+			update_mmu_cache(vma, address, entry);
+		ret |= VM_FAULT_WRITE;
+		goto out_unlock;
+	}
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	if (transparent_hugepage_enabled(vma) &&
+	    !transparent_hugepage_debug_cow())
+		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+	else
+		new_page = NULL;
+
+	if (unlikely(!new_page)) {
+		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+						   pmd, orig_pmd, page, haddr);
+		put_page(page);
+		goto out;
+	}
+
+	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+	__SetPageUptodate(new_page);
+
+	spin_lock(&mm->page_table_lock);
+	put_page(page);
+	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+		put_page(new_page);
+	else {
+		pmd_t entry;
+		VM_BUG_ON(!PageHead(page));
+		entry = mk_pmd(new_page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkhuge(entry);
+		pmdp_clear_flush_notify(vma, haddr, pmd);
+		page_add_new_anon_rmap(new_page, vma, haddr);
+		set_pmd_at(mm, haddr, pmd, entry);
+		update_mmu_cache(vma, address, entry);
+		page_remove_rmap(page);
+		put_page(page);
+		ret |= VM_FAULT_WRITE;
+	}
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+out:
+	return ret;
+}
+
+struct page *follow_trans_huge_pmd(struct mm_struct *mm,
+				   unsigned long addr,
+				   pmd_t *pmd,
+				   unsigned int flags)
+{
+	struct page *page = NULL;
+
+	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
+
+	if (flags & FOLL_WRITE && !pmd_write(*pmd))
+		goto out;
+
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!PageHead(page));
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+		/*
+		 * We should set the dirty bit only for FOLL_WRITE but
+		 * for now the dirty bit in the pmd is meaningless.
+		 * And if the dirty bit will become meaningful and
+		 * we'll only set it with FOLL_WRITE, an atomic
+		 * set_bit will be required on the pmd to set the
+		 * young bit, instead of the current set_pmd_at.
+		 */
+		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
+		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
+	}
+	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+	VM_BUG_ON(!PageCompound(page));
+	if (flags & FOLL_GET)
+		get_page(page);
+
+out:
+	return page;
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd)
+{
+	int ret = 0;
+
+	spin_lock(&tlb->mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&tlb->mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma,
+					     pmd);
+		} else {
+			struct page *page;
+			pgtable_t pgtable;
+			pgtable = get_pmd_huge_pte(tlb->mm);
+			page = pmd_page(*pmd);
+			pmd_clear(pmd);
+			page_remove_rmap(page);
+			VM_BUG_ON(page_mapcount(page) < 0);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			VM_BUG_ON(!PageHead(page));
+			spin_unlock(&tlb->mm->page_table_lock);
+			tlb_remove_page(tlb, page);
+			pte_free(tlb->mm, pgtable);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&tlb->mm->page_table_lock);
+
+	return ret;
+}
+
+pmd_t *page_check_address_pmd(struct page *page,
+			      struct mm_struct *mm,
+			      unsigned long address,
+			      enum page_check_address_pmd_flag flag)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, *ret = NULL;
+
+	if (address & ~HPAGE_PMD_MASK)
+		goto out;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd))
+		goto out;
+	if (pmd_page(*pmd) != page)
+		goto out;
+	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+		  pmd_trans_splitting(*pmd));
+	if (pmd_trans_huge(*pmd)) {
+		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
+			  !pmd_trans_splitting(*pmd));
+		ret = pmd;
+	}
+out:
+	return ret;
+}
+
+static int __split_huge_page_splitting(struct page *page,
+				       struct vm_area_struct *vma,
+				       unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
+	if (pmd) {
+		/*
+		 * We can't temporarily set the pmd to null in order
+		 * to split it, the pmd must remain marked huge at all
+		 * times or the VM won't take the pmd_trans_huge paths
+		 * and it won't wait on the anon_vma->root->lock to
+		 * serialize against split_huge_page*.
+		 */
+		pmdp_splitting_flush_notify(vma, address, pmd);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+static void __split_huge_page_refcount(struct page *page)
+{
+	int i;
+	unsigned long head_index = page->index;
+	struct zone *zone = page_zone(page);
+
+	/* prevent PageLRU to go away from under us, and freeze lru stats */
+	spin_lock_irq(&zone->lru_lock);
+	compound_lock(page);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+
+		/* tail_page->_count cannot change */
+		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
+		BUG_ON(page_count(page) <= 0);
+		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
+		BUG_ON(atomic_read(&page_tail->_count) <= 0);
+
+		/* after clearing PageTail the gup refcount can be released */
+		smp_mb();
+
+		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		page_tail->flags |= (page->flags &
+				     ((1L << PG_referenced) |
+				      (1L << PG_swapbacked) |
+				      (1L << PG_mlocked) |
+				      (1L << PG_uptodate)));
+		page_tail->flags |= (1L << PG_dirty);
+
+		/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();
+
+		/*
+		 * __split_huge_page_splitting() already set the
+		 * splitting bit in all pmd that could map this
+		 * hugepage, that will ensure no CPU can alter the
+		 * mapcount on the head page. The mapcount is only
+		 * accounted in the head page and it has to be
+		 * transferred to all tail pages in the below code. So
+		 * for this code to be safe, the split the mapcount
+		 * can't change. But that doesn't mean userland can't
+		 * keep changing and reading the page contents while
+		 * we transfer the mapcount, so the pmd splitting
+		 * status is achieved setting a reserved bit in the
+		 * pmd, not by clearing the present bit.
+		*/
+		BUG_ON(page_mapcount(page_tail));
+		page_tail->_mapcount = page->_mapcount;
+
+		BUG_ON(page_tail->mapping);
+		page_tail->mapping = page->mapping;
+
+		page_tail->index = ++head_index;
+
+		BUG_ON(!PageAnon(page_tail));
+		BUG_ON(!PageUptodate(page_tail));
+		BUG_ON(!PageDirty(page_tail));
+		BUG_ON(!PageSwapBacked(page_tail));
+
+		lru_add_page_tail(zone, page, page_tail);
+	}
+
+	ClearPageCompound(page);
+	compound_unlock(page);
+	spin_unlock_irq(&zone->lru_lock);
+
+	for (i = 1; i < HPAGE_PMD_NR; i++) {
+		struct page *page_tail = page + i;
+		BUG_ON(page_count(page_tail) <= 0);
+		/*
+		 * Tail pages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		put_page(page_tail);
+	}
+
+	/*
+	 * Only the head page (now become a regular page) is required
+	 * to be pinned by the caller.
+	 */
+	BUG_ON(page_count(page) <= 0);
+}
+
+static int __split_huge_page_map(struct page *page,
+				 struct vm_area_struct *vma,
+				 unsigned long address)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd, _pmd;
+	int ret = 0, i;
+	pgtable_t pgtable;
+	unsigned long haddr;
+
+	spin_lock(&mm->page_table_lock);
+	pmd = page_check_address_pmd(page, mm, address,
+				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
+	if (pmd) {
+		pgtable = get_pmd_huge_pte(mm);
+		pmd_populate(mm, &_pmd, pgtable);
+
+		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
+		     i++, haddr += PAGE_SIZE) {
+			pte_t *pte, entry;
+			BUG_ON(PageCompound(page+i));
+			entry = mk_pte(page + i, vma->vm_page_prot);
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			if (!pmd_write(*pmd))
+				entry = pte_wrprotect(entry);
+			else
+				BUG_ON(page_mapcount(page) != 1);
+			if (!pmd_young(*pmd))
+				entry = pte_mkold(entry);
+			pte = pte_offset_map(&_pmd, haddr);
+			BUG_ON(!pte_none(*pte));
+			set_pte_at(mm, haddr, pte, entry);
+			pte_unmap(pte);
+		}
+
+		mm->nr_ptes++;
+		smp_wmb(); /* make pte visible before pmd */
+		/*
+		 * Up to this point the pmd is present and huge and
+		 * userland has the whole access to the hugepage
+		 * during the split (which happens in place). If we
+		 * overwrite the pmd with the not-huge version
+		 * pointing to the pte here (which of course we could
+		 * if all CPUs were bug free), userland could trigger
+		 * a small page size TLB miss on the small sized TLB
+		 * while the hugepage TLB entry is still established
+		 * in the huge TLB. Some CPU doesn't like that. See
+		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
+		 * Erratum 383 on page 93. Intel should be safe but is
+		 * also warns that it's only safe if the permission
+		 * and cache attributes of the two entries loaded in
+		 * the two TLB is identical (which should be the case
+		 * here). But it is generally safer to never allow
+		 * small and huge TLB entries for the same virtual
+		 * address to be loaded simultaneously. So instead of
+		 * doing "pmd_populate(); flush_tlb_range();" we first
+		 * mark the current pmd notpresent (atomically because
+		 * here the pmd_trans_huge and pmd_trans_splitting
+		 * must remain set at all times on the pmd until the
+		 * split is complete for this pmd), then we flush the
+		 * SMP TLB and finally we write the non-huge version
+		 * of the pmd entry with pmd_populate.
+		 */
+		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+		pmd_populate(mm, pmd, pgtable);
+		ret = 1;
+	}
+	spin_unlock(&mm->page_table_lock);
+
+	return ret;
+}
+
+/* must be called with anon_vma->root->lock hold */
+static void __split_huge_page(struct page *page,
+			      struct anon_vma *anon_vma)
+{
+	int mapcount, mapcount2;
+	struct anon_vma_chain *avc;
+
+	BUG_ON(!PageHead(page));
+	BUG_ON(PageTail(page));
+
+	mapcount = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount += __split_huge_page_splitting(page, vma, addr);
+	}
+	BUG_ON(mapcount != page_mapcount(page));
+
+	__split_huge_page_refcount(page);
+
+	mapcount2 = 0;
+	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
+		struct vm_area_struct *vma = avc->vma;
+		unsigned long addr = vma_address(page, vma);
+		if (addr == -EFAULT)
+			continue;
+		mapcount2 += __split_huge_page_map(page, vma, addr);
+	}
+	BUG_ON(mapcount != mapcount2);
+}
+
+int split_huge_page(struct page *page)
+{
+	struct anon_vma *anon_vma;
+	int ret = 1;
+
+	BUG_ON(!PageAnon(page));
+	anon_vma = page_lock_anon_vma(page);
+	if (!anon_vma)
+		goto out;
+	ret = 0;
+	if (!PageCompound(page))
+		goto out_unlock;
+
+	BUG_ON(!PageSwapBacked(page));
+	__split_huge_page(page, anon_vma);
+
+	BUG_ON(PageCompound(page));
+out_unlock:
+	page_unlock_anon_vma(anon_vma);
+out:
+	return ret;
+}
+
+void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+{
+	struct page *page;
+
+	spin_lock(&mm->page_table_lock);
+	if (unlikely(!pmd_trans_huge(*pmd))) {
+		spin_unlock(&mm->page_table_lock);
+		return;
+	}
+	page = pmd_page(*pmd);
+	VM_BUG_ON(!page_count(page));
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	split_huge_page(page);
+
+	put_page(page);
+	BUG_ON(pmd_trans_huge(*pmd));
+}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -726,9 +726,9 @@ out_set_pte:
 	return 0;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		unsigned long addr, unsigned long end)
+int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
+		   unsigned long addr, unsigned long end)
 {
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
@@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*src_pmd)) {
+			int err;
+			err = copy_huge_pmd(dst_mm, src_mm,
+					    dst_pmd, src_pmd, addr, vma);
+			if (err == -ENOMEM)
+				return -ENOMEM;
+			if (!err)
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+		if (pmd_trans_huge(*pmd)) {
+			if (next-addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (zap_huge_pmd(tlb, vma, pmd)) {
+				(*zap_work)--;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd)) {
 			(*zap_work)--;
 			continue;
@@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
 	pmd = pmd_offset(pud, address);
 	if (pmd_none(*pmd))
 		goto no_page_table;
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
 		BUG_ON(flags & FOLL_GET);
 		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
 		goto out;
 	}
+	if (pmd_trans_huge(*pmd)) {
+		spin_lock(&mm->page_table_lock);
+		if (likely(pmd_trans_huge(*pmd))) {
+			if (unlikely(pmd_trans_splitting(*pmd))) {
+				spin_unlock(&mm->page_table_lock);
+				wait_split_huge_page(vma->anon_vma, pmd);
+			} else {
+				page = follow_trans_huge_pmd(mm, address,
+							     pmd, flags);
+				spin_unlock(&mm->page_table_lock);
+				goto out;
+			}
+		} else
+			spin_unlock(&mm->page_table_lock);
+		/* fall through */
+	}
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
@@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
  * but allow concurrent faults), and pte mapped but not yet locked.
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
-static inline int handle_pte_fault(struct mm_struct *mm,
-		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+int handle_pte_fault(struct mm_struct *mm,
+		     struct vm_area_struct *vma, unsigned long address,
+		     pte_t *pte, pmd_t *pmd, unsigned int flags)
 {
 	pte_t entry;
 	spinlock_t *ptl;
@@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
 	pmd = pmd_alloc(mm, pud, address);
 	if (!pmd)
 		return VM_FAULT_OOM;
-	pte = pte_alloc_map(mm, vma, pmd, address);
-	if (!pte)
+	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+		if (!vma->vm_ops)
+			return do_huge_pmd_anonymous_page(mm, vma, address,
+							  pmd, flags);
+	} else {
+		pmd_t orig_pmd = *pmd;
+		barrier();
+		if (pmd_trans_huge(orig_pmd)) {
+			if (flags & FAULT_FLAG_WRITE &&
+			    !pmd_write(orig_pmd) &&
+			    !pmd_trans_splitting(orig_pmd))
+				return do_huge_pmd_wp_page(mm, vma, address,
+							   pmd, orig_pmd);
+			return 0;
+		}
+	}
+
+	/*
+	 * Use __pte_alloc instead of pte_alloc_map, because we can't
+	 * run pte_offset_map on the pmd, if an huge pmd could
+	 * materialize from under us from a different thread.
+	 */
+	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
 		return VM_FAULT_OOM;
+	/* if an huge pmd materialized from under us just retry later */
+	if (unlikely(pmd_trans_huge(*pmd)))
+		return 0;
+	/*
+	 * A regular pmd is established and it can't morph into a huge pmd
+	 * from under us anymore at this point because we hold the mmap_sem
+	 * read mode and khugepaged takes it in write mode. So now it's
+	 * safe to run pte_offset_map().
+	 */
+	pte = pte_offset_map(pmd, address);
 
 	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
  * Returns virtual address or -EFAULT if page's index/offset is not
  * within the range mapped the @vma.
  */
-static inline unsigned long
+inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
 	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
 	pmd = pmd_offset(pud, address);
 	if (!pmd_present(*pmd))
 		return NULL;
+	if (pmd_trans_huge(*pmd))
+		return NULL;
 
 	pte = pte_offset_map(pmd, address);
 	/* Make a quick check before getting the lock */
@@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	pte_t *pte;
-	spinlock_t *ptl;
 	int referenced = 0;
 
-	pte = page_check_address(page, mm, address, &ptl, 0);
-	if (!pte)
-		goto out;
-
 	/*
 	 * Don't want to elevate referenced for mlocked page that gets this far,
 	 * in order that it progresses to try_to_unmap and is moved to the
 	 * unevictable list.
 	 */
 	if (vma->vm_flags & VM_LOCKED) {
-		*mapcount = 1;	/* break early from loop */
+		*mapcount = 0;	/* break early from loop */
 		*vm_flags |= VM_LOCKED;
-		goto out_unmap;
-	}
-
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!VM_SequentialReadHint(vma)))
-			referenced++;
+		goto out;
 	}
 
 	/* Pretend the page is referenced if the task has the
@@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
 			rwsem_is_locked(&mm->mmap_sem))
 		referenced++;
 
-out_unmap:
+	if (unlikely(PageTransHuge(page))) {
+		pmd_t *pmd;
+
+		spin_lock(&mm->page_table_lock);
+		pmd = page_check_address_pmd(page, mm, address,
+					     PAGE_CHECK_ADDRESS_PMD_FLAG);
+		if (pmd && !pmd_trans_splitting(*pmd) &&
+		    pmdp_clear_flush_young_notify(vma, address, pmd))
+			referenced++;
+		spin_unlock(&mm->page_table_lock);
+	} else {
+		pte_t *pte;
+		spinlock_t *ptl;
+
+		pte = page_check_address(page, mm, address, &ptl, 0);
+		if (!pte)
+			goto out;
+
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!VM_SequentialReadHint(vma)))
+				referenced++;
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
+
 	(*mapcount)--;
-	pte_unmap_unlock(pte, ptl);
 
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
 
 EXPORT_SYMBOL(__pagevec_release);
 
+/* used by __split_huge_page_refcount() */
+void lru_add_page_tail(struct zone* zone,
+		       struct page *page, struct page *page_tail)
+{
+	int active;
+	enum lru_list lru;
+	const int file = 0;
+	struct list_head *head;
+
+	VM_BUG_ON(!PageHead(page));
+	VM_BUG_ON(PageCompound(page_tail));
+	VM_BUG_ON(PageLRU(page_tail));
+	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
+
+	SetPageLRU(page_tail);
+
+	if (page_evictable(page_tail, NULL)) {
+		if (PageActive(page)) {
+			SetPageActive(page_tail);
+			active = 1;
+			lru = LRU_ACTIVE_ANON;
+		} else {
+			active = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
+		update_page_reclaim_stat(zone, page_tail, file, active);
+		if (likely(PageLRU(page)))
+			head = page->lru.prev;
+		else
+			head = &zone->lru[lru].list;
+		__add_page_to_lru_list(zone, page_tail, lru, head);
+	} else {
+		SetPageUnevictable(page_tail);
+		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
+	}
+}
+
 /*
  * Add the passed pages to the LRU, then drop the caller's refcount
  * on them.  Reinitialises the caller's pagevec.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 31 of 66] split_huge_page anon_vma ordering dependency
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This documents how split_huge_page is safe vs new vma inserctions into
the anon_vma that may have already released the anon_vma->lock but not
established pmds yet when split_huge_page starts.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -840,6 +840,19 @@ static void __split_huge_page(struct pag
 			continue;
 		mapcount += __split_huge_page_splitting(page, vma, addr);
 	}
+	/*
+	 * It is critical that new vmas are added to the tail of the
+	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
+	 * and establishes a child pmd before
+	 * __split_huge_page_splitting() freezes the parent pmd (so if
+	 * we fail to prevent copy_huge_pmd() from running until the
+	 * whole __split_huge_page() is complete), we will still see
+	 * the newly established pmd of the child later during the
+	 * walk, to be able to set it as pmd_trans_splitting too.
+	 */
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
 	BUG_ON(mapcount != page_mapcount(page));
 
 	__split_huge_page_refcount(page);
@@ -852,6 +865,9 @@ static void __split_huge_page(struct pag
 			continue;
 		mapcount2 += __split_huge_page_map(page, vma, addr);
 	}
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
 	BUG_ON(mapcount != mapcount2);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -177,6 +177,10 @@ static void anon_vma_chain_link(struct v
 	list_add(&avc->same_vma, &vma->anon_vma_chain);
 
 	anon_vma_lock(anon_vma);
+	/*
+	 * It's critical to add new vmas to the tail of the anon_vma,
+	 * see comment in huge_memory.c:__split_huge_page().
+	 */
 	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
 	anon_vma_unlock(anon_vma);
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 31 of 66] split_huge_page anon_vma ordering dependency
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This documents how split_huge_page is safe vs new vma inserctions into
the anon_vma that may have already released the anon_vma->lock but not
established pmds yet when split_huge_page starts.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -840,6 +840,19 @@ static void __split_huge_page(struct pag
 			continue;
 		mapcount += __split_huge_page_splitting(page, vma, addr);
 	}
+	/*
+	 * It is critical that new vmas are added to the tail of the
+	 * anon_vma list. This guarantes that if copy_huge_pmd() runs
+	 * and establishes a child pmd before
+	 * __split_huge_page_splitting() freezes the parent pmd (so if
+	 * we fail to prevent copy_huge_pmd() from running until the
+	 * whole __split_huge_page() is complete), we will still see
+	 * the newly established pmd of the child later during the
+	 * walk, to be able to set it as pmd_trans_splitting too.
+	 */
+	if (mapcount != page_mapcount(page))
+		printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+		       mapcount, page_mapcount(page));
 	BUG_ON(mapcount != page_mapcount(page));
 
 	__split_huge_page_refcount(page);
@@ -852,6 +865,9 @@ static void __split_huge_page(struct pag
 			continue;
 		mapcount2 += __split_huge_page_map(page, vma, addr);
 	}
+	if (mapcount != mapcount2)
+		printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+		       mapcount, mapcount2, page_mapcount(page));
 	BUG_ON(mapcount != mapcount2);
 }
 
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -177,6 +177,10 @@ static void anon_vma_chain_link(struct v
 	list_add(&avc->same_vma, &vma->anon_vma_chain);
 
 	anon_vma_lock(anon_vma);
+	/*
+	 * It's critical to add new vmas to the tail of the anon_vma,
+	 * see comment in huge_memory.c:__split_huge_page().
+	 */
 	list_add_tail(&avc->same_anon_vma, &anon_vma->head);
 	anon_vma_unlock(anon_vma);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 32 of 66] verify pmd_trans_huge isn't leaking
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1428,6 +1428,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -1640,8 +1641,10 @@ pte_t *__get_locked_pte(struct mm_struct
 	pud_t * pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
 		pmd_t * pmd = pmd_alloc(mm, pud, addr);
-		if (pmd)
+		if (pmd) {
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			return pte_alloc_map_lock(mm, pmd, addr, ptl);
+		}
 	}
 	return NULL;
 }
@@ -1860,6 +1863,7 @@ static inline int remap_pmd_range(struct
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
 		if (remap_pte_range(mm, pmd, addr, next,
@@ -3428,6 +3432,7 @@ static int __follow_pte(struct mm_struct
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 32 of 66] verify pmd_trans_huge isn't leaking
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

pte_trans_huge must not leak in certain vmas like the mmio special pfn or
filebacked mappings.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1428,6 +1428,7 @@ int __get_user_pages(struct task_struct 
 			pmd = pmd_offset(pud, pg);
 			if (pmd_none(*pmd))
 				return i ? : -EFAULT;
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			pte = pte_offset_map(pmd, pg);
 			if (pte_none(*pte)) {
 				pte_unmap(pte);
@@ -1640,8 +1641,10 @@ pte_t *__get_locked_pte(struct mm_struct
 	pud_t * pud = pud_alloc(mm, pgd, addr);
 	if (pud) {
 		pmd_t * pmd = pmd_alloc(mm, pud, addr);
-		if (pmd)
+		if (pmd) {
+			VM_BUG_ON(pmd_trans_huge(*pmd));
 			return pte_alloc_map_lock(mm, pmd, addr, ptl);
+		}
 	}
 	return NULL;
 }
@@ -1860,6 +1863,7 @@ static inline int remap_pmd_range(struct
 	pmd = pmd_alloc(mm, pud, addr);
 	if (!pmd)
 		return -ENOMEM;
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
 		if (remap_pte_range(mm, pmd, addr, next,
@@ -3428,6 +3432,7 @@ static int __follow_pte(struct mm_struct
 		goto out;
 
 	pmd = pmd_offset(pud, address);
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -121,6 +122,11 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -894,6 +894,22 @@ out:
 	return ret;
 }
 
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
isn't built into the kernel. Never silently return success.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct
 #endif
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
+extern int hugepage_madvise(unsigned long *vm_flags);
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -121,6 +122,11 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+static inline int hugepage_madvise(unsigned long *vm_flags)
+{
+	BUG_ON(0);
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -894,6 +894,22 @@ out:
 	return ret;
 }
 
+int hugepage_madvise(unsigned long *vm_flags)
+{
+	/*
+	 * Be somewhat over-protective like KSM for now!
+	 */
+	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
+			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
+			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
+			 VM_MIXEDMAP | VM_SAO))
+		return -EINVAL;
+
+	*vm_flags |= VM_HUGEPAGE;
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;
diff --git a/mm/madvise.c b/mm/madvise.c
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
 		if (error)
 			goto out;
 		break;
+	case MADV_HUGEPAGE:
+		error = hugepage_madvise(&new_flags);
+		if (error)
+			goto out;
+		break;
 	}
 
 	if (new_flags == vma->vm_flags) {
@@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	case MADV_HUGEPAGE:
+#endif
 		return 1;
 
 	default:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 34 of 66] add PageTransCompound
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Remove branches when CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -105,6 +105,10 @@ static inline int PageTransHuge(struct p
 	VM_BUG_ON(PageTail(page));
 	return PageHead(page);
 }
+static inline int PageTransCompound(struct page *page)
+{
+	return PageCompound(page);
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
@@ -122,6 +126,7 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+#define PageTransCompound(page) 0
 static inline int hugepage_madvise(unsigned long *vm_flags)
 {
 	BUG_ON(0);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 34 of 66] add PageTransCompound
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Remove branches when CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -105,6 +105,10 @@ static inline int PageTransHuge(struct p
 	VM_BUG_ON(PageTail(page));
 	return PageHead(page);
 }
+static inline int PageTransCompound(struct page *page)
+{
+	return PageCompound(page);
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
@@ -122,6 +126,7 @@ static inline int split_huge_page(struct
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
 #define PageTransHuge(page) 0
+#define PageTransCompound(page) 0
 static inline int hugepage_madvise(unsigned long *vm_flags)
 {
 	BUG_ON(0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 35 of 66] pmd_trans_huge migrate bugcheck
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1475,6 +1475,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_SPLIT	0x20	/* don't return transhuge pages, split them */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1305,6 +1305,10 @@ struct page *follow_page(struct vm_area_
 		goto out;
 	}
 	if (pmd_trans_huge(*pmd)) {
+		if (flags & FOLL_SPLIT) {
+			split_huge_page_pmd(mm, pmd);
+			goto split_fallthrough;
+		}
 		spin_lock(&mm->page_table_lock);
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
@@ -1320,6 +1324,7 @@ struct page *follow_page(struct vm_area_
 			spin_unlock(&mm->page_table_lock);
 		/* fall through */
 	}
+split_fallthrough:
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -111,6 +111,8 @@ static int remove_migration_pte(struct p
 			goto out;
 
 		pmd = pmd_offset(pud, addr);
+		if (pmd_trans_huge(*pmd))
+			goto out;
 		if (!pmd_present(*pmd))
 			goto out;
 
@@ -630,6 +632,9 @@ static int unmap_and_move(new_page_t get
 		/* page was freed from under us. So we are done. */
 		goto move_newpage;
 	}
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			goto move_newpage;
 
 	/* prepare cgroup just returns 0 or -ENOMEM */
 	rc = -EAGAIN;
@@ -1040,7 +1045,7 @@ static int do_move_page_to_node_array(st
 		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET);
+		page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 35 of 66] pmd_trans_huge migrate bugcheck
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

No pmd_trans_huge should ever materialize in migration ptes areas, because
we split the hugepage before migration ptes are instantiated.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1475,6 +1475,7 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET	0x04	/* do get_page on page */
 #define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
 #define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
+#define FOLL_SPLIT	0x20	/* don't return transhuge pages, split them */
 
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1305,6 +1305,10 @@ struct page *follow_page(struct vm_area_
 		goto out;
 	}
 	if (pmd_trans_huge(*pmd)) {
+		if (flags & FOLL_SPLIT) {
+			split_huge_page_pmd(mm, pmd);
+			goto split_fallthrough;
+		}
 		spin_lock(&mm->page_table_lock);
 		if (likely(pmd_trans_huge(*pmd))) {
 			if (unlikely(pmd_trans_splitting(*pmd))) {
@@ -1320,6 +1324,7 @@ struct page *follow_page(struct vm_area_
 			spin_unlock(&mm->page_table_lock);
 		/* fall through */
 	}
+split_fallthrough:
 	if (unlikely(pmd_bad(*pmd)))
 		goto no_page_table;
 
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -111,6 +111,8 @@ static int remove_migration_pte(struct p
 			goto out;
 
 		pmd = pmd_offset(pud, addr);
+		if (pmd_trans_huge(*pmd))
+			goto out;
 		if (!pmd_present(*pmd))
 			goto out;
 
@@ -630,6 +632,9 @@ static int unmap_and_move(new_page_t get
 		/* page was freed from under us. So we are done. */
 		goto move_newpage;
 	}
+	if (unlikely(PageTransHuge(page)))
+		if (unlikely(split_huge_page(page)))
+			goto move_newpage;
 
 	/* prepare cgroup just returns 0 or -ENOMEM */
 	rc = -EAGAIN;
@@ -1040,7 +1045,7 @@ static int do_move_page_to_node_array(st
 		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
 			goto set_status;
 
-		page = follow_page(vma, pp->addr, FOLL_GET);
+		page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
 
 		err = PTR_ERR(page);
 		if (IS_ERR(page))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 36 of 66] memcg compound
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+				   gfp_t gfp_mask,
+				   struct mem_cgroup **memcg, bool oom,
+				   int page_size)
 {
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem = NULL;
 	int ret;
-	int csize = CHARGE_SIZE;
+	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1909,7 +1915,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (consume_stock(mem))
+		if (page_size == PAGE_SIZE && consume_stock(mem))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1933,7 +1939,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (consume_stock(mem)) {
+		if (page_size == PAGE_SIZE && consume_stock(mem)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
@@ -1974,7 +1980,7 @@ again:
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = PAGE_SIZE;
+			csize = page_size;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
@@ -1995,8 +2001,8 @@ again:
 		}
 	} while (ret != CHARGE_OK);
 
-	if (csize > PAGE_SIZE)
-		refill_stock(mem, csize - PAGE_SIZE);
+	if (csize > page_size)
+		refill_stock(mem, csize - page_size);
 	css_put(&mem->css);
 done:
 	*memcg = mem;
@@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
 	}
 }
 
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
+				     int page_size)
 {
-	__mem_cgroup_cancel_charge(mem, 1);
+	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
 }
 
 /*
@@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from);
+		mem_cgroup_cancel_charge(from, PAGE_SIZE);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
+				      PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
 	ret = mem_cgroup_move_account(pc, child, parent, true);
 	if (ret)
-		mem_cgroup_cancel_charge(parent);
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
 put_back:
 	putback_lru_page(page);
 put:
@@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem = NULL;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
 }
 
 static void
@@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	if (unlikely(batch->memcg != mem))
 		memcg_oom_recover(mem);
 	return;
@@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	int page_size = PAGE_SIZE;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageSwapCache(page))
 		return NULL;
 
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 
 	return mem;
 
@@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
 	enum charge_type ctype;
 	int ret = 0;
 
+	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return 0;
 
@@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
 		return 0;
 
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
 	css_put(&mem->css);/* drop extra refcnt */
 	if (ret || *ptr == NULL) {
 		if (PageAnon(page)) {
@@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
 		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	else
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 	return ret;
 }
 
@@ -4452,7 +4469,8 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+					      PAGE_SIZE);
 		if (ret || !mem)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return -ENOMEM;
@@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
@@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
 	spinlock_t *ptl;
 
 retry:
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 36 of 66] memcg compound
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Teach memcg to charge/uncharge compound pages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup_per_zone *mz;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
  * oom-killer can be invoked.
  */
 static int __mem_cgroup_try_charge(struct mm_struct *mm,
-		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
+				   gfp_t gfp_mask,
+				   struct mem_cgroup **memcg, bool oom,
+				   int page_size)
 {
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem = NULL;
 	int ret;
-	int csize = CHARGE_SIZE;
+	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1909,7 +1915,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (consume_stock(mem))
+		if (page_size == PAGE_SIZE && consume_stock(mem))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1933,7 +1939,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (consume_stock(mem)) {
+		if (page_size == PAGE_SIZE && consume_stock(mem)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
@@ -1974,7 +1980,7 @@ again:
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = PAGE_SIZE;
+			csize = page_size;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
@@ -1995,8 +2001,8 @@ again:
 		}
 	} while (ret != CHARGE_OK);
 
-	if (csize > PAGE_SIZE)
-		refill_stock(mem, csize - PAGE_SIZE);
+	if (csize > page_size)
+		refill_stock(mem, csize - page_size);
 	css_put(&mem->css);
 done:
 	*memcg = mem;
@@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
 	}
 }
 
-static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
+static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
+				     int page_size)
 {
-	__mem_cgroup_cancel_charge(mem, 1);
+	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
 }
 
 /*
@@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  */
 
 static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				     struct page_cgroup *pc,
-				     enum charge_type ctype)
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
 {
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
@@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem);
+		mem_cgroup_cancel_charge(mem, page_size);
 		return;
 	}
 
@@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from);
+		mem_cgroup_cancel_charge(from, PAGE_SIZE);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
 		goto put;
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
+				      PAGE_SIZE);
 	if (ret || !parent)
 		goto put_back;
 
 	ret = mem_cgroup_move_account(pc, child, parent, true);
 	if (ret)
-		mem_cgroup_cancel_charge(parent);
+		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
 put_back:
 	putback_lru_page(page);
 put:
@@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
 	struct mem_cgroup *mem = NULL;
 	struct page_cgroup *pc;
 	int ret;
+	int page_size = PAGE_SIZE;
+
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
 	if (ret || !mem)
 		return ret;
 
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
 	return 0;
 }
 
@@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	if (PageCompound(page))
-		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
 	if (!mem)
 		goto charge_cur_mm;
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
+	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
 	css_put(&mem->css);
 	return ret;
 charge_cur_mm:
 	if (unlikely(!mm))
 		mm = &init_mm;
-	return __mem_cgroup_try_charge(mm, mask, ptr, true);
+	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
 }
 
 static void
@@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
 	cgroup_exclude_rmdir(&ptr->css);
 	pc = lookup_page_cgroup(page);
 	mem_cgroup_lru_del_before_commit_swapcache(page);
-	__mem_cgroup_commit_charge(ptr, pc, ctype);
+	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
 	mem_cgroup_lru_add_after_commit_swapcache(page);
 	/*
 	 * Now swap is on-memory. This means this page may be
@@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
 		return;
 	if (!mem)
 		return;
-	mem_cgroup_cancel_charge(mem);
+	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
 }
 
 static void
-__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
+__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
+	      int page_size)
 {
 	struct memcg_batch_info *batch = NULL;
 	bool uncharge_memsw = true;
@@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += PAGE_SIZE;
+	batch->bytes += page_size;
 	if (uncharge_memsw)
-		batch->memsw_bytes += PAGE_SIZE;
+		batch->memsw_bytes += page_size;
 	return;
 direct_uncharge:
-	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	res_counter_uncharge(&mem->res, page_size);
 	if (uncharge_memsw)
-		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+		res_counter_uncharge(&mem->memsw, page_size);
 	if (unlikely(batch->memcg != mem))
 		memcg_oom_recover(mem);
 	return;
@@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	int page_size = PAGE_SIZE;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageSwapCache(page))
 		return NULL;
 
+	if (PageTransHuge(page))
+		page_size <<= compound_order(page);
+
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
 		mem_cgroup_get(mem);
 	}
 	if (!mem_cgroup_is_root(mem))
-		__do_uncharge(mem, ctype);
+		__do_uncharge(mem, ctype, page_size);
 
 	return mem;
 
@@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
 	enum charge_type ctype;
 	int ret = 0;
 
+	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return 0;
 
@@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
 		return 0;
 
 	*ptr = mem;
-	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
 	css_put(&mem->css);/* drop extra refcnt */
 	if (ret || *ptr == NULL) {
 		if (PageAnon(page)) {
@@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
 		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	else
 		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-	__mem_cgroup_commit_charge(mem, pc, ctype);
+	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
 	return ret;
 }
 
@@ -4452,7 +4469,8 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
+		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
+					      PAGE_SIZE);
 		if (ret || !mem)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return -ENOMEM;
@@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
 	pte_t *pte;
 	spinlock_t *ptl;
 
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
@@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
 	spinlock_t *ptl;
 
 retry:
+	VM_BUG_ON(pmd_trans_huge(*pmd));
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 37 of 66] transhuge-memcg: commit tail pages at charge
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

By this patch, when a transparent hugepage is charged, not only the head page but
also all the tail pages are committed, IOW pc->mem_cgroup and pc->flags of tail
pages are set.

Without this patch:

- Tail pages are not linked to any memcg's LRU at splitting. This causes many
  problems, for example, the charged memcg's directory can never be rmdir'ed
  because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
  root cgroup is calculated by the stat not by res_counter(since 2.6.32),
  it would be incorrect too.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2087,23 +2087,10 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  * commit a charge got by __mem_cgroup_try_charge() and makes page_cgroup to be
  * USED state. If already USED, uncharge and return.
  */
-
-static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				       struct page_cgroup *pc,
-				       enum charge_type ctype,
-				       int page_size)
+static void ____mem_cgroup_commit_charge(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 enum charge_type ctype)
 {
-	/* try_charge() can return NULL to *memcg, taking care of it. */
-	if (!mem)
-		return;
-
-	lock_page_cgroup(pc);
-	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem, page_size);
-		return;
-	}
-
 	pc->mem_cgroup = mem;
 	/*
 	 * We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2128,6 +2115,33 @@ static void __mem_cgroup_commit_charge(s
 	}
 
 	mem_cgroup_charge_statistics(mem, pc, true);
+}
+
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
+{
+	int i;
+	int count = page_size >> PAGE_SHIFT;
+
+	/* try_charge() can return NULL to *memcg, taking care of it. */
+	if (!mem)
+		return;
+
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		mem_cgroup_cancel_charge(mem, page_size);
+		return;
+	}
+
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
+	for (i = 0; i < count; i++)
+		____mem_cgroup_commit_charge(mem, pc + i, ctype);
 
 	unlock_page_cgroup(pc);
 	/*
@@ -2522,6 +2536,8 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
+	int i;
+	int count;
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int page_size = PAGE_SIZE;
@@ -2535,6 +2551,7 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageTransHuge(page))
 		page_size <<= compound_order(page);
 
+	count = page_size >> PAGE_SHIFT;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2567,7 +2584,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, pc, false);
+	for (i = 0; i < count; i++)
+		mem_cgroup_charge_statistics(mem, pc + i, false);
 
 	ClearPageCgroupUsed(pc);
 	/*

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 37 of 66] transhuge-memcg: commit tail pages at charge
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

By this patch, when a transparent hugepage is charged, not only the head page but
also all the tail pages are committed, IOW pc->mem_cgroup and pc->flags of tail
pages are set.

Without this patch:

- Tail pages are not linked to any memcg's LRU at splitting. This causes many
  problems, for example, the charged memcg's directory can never be rmdir'ed
  because it doesn't have enough pages to scan to make the usage decrease to 0.
- "rss" field in memory.stat would be incorrect. Moreover, usage_in_bytes in
  root cgroup is calculated by the stat not by res_counter(since 2.6.32),
  it would be incorrect too.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2087,23 +2087,10 @@ struct mem_cgroup *try_get_mem_cgroup_fr
  * commit a charge got by __mem_cgroup_try_charge() and makes page_cgroup to be
  * USED state. If already USED, uncharge and return.
  */
-
-static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
-				       struct page_cgroup *pc,
-				       enum charge_type ctype,
-				       int page_size)
+static void ____mem_cgroup_commit_charge(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 enum charge_type ctype)
 {
-	/* try_charge() can return NULL to *memcg, taking care of it. */
-	if (!mem)
-		return;
-
-	lock_page_cgroup(pc);
-	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
-		mem_cgroup_cancel_charge(mem, page_size);
-		return;
-	}
-
 	pc->mem_cgroup = mem;
 	/*
 	 * We access a page_cgroup asynchronously without lock_page_cgroup().
@@ -2128,6 +2115,33 @@ static void __mem_cgroup_commit_charge(s
 	}
 
 	mem_cgroup_charge_statistics(mem, pc, true);
+}
+
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+				       struct page_cgroup *pc,
+				       enum charge_type ctype,
+				       int page_size)
+{
+	int i;
+	int count = page_size >> PAGE_SHIFT;
+
+	/* try_charge() can return NULL to *memcg, taking care of it. */
+	if (!mem)
+		return;
+
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		mem_cgroup_cancel_charge(mem, page_size);
+		return;
+	}
+
+	/*
+	 * we don't need page_cgroup_lock about tail pages, becase they are not
+	 * accessed by any other context at this point.
+	 */
+	for (i = 0; i < count; i++)
+		____mem_cgroup_commit_charge(mem, pc + i, ctype);
 
 	unlock_page_cgroup(pc);
 	/*
@@ -2522,6 +2536,8 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
+	int i;
+	int count;
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int page_size = PAGE_SIZE;
@@ -2535,6 +2551,7 @@ __mem_cgroup_uncharge_common(struct page
 	if (PageTransHuge(page))
 		page_size <<= compound_order(page);
 
+	count = page_size >> PAGE_SHIFT;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2567,7 +2584,8 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, pc, false);
+	for (i = 0; i < count; i++)
+		mem_cgroup_charge_statistics(mem, pc + i, false);
 
 	ClearPageCgroupUsed(pc);
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 38 of 66] memcontrol: try charging huge pages from stock
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

The stock unit is just bytes, there is no reason to only take normal
pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1658,19 +1658,19 @@ static DEFINE_PER_CPU(struct memcg_stock
 static atomic_t memcg_drain_count;
 
 /*
- * Try to consume stocked charge on this cpu. If success, PAGE_SIZE is consumed
+ * Try to consume stocked charge on this cpu. If success, @val is consumed
  * from local stock and true is returned. If the stock is 0 or charges from a
  * cgroup which is not current target, returns false. This stock will be
  * refilled.
  */
-static bool consume_stock(struct mem_cgroup *mem)
+static bool consume_stock(struct mem_cgroup *mem, int val)
 {
 	struct memcg_stock_pcp *stock;
 	bool ret = true;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (mem == stock->cached && stock->charge)
-		stock->charge -= PAGE_SIZE;
+	if (mem == stock->cached && stock->charge >= val)
+		stock->charge -= val;
 	else /* need to call res_counter_charge */
 		ret = false;
 	put_cpu_var(memcg_stock);
@@ -1915,7 +1915,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (page_size == PAGE_SIZE && consume_stock(mem))
+		if (consume_stock(mem, page_size))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1939,7 +1939,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (page_size == PAGE_SIZE && consume_stock(mem)) {
+		if (consume_stock(mem, page_size)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 38 of 66] memcontrol: try charging huge pages from stock
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

The stock unit is just bytes, there is no reason to only take normal
pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1658,19 +1658,19 @@ static DEFINE_PER_CPU(struct memcg_stock
 static atomic_t memcg_drain_count;
 
 /*
- * Try to consume stocked charge on this cpu. If success, PAGE_SIZE is consumed
+ * Try to consume stocked charge on this cpu. If success, @val is consumed
  * from local stock and true is returned. If the stock is 0 or charges from a
  * cgroup which is not current target, returns false. This stock will be
  * refilled.
  */
-static bool consume_stock(struct mem_cgroup *mem)
+static bool consume_stock(struct mem_cgroup *mem, int val)
 {
 	struct memcg_stock_pcp *stock;
 	bool ret = true;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (mem == stock->cached && stock->charge)
-		stock->charge -= PAGE_SIZE;
+	if (mem == stock->cached && stock->charge >= val)
+		stock->charge -= val;
 	else /* need to call res_counter_charge */
 		ret = false;
 	put_cpu_var(memcg_stock);
@@ -1915,7 +1915,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (page_size == PAGE_SIZE && consume_stock(mem))
+		if (consume_stock(mem, page_size))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1939,7 +1939,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (page_size == PAGE_SIZE && consume_stock(mem)) {
+		if (consume_stock(mem, page_size)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 39 of 66] memcg huge memory
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -233,6 +233,7 @@ static int __do_huge_pmd_anonymous_page(
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -243,6 +244,7 @@ static int __do_huge_pmd_anonymous_page(
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -286,6 +288,10 @@ int do_huge_pmd_anonymous_page(struct mm
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 	}
@@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 					  vma, address);
-		if (unlikely(!pages[i])) {
-			while (--i >= 0)
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_newpage_charge(pages[i], mm,
+						       GFP_KERNEL))) {
+			if (pages[i])
 				put_page(pages[i]);
+			while (--i >= 0) {
+				mem_cgroup_uncharge_page(pages[i]);
+				put_page(pages[i]);
+			}
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
@@ -455,8 +467,10 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out;
 }
@@ -501,14 +515,22 @@ int do_huge_pmd_wp_page(struct mm_struct
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		put_page(page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
 	spin_lock(&mm->page_table_lock);
 	put_page(page);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		VM_BUG_ON(!PageHead(page));
 		entry = mk_pmd(new_page, vma->vm_page_prot);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 39 of 66] memcg huge memory
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add memcg charge/uncharge to hugepage faults in huge_memory.c.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -233,6 +233,7 @@ static int __do_huge_pmd_anonymous_page(
 	VM_BUG_ON(!PageCompound(page));
 	pgtable = pte_alloc_one(mm, haddr);
 	if (unlikely(!pgtable)) {
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		return VM_FAULT_OOM;
 	}
@@ -243,6 +244,7 @@ static int __do_huge_pmd_anonymous_page(
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_none(*pmd))) {
 		spin_unlock(&mm->page_table_lock);
+		mem_cgroup_uncharge_page(page);
 		put_page(page);
 		pte_free(mm, pgtable);
 	} else {
@@ -286,6 +288,10 @@ int do_huge_pmd_anonymous_page(struct mm
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
+		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+			put_page(page);
+			goto out;
+		}
 
 		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
 	}
@@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
 					  vma, address);
-		if (unlikely(!pages[i])) {
-			while (--i >= 0)
+		if (unlikely(!pages[i] ||
+			     mem_cgroup_newpage_charge(pages[i], mm,
+						       GFP_KERNEL))) {
+			if (pages[i])
 				put_page(pages[i]);
+			while (--i >= 0) {
+				mem_cgroup_uncharge_page(pages[i]);
+				put_page(pages[i]);
+			}
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;
@@ -455,8 +467,10 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
+	}
 	kfree(pages);
 	goto out;
 }
@@ -501,14 +515,22 @@ int do_huge_pmd_wp_page(struct mm_struct
 		goto out;
 	}
 
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
+		put_page(new_page);
+		put_page(page);
+		ret |= VM_FAULT_OOM;
+		goto out;
+	}
+
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
 	spin_lock(&mm->page_table_lock);
 	put_page(page);
-	if (unlikely(!pmd_same(*pmd, orig_pmd)))
+	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-	else {
+	} else {
 		pmd_t entry;
 		VM_BUG_ON(!PageHead(page));
 		entry = mk_pmd(new_page, vma->vm_page_prot);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 40 of 66] transparent hugepage vmstat
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -128,7 +131,12 @@ static int meminfo_proc_show(struct seq_
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_page_state(NR_ANON_PAGES)),
+		K(global_page_state(NR_ANON_PAGES)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		  + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		  HPAGE_PMD_NR
+#endif
+		  ),
 		K(global_page_state(NR_FILE_MAPPED)),
 		K(global_page_state(NR_SHMEM)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
@@ -151,6 +159,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,7 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -747,6 +747,9 @@ static void __split_huge_page_refcount(s
 		lru_add_page_tail(zone, page, page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -882,8 +882,13 @@ void do_page_add_anon_rmap(struct page *
 	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageTransHuge(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -911,7 +916,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageTransHuge(page))
+		__inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address, 1);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -964,7 +972,11 @@ void page_remove_rmap(struct page *page)
 		return;
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageTransHuge(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -761,6 +761,7 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+	"nr_anon_transparent_hugepages",
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 40 of 66] transparent hugepage vmstat
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add hugepage stat information to /proc/vmstat and /proc/meminfo.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,6 +101,9 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		"HardwareCorrupted: %5lu kB\n"
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		"AnonHugePages:  %8lu kB\n"
+#endif
 		,
 		K(i.totalram),
 		K(i.freeram),
@@ -128,7 +131,12 @@ static int meminfo_proc_show(struct seq_
 		K(i.freeswap),
 		K(global_page_state(NR_FILE_DIRTY)),
 		K(global_page_state(NR_WRITEBACK)),
-		K(global_page_state(NR_ANON_PAGES)),
+		K(global_page_state(NR_ANON_PAGES)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		  + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		  HPAGE_PMD_NR
+#endif
+		  ),
 		K(global_page_state(NR_FILE_MAPPED)),
 		K(global_page_state(NR_SHMEM)),
 		K(global_page_state(NR_SLAB_RECLAIMABLE) +
@@ -151,6 +159,10 @@ static int meminfo_proc_show(struct seq_
 #ifdef CONFIG_MEMORY_FAILURE
 		,atomic_long_read(&mce_bad_pages) << (PAGE_SHIFT - 10)
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
+		   HPAGE_PMD_NR)
+#endif
 		);
 
 	hugetlb_report_meminfo(m);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -114,6 +114,7 @@ enum zone_stat_item {
 	NUMA_LOCAL,		/* allocation from local node */
 	NUMA_OTHER,		/* allocation from other node */
 #endif
+	NR_ANON_TRANSPARENT_HUGEPAGES,
 	NR_VM_ZONE_STAT_ITEMS };
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -747,6 +747,9 @@ static void __split_huge_page_refcount(s
 		lru_add_page_tail(zone, page, page_tail);
 	}
 
+	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/rmap.c b/mm/rmap.c
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -882,8 +882,13 @@ void do_page_add_anon_rmap(struct page *
 	struct vm_area_struct *vma, unsigned long address, int exclusive)
 {
 	int first = atomic_inc_and_test(&page->_mapcount);
-	if (first)
-		__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (first) {
+		if (!PageTransHuge(page))
+			__inc_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__inc_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
+	}
 	if (unlikely(PageKsm(page)))
 		return;
 
@@ -911,7 +916,10 @@ void page_add_new_anon_rmap(struct page 
 	VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	SetPageSwapBacked(page);
 	atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
-	__inc_zone_page_state(page, NR_ANON_PAGES);
+	if (!PageTransHuge(page))
+		__inc_zone_page_state(page, NR_ANON_PAGES);
+	else
+		__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__page_set_anon_rmap(page, vma, address, 1);
 	if (page_evictable(page, vma))
 		lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -964,7 +972,11 @@ void page_remove_rmap(struct page *page)
 		return;
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
-		__dec_zone_page_state(page, NR_ANON_PAGES);
+		if (!PageTransHuge(page))
+			__dec_zone_page_state(page, NR_ANON_PAGES);
+		else
+			__dec_zone_page_state(page,
+					      NR_ANON_TRANSPARENT_HUGEPAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_update_file_mapped(page, -1);
diff --git a/mm/vmstat.c b/mm/vmstat.c
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -761,6 +761,7 @@ static const char * const vmstat_text[] 
 	"numa_local",
 	"numa_other",
 #endif
+	"nr_anon_transparent_hugepages",
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 	"pgpgin",

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 41 of 66] khugepaged
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)

The fundamental reason why khugepaged is unavoidable, is that some
memory can be fragmented and not everything can be relocated. So when
a virtual machine quits and releases gigabytes of hugepages, we want
to use those freely available hugepages to create huge-pmd in the
other virtual machines that may be running on fragmented memory, to
maximize the CPU efficiency at all times. The scan is slow, it takes
nearly zero cpu time, except when it copies data (in which case it
means we definitely want to pay for that cpu time) so it seems a good
tradeoff.

In addition to the hugepages being released by other process releasing memory,
we have the strong suspicion that the performance impact of potentially
defragmenting hugepages during or before each page fault could lead to more
performance inconsistency than allocating small pages at first and having them
collapsed into large pages later... if they prove themselfs to be long lived
mappings (khugepaged scan is slow so short lived mappings have low probability
to run into khugepaged if compared to long lived mappings).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -25,6 +25,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
new file mode 100644
--- /dev/null
+++ b/include/linux/khugepaged.h
@@ -0,0 +1,66 @@
+#ifndef _LINUX_KHUGEPAGED_H
+#define _LINUX_KHUGEPAGED_H
+
+#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_exit(struct mm_struct *mm);
+extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma);
+
+#define khugepaged_enabled()					       \
+	(transparent_hugepage_flags &				       \
+	 ((1<<TRANSPARENT_HUGEPAGE_FLAG) |		       \
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
+#define khugepaged_always()				\
+	(transparent_hugepage_flags &			\
+	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
+#define khugepaged_req_madv()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
+#define khugepaged_defrag()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
+
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		return __khugepaged_enter(mm);
+	return 0;
+}
+
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
+		if (khugepaged_always() ||
+		    (khugepaged_req_madv() &&
+		     vma->vm_flags & VM_HUGEPAGE))
+			if (__khugepaged_enter(vma->vm_mm))
+				return -ENOMEM;
+	return 0;
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	return 0;
+}
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+}
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -431,6 +431,7 @@ extern int get_dumpable(struct mm_struct
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
+#include <linux/khugepaged.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -328,6 +329,9 @@ static int dup_mmap(struct mm_struct *mm
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
@@ -544,6 +548,7 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
+		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,14 +12,111 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+/*
+ * By default transparent hugepage support is enabled for all mappings
+ * and khugepaged scans all mappings. Defrag is only invoked by
+ * khugepaged hugepage allocations and by page faults inside
+ * MADV_HUGEPAGE regions to avoid the risk of slowing down short lived
+ * allocations.
+ */
 unsigned long transparent_hugepage_flags __read_mostly =
-	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+
+/* default scan 8*512 pte (or vmas) every 30 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
+static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_full_scans;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+/*
+ * default collapse hugepages if there is at least one pte mapped like
+ * it would have happened if the vma was large enough during page
+ * fault.
+ */
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+
+static int khugepaged(void *none);
+static int mm_slots_hash_init(void);
+static int khugepaged_slab_init(void);
+static void khugepaged_slab_free(void);
+
+#define MM_SLOTS_HASH_HEADS 1024
+static struct hlist_head *mm_slots_hash __read_mostly;
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+} khugepaged_scan = {
+	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+static int start_khugepaged(void)
+{
+	int err = 0;
+	if (khugepaged_enabled()) {
+		int wakeup;
+		if (unlikely(!mm_slot_cache || !mm_slots_hash)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_thread)
+			khugepaged_thread = kthread_run(khugepaged, NULL,
+							"khugepaged");
+		if (unlikely(IS_ERR(khugepaged_thread))) {
+			printk(KERN_ERR
+			       "khugepaged: kthread_run(khugepaged) failed\n");
+			err = PTR_ERR(khugepaged_thread);
+			khugepaged_thread = NULL;
+		}
+		wakeup = !list_empty(&khugepaged_scan.mm_head);
+		mutex_unlock(&khugepaged_mutex);
+		if (wakeup)
+			wake_up_interruptible(&khugepaged_wait);
+	} else
+		/* wakeup to exit */
+		wake_up_interruptible(&khugepaged_wait);
+out:
+	return err;
+}
 
 #ifdef CONFIG_SYSFS
+
 static ssize_t double_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag enabled,
@@ -68,9 +165,19 @@ static ssize_t enabled_store(struct kobj
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	return double_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_FLAG,
-				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+	ssize_t ret;
+
+	ret = double_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+
+	if (ret > 0) {
+		int err = start_khugepaged();
+		if (err)
+			ret = err;
+	}
+
+	return ret;
 }
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
@@ -153,20 +260,212 @@ static struct attribute *hugepage_attr[]
 
 static struct attribute_group hugepage_attr_group = {
 	.attrs = hugepage_attr,
-	.name = "transparent_hugepage",
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_scan_sleep_millisecs = msecs;
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+	__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+	       scan_sleep_millisecs_store);
+
+static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
+}
+
+static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_alloc_sleep_millisecs = msecs;
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute alloc_sleep_millisecs_attr =
+	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
+	       alloc_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long pages;
+
+	err = strict_strtoul(buf, 10, &pages);
+	if (err || !pages || pages > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_pages_to_scan = pages;
+
+	return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+	__ATTR(pages_to_scan, 0644, pages_to_scan_show,
+	       pages_to_scan_store);
+
+static ssize_t pages_collapsed_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
+}
+static struct kobj_attribute pages_collapsed_attr =
+	__ATTR_RO(pages_collapsed);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_full_scans);
+}
+static struct kobj_attribute full_scans_attr =
+	__ATTR_RO(full_scans);
+
+static ssize_t khugepaged_defrag_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static ssize_t khugepaged_defrag_store(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static struct kobj_attribute khugepaged_defrag_attr =
+	__ATTR(defrag, 0644, khugepaged_defrag_show,
+	       khugepaged_defrag_store);
+
+/*
+ * max_ptes_none controls if khugepaged should collapse hugepages over
+ * any unmapped ptes in turn potentially increasing the memory
+ * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
+ * reduce the available free memory in the system as it
+ * runs. Increasing max_ptes_none will instead potentially reduce the
+ * free memory in the system during the khugepaged scan.
+ */
+static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
+}
+static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
+					      struct kobj_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int err;
+	unsigned long max_ptes_none;
+
+	err = strict_strtoul(buf, 10, &max_ptes_none);
+	if (err || max_ptes_none > HPAGE_PMD_NR-1)
+		return -EINVAL;
+
+	khugepaged_max_ptes_none = max_ptes_none;
+
+	return count;
+}
+static struct kobj_attribute khugepaged_max_ptes_none_attr =
+	__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
+	       khugepaged_max_ptes_none_store);
+
+static struct attribute *khugepaged_attr[] = {
+	&khugepaged_defrag_attr.attr,
+	&khugepaged_max_ptes_none_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_collapsed_attr.attr,
+	&full_scans_attr.attr,
+	&scan_sleep_millisecs_attr.attr,
+	&alloc_sleep_millisecs_attr.attr,
+	NULL,
+};
+
+static struct attribute_group khugepaged_attr_group = {
+	.attrs = khugepaged_attr,
+	.name = "khugepaged",
 };
 #endif /* CONFIG_SYSFS */
 
 static int __init hugepage_init(void)
 {
+	int err;
 #ifdef CONFIG_SYSFS
-	int err;
+	static struct kobject *hugepage_kobj;
 
-	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	err = -ENOMEM;
+	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
+	if (unlikely(!hugepage_kobj)) {
+		printk(KERN_ERR "hugepage: failed kobject create\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+#endif
+
+	err = khugepaged_slab_init();
 	if (err)
-		printk(KERN_ERR "hugepage: register sysfs failed\n");
-#endif
-	return 0;
+		goto out;
+
+	err = mm_slots_hash_init();
+	if (err) {
+		khugepaged_slab_free();
+		goto out;
+	}
+
+	start_khugepaged();
+
+out:
+	return err;
 }
 module_init(hugepage_init)
 
@@ -285,6 +584,8 @@ int do_huge_pmd_anonymous_page(struct mm
 	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
@@ -935,6 +1236,758 @@ int hugepage_madvise(unsigned long *vm_f
 	return 0;
 }
 
+static int __init khugepaged_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init khugepaged_slab_free(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+	mm_slot_cache = NULL;
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static int __init mm_slots_hash_init(void)
+{
+	mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head),
+				GFP_KERNEL);
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	return 0;
+}
+
+#if 0
+static void __init mm_slots_hash_free(void)
+{
+	kfree(mm_slots_hash);
+	mm_slots_hash = NULL;
+}
+#endif
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int wakeup;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	/* __khugepaged_exit() must not run from under us */
+	VM_BUG_ON(khugepaged_test_exit(mm));
+	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
+		free_mm_slot(mm_slot);
+		return 0;
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	wakeup = list_empty(&khugepaged_scan.mm_head);
+	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+	spin_unlock(&khugepaged_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	if (wakeup)
+		wake_up_interruptible(&khugepaged_wait);
+
+	return 0;
+}
+
+int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	unsigned long hstart, hend;
+	if (!vma->anon_vma)
+		/*
+		 * Not yet faulted in so we will register later in the
+		 * page fault if needed.
+		 */
+		return 0;
+	if (vma->vm_file || vma->vm_ops)
+		/* khugepaged not yet working on file or special mappings */
+		return 0;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (hstart < hend)
+		return khugepaged_enter(vma);
+	return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+
+	if (free) {
+		spin_unlock(&khugepaged_mm_lock);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		spin_unlock(&khugepaged_mm_lock);
+		/*
+		 * This is required to serialize against
+		 * khugepaged_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	} else
+		spin_unlock(&khugepaged_mm_lock);
+}
+
+static void release_pte_page(struct page *page)
+{
+	/* 0 stands for page_is_file_cache(page) == false */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+	while (--_pte >= pte) {
+		pte_t pteval = *_pte;
+		if (!pte_none(pteval))
+			release_pte_page(pte_page(pteval));
+	}
+}
+
+static void release_all_pte_pages(pte_t *pte)
+{
+	release_pte_pages(pte, pte + HPAGE_PMD_NR);
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long address,
+					pte_t *pte)
+{
+	struct page *page;
+	pte_t *_pte;
+	int referenced = 0, isolated = 0, none = 0;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else {
+				release_pte_pages(pte, _pte);
+				goto out;
+			}
+		}
+		if (!pte_present(pteval) || !pte_write(pteval)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		page = vm_normal_page(vma, address, pteval);
+		if (unlikely(!page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * is needed to serialize against split_huge_page
+		 * when invoked from the VM.
+		 */
+		if (!trylock_page(page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* 0 stands for page_is_file_cache(page) == false */
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		VM_BUG_ON(!PageLocked(page));
+		VM_BUG_ON(PageLRU(page));
+
+		/* If there is no mapped pte young don't collapse the page */
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (unlikely(!referenced))
+		release_all_pte_pages(pte);
+	else
+		isolated = 1;
+out:
+	return isolated;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address,
+				      spinlock_t *ptl)
+{
+	pte_t *_pte;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		pte_t pteval = *_pte;
+		struct page *src_page;
+
+		if (pte_none(pteval)) {
+			clear_user_highpage(page, address);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		} else {
+			src_page = pte_page(pteval);
+			copy_user_highpage(page, src_page, address, vma);
+			VM_BUG_ON(page_mapcount(src_page) != 1);
+			VM_BUG_ON(page_count(src_page) != 2);
+			release_pte_page(src_page);
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pte_clear(vma->vm_mm, address, _pte);
+			page_remove_rmap(src_page);
+			spin_unlock(ptl);
+			free_page_and_swap_cache(src_page);
+		}
+
+		address += PAGE_SIZE;
+		page++;
+	}
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	struct vm_area_struct *vma;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	pgtable_t pgtable;
+	struct page *new_page;
+	spinlock_t *ptl;
+	int isolated;
+	unsigned long hstart, hend;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(!*hpage);
+
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later hanlded by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	down_write(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	vma = find_vma(mm, address);
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
+		goto out;
+
+	if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+		goto out;
+
+	/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		goto out;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	/* pmd can't go away or become huge under us */
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	new_page = *hpage;
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
+		goto out;
+
+	anon_vma_lock(vma->anon_vma);
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	/*
+	 * After this gup_fast can't run anymore. This also removes
+	 * any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address
+	 * to avoid the risk of CPU bugs in that area.
+	 */
+	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	spin_lock(ptl);
+	isolated = __collapse_huge_page_isolate(vma, address, pte);
+	spin_unlock(ptl);
+	pte_unmap(pte);
+
+	if (unlikely(!isolated)) {
+		spin_lock(&mm->page_table_lock);
+		BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, address, pmd, _pmd);
+		spin_unlock(&mm->page_table_lock);
+		anon_vma_unlock(vma->anon_vma);
+		mem_cgroup_uncharge_page(new_page);
+		goto out;
+	}
+
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+	anon_vma_unlock(vma->anon_vma);
+
+	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	__SetPageUptodate(new_page);
+	pgtable = pmd_pgtable(_pmd);
+	VM_BUG_ON(page_count(pgtable) != 1);
+	VM_BUG_ON(page_mapcount(pgtable) != 0);
+
+	_pmd = mk_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+	_pmd = pmd_mkhuge(_pmd);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	BUG_ON(!pmd_none(*pmd));
+	page_add_new_anon_rmap(new_page, vma, address);
+	set_pmd_at(mm, address, pmd, _pmd);
+	update_mmu_cache(vma, address, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes--;
+	spin_unlock(&mm->page_table_lock);
+
+	*hpage = NULL;
+	khugepaged_pages_collapsed++;
+out:
+	up_write(&mm->mmap_sem);
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	int ret = 0, referenced = 0, none = 0;
+	struct page *page;
+	unsigned long _address;
+	spinlock_t *ptl;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else
+				goto out_unmap;
+		}
+		if (!pte_present(pteval) || !pte_write(pteval))
+			goto out_unmap;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			goto out_unmap;
+		VM_BUG_ON(PageCompound(page));
+		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+			goto out_unmap;
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1)
+			goto out_unmap;
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (referenced)
+		ret = 1;
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	if (ret) {
+		up_read(&mm->mmap_sem);
+		collapse_huge_page(mm, address, hpage);
+	}
+out:
+	return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+
+		/*
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		 */
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+					    struct page **hpage)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	VM_BUG_ON(!pages);
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_scan.mm_slot)
+		mm_slot = khugepaged_scan.mm_slot;
+	else {
+		mm_slot = list_entry(khugepaged_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		khugepaged_scan.address = 0;
+		khugepaged_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&khugepaged_mm_lock);
+
+	mm = mm_slot->mm;
+	down_read(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, khugepaged_scan.address);
+
+	progress++;
+	for (; vma; vma = vma->vm_next) {
+		unsigned long hstart, hend;
+
+		cond_resched();
+		if (unlikely(khugepaged_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!(vma->vm_flags & VM_HUGEPAGE) &&
+		    !khugepaged_always()) {
+			progress++;
+			continue;
+		}
+
+		/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+		if (!vma->anon_vma || vma->vm_ops || vma->vm_file) {
+			khugepaged_scan.address = vma->vm_end;
+			progress++;
+			continue;
+		}
+		VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+		hend = vma->vm_end & HPAGE_PMD_MASK;
+		if (hstart >= hend) {
+			progress++;
+			continue;
+		}
+		if (khugepaged_scan.address < hstart)
+			khugepaged_scan.address = hstart;
+		if (khugepaged_scan.address > hend) {
+			khugepaged_scan.address = hend + HPAGE_PMD_SIZE;
+			progress++;
+			continue;
+		}
+		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+		while (khugepaged_scan.address < hend) {
+			int ret;
+			cond_resched();
+			if (unlikely(khugepaged_test_exit(mm)))
+				goto breakouterloop;
+
+			VM_BUG_ON(khugepaged_scan.address < hstart ||
+				  khugepaged_scan.address + HPAGE_PMD_SIZE >
+				  hend);
+			ret = khugepaged_scan_pmd(mm, vma,
+						  khugepaged_scan.address,
+						  hpage);
+			/* move to next address */
+			khugepaged_scan.address += HPAGE_PMD_SIZE;
+			progress += HPAGE_PMD_NR;
+			if (ret)
+				/* we released mmap_sem so break loop */
+				goto breakouterloop_mmap_sem;
+			if (progress >= pages)
+				goto breakouterloop;
+		}
+	}
+breakouterloop:
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+	spin_lock(&khugepaged_mm_lock);
+	BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (khugepaged_test_exit(mm) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * khugepaged runs here, khugepaged_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+			khugepaged_scan.mm_slot = list_entry(
+				mm_slot->mm_node.next,
+				struct mm_slot, mm_node);
+			khugepaged_scan.address = 0;
+		} else {
+			khugepaged_scan.mm_slot = NULL;
+			khugepaged_full_scans++;
+		}
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) ||
+		!khugepaged_enabled();
+}
+
+static void khugepaged_do_scan(struct page **hpage)
+{
+	unsigned int progress = 0, pass_through_head = 0;
+	unsigned int pages = khugepaged_pages_to_scan;
+
+	barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+	while (progress < pages) {
+		cond_resched();
+
+		if (!*hpage) {
+			*hpage = alloc_hugepage(khugepaged_defrag());
+			if (unlikely(!*hpage))
+				break;
+		}
+
+		spin_lock(&khugepaged_mm_lock);
+		if (!khugepaged_scan.mm_slot)
+			pass_through_head++;
+		if (khugepaged_has_work() &&
+		    pass_through_head < 2)
+			progress += khugepaged_scan_mm_slot(pages - progress,
+							    hpage);
+		else
+			progress = pages;
+		spin_unlock(&khugepaged_mm_lock);
+	}
+}
+
+static struct page *khugepaged_alloc_hugepage(void)
+{
+	struct page *hpage;
+
+	do {
+		hpage = alloc_hugepage(khugepaged_defrag());
+		if (!hpage) {
+			DEFINE_WAIT(wait);
+			add_wait_queue(&khugepaged_wait, &wait);
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_alloc_sleep_millisecs));
+			remove_wait_queue(&khugepaged_wait, &wait);
+		}
+	} while (unlikely(!hpage) &&
+		 likely(khugepaged_enabled()));
+	return hpage;
+}
+
+static void khugepaged_loop(void)
+{
+	struct page *hpage;
+
+	while (likely(khugepaged_enabled())) {
+		hpage = khugepaged_alloc_hugepage();
+		if (unlikely(!hpage))
+			break;
+
+		khugepaged_do_scan(&hpage);
+		if (hpage)
+			put_page(hpage);
+		if (khugepaged_has_work()) {
+			DEFINE_WAIT(wait);
+			if (!khugepaged_scan_sleep_millisecs)
+				continue;
+			add_wait_queue(&khugepaged_wait, &wait);
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_scan_sleep_millisecs));
+			remove_wait_queue(&khugepaged_wait, &wait);
+		} else if (khugepaged_enabled())
+			wait_event_interruptible(khugepaged_wait,
+						 khugepaged_wait_event());
+	}
+}
+
+static int khugepaged(void *none)
+{
+	struct mm_slot *mm_slot;
+
+	set_user_nice(current, 19);
+
+	/* serialize with start_khugepaged() */
+	mutex_lock(&khugepaged_mutex);
+
+	for (;;) {
+		mutex_unlock(&khugepaged_mutex);
+		BUG_ON(khugepaged_thread != current);
+		khugepaged_loop();
+		BUG_ON(khugepaged_thread != current);
+
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_enabled())
+			break;
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = khugepaged_scan.mm_slot;
+	khugepaged_scan.mm_slot = NULL;
+	if (mm_slot)
+		collect_mm_slot(mm_slot);
+	spin_unlock(&khugepaged_mm_lock);
+
+	khugepaged_thread = NULL;
+	mutex_unlock(&khugepaged_mutex);
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 41 of 66] khugepaged
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add khugepaged to relocate fragmented pages into hugepages if new hugepages
become available. (this is indipendent of the defrag logic that will have to
make new hugepages available)

The fundamental reason why khugepaged is unavoidable, is that some
memory can be fragmented and not everything can be relocated. So when
a virtual machine quits and releases gigabytes of hugepages, we want
to use those freely available hugepages to create huge-pmd in the
other virtual machines that may be running on fragmented memory, to
maximize the CPU efficiency at all times. The scan is slow, it takes
nearly zero cpu time, except when it copies data (in which case it
means we definitely want to pay for that cpu time) so it seems a good
tradeoff.

In addition to the hugepages being released by other process releasing memory,
we have the strong suspicion that the performance impact of potentially
defragmenting hugepages during or before each page fault could lead to more
performance inconsistency than allocating small pages at first and having them
collapsed into large pages later... if they prove themselfs to be long lived
mappings (khugepaged scan is slow so short lived mappings have low probability
to run into khugepaged if compared to long lived mappings).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -25,6 +25,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
new file mode 100644
--- /dev/null
+++ b/include/linux/khugepaged.h
@@ -0,0 +1,66 @@
+#ifndef _LINUX_KHUGEPAGED_H
+#define _LINUX_KHUGEPAGED_H
+
+#include <linux/sched.h> /* MMF_VM_HUGEPAGE */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern int __khugepaged_enter(struct mm_struct *mm);
+extern void __khugepaged_exit(struct mm_struct *mm);
+extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma);
+
+#define khugepaged_enabled()					       \
+	(transparent_hugepage_flags &				       \
+	 ((1<<TRANSPARENT_HUGEPAGE_FLAG) |		       \
+	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)))
+#define khugepaged_always()				\
+	(transparent_hugepage_flags &			\
+	 (1<<TRANSPARENT_HUGEPAGE_FLAG))
+#define khugepaged_req_madv()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))
+#define khugepaged_defrag()					\
+	(transparent_hugepage_flags &				\
+	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
+
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+		return __khugepaged_enter(mm);
+	return 0;
+}
+
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
+		__khugepaged_exit(mm);
+}
+
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags))
+		if (khugepaged_always() ||
+		    (khugepaged_req_madv() &&
+		     vma->vm_flags & VM_HUGEPAGE))
+			if (__khugepaged_enter(vma->vm_mm))
+				return -ENOMEM;
+	return 0;
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
+{
+	return 0;
+}
+static inline void khugepaged_exit(struct mm_struct *mm)
+{
+}
+static inline int khugepaged_enter(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+#endif /* _LINUX_KHUGEPAGED_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -431,6 +431,7 @@ extern int get_dumpable(struct mm_struct
 #endif
 					/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE	16	/* KSM may merge identical pages */
+#define MMF_VM_HUGEPAGE		17	/* set when VM_HUGEPAGE is set on vma */
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
+#include <linux/khugepaged.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -328,6 +329,9 @@ static int dup_mmap(struct mm_struct *mm
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
+	retval = khugepaged_fork(mm, oldmm);
+	if (retval)
+		goto out;
 
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
@@ -544,6 +548,7 @@ void mmput(struct mm_struct *mm)
 	if (atomic_dec_and_test(&mm->mm_users)) {
 		exit_aio(mm);
 		ksm_exit(mm);
+		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -12,14 +12,111 @@
 #include <linux/mmu_notifier.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
+#include <linux/mm_inline.h>
+#include <linux/kthread.h>
+#include <linux/khugepaged.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+/*
+ * By default transparent hugepage support is enabled for all mappings
+ * and khugepaged scans all mappings. Defrag is only invoked by
+ * khugepaged hugepage allocations and by page faults inside
+ * MADV_HUGEPAGE regions to avoid the risk of slowing down short lived
+ * allocations.
+ */
 unsigned long transparent_hugepage_flags __read_mostly =
-	(1<<TRANSPARENT_HUGEPAGE_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+
+/* default scan 8*512 pte (or vmas) every 30 second */
+static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
+static unsigned int khugepaged_pages_collapsed;
+static unsigned int khugepaged_full_scans;
+static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+/* during fragmentation poll the hugepage allocator once every minute */
+static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
+static struct task_struct *khugepaged_thread __read_mostly;
+static DEFINE_MUTEX(khugepaged_mutex);
+static DEFINE_SPINLOCK(khugepaged_mm_lock);
+static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
+/*
+ * default collapse hugepages if there is at least one pte mapped like
+ * it would have happened if the vma was large enough during page
+ * fault.
+ */
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+
+static int khugepaged(void *none);
+static int mm_slots_hash_init(void);
+static int khugepaged_slab_init(void);
+static void khugepaged_slab_free(void);
+
+#define MM_SLOTS_HASH_HEADS 1024
+static struct hlist_head *mm_slots_hash __read_mostly;
+static struct kmem_cache *mm_slot_cache __read_mostly;
+
+/**
+ * struct mm_slot - hash lookup from mm to mm_slot
+ * @hash: hash collision list
+ * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
+ * @mm: the mm that this information is valid for
+ */
+struct mm_slot {
+	struct hlist_node hash;
+	struct list_head mm_node;
+	struct mm_struct *mm;
+};
+
+/**
+ * struct khugepaged_scan - cursor for scanning
+ * @mm_head: the head of the mm list to scan
+ * @mm_slot: the current mm_slot we are scanning
+ * @address: the next address inside that to be scanned
+ *
+ * There is only the one khugepaged_scan instance of this cursor structure.
+ */
+struct khugepaged_scan {
+	struct list_head mm_head;
+	struct mm_slot *mm_slot;
+	unsigned long address;
+} khugepaged_scan = {
+	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
+};
+
+static int start_khugepaged(void)
+{
+	int err = 0;
+	if (khugepaged_enabled()) {
+		int wakeup;
+		if (unlikely(!mm_slot_cache || !mm_slots_hash)) {
+			err = -ENOMEM;
+			goto out;
+		}
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_thread)
+			khugepaged_thread = kthread_run(khugepaged, NULL,
+							"khugepaged");
+		if (unlikely(IS_ERR(khugepaged_thread))) {
+			printk(KERN_ERR
+			       "khugepaged: kthread_run(khugepaged) failed\n");
+			err = PTR_ERR(khugepaged_thread);
+			khugepaged_thread = NULL;
+		}
+		wakeup = !list_empty(&khugepaged_scan.mm_head);
+		mutex_unlock(&khugepaged_mutex);
+		if (wakeup)
+			wake_up_interruptible(&khugepaged_wait);
+	} else
+		/* wakeup to exit */
+		wake_up_interruptible(&khugepaged_wait);
+out:
+	return err;
+}
 
 #ifdef CONFIG_SYSFS
+
 static ssize_t double_flag_show(struct kobject *kobj,
 				struct kobj_attribute *attr, char *buf,
 				enum transparent_hugepage_flag enabled,
@@ -68,9 +165,19 @@ static ssize_t enabled_store(struct kobj
 			     struct kobj_attribute *attr,
 			     const char *buf, size_t count)
 {
-	return double_flag_store(kobj, attr, buf, count,
-				 TRANSPARENT_HUGEPAGE_FLAG,
-				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+	ssize_t ret;
+
+	ret = double_flag_store(kobj, attr, buf, count,
+				TRANSPARENT_HUGEPAGE_FLAG,
+				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
+
+	if (ret > 0) {
+		int err = start_khugepaged();
+		if (err)
+			ret = err;
+	}
+
+	return ret;
 }
 static struct kobj_attribute enabled_attr =
 	__ATTR(enabled, 0644, enabled_show, enabled_store);
@@ -153,20 +260,212 @@ static struct attribute *hugepage_attr[]
 
 static struct attribute_group hugepage_attr_group = {
 	.attrs = hugepage_attr,
-	.name = "transparent_hugepage",
+};
+
+static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
+}
+
+static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_scan_sleep_millisecs = msecs;
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute scan_sleep_millisecs_attr =
+	__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
+	       scan_sleep_millisecs_store);
+
+static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
+					  struct kobj_attribute *attr,
+					  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
+}
+
+static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = strict_strtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_alloc_sleep_millisecs = msecs;
+	wake_up_interruptible(&khugepaged_wait);
+
+	return count;
+}
+static struct kobj_attribute alloc_sleep_millisecs_attr =
+	__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
+	       alloc_sleep_millisecs_store);
+
+static ssize_t pages_to_scan_show(struct kobject *kobj,
+				  struct kobj_attribute *attr,
+				  char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
+}
+static ssize_t pages_to_scan_store(struct kobject *kobj,
+				   struct kobj_attribute *attr,
+				   const char *buf, size_t count)
+{
+	int err;
+	unsigned long pages;
+
+	err = strict_strtoul(buf, 10, &pages);
+	if (err || !pages || pages > UINT_MAX)
+		return -EINVAL;
+
+	khugepaged_pages_to_scan = pages;
+
+	return count;
+}
+static struct kobj_attribute pages_to_scan_attr =
+	__ATTR(pages_to_scan, 0644, pages_to_scan_show,
+	       pages_to_scan_store);
+
+static ssize_t pages_collapsed_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
+}
+static struct kobj_attribute pages_collapsed_attr =
+	__ATTR_RO(pages_collapsed);
+
+static ssize_t full_scans_show(struct kobject *kobj,
+			       struct kobj_attribute *attr,
+			       char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_full_scans);
+}
+static struct kobj_attribute full_scans_attr =
+	__ATTR_RO(full_scans);
+
+static ssize_t khugepaged_defrag_show(struct kobject *kobj,
+				      struct kobj_attribute *attr, char *buf)
+{
+	return single_flag_show(kobj, attr, buf,
+				TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static ssize_t khugepaged_defrag_store(struct kobject *kobj,
+				       struct kobj_attribute *attr,
+				       const char *buf, size_t count)
+{
+	return single_flag_store(kobj, attr, buf, count,
+				 TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+}
+static struct kobj_attribute khugepaged_defrag_attr =
+	__ATTR(defrag, 0644, khugepaged_defrag_show,
+	       khugepaged_defrag_store);
+
+/*
+ * max_ptes_none controls if khugepaged should collapse hugepages over
+ * any unmapped ptes in turn potentially increasing the memory
+ * footprint of the vmas. When max_ptes_none is 0 khugepaged will not
+ * reduce the available free memory in the system as it
+ * runs. Increasing max_ptes_none will instead potentially reduce the
+ * free memory in the system during the khugepaged scan.
+ */
+static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
+					     struct kobj_attribute *attr,
+					     char *buf)
+{
+	return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
+}
+static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
+					      struct kobj_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int err;
+	unsigned long max_ptes_none;
+
+	err = strict_strtoul(buf, 10, &max_ptes_none);
+	if (err || max_ptes_none > HPAGE_PMD_NR-1)
+		return -EINVAL;
+
+	khugepaged_max_ptes_none = max_ptes_none;
+
+	return count;
+}
+static struct kobj_attribute khugepaged_max_ptes_none_attr =
+	__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
+	       khugepaged_max_ptes_none_store);
+
+static struct attribute *khugepaged_attr[] = {
+	&khugepaged_defrag_attr.attr,
+	&khugepaged_max_ptes_none_attr.attr,
+	&pages_to_scan_attr.attr,
+	&pages_collapsed_attr.attr,
+	&full_scans_attr.attr,
+	&scan_sleep_millisecs_attr.attr,
+	&alloc_sleep_millisecs_attr.attr,
+	NULL,
+};
+
+static struct attribute_group khugepaged_attr_group = {
+	.attrs = khugepaged_attr,
+	.name = "khugepaged",
 };
 #endif /* CONFIG_SYSFS */
 
 static int __init hugepage_init(void)
 {
+	int err;
 #ifdef CONFIG_SYSFS
-	int err;
+	static struct kobject *hugepage_kobj;
 
-	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
+	err = -ENOMEM;
+	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
+	if (unlikely(!hugepage_kobj)) {
+		printk(KERN_ERR "hugepage: failed kobject create\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &hugepage_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+
+	err = sysfs_create_group(hugepage_kobj, &khugepaged_attr_group);
+	if (err) {
+		printk(KERN_ERR "hugepage: failed register hugeage group\n");
+		goto out;
+	}
+#endif
+
+	err = khugepaged_slab_init();
 	if (err)
-		printk(KERN_ERR "hugepage: register sysfs failed\n");
-#endif
-	return 0;
+		goto out;
+
+	err = mm_slots_hash_init();
+	if (err) {
+		khugepaged_slab_free();
+		goto out;
+	}
+
+	start_khugepaged();
+
+out:
+	return err;
 }
 module_init(hugepage_init)
 
@@ -285,6 +584,8 @@ int do_huge_pmd_anonymous_page(struct mm
 	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
 		if (unlikely(anon_vma_prepare(vma)))
 			return VM_FAULT_OOM;
+		if (unlikely(khugepaged_enter(vma)))
+			return VM_FAULT_OOM;
 		page = alloc_hugepage(transparent_hugepage_defrag(vma));
 		if (unlikely(!page))
 			goto out;
@@ -935,6 +1236,758 @@ int hugepage_madvise(unsigned long *vm_f
 	return 0;
 }
 
+static int __init khugepaged_slab_init(void)
+{
+	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
+					  sizeof(struct mm_slot),
+					  __alignof__(struct mm_slot), 0, NULL);
+	if (!mm_slot_cache)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void __init khugepaged_slab_free(void)
+{
+	kmem_cache_destroy(mm_slot_cache);
+	mm_slot_cache = NULL;
+}
+
+static inline struct mm_slot *alloc_mm_slot(void)
+{
+	if (!mm_slot_cache)	/* initialization failed */
+		return NULL;
+	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
+}
+
+static inline void free_mm_slot(struct mm_slot *mm_slot)
+{
+	kmem_cache_free(mm_slot_cache, mm_slot);
+}
+
+static int __init mm_slots_hash_init(void)
+{
+	mm_slots_hash = kzalloc(MM_SLOTS_HASH_HEADS * sizeof(struct hlist_head),
+				GFP_KERNEL);
+	if (!mm_slots_hash)
+		return -ENOMEM;
+	return 0;
+}
+
+#if 0
+static void __init mm_slots_hash_free(void)
+{
+	kfree(mm_slots_hash);
+	mm_slots_hash = NULL;
+}
+#endif
+
+static struct mm_slot *get_mm_slot(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	struct hlist_head *bucket;
+	struct hlist_node *node;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	hlist_for_each_entry(mm_slot, node, bucket, hash) {
+		if (mm == mm_slot->mm)
+			return mm_slot;
+	}
+	return NULL;
+}
+
+static void insert_to_mm_slots_hash(struct mm_struct *mm,
+				    struct mm_slot *mm_slot)
+{
+	struct hlist_head *bucket;
+
+	bucket = &mm_slots_hash[((unsigned long)mm / sizeof(struct mm_struct))
+				% MM_SLOTS_HASH_HEADS];
+	mm_slot->mm = mm;
+	hlist_add_head(&mm_slot->hash, bucket);
+}
+
+static inline int khugepaged_test_exit(struct mm_struct *mm)
+{
+	return atomic_read(&mm->mm_users) == 0;
+}
+
+int __khugepaged_enter(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int wakeup;
+
+	mm_slot = alloc_mm_slot();
+	if (!mm_slot)
+		return -ENOMEM;
+
+	/* __khugepaged_exit() must not run from under us */
+	VM_BUG_ON(khugepaged_test_exit(mm));
+	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
+		free_mm_slot(mm_slot);
+		return 0;
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	insert_to_mm_slots_hash(mm, mm_slot);
+	/*
+	 * Insert just behind the scanning cursor, to let the area settle
+	 * down a little.
+	 */
+	wakeup = list_empty(&khugepaged_scan.mm_head);
+	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
+	spin_unlock(&khugepaged_mm_lock);
+
+	atomic_inc(&mm->mm_count);
+	if (wakeup)
+		wake_up_interruptible(&khugepaged_wait);
+
+	return 0;
+}
+
+int khugepaged_enter_vma_merge(struct vm_area_struct *vma)
+{
+	unsigned long hstart, hend;
+	if (!vma->anon_vma)
+		/*
+		 * Not yet faulted in so we will register later in the
+		 * page fault if needed.
+		 */
+		return 0;
+	if (vma->vm_file || vma->vm_ops)
+		/* khugepaged not yet working on file or special mappings */
+		return 0;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (hstart < hend)
+		return khugepaged_enter(vma);
+	return 0;
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+	int free = 0;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+		free = 1;
+	}
+
+	if (free) {
+		spin_unlock(&khugepaged_mm_lock);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	} else if (mm_slot) {
+		spin_unlock(&khugepaged_mm_lock);
+		/*
+		 * This is required to serialize against
+		 * khugepaged_test_exit() (which is guaranteed to run
+		 * under mmap sem read mode). Stop here (after we
+		 * return all pagetables will be destroyed) until
+		 * khugepaged has finished working on the pagetables
+		 * under the mmap_sem.
+		 */
+		down_write(&mm->mmap_sem);
+		up_write(&mm->mmap_sem);
+	} else
+		spin_unlock(&khugepaged_mm_lock);
+}
+
+static void release_pte_page(struct page *page)
+{
+	/* 0 stands for page_is_file_cache(page) == false */
+	dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+	unlock_page(page);
+	putback_lru_page(page);
+}
+
+static void release_pte_pages(pte_t *pte, pte_t *_pte)
+{
+	while (--_pte >= pte) {
+		pte_t pteval = *_pte;
+		if (!pte_none(pteval))
+			release_pte_page(pte_page(pteval));
+	}
+}
+
+static void release_all_pte_pages(pte_t *pte)
+{
+	release_pte_pages(pte, pte + HPAGE_PMD_NR);
+}
+
+static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
+					unsigned long address,
+					pte_t *pte)
+{
+	struct page *page;
+	pte_t *_pte;
+	int referenced = 0, isolated = 0, none = 0;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else {
+				release_pte_pages(pte, _pte);
+				goto out;
+			}
+		}
+		if (!pte_present(pteval) || !pte_write(pteval)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		page = vm_normal_page(vma, address, pteval);
+		if (unlikely(!page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		VM_BUG_ON(PageCompound(page));
+		BUG_ON(!PageAnon(page));
+		VM_BUG_ON(!PageSwapBacked(page));
+
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * We can do it before isolate_lru_page because the
+		 * page can't be freed from under us. NOTE: PG_lock
+		 * is needed to serialize against split_huge_page
+		 * when invoked from the VM.
+		 */
+		if (!trylock_page(page)) {
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/*
+		 * Isolate the page to avoid collapsing an hugepage
+		 * currently in use by the VM.
+		 */
+		if (isolate_lru_page(page)) {
+			unlock_page(page);
+			release_pte_pages(pte, _pte);
+			goto out;
+		}
+		/* 0 stands for page_is_file_cache(page) == false */
+		inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+		VM_BUG_ON(!PageLocked(page));
+		VM_BUG_ON(PageLRU(page));
+
+		/* If there is no mapped pte young don't collapse the page */
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (unlikely(!referenced))
+		release_all_pte_pages(pte);
+	else
+		isolated = 1;
+out:
+	return isolated;
+}
+
+static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+				      struct vm_area_struct *vma,
+				      unsigned long address,
+				      spinlock_t *ptl)
+{
+	pte_t *_pte;
+	for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
+		pte_t pteval = *_pte;
+		struct page *src_page;
+
+		if (pte_none(pteval)) {
+			clear_user_highpage(page, address);
+			add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+		} else {
+			src_page = pte_page(pteval);
+			copy_user_highpage(page, src_page, address, vma);
+			VM_BUG_ON(page_mapcount(src_page) != 1);
+			VM_BUG_ON(page_count(src_page) != 2);
+			release_pte_page(src_page);
+			/*
+			 * ptl mostly unnecessary, but preempt has to
+			 * be disabled to update the per-cpu stats
+			 * inside page_remove_rmap().
+			 */
+			spin_lock(ptl);
+			/*
+			 * paravirt calls inside pte_clear here are
+			 * superfluous.
+			 */
+			pte_clear(vma->vm_mm, address, _pte);
+			page_remove_rmap(src_page);
+			spin_unlock(ptl);
+			free_page_and_swap_cache(src_page);
+		}
+
+		address += PAGE_SIZE;
+		page++;
+	}
+}
+
+static void collapse_huge_page(struct mm_struct *mm,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	struct vm_area_struct *vma;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd, _pmd;
+	pte_t *pte;
+	pgtable_t pgtable;
+	struct page *new_page;
+	spinlock_t *ptl;
+	int isolated;
+	unsigned long hstart, hend;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(!*hpage);
+
+	/*
+	 * Prevent all access to pagetables with the exception of
+	 * gup_fast later hanlded by the ptep_clear_flush and the VM
+	 * handled by the anon_vma lock + PG_lock.
+	 */
+	down_write(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto out;
+
+	vma = find_vma(mm, address);
+	hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = vma->vm_end & HPAGE_PMD_MASK;
+	if (address < hstart || address + HPAGE_PMD_SIZE > hend)
+		goto out;
+
+	if (!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always())
+		goto out;
+
+	/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		goto out;
+	VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	/* pmd can't go away or become huge under us */
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	new_page = *hpage;
+	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
+		goto out;
+
+	anon_vma_lock(vma->anon_vma);
+
+	pte = pte_offset_map(pmd, address);
+	ptl = pte_lockptr(mm, pmd);
+
+	spin_lock(&mm->page_table_lock); /* probably unnecessary */
+	/*
+	 * After this gup_fast can't run anymore. This also removes
+	 * any huge TLB entry from the CPU so we won't allow
+	 * huge and small TLB entries for the same virtual address
+	 * to avoid the risk of CPU bugs in that area.
+	 */
+	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	spin_unlock(&mm->page_table_lock);
+
+	spin_lock(ptl);
+	isolated = __collapse_huge_page_isolate(vma, address, pte);
+	spin_unlock(ptl);
+	pte_unmap(pte);
+
+	if (unlikely(!isolated)) {
+		spin_lock(&mm->page_table_lock);
+		BUG_ON(!pmd_none(*pmd));
+		set_pmd_at(mm, address, pmd, _pmd);
+		spin_unlock(&mm->page_table_lock);
+		anon_vma_unlock(vma->anon_vma);
+		mem_cgroup_uncharge_page(new_page);
+		goto out;
+	}
+
+	/*
+	 * All pages are isolated and locked so anon_vma rmap
+	 * can't run anymore.
+	 */
+	anon_vma_unlock(vma->anon_vma);
+
+	__collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+	__SetPageUptodate(new_page);
+	pgtable = pmd_pgtable(_pmd);
+	VM_BUG_ON(page_count(pgtable) != 1);
+	VM_BUG_ON(page_mapcount(pgtable) != 0);
+
+	_pmd = mk_pmd(new_page, vma->vm_page_prot);
+	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+	_pmd = pmd_mkhuge(_pmd);
+
+	/*
+	 * spin_lock() below is not the equivalent of smp_wmb(), so
+	 * this is needed to avoid the copy_huge_page writes to become
+	 * visible after the set_pmd_at() write.
+	 */
+	smp_wmb();
+
+	spin_lock(&mm->page_table_lock);
+	BUG_ON(!pmd_none(*pmd));
+	page_add_new_anon_rmap(new_page, vma, address);
+	set_pmd_at(mm, address, pmd, _pmd);
+	update_mmu_cache(vma, address, entry);
+	prepare_pmd_huge_pte(pgtable, mm);
+	mm->nr_ptes--;
+	spin_unlock(&mm->page_table_lock);
+
+	*hpage = NULL;
+	khugepaged_pages_collapsed++;
+out:
+	up_write(&mm->mmap_sem);
+}
+
+static int khugepaged_scan_pmd(struct mm_struct *mm,
+			       struct vm_area_struct *vma,
+			       unsigned long address,
+			       struct page **hpage)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte, *_pte;
+	int ret = 0, referenced = 0, none = 0;
+	struct page *page;
+	unsigned long _address;
+	spinlock_t *ptl;
+
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
+		goto out;
+
+	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
+	     _pte++, _address += PAGE_SIZE) {
+		pte_t pteval = *_pte;
+		if (pte_none(pteval)) {
+			if (++none <= khugepaged_max_ptes_none)
+				continue;
+			else
+				goto out_unmap;
+		}
+		if (!pte_present(pteval) || !pte_write(pteval))
+			goto out_unmap;
+		page = vm_normal_page(vma, _address, pteval);
+		if (unlikely(!page))
+			goto out_unmap;
+		VM_BUG_ON(PageCompound(page));
+		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
+			goto out_unmap;
+		/* cannot use mapcount: can't collapse if there's a gup pin */
+		if (page_count(page) != 1)
+			goto out_unmap;
+		if (pte_young(pteval))
+			referenced = 1;
+	}
+	if (referenced)
+		ret = 1;
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	if (ret) {
+		up_read(&mm->mmap_sem);
+		collapse_huge_page(mm, address, hpage);
+	}
+out:
+	return ret;
+}
+
+static void collect_mm_slot(struct mm_slot *mm_slot)
+{
+	struct mm_struct *mm = mm_slot->mm;
+
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
+		hlist_del(&mm_slot->hash);
+		list_del(&mm_slot->mm_node);
+
+		/*
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		 */
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
+					    struct page **hpage)
+{
+	struct mm_slot *mm_slot;
+	struct mm_struct *mm;
+	struct vm_area_struct *vma;
+	int progress = 0;
+
+	VM_BUG_ON(!pages);
+	VM_BUG_ON(!spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_scan.mm_slot)
+		mm_slot = khugepaged_scan.mm_slot;
+	else {
+		mm_slot = list_entry(khugepaged_scan.mm_head.next,
+				     struct mm_slot, mm_node);
+		khugepaged_scan.address = 0;
+		khugepaged_scan.mm_slot = mm_slot;
+	}
+	spin_unlock(&khugepaged_mm_lock);
+
+	mm = mm_slot->mm;
+	down_read(&mm->mmap_sem);
+	if (unlikely(khugepaged_test_exit(mm)))
+		vma = NULL;
+	else
+		vma = find_vma(mm, khugepaged_scan.address);
+
+	progress++;
+	for (; vma; vma = vma->vm_next) {
+		unsigned long hstart, hend;
+
+		cond_resched();
+		if (unlikely(khugepaged_test_exit(mm))) {
+			progress++;
+			break;
+		}
+
+		if (!(vma->vm_flags & VM_HUGEPAGE) &&
+		    !khugepaged_always()) {
+			progress++;
+			continue;
+		}
+
+		/* VM_PFNMAP vmas may have vm_ops null but vm_file set */
+		if (!vma->anon_vma || vma->vm_ops || vma->vm_file) {
+			khugepaged_scan.address = vma->vm_end;
+			progress++;
+			continue;
+		}
+		VM_BUG_ON(is_linear_pfn_mapping(vma) || is_pfn_mapping(vma));
+
+		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+		hend = vma->vm_end & HPAGE_PMD_MASK;
+		if (hstart >= hend) {
+			progress++;
+			continue;
+		}
+		if (khugepaged_scan.address < hstart)
+			khugepaged_scan.address = hstart;
+		if (khugepaged_scan.address > hend) {
+			khugepaged_scan.address = hend + HPAGE_PMD_SIZE;
+			progress++;
+			continue;
+		}
+		BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
+
+		while (khugepaged_scan.address < hend) {
+			int ret;
+			cond_resched();
+			if (unlikely(khugepaged_test_exit(mm)))
+				goto breakouterloop;
+
+			VM_BUG_ON(khugepaged_scan.address < hstart ||
+				  khugepaged_scan.address + HPAGE_PMD_SIZE >
+				  hend);
+			ret = khugepaged_scan_pmd(mm, vma,
+						  khugepaged_scan.address,
+						  hpage);
+			/* move to next address */
+			khugepaged_scan.address += HPAGE_PMD_SIZE;
+			progress += HPAGE_PMD_NR;
+			if (ret)
+				/* we released mmap_sem so break loop */
+				goto breakouterloop_mmap_sem;
+			if (progress >= pages)
+				goto breakouterloop;
+		}
+	}
+breakouterloop:
+	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+breakouterloop_mmap_sem:
+
+	spin_lock(&khugepaged_mm_lock);
+	BUG_ON(khugepaged_scan.mm_slot != mm_slot);
+	/*
+	 * Release the current mm_slot if this mm is about to die, or
+	 * if we scanned all vmas of this mm.
+	 */
+	if (khugepaged_test_exit(mm) || !vma) {
+		/*
+		 * Make sure that if mm_users is reaching zero while
+		 * khugepaged runs here, khugepaged_exit will find
+		 * mm_slot not pointing to the exiting mm.
+		 */
+		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
+			khugepaged_scan.mm_slot = list_entry(
+				mm_slot->mm_node.next,
+				struct mm_slot, mm_node);
+			khugepaged_scan.address = 0;
+		} else {
+			khugepaged_scan.mm_slot = NULL;
+			khugepaged_full_scans++;
+		}
+
+		collect_mm_slot(mm_slot);
+	}
+
+	return progress;
+}
+
+static int khugepaged_has_work(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) &&
+		khugepaged_enabled();
+}
+
+static int khugepaged_wait_event(void)
+{
+	return !list_empty(&khugepaged_scan.mm_head) ||
+		!khugepaged_enabled();
+}
+
+static void khugepaged_do_scan(struct page **hpage)
+{
+	unsigned int progress = 0, pass_through_head = 0;
+	unsigned int pages = khugepaged_pages_to_scan;
+
+	barrier(); /* write khugepaged_pages_to_scan to local stack */
+
+	while (progress < pages) {
+		cond_resched();
+
+		if (!*hpage) {
+			*hpage = alloc_hugepage(khugepaged_defrag());
+			if (unlikely(!*hpage))
+				break;
+		}
+
+		spin_lock(&khugepaged_mm_lock);
+		if (!khugepaged_scan.mm_slot)
+			pass_through_head++;
+		if (khugepaged_has_work() &&
+		    pass_through_head < 2)
+			progress += khugepaged_scan_mm_slot(pages - progress,
+							    hpage);
+		else
+			progress = pages;
+		spin_unlock(&khugepaged_mm_lock);
+	}
+}
+
+static struct page *khugepaged_alloc_hugepage(void)
+{
+	struct page *hpage;
+
+	do {
+		hpage = alloc_hugepage(khugepaged_defrag());
+		if (!hpage) {
+			DEFINE_WAIT(wait);
+			add_wait_queue(&khugepaged_wait, &wait);
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_alloc_sleep_millisecs));
+			remove_wait_queue(&khugepaged_wait, &wait);
+		}
+	} while (unlikely(!hpage) &&
+		 likely(khugepaged_enabled()));
+	return hpage;
+}
+
+static void khugepaged_loop(void)
+{
+	struct page *hpage;
+
+	while (likely(khugepaged_enabled())) {
+		hpage = khugepaged_alloc_hugepage();
+		if (unlikely(!hpage))
+			break;
+
+		khugepaged_do_scan(&hpage);
+		if (hpage)
+			put_page(hpage);
+		if (khugepaged_has_work()) {
+			DEFINE_WAIT(wait);
+			if (!khugepaged_scan_sleep_millisecs)
+				continue;
+			add_wait_queue(&khugepaged_wait, &wait);
+			schedule_timeout_interruptible(
+				msecs_to_jiffies(
+					khugepaged_scan_sleep_millisecs));
+			remove_wait_queue(&khugepaged_wait, &wait);
+		} else if (khugepaged_enabled())
+			wait_event_interruptible(khugepaged_wait,
+						 khugepaged_wait_event());
+	}
+}
+
+static int khugepaged(void *none)
+{
+	struct mm_slot *mm_slot;
+
+	set_user_nice(current, 19);
+
+	/* serialize with start_khugepaged() */
+	mutex_lock(&khugepaged_mutex);
+
+	for (;;) {
+		mutex_unlock(&khugepaged_mutex);
+		BUG_ON(khugepaged_thread != current);
+		khugepaged_loop();
+		BUG_ON(khugepaged_thread != current);
+
+		mutex_lock(&khugepaged_mutex);
+		if (!khugepaged_enabled())
+			break;
+	}
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = khugepaged_scan.mm_slot;
+	khugepaged_scan.mm_slot = NULL;
+	if (mm_slot)
+		collect_mm_slot(mm_slot);
+	spin_unlock(&khugepaged_mm_lock);
+
+	khugepaged_thread = NULL;
+	mutex_unlock(&khugepaged_mutex);
+
+	return 0;
+}
+
 void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
 {
 	struct page *page;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 42 of 66] khugepaged vma merge
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

register in khugepaged if the vma grows.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/perf_event.h>
 #include <linux/audit.h>
+#include <linux/khugepaged.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -815,6 +816,7 @@ struct vm_area_struct *vma_merge(struct 
 				end, prev->vm_pgoff, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(prev);
 		return prev;
 	}
 
@@ -833,6 +835,7 @@ struct vm_area_struct *vma_merge(struct 
 				next->vm_pgoff - pglen, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(area);
 		return area;
 	}
 
@@ -1761,6 +1764,7 @@ int expand_upwards(struct vm_area_struct
 		}
 	}
 	vma_unlock_anon_vma(vma);
+	khugepaged_enter_vma_merge(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1808,6 +1812,7 @@ static int expand_downwards(struct vm_ar
 		}
 	}
 	vma_unlock_anon_vma(vma);
+	khugepaged_enter_vma_merge(vma);
 	return error;
 }
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 42 of 66] khugepaged vma merge
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

register in khugepaged if the vma grows.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/perf_event.h>
 #include <linux/audit.h>
+#include <linux/khugepaged.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -815,6 +816,7 @@ struct vm_area_struct *vma_merge(struct 
 				end, prev->vm_pgoff, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(prev);
 		return prev;
 	}
 
@@ -833,6 +835,7 @@ struct vm_area_struct *vma_merge(struct 
 				next->vm_pgoff - pglen, NULL);
 		if (err)
 			return NULL;
+		khugepaged_enter_vma_merge(area);
 		return area;
 	}
 
@@ -1761,6 +1764,7 @@ int expand_upwards(struct vm_area_struct
 		}
 	}
 	vma_unlock_anon_vma(vma);
+	khugepaged_enter_vma_merge(vma);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -1808,6 +1812,7 @@ static int expand_downwards(struct vm_ar
 		}
 	}
 	vma_unlock_anon_vma(vma);
+	khugepaged_enter_vma_merge(vma);
 	return error;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

When swapcache is replaced by a ksm page don't leave orhpaned swap cache.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -800,7 +800,7 @@ static int replace_page(struct vm_area_s
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
-	put_page(page);
+	free_page_and_swap_cache(page);
 
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
@@ -846,7 +846,18 @@ static int try_to_merge_one_page(struct 
 	 * ptes are necessarily already write-protected.  But in either
 	 * case, we need to lock and check page_count is not raised.
 	 */
-	if (write_protect_page(vma, page, &orig_pte) == 0) {
+	err = write_protect_page(vma, page, &orig_pte);
+
+	/*
+	 * After this mapping is wrprotected we don't need further
+	 * checks for PageSwapCache vs page_count unlock_page(page)
+	 * and we rely only on the pte_same() check run under PT lock
+	 * to ensure the pte didn't change since when we wrprotected
+	 * it under PG_lock.
+	 */
+	unlock_page(page);
+
+	if (!err) {
 		if (!kpage) {
 			/*
 			 * While we hold page lock, upgrade page from
@@ -855,22 +866,22 @@ static int try_to_merge_one_page(struct 
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
-			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-	}
+	} else
+		err = -EFAULT;
 
 	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
+		lock_page(page);	/* for LRU manipulation */
 		munlock_vma_page(page);
+		unlock_page(page);
 		if (!PageMlocked(kpage)) {
-			unlock_page(page);
 			lock_page(kpage);
 			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
+			unlock_page(kpage);
 		}
 	}
 
-	unlock_page(page);
 out:
 	return err;
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

When swapcache is replaced by a ksm page don't leave orhpaned swap cache.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -800,7 +800,7 @@ static int replace_page(struct vm_area_s
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
-	put_page(page);
+	free_page_and_swap_cache(page);
 
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
@@ -846,7 +846,18 @@ static int try_to_merge_one_page(struct 
 	 * ptes are necessarily already write-protected.  But in either
 	 * case, we need to lock and check page_count is not raised.
 	 */
-	if (write_protect_page(vma, page, &orig_pte) == 0) {
+	err = write_protect_page(vma, page, &orig_pte);
+
+	/*
+	 * After this mapping is wrprotected we don't need further
+	 * checks for PageSwapCache vs page_count unlock_page(page)
+	 * and we rely only on the pte_same() check run under PT lock
+	 * to ensure the pte didn't change since when we wrprotected
+	 * it under PG_lock.
+	 */
+	unlock_page(page);
+
+	if (!err) {
 		if (!kpage) {
 			/*
 			 * While we hold page lock, upgrade page from
@@ -855,22 +866,22 @@ static int try_to_merge_one_page(struct 
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
-			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-	}
+	} else
+		err = -EFAULT;
 
 	if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
+		lock_page(page);	/* for LRU manipulation */
 		munlock_vma_page(page);
+		unlock_page(page);
 		if (!PageMlocked(kpage)) {
-			unlock_page(page);
 			lock_page(kpage);
 			mlock_vma_page(kpage);
-			page = kpage;		/* for final unlock */
+			unlock_page(kpage);
 		}
 	}
 
-	unlock_page(page);
 out:
 	return err;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 44 of 66] skip transhuge pages in ksm for now
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Skip transhuge pages in ksm for now.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -430,7 +430,7 @@ static struct page *get_mergeable_page(s
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page)) {
+	if (PageAnon(page) && !PageTransCompound(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -1288,7 +1288,19 @@ next_mm:
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (!IS_ERR_OR_NULL(*page) && PageAnon(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
+				ksm_scan.address += PAGE_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageTransCompound(*page)) {
+				put_page(*page);
+				ksm_scan.address &= HPAGE_PMD_MASK;
+				ksm_scan.address += HPAGE_PMD_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
@@ -1302,8 +1314,7 @@ next_mm:
 				up_read(&mm->mmap_sem);
 				return rmap_item;
 			}
-			if (!IS_ERR_OR_NULL(*page))
-				put_page(*page);
+			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
 		}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 44 of 66] skip transhuge pages in ksm for now
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Skip transhuge pages in ksm for now.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -430,7 +430,7 @@ static struct page *get_mergeable_page(s
 	page = follow_page(vma, addr, FOLL_GET);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
-	if (PageAnon(page)) {
+	if (PageAnon(page) && !PageTransCompound(page)) {
 		flush_anon_page(vma, page, addr);
 		flush_dcache_page(page);
 	} else {
@@ -1288,7 +1288,19 @@ next_mm:
 			if (ksm_test_exit(mm))
 				break;
 			*page = follow_page(vma, ksm_scan.address, FOLL_GET);
-			if (!IS_ERR_OR_NULL(*page) && PageAnon(*page)) {
+			if (IS_ERR_OR_NULL(*page)) {
+				ksm_scan.address += PAGE_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageTransCompound(*page)) {
+				put_page(*page);
+				ksm_scan.address &= HPAGE_PMD_MASK;
+				ksm_scan.address += HPAGE_PMD_SIZE;
+				cond_resched();
+				continue;
+			}
+			if (PageAnon(*page)) {
 				flush_anon_page(vma, *page, ksm_scan.address);
 				flush_dcache_page(*page);
 				rmap_item = get_next_rmap_item(slot,
@@ -1302,8 +1314,7 @@ next_mm:
 				up_read(&mm->mmap_sem);
 				return rmap_item;
 			}
-			if (!IS_ERR_OR_NULL(*page))
-				put_page(*page);
+			put_page(*page);
 			ksm_scan.address += PAGE_SIZE;
 			cond_resched();
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 45 of 66] remove PG_buddy
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
added to page->flags without overflowing (because of the sparse section bits
increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
the memory hotplug code from _mapcount to lru.next to avoid any risk of
clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
lru.next even more easily than the mapcount instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/proc/page.c b/fs/proc/page.c
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -116,15 +116,17 @@ u64 stable_page_flags(struct page *page)
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
 
+	/*
+	 * Caveats on high order pages: page->_count will only be set
+	 * -1 on the head page; SLUB/SLQB do the same for PG_slab;
+	 * SLOB won't set PG_slab at all on compound pages.
+	 */
+	if (PageBuddy(page))
+		u |= 1 << KPF_BUDDY;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
-	/*
-	 * Caveats on high order pages:
-	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
-	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
-	 */
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
-	u |= kpf_copy_bit(k, KPF_BUDDY,		PG_buddy);
 
 	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
 	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -13,12 +13,16 @@ struct mem_section;
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /*
- * Types for free bootmem.
- * The normal smallest mapcount is -1. Here is smaller value than it.
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
  */
-#define SECTION_INFO		(-1 - 1)
-#define MIX_SECTION_INFO	(-1 - 2)
-#define NODE_INFO		(-1 - 3)
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
 
 /*
  * pgdat resizing functions
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -383,6 +383,27 @@ static inline void init_page_count(struc
 	atomic_set(&page->_count, 1);
 }
 
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+static inline int PageBuddy(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == -2;
+}
+
+static inline void __SetPageBuddy(struct page *page)
+{
+	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+	atomic_set(&page->_mapcount, -2);
+}
+
+static inline void __ClearPageBuddy(struct page *page)
+{
+	VM_BUG_ON(!PageBuddy(page));
+	atomic_set(&page->_mapcount, -1);
+}
+
 void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -48,9 +48,6 @@
  * struct page (these bits with information) are always mapped into kernel
  * address space...
  *
- * PG_buddy is set to indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
@@ -96,7 +93,6 @@ enum pageflags {
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
-	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
 #ifdef CONFIG_MMU
@@ -233,7 +229,6 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
  * risky: they bypass page accounting.
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
-__PAGEFLAG(Buddy, buddy)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
@@ -428,7 +423,7 @@ static inline void ClearPageCompound(str
 #define PAGE_FLAGS_CHECK_AT_FREE \
 	(1 << PG_lru	 | 1 << PG_locked    | \
 	 1 << PG_private | 1 << PG_private_2 | \
-	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
+	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
 	 __PG_COMPOUND_LOCK)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -65,9 +65,10 @@ static void release_memory_resource(stru
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page, int type)
+static void get_page_bootmem(unsigned long info,  struct page *page,
+			     unsigned long type)
 {
-	atomic_set(&page->_mapcount, type);
+	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
 	set_page_private(page, info);
 	atomic_inc(&page->_count);
@@ -77,15 +78,16 @@ static void get_page_bootmem(unsigned lo
  * so use __ref to tell modpost not to generate a warning */
 void __ref put_page_bootmem(struct page *page)
 {
-	int type;
+	unsigned long type;
 
-	type = atomic_read(&page->_mapcount);
-	BUG_ON(type >= -1);
+	type = (unsigned long) page->lru.next;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (atomic_dec_return(&page->_count) == 1) {
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
-		reset_page_mapcount(page);
+		INIT_LIST_HEAD(&page->lru);
 		__free_pages_bootmem(page, 0);
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -444,8 +444,8 @@ __find_combined_index(unsigned long page
  * (c) a page and its buddy have the same order &&
  * (d) a page and its buddy are in the same zone.
  *
- * For recording whether a page is in the buddy system, we use PG_buddy.
- * Setting, clearing, and testing PG_buddy is serialized by zone->lock.
+ * For recording whether a page is in the buddy system, we set ->_mapcount -2.
+ * Setting, clearing, and testing _mapcount -2 is serialized by zone->lock.
  *
  * For recording page's order, we use page_private(page).
  */
@@ -478,7 +478,7 @@ static inline int page_is_buddy(struct p
  * as necessary, plus some accounting needed to play nicely with other
  * parts of the VM system.
  * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with PG_buddy. Page's
+ * free pages of length of (1 << order) and marked with _mapcount -2. Page's
  * order is recorded in page_private(page) field.
  * So when we are allocating or freeing one, we can derive the state of the
  * other.  That is, if we allocate a small block, and both were   
@@ -5520,7 +5520,6 @@ static struct trace_print_flags pageflag
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
-	{1UL << PG_buddy,		"buddy"		},
 	{1UL << PG_swapbacked,		"swapbacked"	},
 	{1UL << PG_unevictable,		"unevictable"	},
 #ifdef CONFIG_MMU
diff --git a/mm/sparse.c b/mm/sparse.c
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -671,10 +671,10 @@ static void __kfree_section_memmap(struc
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
-	int magic;
+	unsigned long magic;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = atomic_read(&page->_mapcount);
+		magic = (unsigned long) page->lru.next;
 
 		BUG_ON(magic == NODE_INFO);
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 45 of 66] remove PG_buddy
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
added to page->flags without overflowing (because of the sparse section bits
increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
the memory hotplug code from _mapcount to lru.next to avoid any risk of
clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
lru.next even more easily than the mapcount instead.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/fs/proc/page.c b/fs/proc/page.c
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -116,15 +116,17 @@ u64 stable_page_flags(struct page *page)
 	if (PageHuge(page))
 		u |= 1 << KPF_HUGE;
 
+	/*
+	 * Caveats on high order pages: page->_count will only be set
+	 * -1 on the head page; SLUB/SLQB do the same for PG_slab;
+	 * SLOB won't set PG_slab at all on compound pages.
+	 */
+	if (PageBuddy(page))
+		u |= 1 << KPF_BUDDY;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
-	/*
-	 * Caveats on high order pages:
-	 * PG_buddy will only be set on the head page; SLUB/SLQB do the same
-	 * for PG_slab; SLOB won't set PG_slab at all on compound pages.
-	 */
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
-	u |= kpf_copy_bit(k, KPF_BUDDY,		PG_buddy);
 
 	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
 	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -13,12 +13,16 @@ struct mem_section;
 #ifdef CONFIG_MEMORY_HOTPLUG
 
 /*
- * Types for free bootmem.
- * The normal smallest mapcount is -1. Here is smaller value than it.
+ * Types for free bootmem stored in page->lru.next. These have to be in
+ * some random range in unsigned long space for debugging purposes.
  */
-#define SECTION_INFO		(-1 - 1)
-#define MIX_SECTION_INFO	(-1 - 2)
-#define NODE_INFO		(-1 - 3)
+enum {
+	MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
+	SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
+	MIX_SECTION_INFO,
+	NODE_INFO,
+	MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+};
 
 /*
  * pgdat resizing functions
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -383,6 +383,27 @@ static inline void init_page_count(struc
 	atomic_set(&page->_count, 1);
 }
 
+/*
+ * PageBuddy() indicate that the page is free and in the buddy system
+ * (see mm/page_alloc.c).
+ */
+static inline int PageBuddy(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == -2;
+}
+
+static inline void __SetPageBuddy(struct page *page)
+{
+	VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
+	atomic_set(&page->_mapcount, -2);
+}
+
+static inline void __ClearPageBuddy(struct page *page)
+{
+	VM_BUG_ON(!PageBuddy(page));
+	atomic_set(&page->_mapcount, -1);
+}
+
 void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -48,9 +48,6 @@
  * struct page (these bits with information) are always mapped into kernel
  * address space...
  *
- * PG_buddy is set to indicate that the page is free and in the buddy system
- * (see mm/page_alloc.c).
- *
  * PG_hwpoison indicates that a page got corrupted in hardware and contains
  * data with incorrect ECC bits that triggered a machine check. Accessing is
  * not safe since it may cause another machine check. Don't touch!
@@ -96,7 +93,6 @@ enum pageflags {
 	PG_swapcache,		/* Swap page: swp_entry_t in private */
 	PG_mappedtodisk,	/* Has blocks allocated on-disk */
 	PG_reclaim,		/* To be reclaimed asap */
-	PG_buddy,		/* Page is free, on buddy lists */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
 #ifdef CONFIG_MMU
@@ -233,7 +229,6 @@ PAGEFLAG(OwnerPriv1, owner_priv_1) TESTC
  * risky: they bypass page accounting.
  */
 TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
-__PAGEFLAG(Buddy, buddy)
 PAGEFLAG(MappedToDisk, mappedtodisk)
 
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
@@ -428,7 +423,7 @@ static inline void ClearPageCompound(str
 #define PAGE_FLAGS_CHECK_AT_FREE \
 	(1 << PG_lru	 | 1 << PG_locked    | \
 	 1 << PG_private | 1 << PG_private_2 | \
-	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
+	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
 	 __PG_COMPOUND_LOCK)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -65,9 +65,10 @@ static void release_memory_resource(stru
 
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info,  struct page *page, int type)
+static void get_page_bootmem(unsigned long info,  struct page *page,
+			     unsigned long type)
 {
-	atomic_set(&page->_mapcount, type);
+	page->lru.next = (struct list_head *) type;
 	SetPagePrivate(page);
 	set_page_private(page, info);
 	atomic_inc(&page->_count);
@@ -77,15 +78,16 @@ static void get_page_bootmem(unsigned lo
  * so use __ref to tell modpost not to generate a warning */
 void __ref put_page_bootmem(struct page *page)
 {
-	int type;
+	unsigned long type;
 
-	type = atomic_read(&page->_mapcount);
-	BUG_ON(type >= -1);
+	type = (unsigned long) page->lru.next;
+	BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+	       type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
 
 	if (atomic_dec_return(&page->_count) == 1) {
 		ClearPagePrivate(page);
 		set_page_private(page, 0);
-		reset_page_mapcount(page);
+		INIT_LIST_HEAD(&page->lru);
 		__free_pages_bootmem(page, 0);
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -444,8 +444,8 @@ __find_combined_index(unsigned long page
  * (c) a page and its buddy have the same order &&
  * (d) a page and its buddy are in the same zone.
  *
- * For recording whether a page is in the buddy system, we use PG_buddy.
- * Setting, clearing, and testing PG_buddy is serialized by zone->lock.
+ * For recording whether a page is in the buddy system, we set ->_mapcount -2.
+ * Setting, clearing, and testing _mapcount -2 is serialized by zone->lock.
  *
  * For recording page's order, we use page_private(page).
  */
@@ -478,7 +478,7 @@ static inline int page_is_buddy(struct p
  * as necessary, plus some accounting needed to play nicely with other
  * parts of the VM system.
  * At each level, we keep a list of pages, which are heads of continuous
- * free pages of length of (1 << order) and marked with PG_buddy. Page's
+ * free pages of length of (1 << order) and marked with _mapcount -2. Page's
  * order is recorded in page_private(page) field.
  * So when we are allocating or freeing one, we can derive the state of the
  * other.  That is, if we allocate a small block, and both were   
@@ -5520,7 +5520,6 @@ static struct trace_print_flags pageflag
 	{1UL << PG_swapcache,		"swapcache"	},
 	{1UL << PG_mappedtodisk,	"mappedtodisk"	},
 	{1UL << PG_reclaim,		"reclaim"	},
-	{1UL << PG_buddy,		"buddy"		},
 	{1UL << PG_swapbacked,		"swapbacked"	},
 	{1UL << PG_unevictable,		"unevictable"	},
 #ifdef CONFIG_MMU
diff --git a/mm/sparse.c b/mm/sparse.c
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -671,10 +671,10 @@ static void __kfree_section_memmap(struc
 static void free_map_bootmem(struct page *page, unsigned long nr_pages)
 {
 	unsigned long maps_section_nr, removing_section_nr, i;
-	int magic;
+	unsigned long magic;
 
 	for (i = 0; i < nr_pages; i++, page++) {
-		magic = atomic_read(&page->_mapcount);
+		magic = (unsigned long) page->lru.next;
 
 		BUG_ON(magic == NODE_INFO);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 46 of 66] add x86 32bit support
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Add support for transparent hugepages to x86 32bit.

Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support
transparent hugepages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -46,6 +46,15 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+	return __pmd(xchg((pmdval_t *)xp, 0));
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken,
  * split up the 29 bits of offset into this range:
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -104,6 +104,29 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+union split_pmd {
+	struct {
+		u32 pmd_low;
+		u32 pmd_high;
+	};
+	pmd_t pmd;
+};
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	union split_pmd res, *orig = (union split_pmd *)pmdp;
+
+	/* xchg acts as a barrier before setting of the high bits */
+	res.pmd_low = xchg(&orig->pmd_low, 0);
+	res.pmd_high = orig->pmd_high;
+	orig->pmd_high = 0;
+
+	return res.pmd;
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits 0, 6 and 7 are taken in the low part of the pte,
  * put the 32 bits of offset into the high part.
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -97,6 +97,11 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
@@ -145,6 +150,18 @@ static inline int pmd_large(pmd_t pte)
 		(_PAGE_PSE | _PAGE_PRESENT);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
 {
 	pteval_t v = native_pte_val(pte);
@@ -219,6 +236,55 @@ static inline pte_t pte_mkspecial(pte_t 
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 /*
  * Mask out unsupported bits in a present pgprot.  Non-present pgprots
  * can use those bits for other purposes, so leave them be.
@@ -527,6 +593,14 @@ static inline pte_t native_local_ptep_ge
 	return res;
 }
 
+static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	pmd_t res = *pmdp;
+
+	native_pmd_clear(pmdp);
+	return res;
+}
+
 static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
 				     pte_t *ptep , pte_t pte)
 {
@@ -616,6 +690,49 @@ static inline void ptep_set_wrprotect(st
 
 #define flush_tlb_fix_spurious_fault(vma, address)
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	pmd_update(mm, addr, pmdp);
+}
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -182,115 +182,6 @@ extern void cleanup_highmap(void);
 
 #define __HAVE_ARCH_PTE_SAME
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
-static inline int pmd_trans_huge(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_PSE;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
-#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
-
-#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
-extern int pmdp_set_access_flags(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp,
-				 pmd_t entry, int dirty);
-
-#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
-				     unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
-extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
-				  unsigned long address, pmd_t *pmdp);
-
-
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMD_WRITE
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
-#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
-static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmdp)
-{
-	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
-	pmd_update(mm, addr, pmdp);
-	return pmd;
-}
-
-#define __HAVE_ARCH_PMDP_SET_WRPROTECT
-static inline void pmdp_set_wrprotect(struct mm_struct *mm,
-				      unsigned long addr, pmd_t *pmdp)
-{
-	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
-	pmd_update(mm, addr, pmdp);
-}
-
-static inline int pmd_young(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_ACCESSED;
-}
-
-static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v | set);
-}
-
-static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v & ~clear);
-}
-
-static inline pmd_t pmd_mkold(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_wrprotect(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mkdirty(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DIRTY);
-}
-
-static inline pmd_t pmd_mkhuge(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_PSE);
-}
-
-static inline pmd_t pmd_mkyoung(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mknotpresent(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
-}
-
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -362,7 +362,7 @@ int pmdp_test_and_clear_young(struct vm_
 
 	if (pmd_young(*pmdp))
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
-					 (unsigned long *) &pmdp->pmd);
+					 (unsigned long *)pmdp);
 
 	if (ret)
 		pmd_update(vma->vm_mm, addr, pmdp);
@@ -404,7 +404,7 @@ void pmdp_splitting_flush(struct vm_area
 	int set;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)&pmdp->pmd);
+				(unsigned long *)pmdp);
 	if (set) {
 		pmd_update(vma->vm_mm, address, pmdp);
 		/* need tlb flush only to serialize against gup-fast */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -102,7 +102,11 @@ extern unsigned int kobjsize(const void 
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
+#else
+#define VM_HUGEPAGE	0x01000000	/* MADV_HUGEPAGE marked this vma */
+#endif
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
@@ -111,9 +115,6 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
-#if BITS_PER_LONG > 32
-#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
-#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -304,7 +304,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support" if EMBEDDED
-	depends on X86_64 && MMU
+	depends on X86 && MMU
 	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 46 of 66] add x86 32bit support
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Add support for transparent hugepages to x86 32bit.

Share the same VM_ bitflag for VM_MAPPED_COPY. mm/nommu.c will never support
transparent hugepages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -46,6 +46,15 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
+{
+	return __pmd(xchg((pmdval_t *)xp, 0));
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits _PAGE_BIT_PRESENT, _PAGE_BIT_FILE and _PAGE_BIT_PROTNONE are taken,
  * split up the 29 bits of offset into this range:
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -104,6 +104,29 @@ static inline pte_t native_ptep_get_and_
 #define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
 #endif
 
+#ifdef CONFIG_SMP
+union split_pmd {
+	struct {
+		u32 pmd_low;
+		u32 pmd_high;
+	};
+	pmd_t pmd;
+};
+static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	union split_pmd res, *orig = (union split_pmd *)pmdp;
+
+	/* xchg acts as a barrier before setting of the high bits */
+	res.pmd_low = xchg(&orig->pmd_low, 0);
+	res.pmd_high = orig->pmd_high;
+	orig->pmd_high = 0;
+
+	return res.pmd;
+}
+#else
+#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
+#endif
+
 /*
  * Bits 0, 6 and 7 are taken in the low part of the pte,
  * put the 32 bits of offset into the high part.
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -97,6 +97,11 @@ static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_young(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_ACCESSED;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
@@ -145,6 +150,18 @@ static inline int pmd_large(pmd_t pte)
 		(_PAGE_PSE | _PAGE_PRESENT);
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_SPLITTING;
+}
+
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return pmd_val(pmd) & _PAGE_PSE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
 {
 	pteval_t v = native_pte_val(pte);
@@ -219,6 +236,55 @@ static inline pte_t pte_mkspecial(pte_t 
 	return pte_set_flags(pte, _PAGE_SPECIAL);
 }
 
+static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v | set);
+}
+
+static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
+{
+	pmdval_t v = native_pmd_val(pmd);
+
+	return __pmd(v & ~clear);
+}
+
+static inline pmd_t pmd_mkold(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_wrprotect(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mkdirty(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_DIRTY);
+}
+
+static inline pmd_t pmd_mkhuge(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_PSE);
+}
+
+static inline pmd_t pmd_mkyoung(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mkwrite(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_RW);
+}
+
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+
 /*
  * Mask out unsupported bits in a present pgprot.  Non-present pgprots
  * can use those bits for other purposes, so leave them be.
@@ -527,6 +593,14 @@ static inline pte_t native_local_ptep_ge
 	return res;
 }
 
+static inline pmd_t native_local_pmdp_get_and_clear(pmd_t *pmdp)
+{
+	pmd_t res = *pmdp;
+
+	native_pmd_clear(pmdp);
+	return res;
+}
+
 static inline void native_set_pte_at(struct mm_struct *mm, unsigned long addr,
 				     pte_t *ptep , pte_t pte)
 {
@@ -616,6 +690,49 @@ static inline void ptep_set_wrprotect(st
 
 #define flush_tlb_fix_spurious_fault(vma, address)
 
+#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
+
+#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
+
+#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+				     unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
+				  unsigned long address, pmd_t *pmdp);
+
+
+#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern void pmdp_splitting_flush(struct vm_area_struct *vma,
+				 unsigned long addr, pmd_t *pmdp);
+
+#define __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_RW;
+}
+
+#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
+	pmd_update(mm, addr, pmdp);
+	return pmd;
+}
+
+#define __HAVE_ARCH_PMDP_SET_WRPROTECT
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long addr, pmd_t *pmdp)
+{
+	clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
+	pmd_update(mm, addr, pmdp);
+}
+
 /*
  * clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
  *
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -182,115 +182,6 @@ extern void cleanup_highmap(void);
 
 #define __HAVE_ARCH_PTE_SAME
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pmd_trans_splitting(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_SPLITTING;
-}
-
-static inline int pmd_trans_huge(pmd_t pmd)
-{
-	return pmd_val(pmd) & _PAGE_PSE;
-}
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
-#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
-
-#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
-extern int pmdp_set_access_flags(struct vm_area_struct *vma,
-				 unsigned long address, pmd_t *pmdp,
-				 pmd_t entry, int dirty);
-
-#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
-				     unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
-extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
-				  unsigned long address, pmd_t *pmdp);
-
-
-#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
-extern void pmdp_splitting_flush(struct vm_area_struct *vma,
-				 unsigned long addr, pmd_t *pmdp);
-
-#define __HAVE_ARCH_PMD_WRITE
-static inline int pmd_write(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_RW;
-}
-
-#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
-static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
-				       pmd_t *pmdp)
-{
-	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
-	pmd_update(mm, addr, pmdp);
-	return pmd;
-}
-
-#define __HAVE_ARCH_PMDP_SET_WRPROTECT
-static inline void pmdp_set_wrprotect(struct mm_struct *mm,
-				      unsigned long addr, pmd_t *pmdp)
-{
-	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
-	pmd_update(mm, addr, pmdp);
-}
-
-static inline int pmd_young(pmd_t pmd)
-{
-	return pmd_flags(pmd) & _PAGE_ACCESSED;
-}
-
-static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v | set);
-}
-
-static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
-{
-	pmdval_t v = native_pmd_val(pmd);
-
-	return native_make_pmd(v & ~clear);
-}
-
-static inline pmd_t pmd_mkold(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_wrprotect(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mkdirty(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_DIRTY);
-}
-
-static inline pmd_t pmd_mkhuge(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_PSE);
-}
-
-static inline pmd_t pmd_mkyoung(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_ACCESSED);
-}
-
-static inline pmd_t pmd_mkwrite(pmd_t pmd)
-{
-	return pmd_set_flags(pmd, _PAGE_RW);
-}
-
-static inline pmd_t pmd_mknotpresent(pmd_t pmd)
-{
-	return pmd_clear_flags(pmd, _PAGE_PRESENT);
-}
-
 #endif /* !__ASSEMBLY__ */
 
 #endif /* _ASM_X86_PGTABLE_64_H */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -362,7 +362,7 @@ int pmdp_test_and_clear_young(struct vm_
 
 	if (pmd_young(*pmdp))
 		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
-					 (unsigned long *) &pmdp->pmd);
+					 (unsigned long *)pmdp);
 
 	if (ret)
 		pmd_update(vma->vm_mm, addr, pmdp);
@@ -404,7 +404,7 @@ void pmdp_splitting_flush(struct vm_area
 	int set;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
-				(unsigned long *)&pmdp->pmd);
+				(unsigned long *)pmdp);
 	if (set) {
 		pmd_update(vma->vm_mm, address, pmdp);
 		/* need tlb flush only to serialize against gup-fast */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -102,7 +102,11 @@ extern unsigned int kobjsize(const void 
 #define VM_NORESERVE	0x00200000	/* should the VM suppress accounting */
 #define VM_HUGETLB	0x00400000	/* Huge TLB Page VM */
 #define VM_NONLINEAR	0x00800000	/* Is non-linear (remap_file_pages) */
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define VM_MAPPED_COPY	0x01000000	/* T if mapped copy of data (nommu mmap) */
+#else
+#define VM_HUGEPAGE	0x01000000	/* MADV_HUGEPAGE marked this vma */
+#endif
 #define VM_INSERTPAGE	0x02000000	/* The vma has had "vm_insert_page()" done on it */
 #define VM_ALWAYSDUMP	0x04000000	/* Always include in core dumps */
 
@@ -111,9 +115,6 @@ extern unsigned int kobjsize(const void 
 #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
 #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
 #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
-#if BITS_PER_LONG > 32
-#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
-#endif
 
 #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -304,7 +304,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support" if EMBEDDED
-	depends on X86_64 && MMU
+	depends on X86 && MMU
 	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 47 of 66] mincore transparent hugepage support
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Handle transparent huge page pmd entries natively instead of splitting
them into subpages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pm
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd);
+extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, unsigned long end,
+			unsigned char *vec);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -919,6 +919,31 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 	return ret;
 }
 
+int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, unsigned long end,
+		unsigned char *vec)
+{
+	int ret = 0;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		ret = !pmd_trans_splitting(*pmd);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (unlikely(!ret))
+			wait_split_huge_page(vma->anon_vma, pmd);
+		else {
+			/*
+			 * All logical pages in the range are present
+			 * if backed by a huge page.
+			 */
+			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,7 +154,13 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (mincore_huge_pmd(vma, pmd, addr, next, vec)) {
+				vec += (next - addr) >> PAGE_SHIFT;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 47 of 66] mincore transparent hugepage support
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Handle transparent huge page pmd entries natively instead of splitting
them into subpages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pm
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd);
+extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, unsigned long end,
+			unsigned char *vec);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -919,6 +919,31 @@ int zap_huge_pmd(struct mmu_gather *tlb,
 	return ret;
 }
 
+int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, unsigned long end,
+		unsigned char *vec)
+{
+	int ret = 0;
+
+	spin_lock(&vma->vm_mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		ret = !pmd_trans_splitting(*pmd);
+		spin_unlock(&vma->vm_mm->page_table_lock);
+		if (unlikely(!ret))
+			wait_split_huge_page(vma->anon_vma, pmd);
+		else {
+			/*
+			 * All logical pages in the range are present
+			 * if backed by a huge page.
+			 */
+			memset(vec, 1, (end - addr) >> PAGE_SHIFT);
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mincore.c b/mm/mincore.c
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -154,7 +154,13 @@ static void mincore_pmd_range(struct vm_
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (mincore_huge_pmd(vma, pmd, addr, next, vec)) {
+				vec += (next - addr) >> PAGE_SHIFT;
+				continue;
+			}
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			mincore_unmapped_range(vma, addr, next, vec);
 		else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 48 of 66] add pmd_modify
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Add pmd_modify() for use with mprotect() on huge pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -325,6 +325,16 @@ static inline pte_t pte_modify(pte_t pte
 	return __pte(val);
 }
 
+static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+{
+	pmdval_t val = pmd_val(pmd);
+
+	val &= _HPAGE_CHG_MASK;
+	val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+
+	return __pmd(val);
+}
+
 /* mprotect needs to preserve PAT bits when updating vm_page_prot */
 #define pgprot_modify pgprot_modify
 static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -72,6 +72,7 @@
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 #define _PAGE_CACHE_MASK	(_PAGE_PCD | _PAGE_PWT)
 #define _PAGE_CACHE_WB		(0)

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 48 of 66] add pmd_modify
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Add pmd_modify() for use with mprotect() on huge pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -325,6 +325,16 @@ static inline pte_t pte_modify(pte_t pte
 	return __pte(val);
 }
 
+static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+{
+	pmdval_t val = pmd_val(pmd);
+
+	val &= _HPAGE_CHG_MASK;
+	val |= massage_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+
+	return __pmd(val);
+}
+
 /* mprotect needs to preserve PAT bits when updating vm_page_prot */
 #define pgprot_modify pgprot_modify
 static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -72,6 +72,7 @@
 /* Set of bits not changed in pte_modify */
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY)
+#define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 #define _PAGE_CACHE_MASK	(_PAGE_PCD | _PAGE_PWT)
 #define _PAGE_CACHE_WB		(0)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 49 of 66] mprotect: pass vma down to page table walkers
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Flushing the tlb for huge pmds requires the vma's anon_vma,
so pass along the vma instead of the mm, we can always get the latter
when we need it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -78,7 +78,7 @@ static void change_pte_range(struct mm_s
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
+static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -88,14 +88,15 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(mm, pmd);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
+		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -107,7 +108,8 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable);
+		change_pmd_range(vma, pud, addr, next, newprot,
+				 dirty_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -127,7 +129,8 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
+		change_pud_range(vma, pgd, addr, next, newprot,
+				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 49 of 66] mprotect: pass vma down to page table walkers
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Flushing the tlb for huge pmds requires the vma's anon_vma,
so pass along the vma instead of the mm, we can always get the latter
when we need it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -78,7 +78,7 @@ static void change_pte_range(struct mm_s
 	pte_unmap_unlock(pte - 1, ptl);
 }
 
-static inline void change_pmd_range(struct mm_struct *mm, pud_t *pud,
+static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -88,14 +88,15 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(mm, pmd);
+		split_huge_page_pmd(vma->vm_mm, pmd);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		change_pte_range(mm, pmd, addr, next, newprot, dirty_accountable);
+		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static inline void change_pud_range(struct mm_struct *mm, pgd_t *pgd,
+static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
@@ -107,7 +108,8 @@ static inline void change_pud_range(stru
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		change_pmd_range(mm, pud, addr, next, newprot, dirty_accountable);
+		change_pmd_range(vma, pud, addr, next, newprot,
+				 dirty_accountable);
 	} while (pud++, addr = next, addr != end);
 }
 
@@ -127,7 +129,8 @@ static void change_protection(struct vm_
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		change_pud_range(mm, pgd, addr, next, newprot, dirty_accountable);
+		change_pud_range(vma, pgd, addr, next, newprot,
+				 dirty_accountable);
 	} while (pgd++, addr = next, addr != end);
 	flush_tlb_range(vma, start, end);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 50 of 66] mprotect: transparent huge page support
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Natively handle huge pmds when changing page tables on behalf of
mprotect().

I left out update_mmu_cache() because we do not need it on x86 anyway
but more importantly the interface works on ptes, not pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,8 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -944,6 +944,33 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, pgprot_t newprot)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			pmd_t entry;
+
+			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmd_modify(entry, newprot);
+			set_pmd_at(mm, addr, pmd, entry);
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,7 +88,13 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (change_huge_pmd(vma, pmd, addr, newprot))
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 50 of 66] mprotect: transparent huge page support
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Johannes Weiner <hannes@cmpxchg.org>

Natively handle huge pmds when changing page tables on behalf of
mprotect().

I left out update_mmu_cache() because we do not need it on x86 anyway
but more importantly the interface works on ptes, not pmds.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,8 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long addr, pgprot_t newprot);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -944,6 +944,33 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, pgprot_t newprot)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int ret = 0;
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*pmd))) {
+		if (unlikely(pmd_trans_splitting(*pmd))) {
+			spin_unlock(&mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, pmd);
+		} else {
+			pmd_t entry;
+
+			entry = pmdp_get_and_clear(mm, addr, pmd);
+			entry = pmd_modify(entry, newprot);
+			set_pmd_at(mm, addr, pmd, entry);
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			flush_tlb_range(vma, addr, addr + HPAGE_PMD_SIZE);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&vma->vm_mm->page_table_lock);
+
+	return ret;
+}
+
 pmd_t *page_check_address_pmd(struct page *page,
 			      struct mm_struct *mm,
 			      unsigned long address,
diff --git a/mm/mprotect.c b/mm/mprotect.c
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,7 +88,13 @@ static inline void change_pmd_range(stru
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		split_huge_page_pmd(vma->vm_mm, pmd);
+		if (pmd_trans_huge(*pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE)
+				split_huge_page_pmd(vma->vm_mm, pmd);
+			else if (change_huge_pmd(vma, pmd, addr, newprot))
+				continue;
+			/* fall through */
+		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		change_pte_range(vma->vm_mm, pmd, addr, next, newprot,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 51 of 66] set recommended min free kbytes
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

If transparent hugepage is enabled initialize min_free_kbytes to an optimal
value by default. This moves the hugeadm algorithm in kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -85,6 +85,47 @@ struct khugepaged_scan {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
 
+
+static int set_recommended_min_free_kbytes(void)
+{
+	struct zone *zone;
+	int nr_zones = 0;
+	unsigned long recommended_min;
+	extern int min_free_kbytes;
+
+	if (!test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		      &transparent_hugepage_flags) &&
+	    !test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		      &transparent_hugepage_flags))
+		return 0;
+
+	for_each_populated_zone(zone)
+		nr_zones++;
+
+	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
+
+	/*
+	 * Make sure that on average at least two pageblocks are almost free
+	 * of another type, one for a migratetype to fall back to and a
+	 * second to avoid subsequent fallbacks of other types There are 3
+	 * MIGRATE_TYPES we care about.
+	 */
+	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
+
+	/* don't ever allow to reserve more than 5% of the lowmem */
+	recommended_min = min(recommended_min,
+			      (unsigned long) nr_free_buffer_pages() / 20);
+	recommended_min <<= (PAGE_SHIFT-10);
+
+	if (recommended_min > min_free_kbytes) {
+		min_free_kbytes = recommended_min;
+		setup_per_zone_wmarks();
+	}
+	return 0;
+}
+late_initcall(set_recommended_min_free_kbytes);
+
 static int start_khugepaged(void)
 {
 	int err = 0;
@@ -108,6 +149,8 @@ static int start_khugepaged(void)
 		mutex_unlock(&khugepaged_mutex);
 		if (wakeup)
 			wake_up_interruptible(&khugepaged_wait);
+
+		set_recommended_min_free_kbytes();
 	} else
 		/* wakeup to exit */
 		wake_up_interruptible(&khugepaged_wait);
@@ -177,6 +220,13 @@ static ssize_t enabled_store(struct kobj
 			ret = err;
 	}
 
+	if (ret > 0 &&
+	    (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		      &transparent_hugepage_flags) ||
+	     test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		      &transparent_hugepage_flags)))
+		set_recommended_min_free_kbytes();
+
 	return ret;
 }
 static struct kobj_attribute enabled_attr =
@@ -464,6 +514,8 @@ static int __init hugepage_init(void)
 
 	start_khugepaged();
 
+	set_recommended_min_free_kbytes();
+
 out:
 	return err;
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 51 of 66] set recommended min free kbytes
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

If transparent hugepage is enabled initialize min_free_kbytes to an optimal
value by default. This moves the hugeadm algorithm in kernel.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -85,6 +85,47 @@ struct khugepaged_scan {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
 
+
+static int set_recommended_min_free_kbytes(void)
+{
+	struct zone *zone;
+	int nr_zones = 0;
+	unsigned long recommended_min;
+	extern int min_free_kbytes;
+
+	if (!test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		      &transparent_hugepage_flags) &&
+	    !test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		      &transparent_hugepage_flags))
+		return 0;
+
+	for_each_populated_zone(zone)
+		nr_zones++;
+
+	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
+
+	/*
+	 * Make sure that on average at least two pageblocks are almost free
+	 * of another type, one for a migratetype to fall back to and a
+	 * second to avoid subsequent fallbacks of other types There are 3
+	 * MIGRATE_TYPES we care about.
+	 */
+	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
+
+	/* don't ever allow to reserve more than 5% of the lowmem */
+	recommended_min = min(recommended_min,
+			      (unsigned long) nr_free_buffer_pages() / 20);
+	recommended_min <<= (PAGE_SHIFT-10);
+
+	if (recommended_min > min_free_kbytes) {
+		min_free_kbytes = recommended_min;
+		setup_per_zone_wmarks();
+	}
+	return 0;
+}
+late_initcall(set_recommended_min_free_kbytes);
+
 static int start_khugepaged(void)
 {
 	int err = 0;
@@ -108,6 +149,8 @@ static int start_khugepaged(void)
 		mutex_unlock(&khugepaged_mutex);
 		if (wakeup)
 			wake_up_interruptible(&khugepaged_wait);
+
+		set_recommended_min_free_kbytes();
 	} else
 		/* wakeup to exit */
 		wake_up_interruptible(&khugepaged_wait);
@@ -177,6 +220,13 @@ static ssize_t enabled_store(struct kobj
 			ret = err;
 	}
 
+	if (ret > 0 &&
+	    (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		      &transparent_hugepage_flags) ||
+	     test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		      &transparent_hugepage_flags)))
+		set_recommended_min_free_kbytes();
+
 	return ret;
 }
 static struct kobj_attribute enabled_attr =
@@ -464,6 +514,8 @@ static int __init hugepage_init(void)
 
 	start_khugepaged();
 
+	set_recommended_min_free_kbytes();
+
 out:
 	return err;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 52 of 66] enable direct defrag
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
defrag memory during the (synchronous) transparent hugepage page faults
(TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async)
hugepage allocations that was already enabled even before memory compaction was
in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -28,6 +28,7 @@
  */
 unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 52 of 66] enable direct defrag
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
defrag memory during the (synchronous) transparent hugepage page faults
(TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async)
hugepage allocations that was already enabled even before memory compaction was
in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -28,6 +28,7 @@
  */
 unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 53 of 66] add numa awareness to hugepage allocations
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known and
an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in
case of failure. But it retains the more efficient logic of handling allocation
failures in khugepaged in case of CONFIG_NUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -330,14 +330,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 {
 	return alloc_pages_current(gfp_mask, order);
 }
-extern struct page *alloc_page_vma(gfp_t gfp_mask,
+extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_pages_vma(gfp_mask, order, vma, addr)	\
+	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr)	\
+	alloc_pages_vma(gfp_mask, 0, vma, addr)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -620,11 +620,26 @@ static int __do_huge_pmd_anonymous_page(
 	return ret;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+{
+	return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+			       HPAGE_PMD_ORDER, vma, haddr);
+}
+
+#ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
+	return alloc_pages(alloc_hugepage_gfpmask(defrag),
 			   HPAGE_PMD_ORDER);
 }
+#endif
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
@@ -639,7 +654,8 @@ int do_huge_pmd_anonymous_page(struct mm
 			return VM_FAULT_OOM;
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
-		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+					  vma, haddr);
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -858,7 +874,8 @@ int do_huge_pmd_wp_page(struct mm_struct
 
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
-		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+					      vma, haddr);
 	else
 		new_page = NULL;
 
@@ -1655,7 +1672,11 @@ static void collapse_huge_page(struct mm
 	unsigned long hstart, hend;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+#ifndef CONFIG_NUMA
 	VM_BUG_ON(!*hpage);
+#else
+	VM_BUG_ON(*hpage);
+#endif
 
 	/*
 	 * Prevent all access to pagetables with the exception of
@@ -1693,7 +1714,15 @@ static void collapse_huge_page(struct mm
 	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
 		goto out;
 
+#ifndef CONFIG_NUMA
 	new_page = *hpage;
+#else
+	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
+	if (unlikely(!new_page)) {
+		*hpage = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+#endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
 		goto out;
 
@@ -1724,6 +1753,9 @@ static void collapse_huge_page(struct mm
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock(vma->anon_vma);
 		mem_cgroup_uncharge_page(new_page);
+#ifdef CONFIG_NUMA
+		put_page(new_page);
+#endif
 		goto out;
 	}
 
@@ -1759,7 +1791,9 @@ static void collapse_huge_page(struct mm
 	mm->nr_ptes--;
 	spin_unlock(&mm->page_table_lock);
 
+#ifndef CONFIG_NUMA
 	*hpage = NULL;
+#endif
 	khugepaged_pages_collapsed++;
 out:
 	up_write(&mm->mmap_sem);
@@ -1995,11 +2029,16 @@ static void khugepaged_do_scan(struct pa
 	while (progress < pages) {
 		cond_resched();
 
+#ifndef CONFIG_NUMA
 		if (!*hpage) {
 			*hpage = alloc_hugepage(khugepaged_defrag());
 			if (unlikely(!*hpage))
 				break;
 		}
+#else
+		if (IS_ERR(*hpage))
+			break;
+#endif
 
 		spin_lock(&khugepaged_mm_lock);
 		if (!khugepaged_scan.mm_slot)
@@ -2014,37 +2053,55 @@ static void khugepaged_do_scan(struct pa
 	}
 }
 
+static void khugepaged_alloc_sleep(void)
+{
+	DEFINE_WAIT(wait);
+	add_wait_queue(&khugepaged_wait, &wait);
+	schedule_timeout_interruptible(
+		msecs_to_jiffies(
+			khugepaged_alloc_sleep_millisecs));
+	remove_wait_queue(&khugepaged_wait, &wait);
+}
+
+#ifndef CONFIG_NUMA
 static struct page *khugepaged_alloc_hugepage(void)
 {
 	struct page *hpage;
 
 	do {
 		hpage = alloc_hugepage(khugepaged_defrag());
-		if (!hpage) {
-			DEFINE_WAIT(wait);
-			add_wait_queue(&khugepaged_wait, &wait);
-			schedule_timeout_interruptible(
-				msecs_to_jiffies(
-					khugepaged_alloc_sleep_millisecs));
-			remove_wait_queue(&khugepaged_wait, &wait);
-		}
+		if (!hpage)
+			khugepaged_alloc_sleep();
 	} while (unlikely(!hpage) &&
 		 likely(khugepaged_enabled()));
 	return hpage;
 }
+#endif
 
 static void khugepaged_loop(void)
 {
 	struct page *hpage;
 
+#ifdef CONFIG_NUMA
+	hpage = NULL;
+#endif
 	while (likely(khugepaged_enabled())) {
+#ifndef CONFIG_NUMA
 		hpage = khugepaged_alloc_hugepage();
 		if (unlikely(!hpage))
 			break;
+#else
+		if (IS_ERR(hpage)) {
+			khugepaged_alloc_sleep();
+			hpage = NULL;
+		}
+#endif
 
 		khugepaged_do_scan(&hpage);
+#ifndef CONFIG_NUMA
 		if (hpage)
 			put_page(hpage);
+#endif
 		if (khugepaged_has_work()) {
 			DEFINE_WAIT(wait);
 			if (!khugepaged_scan_sleep_millisecs)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1794,7 +1794,7 @@ static struct page *alloc_page_interleav
 }
 
 /**
- * 	alloc_page_vma	- Allocate a page for a VMA.
+ * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
  * 	@gfp:
  *      %GFP_USER    user allocation.
@@ -1803,6 +1803,7 @@ static struct page *alloc_page_interleav
  *      %GFP_FS      allocation should not call back into a file system.
  *      %GFP_ATOMIC  don't sleep.
  *
+ *	@order:Order of the GFP allocation.
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *
@@ -1816,7 +1817,8 @@ static struct page *alloc_page_interleav
  *	Should be called with the mm_sem of the vma hold.
  */
 struct page *
-alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
+alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
@@ -1828,7 +1830,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, 0, nid);
+		page = alloc_page_interleave(gfp, order, nid);
 		put_mems_allowed();
 		return page;
 	}
@@ -1837,7 +1839,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		/*
 		 * slow path: ref counted shared policy
 		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, 0,
+		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
 		put_mems_allowed();
@@ -1846,7 +1848,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	/*
 	 * fast path:  default or task policy
 	 */
-	page = __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
+	page = __alloc_pages_nodemask(gfp, order, zl,
+				      policy_nodemask(gfp, pol));
 	put_mems_allowed();
 	return page;
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 53 of 66] add numa awareness to hugepage allocations
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's mostly a matter of replacing alloc_pages with alloc_pages_vma after
introducing alloc_pages_vma. khugepaged needs special handling as the
allocation has to happen inside collapse_huge_page where the vma is known and
an error has to be returned to the outer loop to sleep alloc_sleep_millisecs in
case of failure. But it retains the more efficient logic of handling allocation
failures in khugepaged in case of CONFIG_NUMA=n.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -330,14 +330,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 {
 	return alloc_pages_current(gfp_mask, order);
 }
-extern struct page *alloc_page_vma(gfp_t gfp_mask,
+extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 			struct vm_area_struct *vma, unsigned long addr);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_pages_vma(gfp_mask, order, vma, addr)	\
+	alloc_pages(gfp_mask, order)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr)	\
+	alloc_pages_vma(gfp_mask, 0, vma, addr)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -620,11 +620,26 @@ static int __do_huge_pmd_anonymous_page(
 	return ret;
 }
 
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+{
+	return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+					      struct vm_area_struct *vma,
+					      unsigned long haddr)
+{
+	return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+			       HPAGE_PMD_ORDER, vma, haddr);
+}
+
+#ifndef CONFIG_NUMA
 static inline struct page *alloc_hugepage(int defrag)
 {
-	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
+	return alloc_pages(alloc_hugepage_gfpmask(defrag),
 			   HPAGE_PMD_ORDER);
 }
+#endif
 
 int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
@@ -639,7 +654,8 @@ int do_huge_pmd_anonymous_page(struct mm
 			return VM_FAULT_OOM;
 		if (unlikely(khugepaged_enter(vma)))
 			return VM_FAULT_OOM;
-		page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+					  vma, haddr);
 		if (unlikely(!page))
 			goto out;
 		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -858,7 +874,8 @@ int do_huge_pmd_wp_page(struct mm_struct
 
 	if (transparent_hugepage_enabled(vma) &&
 	    !transparent_hugepage_debug_cow())
-		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
+		new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+					      vma, haddr);
 	else
 		new_page = NULL;
 
@@ -1655,7 +1672,11 @@ static void collapse_huge_page(struct mm
 	unsigned long hstart, hend;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+#ifndef CONFIG_NUMA
 	VM_BUG_ON(!*hpage);
+#else
+	VM_BUG_ON(*hpage);
+#endif
 
 	/*
 	 * Prevent all access to pagetables with the exception of
@@ -1693,7 +1714,15 @@ static void collapse_huge_page(struct mm
 	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
 		goto out;
 
+#ifndef CONFIG_NUMA
 	new_page = *hpage;
+#else
+	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
+	if (unlikely(!new_page)) {
+		*hpage = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+#endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
 		goto out;
 
@@ -1724,6 +1753,9 @@ static void collapse_huge_page(struct mm
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock(vma->anon_vma);
 		mem_cgroup_uncharge_page(new_page);
+#ifdef CONFIG_NUMA
+		put_page(new_page);
+#endif
 		goto out;
 	}
 
@@ -1759,7 +1791,9 @@ static void collapse_huge_page(struct mm
 	mm->nr_ptes--;
 	spin_unlock(&mm->page_table_lock);
 
+#ifndef CONFIG_NUMA
 	*hpage = NULL;
+#endif
 	khugepaged_pages_collapsed++;
 out:
 	up_write(&mm->mmap_sem);
@@ -1995,11 +2029,16 @@ static void khugepaged_do_scan(struct pa
 	while (progress < pages) {
 		cond_resched();
 
+#ifndef CONFIG_NUMA
 		if (!*hpage) {
 			*hpage = alloc_hugepage(khugepaged_defrag());
 			if (unlikely(!*hpage))
 				break;
 		}
+#else
+		if (IS_ERR(*hpage))
+			break;
+#endif
 
 		spin_lock(&khugepaged_mm_lock);
 		if (!khugepaged_scan.mm_slot)
@@ -2014,37 +2053,55 @@ static void khugepaged_do_scan(struct pa
 	}
 }
 
+static void khugepaged_alloc_sleep(void)
+{
+	DEFINE_WAIT(wait);
+	add_wait_queue(&khugepaged_wait, &wait);
+	schedule_timeout_interruptible(
+		msecs_to_jiffies(
+			khugepaged_alloc_sleep_millisecs));
+	remove_wait_queue(&khugepaged_wait, &wait);
+}
+
+#ifndef CONFIG_NUMA
 static struct page *khugepaged_alloc_hugepage(void)
 {
 	struct page *hpage;
 
 	do {
 		hpage = alloc_hugepage(khugepaged_defrag());
-		if (!hpage) {
-			DEFINE_WAIT(wait);
-			add_wait_queue(&khugepaged_wait, &wait);
-			schedule_timeout_interruptible(
-				msecs_to_jiffies(
-					khugepaged_alloc_sleep_millisecs));
-			remove_wait_queue(&khugepaged_wait, &wait);
-		}
+		if (!hpage)
+			khugepaged_alloc_sleep();
 	} while (unlikely(!hpage) &&
 		 likely(khugepaged_enabled()));
 	return hpage;
 }
+#endif
 
 static void khugepaged_loop(void)
 {
 	struct page *hpage;
 
+#ifdef CONFIG_NUMA
+	hpage = NULL;
+#endif
 	while (likely(khugepaged_enabled())) {
+#ifndef CONFIG_NUMA
 		hpage = khugepaged_alloc_hugepage();
 		if (unlikely(!hpage))
 			break;
+#else
+		if (IS_ERR(hpage)) {
+			khugepaged_alloc_sleep();
+			hpage = NULL;
+		}
+#endif
 
 		khugepaged_do_scan(&hpage);
+#ifndef CONFIG_NUMA
 		if (hpage)
 			put_page(hpage);
+#endif
 		if (khugepaged_has_work()) {
 			DEFINE_WAIT(wait);
 			if (!khugepaged_scan_sleep_millisecs)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1794,7 +1794,7 @@ static struct page *alloc_page_interleav
 }
 
 /**
- * 	alloc_page_vma	- Allocate a page for a VMA.
+ * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
  * 	@gfp:
  *      %GFP_USER    user allocation.
@@ -1803,6 +1803,7 @@ static struct page *alloc_page_interleav
  *      %GFP_FS      allocation should not call back into a file system.
  *      %GFP_ATOMIC  don't sleep.
  *
+ *	@order:Order of the GFP allocation.
  * 	@vma:  Pointer to VMA or NULL if not available.
  *	@addr: Virtual Address of the allocation. Must be inside the VMA.
  *
@@ -1816,7 +1817,8 @@ static struct page *alloc_page_interleav
  *	Should be called with the mm_sem of the vma hold.
  */
 struct page *
-alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
+alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
+		unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
@@ -1828,7 +1830,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		mpol_cond_put(pol);
-		page = alloc_page_interleave(gfp, 0, nid);
+		page = alloc_page_interleave(gfp, order, nid);
 		put_mems_allowed();
 		return page;
 	}
@@ -1837,7 +1839,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		/*
 		 * slow path: ref counted shared policy
 		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, 0,
+		struct page *page =  __alloc_pages_nodemask(gfp, order,
 						zl, policy_nodemask(gfp, pol));
 		__mpol_put(pol);
 		put_mems_allowed();
@@ -1846,7 +1848,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	/*
 	 * fast path:  default or task policy
 	 */
-	page = __alloc_pages_nodemask(gfp, 0, zl, policy_nodemask(gfp, pol));
+	page = __alloc_pages_nodemask(gfp, order, zl,
+				      policy_nodemask(gfp, pol));
 	put_mems_allowed();
 	return page;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 54 of 66] transparent hugepage config choice
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Allow to choose between the always|madvise default for page faults and
khugepaged at config time. madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE) won't
risk to use any more memory by backing their virtual regions with hugepages).

Initially set the default to N and don't depend on EMBEDDED.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -303,9 +303,8 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  See Documentation/nommu-mmap.txt for more information.
 
 config TRANSPARENT_HUGEPAGE
-	bool "Transparent Hugepage Support" if EMBEDDED
+	bool "Transparent Hugepage Support"
 	depends on X86 && MMU
-	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
 	  huge tlb transparently to the applications whenever possible.
@@ -316,6 +315,30 @@ config TRANSPARENT_HUGEPAGE
 
 	  If memory constrained on embedded, you may want to say N.
 
+choice
+	prompt "Transparent Hugepage Support sysfs defaults"
+	depends on TRANSPARENT_HUGEPAGE
+	default TRANSPARENT_HUGEPAGE_ALWAYS
+	help
+	  Selects the sysfs defaults for Transparent Hugepage Support.
+
+	config TRANSPARENT_HUGEPAGE_ALWAYS
+		bool "always"
+	help
+	  Enabling Transparent Hugepage always, can increase the
+	  memory footprint of applications without a guaranteed
+	  benefit but it will work automatically for all applications.
+
+	config TRANSPARENT_HUGEPAGE_MADVISE
+		bool "madvise"
+	help
+	  Enabling Transparent Hugepage madvise, will only provide a
+	  performance improvement benefit to the applications using
+	  madvise(MADV_HUGEPAGE) but it won't risk to increase the
+	  memory footprint of applications without a guaranteed
+	  benefit.
+endchoice
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -27,7 +27,12 @@
  * allocations.
  */
 unsigned long transparent_hugepage_flags __read_mostly =
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
+	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 54 of 66] transparent hugepage config choice
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Allow to choose between the always|madvise default for page faults and
khugepaged at config time. madvise guarantees zero risk of higher memory
footprint for applications (applications using madvise(MADV_HUGEPAGE) won't
risk to use any more memory by backing their virtual regions with hugepages).

Initially set the default to N and don't depend on EMBEDDED.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -303,9 +303,8 @@ config NOMMU_INITIAL_TRIM_EXCESS
 	  See Documentation/nommu-mmap.txt for more information.
 
 config TRANSPARENT_HUGEPAGE
-	bool "Transparent Hugepage Support" if EMBEDDED
+	bool "Transparent Hugepage Support"
 	depends on X86 && MMU
-	default y
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
 	  huge tlb transparently to the applications whenever possible.
@@ -316,6 +315,30 @@ config TRANSPARENT_HUGEPAGE
 
 	  If memory constrained on embedded, you may want to say N.
 
+choice
+	prompt "Transparent Hugepage Support sysfs defaults"
+	depends on TRANSPARENT_HUGEPAGE
+	default TRANSPARENT_HUGEPAGE_ALWAYS
+	help
+	  Selects the sysfs defaults for Transparent Hugepage Support.
+
+	config TRANSPARENT_HUGEPAGE_ALWAYS
+		bool "always"
+	help
+	  Enabling Transparent Hugepage always, can increase the
+	  memory footprint of applications without a guaranteed
+	  benefit but it will work automatically for all applications.
+
+	config TRANSPARENT_HUGEPAGE_MADVISE
+		bool "madvise"
+	help
+	  Enabling Transparent Hugepage madvise, will only provide a
+	  performance improvement benefit to the applications using
+	  madvise(MADV_HUGEPAGE) but it won't risk to increase the
+	  memory footprint of applications without a guaranteed
+	  benefit.
+endchoice
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -27,7 +27,12 @@
  * allocations.
  */
 unsigned long transparent_hugepage_flags __read_mostly =
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
 	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
+	(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
+#endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

With transparent hugepage support we need compaction for the "defrag" sysfs
controls to be effective.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
 	depends on X86 && MMU
+	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
 	  huge tlb transparently to the applications whenever possible.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

With transparent hugepage support we need compaction for the "defrag" sysfs
controls to be effective.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/Kconfig b/mm/Kconfig
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
 	depends on X86 && MMU
+	select COMPACTION
 	help
 	  Transparent Hugepages allows the kernel to use huge pages and
 	  huge tlb transparently to the applications whenever possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 56 of 66] transhuge isolate_migratepages()
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's not worth migrating transparent hugepages during compaction. Those
hugepages don't create fragmentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -272,10 +272,25 @@ static unsigned long isolate_migratepage
 		if (PageBuddy(page))
 			continue;
 
+		if (!PageLRU(page))
+			continue;
+
+		/*
+		 * PageLRU is set, and lru_lock excludes isolation,
+		 * splitting and collapsing (collapsing has already
+		 * happened if PageLRU is set).
+		 */
+		if (PageTransHuge(page)) {
+			low_pfn += (1 << compound_order(page)) - 1;
+			continue;
+		}
+
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
 			continue;
 
+		VM_BUG_ON(PageTransCompound(page));
+
 		/* Successfully isolated */
 		del_page_from_lru_list(zone, page, page_lru(page));
 		list_add(&page->lru, migratelist);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 56 of 66] transhuge isolate_migratepages()
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's not worth migrating transparent hugepages during compaction. Those
hugepages don't create fragmentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -272,10 +272,25 @@ static unsigned long isolate_migratepage
 		if (PageBuddy(page))
 			continue;
 
+		if (!PageLRU(page))
+			continue;
+
+		/*
+		 * PageLRU is set, and lru_lock excludes isolation,
+		 * splitting and collapsing (collapsing has already
+		 * happened if PageLRU is set).
+		 */
+		if (PageTransHuge(page)) {
+			low_pfn += (1 << compound_order(page)) - 1;
+			continue;
+		}
+
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
 			continue;
 
+		VM_BUG_ON(PageTransCompound(page));
+
 		/* Successfully isolated */
 		del_page_from_lru_list(zone, page, page_lru(page));
 		list_add(&page->lru, migratelist);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 57 of 66] avoid breaking huge pmd invariants in case of vma_adjust failures
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

An huge pmd can only be mapped if the corresponding 2M virtual range
is fully contained in the vma. At times the VM calls split_vma twice,
if the first split_vma succeeds and the second fail, the first
split_vma remains in effect and it's not rolled back. For split_vma or
vma_adjust to fail an allocation failure is needed so it's a very
unlikely event (the out of memory killer would normally fire before
any allocation failure is visible to kernel and userland and if an out
of memory condition happens it's unlikely to happen exactly
here). Nevertheless it's safer to ensure that no huge pmd can be left
around if the vma is adjusted in a way that can't fit hugepages
anymore at the new vm_start/vm_end address.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -106,6 +106,19 @@ extern void __split_huge_page_pmd(struct
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
 extern int hugepage_madvise(unsigned long *vm_flags);
+extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+				    unsigned long start,
+				    unsigned long end,
+				    long adjust_next);
+static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end,
+					 long adjust_next)
+{
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		return;
+	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+}
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -138,6 +151,12 @@ static inline int hugepage_madvise(unsig
 	BUG_ON(0);
 	return 0;
 }
+static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end,
+					 long adjust_next)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1071,8 +1071,16 @@ pmd_t *page_check_address_pmd(struct pag
 		goto out;
 	if (pmd_page(*pmd) != page)
 		goto out;
-	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-		  pmd_trans_splitting(*pmd));
+	/*
+	 * split_vma() may create temporary aliased mappings. There is
+	 * no risk as long as all huge pmd are found and have their
+	 * splitting bit set before __split_huge_page_refcount
+	 * runs. Finding the same huge pmd more than once during the
+	 * same rmap walk is not a problem.
+	 */
+	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+	    pmd_trans_splitting(*pmd))
+		goto out;
 	if (pmd_trans_huge(*pmd)) {
 		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
 			  !pmd_trans_splitting(*pmd));
@@ -2174,3 +2182,71 @@ void __split_huge_page_pmd(struct mm_str
 	put_page(page);
 	BUG_ON(pmd_trans_huge(*pmd));
 }
+
+static void split_huge_page_address(struct mm_struct *mm,
+				    unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return;
+	/*
+	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
+	 * materialize from under us.
+	 */
+	split_huge_page_pmd(mm, pmd);
+}
+
+void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+			     unsigned long start,
+			     unsigned long end,
+			     long adjust_next)
+{
+	/*
+	 * If the new start address isn't hpage aligned and it could
+	 * previously contain an hugepage: check if we need to split
+	 * an huge pmd.
+	 */
+	if (start & ~HPAGE_PMD_MASK &&
+	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
+	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
+		split_huge_page_address(vma->vm_mm, start);
+
+	/*
+	 * If the new end address isn't hpage aligned and it could
+	 * previously contain an hugepage: check if we need to split
+	 * an huge pmd.
+	 */
+	if (end & ~HPAGE_PMD_MASK &&
+	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
+	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
+		split_huge_page_address(vma->vm_mm, end);
+
+	/*
+	 * If we're also updating the vma->vm_next->vm_start, if the new
+	 * vm_next->vm_start isn't page aligned and it could previously
+	 * contain an hugepage: check if we need to split an huge pmd.
+	 */
+	if (adjust_next > 0) {
+		struct vm_area_struct *next = vma->vm_next;
+		unsigned long nstart = next->vm_start;
+		nstart += adjust_next << PAGE_SHIFT;
+		if (nstart & ~HPAGE_PMD_MASK &&
+		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
+		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
+			split_huge_page_address(next->vm_mm, nstart);
+	}
+}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -589,6 +589,8 @@ again:			remove_next = 1 + (end > next->
 		}
 	}
 
+	vma_adjust_trans_huge(vma, start, end, adjust_next);
+
 	/*
 	 * When changing only vma->vm_end, we don't really need anon_vma
 	 * lock. This is a fairly rare case by itself, but the anon_vma

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 57 of 66] avoid breaking huge pmd invariants in case of vma_adjust failures
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

An huge pmd can only be mapped if the corresponding 2M virtual range
is fully contained in the vma. At times the VM calls split_vma twice,
if the first split_vma succeeds and the second fail, the first
split_vma remains in effect and it's not rolled back. For split_vma or
vma_adjust to fail an allocation failure is needed so it's a very
unlikely event (the out of memory killer would normally fire before
any allocation failure is visible to kernel and userland and if an out
of memory condition happens it's unlikely to happen exactly
here). Nevertheless it's safer to ensure that no huge pmd can be left
around if the vma is adjusted in a way that can't fit hugepages
anymore at the new vm_start/vm_end address.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -106,6 +106,19 @@ extern void __split_huge_page_pmd(struct
 
 extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
 extern int hugepage_madvise(unsigned long *vm_flags);
+extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+				    unsigned long start,
+				    unsigned long end,
+				    long adjust_next);
+static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end,
+					 long adjust_next)
+{
+	if (!vma->anon_vma || vma->vm_ops || vma->vm_file)
+		return;
+	__vma_adjust_trans_huge(vma, start, end, adjust_next);
+}
 static inline int PageTransHuge(struct page *page)
 {
 	VM_BUG_ON(PageTail(page));
@@ -138,6 +151,12 @@ static inline int hugepage_madvise(unsig
 	BUG_ON(0);
 	return 0;
 }
+static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
+					 unsigned long start,
+					 unsigned long end,
+					 long adjust_next)
+{
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1071,8 +1071,16 @@ pmd_t *page_check_address_pmd(struct pag
 		goto out;
 	if (pmd_page(*pmd) != page)
 		goto out;
-	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
-		  pmd_trans_splitting(*pmd));
+	/*
+	 * split_vma() may create temporary aliased mappings. There is
+	 * no risk as long as all huge pmd are found and have their
+	 * splitting bit set before __split_huge_page_refcount
+	 * runs. Finding the same huge pmd more than once during the
+	 * same rmap walk is not a problem.
+	 */
+	if (flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
+	    pmd_trans_splitting(*pmd))
+		goto out;
 	if (pmd_trans_huge(*pmd)) {
 		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
 			  !pmd_trans_splitting(*pmd));
@@ -2174,3 +2182,71 @@ void __split_huge_page_pmd(struct mm_str
 	put_page(page);
 	BUG_ON(pmd_trans_huge(*pmd));
 }
+
+static void split_huge_page_address(struct mm_struct *mm,
+				    unsigned long address)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return;
+
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return;
+
+	pmd = pmd_offset(pud, address);
+	if (!pmd_present(*pmd))
+		return;
+	/*
+	 * Caller holds the mmap_sem write mode, so a huge pmd cannot
+	 * materialize from under us.
+	 */
+	split_huge_page_pmd(mm, pmd);
+}
+
+void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+			     unsigned long start,
+			     unsigned long end,
+			     long adjust_next)
+{
+	/*
+	 * If the new start address isn't hpage aligned and it could
+	 * previously contain an hugepage: check if we need to split
+	 * an huge pmd.
+	 */
+	if (start & ~HPAGE_PMD_MASK &&
+	    (start & HPAGE_PMD_MASK) >= vma->vm_start &&
+	    (start & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
+		split_huge_page_address(vma->vm_mm, start);
+
+	/*
+	 * If the new end address isn't hpage aligned and it could
+	 * previously contain an hugepage: check if we need to split
+	 * an huge pmd.
+	 */
+	if (end & ~HPAGE_PMD_MASK &&
+	    (end & HPAGE_PMD_MASK) >= vma->vm_start &&
+	    (end & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= vma->vm_end)
+		split_huge_page_address(vma->vm_mm, end);
+
+	/*
+	 * If we're also updating the vma->vm_next->vm_start, if the new
+	 * vm_next->vm_start isn't page aligned and it could previously
+	 * contain an hugepage: check if we need to split an huge pmd.
+	 */
+	if (adjust_next > 0) {
+		struct vm_area_struct *next = vma->vm_next;
+		unsigned long nstart = next->vm_start;
+		nstart += adjust_next << PAGE_SHIFT;
+		if (nstart & ~HPAGE_PMD_MASK &&
+		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
+		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
+			split_huge_page_address(next->vm_mm, nstart);
+	}
+}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -589,6 +589,8 @@ again:			remove_next = 1 + (end > next->
 		}
 	}
 
+	vma_adjust_trans_huge(vma, start, end, adjust_next);
+
 	/*
 	 * When changing only vma->vm_end, we don't really need anon_vma
 	 * lock. This is a fairly rare case by itself, but the anon_vma

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 58 of 66] don't allow transparent hugepage support without PSE
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Archs implementing Transparent Hugepage Support must implement a
function called has_transparent_hugepage to be sure the virtual or
physical CPU supports Transparent Hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,11 @@ static inline int pmd_trans_huge(pmd_t p
 {
 	return pmd_val(pmd) & _PAGE_PSE;
 }
+
+static inline int has_transparent_hugepage(void)
+{
+	return cpu_has_pse;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -487,7 +487,15 @@ static int __init hugepage_init(void)
 	int err;
 #ifdef CONFIG_SYSFS
 	static struct kobject *hugepage_kobj;
+#endif
 
+	err = -EINVAL;
+	if (!has_transparent_hugepage()) {
+		transparent_hugepage_flags = 0;
+		goto out;
+	}
+
+#ifdef CONFIG_SYSFS
 	err = -ENOMEM;
 	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!hugepage_kobj)) {

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 58 of 66] don't allow transparent hugepage support without PSE
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Archs implementing Transparent Hugepage Support must implement a
function called has_transparent_hugepage to be sure the virtual or
physical CPU supports Transparent Hugepages.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -160,6 +160,11 @@ static inline int pmd_trans_huge(pmd_t p
 {
 	return pmd_val(pmd) & _PAGE_PSE;
 }
+
+static inline int has_transparent_hugepage(void)
+{
+	return cpu_has_pse;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -487,7 +487,15 @@ static int __init hugepage_init(void)
 	int err;
 #ifdef CONFIG_SYSFS
 	static struct kobject *hugepage_kobj;
+#endif
 
+	err = -EINVAL;
+	if (!has_transparent_hugepage()) {
+		transparent_hugepage_flags = 0;
+		goto out;
+	}
+
+#ifdef CONFIG_SYSFS
 	err = -ENOMEM;
 	hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj);
 	if (unlikely(!hugepage_kobj)) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 59 of 66] mmu_notifier_test_young
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's
correct to return 0 when shadow_access_mask is zero, it requires gup-fast to
set the referenced bit). qemu-kvm access already sets the young bit in the pte
if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we
relay on gup-fast to signal the page is in use...

We also need to check the young bits on the secondary pagetables for NPT and
not nested shadow mmu as the data may never get accessed again by the primary
pte.

Without this closer accuracy, we'd have to remove the heuristic that avoids
collapsing hugepages in hugepage virtual regions that have not even a single
subpage in use.

->test_young is full backwards compatible with GRU and other usages that don't
have young bits in pagetables set by the hardware and that should nuke the
secondary mmu mappings when ->clear_flush_young runs just like EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but I
thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -788,6 +788,7 @@ asmlinkage void kvm_handle_fault_on_rebo
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -959,6 +959,35 @@ static int kvm_age_rmapp(struct kvm *kvm
 	return young;
 }
 
+static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
+			      unsigned long data)
+{
+	u64 *spte;
+	int young = 0;
+
+	/*
+	 * If there's no access bit in the secondary pte set by the
+	 * hardware it's up to gup-fast/gup to set the access bit in
+	 * the primary pte or in the page structure.
+	 */
+	if (!shadow_accessed_mask)
+		goto out;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		u64 _spte = *spte;
+		BUG_ON(!(_spte & PT_PRESENT_MASK));
+		young = _spte & PT_ACCESSED_MASK;
+		if (young) {
+			young = 1;
+			break;
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+out:
+	return young;
+}
+
 #define RMAP_RECYCLE_THRESHOLD 1000
 
 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -979,6 +1008,11 @@ int kvm_age_hva(struct kvm *kvm, unsigne
 	return kvm_handle_hva(kvm, hva, 0, kvm_age_rmapp);
 }
 
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+{
+	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
+#include <linux/swap.h>
 
 #include <asm/pgtable.h>
 
@@ -89,6 +90,7 @@ static noinline int gup_pte_range(pmd_t 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 		get_page(page);
+		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
 
@@ -103,6 +105,7 @@ static inline void get_head_page_multipl
 	VM_BUG_ON(page != compound_head(page));
 	VM_BUG_ON(page_count(page) == 0);
 	atomic_add(nr, &page->_count);
+	SetPageReferenced(page);
 }
 
 static inline void pin_huge_page_tail(struct page *page)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -62,6 +62,16 @@ struct mmu_notifier_ops {
 				 unsigned long address);
 
 	/*
+	 * test_young is called to check the young/accessed bitflag in
+	 * the secondary pte. This is used to know if the page is
+	 * frequently used without actually clearing the flag or tearing
+	 * down the secondary mapping on the page.
+	 */
+	int (*test_young)(struct mmu_notifier *mn,
+			  struct mm_struct *mm,
+			  unsigned long address);
+
+	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
 	 * for example, when ksm remaps pte to point to a new shared page.
 	 */
@@ -163,6 +173,8 @@ extern void __mmu_notifier_mm_destroy(st
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
+extern int __mmu_notifier_test_young(struct mm_struct *mm,
+				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
@@ -186,6 +198,14 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
+static inline int mmu_notifier_test_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_young(mm, address);
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
@@ -313,6 +333,12 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
+static inline int mmu_notifier_test_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1626,7 +1626,8 @@ static int __collapse_huge_page_isolate(
 		VM_BUG_ON(PageLRU(page));
 
 		/* If there is no mapped pte young don't collapse the page */
-		if (pte_young(pteval))
+		if (pte_young(pteval) || PageReferenced(page) ||
+		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = 1;
 	}
 	if (unlikely(!referenced))
@@ -1869,7 +1870,8 @@ static int khugepaged_scan_pmd(struct mm
 		/* cannot use mapcount: can't collapse if there's a gup pin */
 		if (page_count(page) != 1)
 			goto out_unmap;
-		if (pte_young(pteval))
+		if (pte_young(pteval) || PageReferenced(page) ||
+		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = 1;
 	}
 	if (referenced)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -100,6 +100,26 @@ int __mmu_notifier_clear_flush_young(str
 	return young;
 }
 
+int __mmu_notifier_test_young(struct mm_struct *mm,
+			      unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->test_young) {
+			young = mn->ops->test_young(mn, mm, address);
+			if (young)
+				break;
+		}
+	}
+	rcu_read_unlock();
+
+	return young;
+}
+
 void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 			       pte_t pte)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -372,6 +372,22 @@ static int kvm_mmu_notifier_clear_flush_
 	return young;
 }
 
+static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int young, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	young = kvm_test_age_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return young;
+}
+
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 				     struct mm_struct *mm)
 {
@@ -388,6 +404,7 @@ static const struct mmu_notifier_ops kvm
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
 };

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 59 of 66] mmu_notifier_test_young
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

For GRU and EPT, we need gup-fast to set referenced bit too (this is why it's
correct to return 0 when shadow_access_mask is zero, it requires gup-fast to
set the referenced bit). qemu-kvm access already sets the young bit in the pte
if it isn't zero-copy, if it's zero copy or a shadow paging EPT minor fault we
relay on gup-fast to signal the page is in use...

We also need to check the young bits on the secondary pagetables for NPT and
not nested shadow mmu as the data may never get accessed again by the primary
pte.

Without this closer accuracy, we'd have to remove the heuristic that avoids
collapsing hugepages in hugepage virtual regions that have not even a single
subpage in use.

->test_young is full backwards compatible with GRU and other usages that don't
have young bits in pagetables set by the hardware and that should nuke the
secondary mmu mappings when ->clear_flush_young runs just like EPT does.

Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but I
thought it was worth it and this makes it reliable.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -788,6 +788,7 @@ asmlinkage void kvm_handle_fault_on_rebo
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva);
 int kvm_age_hva(struct kvm *kvm, unsigned long hva);
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 void kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 int cpuid_maxphyaddr(struct kvm_vcpu *vcpu);
 int kvm_cpu_has_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -959,6 +959,35 @@ static int kvm_age_rmapp(struct kvm *kvm
 	return young;
 }
 
+static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
+			      unsigned long data)
+{
+	u64 *spte;
+	int young = 0;
+
+	/*
+	 * If there's no access bit in the secondary pte set by the
+	 * hardware it's up to gup-fast/gup to set the access bit in
+	 * the primary pte or in the page structure.
+	 */
+	if (!shadow_accessed_mask)
+		goto out;
+
+	spte = rmap_next(kvm, rmapp, NULL);
+	while (spte) {
+		u64 _spte = *spte;
+		BUG_ON(!(_spte & PT_PRESENT_MASK));
+		young = _spte & PT_ACCESSED_MASK;
+		if (young) {
+			young = 1;
+			break;
+		}
+		spte = rmap_next(kvm, rmapp, spte);
+	}
+out:
+	return young;
+}
+
 #define RMAP_RECYCLE_THRESHOLD 1000
 
 static void rmap_recycle(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
@@ -979,6 +1008,11 @@ int kvm_age_hva(struct kvm *kvm, unsigne
 	return kvm_handle_hva(kvm, hva, 0, kvm_age_rmapp);
 }
 
+int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
+{
+	return kvm_handle_hva(kvm, hva, 0, kvm_test_age_rmapp);
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -8,6 +8,7 @@
 #include <linux/mm.h>
 #include <linux/vmstat.h>
 #include <linux/highmem.h>
+#include <linux/swap.h>
 
 #include <asm/pgtable.h>
 
@@ -89,6 +90,7 @@ static noinline int gup_pte_range(pmd_t 
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 		get_page(page);
+		SetPageReferenced(page);
 		pages[*nr] = page;
 		(*nr)++;
 
@@ -103,6 +105,7 @@ static inline void get_head_page_multipl
 	VM_BUG_ON(page != compound_head(page));
 	VM_BUG_ON(page_count(page) == 0);
 	atomic_add(nr, &page->_count);
+	SetPageReferenced(page);
 }
 
 static inline void pin_huge_page_tail(struct page *page)
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -62,6 +62,16 @@ struct mmu_notifier_ops {
 				 unsigned long address);
 
 	/*
+	 * test_young is called to check the young/accessed bitflag in
+	 * the secondary pte. This is used to know if the page is
+	 * frequently used without actually clearing the flag or tearing
+	 * down the secondary mapping on the page.
+	 */
+	int (*test_young)(struct mmu_notifier *mn,
+			  struct mm_struct *mm,
+			  unsigned long address);
+
+	/*
 	 * change_pte is called in cases that pte mapping to page is changed:
 	 * for example, when ksm remaps pte to point to a new shared page.
 	 */
@@ -163,6 +173,8 @@ extern void __mmu_notifier_mm_destroy(st
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 					  unsigned long address);
+extern int __mmu_notifier_test_young(struct mm_struct *mm,
+				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
@@ -186,6 +198,14 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
+static inline int mmu_notifier_test_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	if (mm_has_notifiers(mm))
+		return __mmu_notifier_test_young(mm, address);
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
@@ -313,6 +333,12 @@ static inline int mmu_notifier_clear_flu
 	return 0;
 }
 
+static inline int mmu_notifier_test_young(struct mm_struct *mm,
+					  unsigned long address)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 					   unsigned long address, pte_t pte)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1626,7 +1626,8 @@ static int __collapse_huge_page_isolate(
 		VM_BUG_ON(PageLRU(page));
 
 		/* If there is no mapped pte young don't collapse the page */
-		if (pte_young(pteval))
+		if (pte_young(pteval) || PageReferenced(page) ||
+		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = 1;
 	}
 	if (unlikely(!referenced))
@@ -1869,7 +1870,8 @@ static int khugepaged_scan_pmd(struct mm
 		/* cannot use mapcount: can't collapse if there's a gup pin */
 		if (page_count(page) != 1)
 			goto out_unmap;
-		if (pte_young(pteval))
+		if (pte_young(pteval) || PageReferenced(page) ||
+		    mmu_notifier_test_young(vma->vm_mm, address))
 			referenced = 1;
 	}
 	if (referenced)
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -100,6 +100,26 @@ int __mmu_notifier_clear_flush_young(str
 	return young;
 }
 
+int __mmu_notifier_test_young(struct mm_struct *mm,
+			      unsigned long address)
+{
+	struct mmu_notifier *mn;
+	struct hlist_node *n;
+	int young = 0;
+
+	rcu_read_lock();
+	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
+		if (mn->ops->test_young) {
+			young = mn->ops->test_young(mn, mm, address);
+			if (young)
+				break;
+		}
+	}
+	rcu_read_unlock();
+
+	return young;
+}
+
 void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 			       pte_t pte)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -372,6 +372,22 @@ static int kvm_mmu_notifier_clear_flush_
 	return young;
 }
 
+static int kvm_mmu_notifier_test_young(struct mmu_notifier *mn,
+				       struct mm_struct *mm,
+				       unsigned long address)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int young, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	young = kvm_test_age_hva(kvm, address);
+	spin_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
+
+	return young;
+}
+
 static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 				     struct mm_struct *mm)
 {
@@ -388,6 +404,7 @@ static const struct mmu_notifier_ops kvm
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
+	.test_young		= kvm_mmu_notifier_test_young,
 	.change_pte		= kvm_mmu_notifier_change_pte,
 	.release		= kvm_mmu_notifier_release,
 };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 60 of 66] freeze khugepaged and ksmd
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's unclear why schedule friendly kernel threads can't be taken away by the
CPU through the scheduler itself. It's safer to stop them as they can trigger
memory allocation, if kswapd also freezes itself to avoid generating I/O they
have too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -15,6 +15,7 @@
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/khugepaged.h>
+#include <linux/freezer.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -2063,6 +2064,9 @@ static void khugepaged_do_scan(struct pa
 			break;
 #endif
 
+		if (unlikely(kthread_should_stop() || freezing(current)))
+			break;
+
 		spin_lock(&khugepaged_mm_lock);
 		if (!khugepaged_scan.mm_slot)
 			pass_through_head++;
@@ -2125,6 +2129,9 @@ static void khugepaged_loop(void)
 		if (hpage)
 			put_page(hpage);
 #endif
+		try_to_freeze();
+		if (unlikely(kthread_should_stop()))
+			break;
 		if (khugepaged_has_work()) {
 			DEFINE_WAIT(wait);
 			if (!khugepaged_scan_sleep_millisecs)
@@ -2135,8 +2142,8 @@ static void khugepaged_loop(void)
 					khugepaged_scan_sleep_millisecs));
 			remove_wait_queue(&khugepaged_wait, &wait);
 		} else if (khugepaged_enabled())
-			wait_event_interruptible(khugepaged_wait,
-						 khugepaged_wait_event());
+			wait_event_freezable(khugepaged_wait,
+					     khugepaged_wait_event());
 	}
 }
 
@@ -2144,6 +2151,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 
+	set_freezable();
 	set_user_nice(current, 19);
 
 	/* serialize with start_khugepaged() */
@@ -2158,6 +2166,8 @@ static int khugepaged(void *none)
 		mutex_lock(&khugepaged_mutex);
 		if (!khugepaged_enabled())
 			break;
+		if (unlikely(kthread_should_stop()))
+			break;
 	}
 
 	spin_lock(&khugepaged_mm_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -34,6 +34,7 @@
 #include <linux/swap.h>
 #include <linux/ksm.h>
 #include <linux/hash.h>
+#include <linux/freezer.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1374,7 +1375,7 @@ static void ksm_do_scan(unsigned int sca
 	struct rmap_item *rmap_item;
 	struct page *uninitialized_var(page);
 
-	while (scan_npages--) {
+	while (scan_npages-- && likely(!freezing(current))) {
 		cond_resched();
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
@@ -1392,6 +1393,7 @@ static int ksmd_should_run(void)
 
 static int ksm_scan_thread(void *nothing)
 {
+	set_freezable();
 	set_user_nice(current, 5);
 
 	while (!kthread_should_stop()) {
@@ -1400,11 +1402,13 @@ static int ksm_scan_thread(void *nothing
 			ksm_do_scan(ksm_thread_pages_to_scan);
 		mutex_unlock(&ksm_thread_mutex);
 
+		try_to_freeze();
+
 		if (ksmd_should_run()) {
 			schedule_timeout_interruptible(
 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
 		} else {
-			wait_event_interruptible(ksm_thread_wait,
+			wait_event_freezable(ksm_thread_wait,
 				ksmd_should_run() || kthread_should_stop());
 		}
 	}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 60 of 66] freeze khugepaged and ksmd
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

It's unclear why schedule friendly kernel threads can't be taken away by the
CPU through the scheduler itself. It's safer to stop them as they can trigger
memory allocation, if kswapd also freezes itself to avoid generating I/O they
have too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -15,6 +15,7 @@
 #include <linux/mm_inline.h>
 #include <linux/kthread.h>
 #include <linux/khugepaged.h>
+#include <linux/freezer.h>
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
 #include "internal.h"
@@ -2063,6 +2064,9 @@ static void khugepaged_do_scan(struct pa
 			break;
 #endif
 
+		if (unlikely(kthread_should_stop() || freezing(current)))
+			break;
+
 		spin_lock(&khugepaged_mm_lock);
 		if (!khugepaged_scan.mm_slot)
 			pass_through_head++;
@@ -2125,6 +2129,9 @@ static void khugepaged_loop(void)
 		if (hpage)
 			put_page(hpage);
 #endif
+		try_to_freeze();
+		if (unlikely(kthread_should_stop()))
+			break;
 		if (khugepaged_has_work()) {
 			DEFINE_WAIT(wait);
 			if (!khugepaged_scan_sleep_millisecs)
@@ -2135,8 +2142,8 @@ static void khugepaged_loop(void)
 					khugepaged_scan_sleep_millisecs));
 			remove_wait_queue(&khugepaged_wait, &wait);
 		} else if (khugepaged_enabled())
-			wait_event_interruptible(khugepaged_wait,
-						 khugepaged_wait_event());
+			wait_event_freezable(khugepaged_wait,
+					     khugepaged_wait_event());
 	}
 }
 
@@ -2144,6 +2151,7 @@ static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
 
+	set_freezable();
 	set_user_nice(current, 19);
 
 	/* serialize with start_khugepaged() */
@@ -2158,6 +2166,8 @@ static int khugepaged(void *none)
 		mutex_lock(&khugepaged_mutex);
 		if (!khugepaged_enabled())
 			break;
+		if (unlikely(kthread_should_stop()))
+			break;
 	}
 
 	spin_lock(&khugepaged_mm_lock);
diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -34,6 +34,7 @@
 #include <linux/swap.h>
 #include <linux/ksm.h>
 #include <linux/hash.h>
+#include <linux/freezer.h>
 
 #include <asm/tlbflush.h>
 #include "internal.h"
@@ -1374,7 +1375,7 @@ static void ksm_do_scan(unsigned int sca
 	struct rmap_item *rmap_item;
 	struct page *uninitialized_var(page);
 
-	while (scan_npages--) {
+	while (scan_npages-- && likely(!freezing(current))) {
 		cond_resched();
 		rmap_item = scan_get_next_rmap_item(&page);
 		if (!rmap_item)
@@ -1392,6 +1393,7 @@ static int ksmd_should_run(void)
 
 static int ksm_scan_thread(void *nothing)
 {
+	set_freezable();
 	set_user_nice(current, 5);
 
 	while (!kthread_should_stop()) {
@@ -1400,11 +1402,13 @@ static int ksm_scan_thread(void *nothing
 			ksm_do_scan(ksm_thread_pages_to_scan);
 		mutex_unlock(&ksm_thread_mutex);
 
+		try_to_freeze();
+
 		if (ksmd_should_run()) {
 			schedule_timeout_interruptible(
 				msecs_to_jiffies(ksm_thread_sleep_millisecs));
 		} else {
-			wait_event_interruptible(ksm_thread_wait,
+			wait_event_freezable(ksm_thread_wait,
 				ksmd_should_run() || kthread_should_stop());
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This takes advantage of memory compaction to properly generate pages of order >
0 if regular page reclaim fails and priority level becomes more severe and we
don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
+#define COMPACT_MODE_DIRECT_RECLAIM	0
+#define COMPACT_MODE_KSWAPD		1
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
 			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					int compact_mode);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
 
@@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       int compact_mode)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void defer_compaction(struct zone *zone)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -38,6 +38,8 @@ struct compact_control {
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+
+	int compact_mode;
 };
 
 static unsigned long release_freepages(struct list_head *freelist)
@@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
 }
 
 static int compact_finished(struct zone *zone,
-						struct compact_control *cc)
+			    struct compact_control *cc)
 {
 	unsigned int order;
-	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
@@ -370,12 +372,27 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
+	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
+		watermark = low_wmark_pages(zone);
+	else
+		watermark = high_wmark_pages(zone);
+	watermark += (1 << cc->order);
+
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
 		return COMPACT_CONTINUE;
 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
+	/*
+	 * Generating only one page of the right order is not enough
+	 * for kswapd, we must continue until we're above the high
+	 * watermark as a pool for high order GFP_ATOMIC allocations
+	 * too.
+	 */
+	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
+		return COMPACT_CONTINUE;
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
-						int order, gfp_t gfp_mask)
+unsigned long compact_zone_order(struct zone *zone,
+				 int order, gfp_t gfp_mask,
+				 int compact_mode)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
 		.order = order,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
+		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
 	 * made because an assumption is made that the page allocator can satisfy
 	 * the "cheaper" orders without taking special steps
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
 	count_vm_event(COMPACTSTALL);
@@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
 			break;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask);
+		status = compact_zone_order(zone, order, gfp_mask,
+					    COMPACT_MODE_DIRECT_RECLAIM);
 		rc = max(status, rc);
 
 		if (zone_watermark_ok(zone, order, watermark, 0, 0))
@@ -547,6 +567,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/compaction.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2254,6 +2255,7 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
+			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2283,9 +2285,26 @@ loop_again:
 						lru_pages);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
+
+			compaction = 0;
+			if (order &&
+			    zone_watermark_ok(zone, 0,
+					       high_wmark_pages(zone),
+					      end_zone, 0) &&
+			    !zone_watermark_ok(zone, order,
+					       high_wmark_pages(zone),
+					       end_zone, 0)) {
+				compact_zone_order(zone,
+						   order,
+						   sc.gfp_mask,
+						   COMPACT_MODE_KSWAPD);
+				compaction = 1;
+			}
+
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!compaction && nr_slab == 0 &&
+			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

This takes advantage of memory compaction to properly generate pages of order >
0 if regular page reclaim fails and priority level becomes more severe and we
don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
+#define COMPACT_MODE_DIRECT_RECLAIM	0
+#define COMPACT_MODE_KSWAPD		1
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
 			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					int compact_mode);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
 
@@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       int compact_mode)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void defer_compaction(struct zone *zone)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -38,6 +38,8 @@ struct compact_control {
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+
+	int compact_mode;
 };
 
 static unsigned long release_freepages(struct list_head *freelist)
@@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
 }
 
 static int compact_finished(struct zone *zone,
-						struct compact_control *cc)
+			    struct compact_control *cc)
 {
 	unsigned int order;
-	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
@@ -370,12 +372,27 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
+	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
+		watermark = low_wmark_pages(zone);
+	else
+		watermark = high_wmark_pages(zone);
+	watermark += (1 << cc->order);
+
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
 		return COMPACT_CONTINUE;
 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
+	/*
+	 * Generating only one page of the right order is not enough
+	 * for kswapd, we must continue until we're above the high
+	 * watermark as a pool for high order GFP_ATOMIC allocations
+	 * too.
+	 */
+	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
+		return COMPACT_CONTINUE;
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
-						int order, gfp_t gfp_mask)
+unsigned long compact_zone_order(struct zone *zone,
+				 int order, gfp_t gfp_mask,
+				 int compact_mode)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
 		.order = order,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
+		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
 	 * made because an assumption is made that the page allocator can satisfy
 	 * the "cheaper" orders without taking special steps
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
 	count_vm_event(COMPACTSTALL);
@@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
 			break;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask);
+		status = compact_zone_order(zone, order, gfp_mask,
+					    COMPACT_MODE_DIRECT_RECLAIM);
 		rc = max(status, rc);
 
 		if (zone_watermark_ok(zone, order, watermark, 0, 0))
@@ -547,6 +567,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/compaction.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2254,6 +2255,7 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
+			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2283,9 +2285,26 @@ loop_again:
 						lru_pages);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
+
+			compaction = 0;
+			if (order &&
+			    zone_watermark_ok(zone, 0,
+					       high_wmark_pages(zone),
+					      end_zone, 0) &&
+			    !zone_watermark_ok(zone, order,
+					       high_wmark_pages(zone),
+					       end_zone, 0)) {
+				compact_zone_order(zone,
+						   order,
+						   sc.gfp_mask,
+						   COMPACT_MODE_KSWAPD);
+				compaction = 1;
+			}
+
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!compaction && nr_slab == 0 &&
+			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 62 of 66] disable transparent hugepages by default on small systems
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

On small systems, the extra memory used by the anti-fragmentation
memory reserve and simply because huge pages are smaller than large
pages can easily outweigh the benefits of less TLB misses.

In case of the crashdump kernel, OOMs have been observed due to
the anti-fragmentation memory reserve taking up a large fraction
of the crashdump image.

This patch disables transparent hugepages on systems with less
than 1GB of RAM, but the hugepage subsystem is fully initialized
so administrators can enable THP through /sys if desired.

v2: reduce the limit to 512MB

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -527,6 +527,14 @@ static int __init hugepage_init(void)
 		goto out;
 	}
 
+	/*
+	 * By default disable transparent hugepages on smaller systems,
+	 * where the extra memory used could hurt more than TLB overhead
+	 * is likely to save.  The admin can still enable it through /sys.
+	 */
+	if (totalram_pages < (512 << (20 - PAGE_SHIFT)))
+		transparent_hugepage_flags = 0;
+
 	start_khugepaged();
 
 	set_recommended_min_free_kbytes();

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 62 of 66] disable transparent hugepages by default on small systems
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

On small systems, the extra memory used by the anti-fragmentation
memory reserve and simply because huge pages are smaller than large
pages can easily outweigh the benefits of less TLB misses.

In case of the crashdump kernel, OOMs have been observed due to
the anti-fragmentation memory reserve taking up a large fraction
of the crashdump image.

This patch disables transparent hugepages on systems with less
than 1GB of RAM, but the hugepage subsystem is fully initialized
so administrators can enable THP through /sys if desired.

v2: reduce the limit to 512MB

Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Avi Kiviti <avi@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -527,6 +527,14 @@ static int __init hugepage_init(void)
 		goto out;
 	}
 
+	/*
+	 * By default disable transparent hugepages on smaller systems,
+	 * where the extra memory used could hurt more than TLB overhead
+	 * is likely to save.  The admin can still enable it through /sys.
+	 */
+	if (totalram_pages < (512 << (20 - PAGE_SHIFT)))
+		transparent_hugepage_flags = 0;
+
 	start_khugepaged();
 
 	set_recommended_min_free_kbytes();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 63 of 66] fix anon memory statistics with transparent hugepages
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,11 +128,19 @@ static inline int PageTransCompound(stru
 {
 	return PageCompound(page);
 }
+static inline int hpage_nr_pages(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_NR;
+	return 1;
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUG(); 0; })
 
+#define hpage_nr_pages(x) 1
+
 #define transparent_hugepage_enabled(__vma) 0
 
 #define transparent_hugepage_flags 0UL
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -1,6 +1,8 @@
 #ifndef LINUX_MM_INLINE_H
 #define LINUX_MM_INLINE_H
 
+#include <linux/huge_mm.h>
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -24,7 +26,7 @@ __add_page_to_lru_list(struct zone *zone
 		       struct list_head *head)
 {
 	list_add(&page->lru, head);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 	mem_cgroup_add_lru_list(page, l);
 }
 
@@ -38,7 +40,7 @@ static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
 	mem_cgroup_del_lru_list(page, l);
 }
 
@@ -73,7 +75,7 @@ del_page_from_lru(struct zone *zone, str
 			l += LRU_ACTIVE;
 		}
 	}
-	__dec_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
 	mem_cgroup_del_lru_list(page, l);
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1139,6 +1139,7 @@ static void __split_huge_page_refcount(s
 	int i;
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
+	int zonestat;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1203,6 +1204,15 @@ static void __split_huge_page_refcount(s
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
+	/*
+	 * A hugepage counts for HPAGE_PMD_NR pages on the LRU statistics,
+	 * so adjust those appropriately if this page is on the LRU.
+	 */
+	if (PageLRU(page)) {
+		zonestat = NR_LRU_BASE + page_lru(page);
+		__mod_zone_page_state(zone, zonestat, -(HPAGE_PMD_NR-1));
+	}
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1083,7 +1083,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		case 0:
 			list_move(&page->lru, dst);
 			mem_cgroup_del_lru(page);
-			nr_taken++;
+			nr_taken += hpage_nr_pages(page);
 			break;
 		case -EBUSY:
 			/* we don't affect global LRU but rotate in our LRU */
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,7 +1028,7 @@ static unsigned long isolate_lru_pages(u
 		case 0:
 			list_move(&page->lru, dst);
 			mem_cgroup_del_lru(page);
-			nr_taken++;
+			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
@@ -1086,7 +1086,7 @@ static unsigned long isolate_lru_pages(u
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
-				nr_taken++;
+				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
 					nr_lumpy_dirty++;
@@ -1141,14 +1141,15 @@ static unsigned long clear_active_flags(
 	struct page *page;
 
 	list_for_each_entry(page, page_list, lru) {
+		int numpages = hpage_nr_pages(page);
 		lru = page_lru_base_type(page);
 		if (PageActive(page)) {
 			lru += LRU_ACTIVE;
 			ClearPageActive(page);
-			nr_active++;
+			nr_active += numpages;
 		}
 		if (count)
-			count[lru]++;
+			count[lru] += numpages;
 	}
 
 	return nr_active;
@@ -1466,7 +1467,7 @@ static void move_active_pages_to_lru(str
 
 		list_move(&page->lru, &zone->lru[lru].list);
 		mem_cgroup_add_lru_list(page, lru);
-		pgmoved++;
+		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
 			spin_unlock_irq(&zone->lru_lock);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 63 of 66] fix anon memory statistics with transparent hugepages
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -128,11 +128,19 @@ static inline int PageTransCompound(stru
 {
 	return PageCompound(page);
 }
+static inline int hpage_nr_pages(struct page *page)
+{
+	if (unlikely(PageTransHuge(page)))
+		return HPAGE_PMD_NR;
+	return 1;
+}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
 #define HPAGE_PMD_SIZE ({ BUG(); 0; })
 
+#define hpage_nr_pages(x) 1
+
 #define transparent_hugepage_enabled(__vma) 0
 
 #define transparent_hugepage_flags 0UL
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -1,6 +1,8 @@
 #ifndef LINUX_MM_INLINE_H
 #define LINUX_MM_INLINE_H
 
+#include <linux/huge_mm.h>
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -24,7 +26,7 @@ __add_page_to_lru_list(struct zone *zone
 		       struct list_head *head)
 {
 	list_add(&page->lru, head);
-	__inc_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, hpage_nr_pages(page));
 	mem_cgroup_add_lru_list(page, l);
 }
 
@@ -38,7 +40,7 @@ static inline void
 del_page_from_lru_list(struct zone *zone, struct page *page, enum lru_list l)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
 	mem_cgroup_del_lru_list(page, l);
 }
 
@@ -73,7 +75,7 @@ del_page_from_lru(struct zone *zone, str
 			l += LRU_ACTIVE;
 		}
 	}
-	__dec_zone_state(zone, NR_LRU_BASE + l);
+	__mod_zone_page_state(zone, NR_LRU_BASE + l, -hpage_nr_pages(page));
 	mem_cgroup_del_lru_list(page, l);
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1139,6 +1139,7 @@ static void __split_huge_page_refcount(s
 	int i;
 	unsigned long head_index = page->index;
 	struct zone *zone = page_zone(page);
+	int zonestat;
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
 	spin_lock_irq(&zone->lru_lock);
@@ -1203,6 +1204,15 @@ static void __split_huge_page_refcount(s
 	__dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
 	__mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
 
+	/*
+	 * A hugepage counts for HPAGE_PMD_NR pages on the LRU statistics,
+	 * so adjust those appropriately if this page is on the LRU.
+	 */
+	if (PageLRU(page)) {
+		zonestat = NR_LRU_BASE + page_lru(page);
+		__mod_zone_page_state(zone, zonestat, -(HPAGE_PMD_NR-1));
+	}
+
 	ClearPageCompound(page);
 	compound_unlock(page);
 	spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1083,7 +1083,7 @@ unsigned long mem_cgroup_isolate_pages(u
 		case 0:
 			list_move(&page->lru, dst);
 			mem_cgroup_del_lru(page);
-			nr_taken++;
+			nr_taken += hpage_nr_pages(page);
 			break;
 		case -EBUSY:
 			/* we don't affect global LRU but rotate in our LRU */
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1028,7 +1028,7 @@ static unsigned long isolate_lru_pages(u
 		case 0:
 			list_move(&page->lru, dst);
 			mem_cgroup_del_lru(page);
-			nr_taken++;
+			nr_taken += hpage_nr_pages(page);
 			break;
 
 		case -EBUSY:
@@ -1086,7 +1086,7 @@ static unsigned long isolate_lru_pages(u
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
-				nr_taken++;
+				nr_taken += hpage_nr_pages(page);
 				nr_lumpy_taken++;
 				if (PageDirty(cursor_page))
 					nr_lumpy_dirty++;
@@ -1141,14 +1141,15 @@ static unsigned long clear_active_flags(
 	struct page *page;
 
 	list_for_each_entry(page, page_list, lru) {
+		int numpages = hpage_nr_pages(page);
 		lru = page_lru_base_type(page);
 		if (PageActive(page)) {
 			lru += LRU_ACTIVE;
 			ClearPageActive(page);
-			nr_active++;
+			nr_active += numpages;
 		}
 		if (count)
-			count[lru]++;
+			count[lru] += numpages;
 	}
 
 	return nr_active;
@@ -1466,7 +1467,7 @@ static void move_active_pages_to_lru(str
 
 		list_move(&page->lru, &zone->lru[lru].list);
 		mem_cgroup_add_lru_list(page, lru);
-		pgmoved++;
+		pgmoved += hpage_nr_pages(page);
 
 		if (!pagevec_add(&pvec, page) || list_empty(list)) {
 			spin_unlock_irq(&zone->lru_lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 64 of 66] scale nr_rotated to balance memory pressure
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page.  This ensures pageout scanning balance
is not distorted when there are huge pages on the LRU.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1259,7 +1259,8 @@ putback_lru_pages(struct zone *zone, str
 		add_page_to_lru_list(zone, page, lru);
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
+			int numpages = hpage_nr_pages(page);
+			reclaim_stat->recent_rotated[file] += numpages;
 		}
 		if (!pagevec_add(&pvec, page)) {
 			spin_unlock_irq(&zone->lru_lock);
@@ -1535,7 +1536,7 @@ static void shrink_active_list(unsigned 
 		}
 
 		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
-			nr_rotated++;
+			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 64 of 66] scale nr_rotated to balance memory pressure
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Rik van Riel <riel@redhat.com>

Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page.  This ensures pageout scanning balance
is not distorted when there are huge pages on the LRU.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1259,7 +1259,8 @@ putback_lru_pages(struct zone *zone, str
 		add_page_to_lru_list(zone, page, lru);
 		if (is_active_lru(lru)) {
 			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
+			int numpages = hpage_nr_pages(page);
+			reclaim_stat->recent_rotated[file] += numpages;
 		}
 		if (!pagevec_add(&pvec, page)) {
 			spin_unlock_irq(&zone->lru_lock);
@@ -1535,7 +1536,7 @@ static void shrink_active_list(unsigned 
 		}
 
 		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
-			nr_rotated++;
+			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
 			 * give them one more trip around the active list. So

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 65 of 66] transparent hugepage sysfs meminfo
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: David Rientjes <rientjes@google.com>

Add hugepage statistics to per-node sysfs meminfo

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -117,12 +117,21 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d WritebackTmp:   %8lu kB\n"
 		       "Node %d Slab:           %8lu kB\n"
 		       "Node %d SReclaimable:   %8lu kB\n"
-		       "Node %d SUnreclaim:     %8lu kB\n",
+		       "Node %d SUnreclaim:     %8lu kB\n"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		       "Node %d AnonHugePages:  %8lu kB\n"
+#endif
+			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(nid, NR_ANON_PAGES)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			+ node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR
+#endif
+		       ),
 		       nid, K(node_page_state(nid, NR_SHMEM)),
 		       nid, node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
@@ -133,7 +142,13 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			, nid,
+			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+#endif
+		       );
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
 }

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 65 of 66] transparent hugepage sysfs meminfo
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: David Rientjes <rientjes@google.com>

Add hugepage statistics to per-node sysfs meminfo

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/drivers/base/node.c b/drivers/base/node.c
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -117,12 +117,21 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d WritebackTmp:   %8lu kB\n"
 		       "Node %d Slab:           %8lu kB\n"
 		       "Node %d SReclaimable:   %8lu kB\n"
-		       "Node %d SUnreclaim:     %8lu kB\n",
+		       "Node %d SUnreclaim:     %8lu kB\n"
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		       "Node %d AnonHugePages:  %8lu kB\n"
+#endif
+			,
 		       nid, K(node_page_state(nid, NR_FILE_DIRTY)),
 		       nid, K(node_page_state(nid, NR_WRITEBACK)),
 		       nid, K(node_page_state(nid, NR_FILE_PAGES)),
 		       nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-		       nid, K(node_page_state(nid, NR_ANON_PAGES)),
+		       nid, K(node_page_state(nid, NR_ANON_PAGES)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			+ node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR
+#endif
+		       ),
 		       nid, K(node_page_state(nid, NR_SHMEM)),
 		       nid, node_page_state(nid, NR_KERNEL_STACK) *
 				THREAD_SIZE / 1024,
@@ -133,7 +142,13 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
 				node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
 		       nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+		       nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			, nid,
+			K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+			HPAGE_PMD_NR)
+#endif
+		       );
 	n += hugetlb_report_node_meminfo(nid, buf + n);
 	return n;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 66 of 66] add debug checks for mapcount related invariants
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-03 15:28   ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,6 +804,7 @@ static inline int copy_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*src_pmd)) {
 			int err;
+			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
 			err = copy_huge_pmd(dst_mm, src_mm,
 					    dst_pmd, src_pmd, addr, vma);
 			if (err == -ENOMEM)
@@ -1015,9 +1016,10 @@ static inline unsigned long zap_pmd_rang
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
-			if (next-addr != HPAGE_PMD_SIZE)
+			if (next-addr != HPAGE_PMD_SIZE) {
+				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (zap_huge_pmd(tlb, vma, pmd)) {
+			} else if (zap_huge_pmd(tlb, vma, pmd)) {
 				(*zap_work)--;
 				continue;
 			}

^ permalink raw reply	[flat|nested] 331+ messages in thread

* [PATCH 66 of 66] add debug checks for mapcount related invariants
@ 2010-11-03 15:28   ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-03 15:28 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

From: Andrea Arcangeli <aarcange@redhat.com>

Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,6 +804,7 @@ static inline int copy_pmd_range(struct 
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*src_pmd)) {
 			int err;
+			VM_BUG_ON(next-addr != HPAGE_PMD_SIZE);
 			err = copy_huge_pmd(dst_mm, src_mm,
 					    dst_pmd, src_pmd, addr, vma);
 			if (err == -ENOMEM)
@@ -1015,9 +1016,10 @@ static inline unsigned long zap_pmd_rang
 	do {
 		next = pmd_addr_end(addr, end);
 		if (pmd_trans_huge(*pmd)) {
-			if (next-addr != HPAGE_PMD_SIZE)
+			if (next-addr != HPAGE_PMD_SIZE) {
+				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
 				split_huge_page_pmd(vma->vm_mm, pmd);
-			else if (zap_huge_pmd(tlb, vma, pmd)) {
+			} else if (zap_huge_pmd(tlb, vma, pmd)) {
 				(*zap_work)--;
 				continue;
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-09  3:01     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

Hi

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Page migration requires rmap to be able to find all migration ptes
> created by migration. If the second rmap_walk clearing migration PTEs
> misses an entry, it is left dangling causing a BUG_ON to trigger during
> fault. For example;
> 
> [  511.201534] kernel BUG at include/linux/swapops.h:105!
> [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> [  511.201534] last sysfs file: /sys/block/sde/size
> [  511.201534] CPU 0
> [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> [  511.888526]
> [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

Do you mean current linus-tree code is broken? do we need to merge this
ASAP and need to backport stable tree?





^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-09  3:01     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

Hi

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Page migration requires rmap to be able to find all migration ptes
> created by migration. If the second rmap_walk clearing migration PTEs
> misses an entry, it is left dangling causing a BUG_ON to trigger during
> fault. For example;
> 
> [  511.201534] kernel BUG at include/linux/swapops.h:105!
> [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> [  511.201534] last sysfs file: /sys/block/sde/size
> [  511.201534] CPU 0
> [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> [  511.888526]
> [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

Do you mean current linus-tree code is broken? do we need to merge this
ASAP and need to backport stable tree?




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-09  3:03     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> page_count shows the count of the head page, but the actual check is done on
> the tail page, so show what is really being checked.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
>  {
>  	printk(KERN_ALERT
>  	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> -		page, page_count(page), page_mapcount(page),
> +		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
>  	dump_page_flags(page->flags);
>  }

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
@ 2010-11-09  3:03     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> page_count shows the count of the head page, but the actual check is done on
> the tail page, so show what is really being checked.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
>  {
>  	printk(KERN_ALERT
>  	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> -		page, page_count(page), page_mapcount(page),
> +		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
>  	dump_page_flags(page->flags);
>  }

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-09  3:05     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Not worth throwing away the precious reserved free memory pool for allocations
> that can fail gracefully (either through mempool or because they're transhuge
> allocations later falling back to 4k allocations).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
>  	if (!wait) {
> -		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Not worth trying to allocate harder for
> +		 * __GFP_NOMEMALLOC even if it can't schedule.
> +		 */
> +		if  (!(gfp_mask & __GFP_NOMEMALLOC))
> +			alloc_flags |= ALLOC_HARDER;
>  		/*
>  		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.

I like this.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
@ 2010-11-09  3:05     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Not worth throwing away the precious reserved free memory pool for allocations
> that can fail gracefully (either through mempool or because they're transhuge
> allocations later falling back to 4k allocations).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
>  	if (!wait) {
> -		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Not worth trying to allocate harder for
> +		 * __GFP_NOMEMALLOC even if it can't schedule.
> +		 */
> +		if  (!(gfp_mask & __GFP_NOMEMALLOC))
> +			alloc_flags |= ALLOC_HARDER;
>  		/*
>  		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.

I like this.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-09  3:08     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

This explanation seems to tell this is bugfix. If so, please separate
this one from THP and will send mainline and -stable soon.




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
@ 2010-11-09  3:08     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

This explanation seems to tell this is bugfix. If so, please separate
this one from THP and will send mainline and -stable soon.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-09  3:18     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Compaction is more reliable than lumpy, and lumpy makes the system unusable
> when it runs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -274,6 +274,7 @@ unsigned long shrink_slab(unsigned long 
>  static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
>  				   bool sync)
>  {
> +#ifndef CONFIG_COMPACTION
>  	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
>  
>  	/*
> @@ -294,11 +295,14 @@ static void set_lumpy_reclaim_mode(int p
>  		sc->lumpy_reclaim_mode = mode;
>  	else
>  		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
> +#endif
>  }

I'm talking very personal thing now. I'm usually testing both feature.
Then, runtime switching makes my happy :-)
However I don't know what are you and Mel talking and agree about this.
So, If many developer prefer this approach, I don't oppose anymore.

But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
will become nearly dead code. So, I like just kill than dead code. however
it is also only my preference. ;)




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-09  3:18     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  3:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Compaction is more reliable than lumpy, and lumpy makes the system unusable
> when it runs.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -274,6 +274,7 @@ unsigned long shrink_slab(unsigned long 
>  static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
>  				   bool sync)
>  {
> +#ifndef CONFIG_COMPACTION
>  	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
>  
>  	/*
> @@ -294,11 +295,14 @@ static void set_lumpy_reclaim_mode(int p
>  		sc->lumpy_reclaim_mode = mode;
>  	else
>  		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
> +#endif
>  }

I'm talking very personal thing now. I'm usually testing both feature.
Then, runtime switching makes my happy :-)
However I don't know what are you and Mel talking and agree about this.
So, If many developer prefer this approach, I don't oppose anymore.

But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
will become nearly dead code. So, I like just kill than dead code. however
it is also only my preference. ;)



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 64 of 66] scale nr_rotated to balance memory pressure
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-09  6:16     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  6:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Rik van Riel <riel@redhat.com>
> 
> Make sure we scale up nr_rotated when we encounter a referenced
> transparent huge page.  This ensures pageout scanning balance
> is not distorted when there are huge pages on the LRU.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1259,7 +1259,8 @@ putback_lru_pages(struct zone *zone, str
>  		add_page_to_lru_list(zone, page, lru);
>  		if (is_active_lru(lru)) {
>  			int file = is_file_lru(lru);
> -			reclaim_stat->recent_rotated[file]++;
> +			int numpages = hpage_nr_pages(page);
> +			reclaim_stat->recent_rotated[file] += numpages;
>  		}
>  		if (!pagevec_add(&pvec, page)) {
>  			spin_unlock_irq(&zone->lru_lock);

I haven't seen this patch series carefully yet. So, probably
my question is dumb. Why don't we need to change ->recent_scanned[] too?



> @@ -1535,7 +1536,7 @@ static void shrink_active_list(unsigned 
>  		}
>  
>  		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
> -			nr_rotated++;
> +			nr_rotated += hpage_nr_pages(page);
>  			/*
>  			 * Identify referenced, file-backed active pages and
>  			 * give them one more trip around the active list. So




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 64 of 66] scale nr_rotated to balance memory pressure
@ 2010-11-09  6:16     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  6:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Rik van Riel <riel@redhat.com>
> 
> Make sure we scale up nr_rotated when we encounter a referenced
> transparent huge page.  This ensures pageout scanning balance
> is not distorted when there are huge pages on the LRU.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1259,7 +1259,8 @@ putback_lru_pages(struct zone *zone, str
>  		add_page_to_lru_list(zone, page, lru);
>  		if (is_active_lru(lru)) {
>  			int file = is_file_lru(lru);
> -			reclaim_stat->recent_rotated[file]++;
> +			int numpages = hpage_nr_pages(page);
> +			reclaim_stat->recent_rotated[file] += numpages;
>  		}
>  		if (!pagevec_add(&pvec, page)) {
>  			spin_unlock_irq(&zone->lru_lock);

I haven't seen this patch series carefully yet. So, probably
my question is dumb. Why don't we need to change ->recent_scanned[] too?



> @@ -1535,7 +1536,7 @@ static void shrink_active_list(unsigned 
>  		}
>  
>  		if (page_referenced(page, 0, sc->mem_cgroup, &vm_flags)) {
> -			nr_rotated++;
> +			nr_rotated += hpage_nr_pages(page);
>  			/*
>  			 * Identify referenced, file-backed active pages and
>  			 * give them one more trip around the active list. So



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-09  6:20     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  6:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> With transparent hugepage support we need compaction for the "defrag" sysfs
> controls to be effective.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
>  config TRANSPARENT_HUGEPAGE
>  	bool "Transparent Hugepage Support"
>  	depends on X86 && MMU
> +	select COMPACTION
>  	help
>  	  Transparent Hugepages allows the kernel to use huge pages and
>  	  huge tlb transparently to the applications whenever possible.

I dislike this. THP and compaction are completely orthogonal. I think 
you are talking only your performance recommendation. I mean I dislike
Kconfig 'select' hell and I hope every developers try to avoid it as 
far as possible.

Thanks.




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-09  6:20     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09  6:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> With transparent hugepage support we need compaction for the "defrag" sysfs
> controls to be effective.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
>  config TRANSPARENT_HUGEPAGE
>  	bool "Transparent Hugepage Support"
>  	depends on X86 && MMU
> +	select COMPACTION
>  	help
>  	  Transparent Hugepages allows the kernel to use huge pages and
>  	  huge tlb transparently to the applications whenever possible.

I dislike this. THP and compaction are completely orthogonal. I think 
you are talking only your performance recommendation. I mean I dislike
Kconfig 'select' hell and I hope every developers try to avoid it as 
far as possible.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-09 10:27     ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09 10:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This takes advantage of memory compaction to properly generate pages of order >
> 0 if regular page reclaim fails and priority level becomes more severe and we
> don't reach the proper watermarks.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

First, I don't think this patch is related to GFP_ATOMIC. So, I think the 
patch title is a bit misleading.

Second, this patch has two changes. 1) remove PAGE_ALLOC_COSTLY_ORDER 
threshold 2) implement background compaction. please separate them.

Third, This patch makes a lot of PFN order page scan and churn LRU
aggressively. I'm not sure this aggressive lru shuffling is safe and
works effective. I hope you provide some demonstration and/or show 
benchmark result.

Thanks.




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
@ 2010-11-09 10:27     ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-09 10:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This takes advantage of memory compaction to properly generate pages of order >
> 0 if regular page reclaim fails and priority level becomes more severe and we
> don't reach the proper watermarks.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

First, I don't think this patch is related to GFP_ATOMIC. So, I think the 
patch title is a bit misleading.

Second, this patch has two changes. 1) remove PAGE_ALLOC_COSTLY_ORDER 
threshold 2) implement background compaction. please separate them.

Third, This patch makes a lot of PFN order page scan and churn LRU
aggressively. I'm not sure this aggressive lru shuffling is safe and
works effective. I hope you provide some demonstration and/or show 
benchmark result.

Thanks.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-09  6:20     ` KOSAKI Motohiro
@ 2010-11-09 21:11       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > With transparent hugepage support we need compaction for the "defrag" sysfs
> > controls to be effective.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> >  config TRANSPARENT_HUGEPAGE
> >  	bool "Transparent Hugepage Support"
> >  	depends on X86 && MMU
> > +	select COMPACTION
> >  	help
> >  	  Transparent Hugepages allows the kernel to use huge pages and
> >  	  huge tlb transparently to the applications whenever possible.
> 
> I dislike this. THP and compaction are completely orthogonal. I think 
> you are talking only your performance recommendation. I mean I dislike
> Kconfig 'select' hell and I hope every developers try to avoid it as 
> far as possible.

At the moment THP hangs the system if COMPACTION isn't selected
(please try yourself if you don't believe), as without COMPACTION
lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
orthogonal. When lumpy will be removed from the VM (like I tried
multiple times to achieve) I can remove the select COMPACTION in
theory, but then 99% of THP users would be still doing a mistake in
disabling compaction, even if the mistake won't return in fatal
runtime but just slightly degraded performance.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-09 21:11       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:11 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > With transparent hugepage support we need compaction for the "defrag" sysfs
> > controls to be effective.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> >  config TRANSPARENT_HUGEPAGE
> >  	bool "Transparent Hugepage Support"
> >  	depends on X86 && MMU
> > +	select COMPACTION
> >  	help
> >  	  Transparent Hugepages allows the kernel to use huge pages and
> >  	  huge tlb transparently to the applications whenever possible.
> 
> I dislike this. THP and compaction are completely orthogonal. I think 
> you are talking only your performance recommendation. I mean I dislike
> Kconfig 'select' hell and I hope every developers try to avoid it as 
> far as possible.

At the moment THP hangs the system if COMPACTION isn't selected
(please try yourself if you don't believe), as without COMPACTION
lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
orthogonal. When lumpy will be removed from the VM (like I tried
multiple times to achieve) I can remove the select COMPACTION in
theory, but then 99% of THP users would be still doing a mistake in
disabling compaction, even if the mistake won't return in fatal
runtime but just slightly degraded performance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-09  3:18     ` KOSAKI Motohiro
@ 2010-11-09 21:30       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:18:49PM +0900, KOSAKI Motohiro wrote:
> I'm talking very personal thing now. I'm usually testing both feature.
> Then, runtime switching makes my happy :-)
> However I don't know what are you and Mel talking and agree about this.
> So, If many developer prefer this approach, I don't oppose anymore.

Mel seem to still prefer I allow lumpy for hugetlbfs with a
__GFP_LUMPY specified only for hugetlbfs. But he measured compaction
is more reliable than lumpy at creating hugepages so he seems to be ok
with this too.

> But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
> will become nearly dead code. So, I like just kill than dead code. however
> it is also only my preference. ;)

Killing dead code is my preference too indeed. But then it's fine with
me to delete it only later. In short this is least intrusive
modification I could make to the VM that wouldn't than hang the system
when THP is selected because all pte young bits are ignored for >50%
of page reclaim invocations like lumpy requires.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-09 21:30       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:18:49PM +0900, KOSAKI Motohiro wrote:
> I'm talking very personal thing now. I'm usually testing both feature.
> Then, runtime switching makes my happy :-)
> However I don't know what are you and Mel talking and agree about this.
> So, If many developer prefer this approach, I don't oppose anymore.

Mel seem to still prefer I allow lumpy for hugetlbfs with a
__GFP_LUMPY specified only for hugetlbfs. But he measured compaction
is more reliable than lumpy at creating hugepages so he seems to be ok
with this too.

> But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
> will become nearly dead code. So, I like just kill than dead code. however
> it is also only my preference. ;)

Killing dead code is my preference too indeed. But then it's fine with
me to delete it only later. In short this is least intrusive
modification I could make to the VM that wouldn't than hang the system
when THP is selected because all pte young bits are ignored for >50%
of page reclaim invocations like lumpy requires.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-09  3:01     ` KOSAKI Motohiro
@ 2010-11-09 21:36       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:01:55PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Page migration requires rmap to be able to find all migration ptes
> > created by migration. If the second rmap_walk clearing migration PTEs
> > misses an entry, it is left dangling causing a BUG_ON to trigger during
> > fault. For example;
> > 
> > [  511.201534] kernel BUG at include/linux/swapops.h:105!
> > [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> > [  511.201534] last sysfs file: /sys/block/sde/size
> > [  511.201534] CPU 0
> > [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> > [  511.888526]
> > [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> > [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> 
> Do you mean current linus-tree code is broken? do we need to merge this
> ASAP and need to backport stable tree?

No current upstream is not broken here, but it's required with THP
unless I add the below block to huge_memory.c rmap walks too. I am
afraid to forget special cases like the below in the long term, so the
fewer special/magic cases there are in the kernel to remember forever
the better for me, as it's less likely I introduce hard-to-reproduce
bugs later on in the kernel when I add more rmap walks.

-
-               /*
-                * During exec, a temporary VMA is setup and later
-               moved.
-                * The VMA is moved under the anon_vma lock but not
-               the
-                * page tables leading to a race where migration
-               cannot
-                * find the migration ptes. Rather than increasing the
-                * locking requirements of exec(), migration skips
-                * temporary VMAs until after exec() completes.
-                */
-               if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
-                               is_vma_temporary_stack(vma))
-                       continue;
-

Now if Linus tells me to drop this patch, I will add the above block
to huge_memory.c, and then I can drop this 2/66 patch immediately
(hopefully the user stack will never be born as an hugepage or the
randomizing mremap during execve will be executed with perfect 2m
aligned offsets or then it wouldn't work anymore, but for now I can
safely remove this patch and add the above block to huge_memory.c).

I see the ugliness in exec.c, but I think we're less likely to have
bugs in the kernel in the future without the above magic in every
remap walk that needs to be accurate (like huge_memory.c and
migrate.c), the rmap walks are complicated enough already for my
taste. But I'm totally fine either ways, as I think I can remember.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-09 21:36       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:01:55PM +0900, KOSAKI Motohiro wrote:
> Hi
> 
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Page migration requires rmap to be able to find all migration ptes
> > created by migration. If the second rmap_walk clearing migration PTEs
> > misses an entry, it is left dangling causing a BUG_ON to trigger during
> > fault. For example;
> > 
> > [  511.201534] kernel BUG at include/linux/swapops.h:105!
> > [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> > [  511.201534] last sysfs file: /sys/block/sde/size
> > [  511.201534] CPU 0
> > [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> > [  511.888526]
> > [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> > [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> 
> Do you mean current linus-tree code is broken? do we need to merge this
> ASAP and need to backport stable tree?

No current upstream is not broken here, but it's required with THP
unless I add the below block to huge_memory.c rmap walks too. I am
afraid to forget special cases like the below in the long term, so the
fewer special/magic cases there are in the kernel to remember forever
the better for me, as it's less likely I introduce hard-to-reproduce
bugs later on in the kernel when I add more rmap walks.

-
-               /*
-                * During exec, a temporary VMA is setup and later
-               moved.
-                * The VMA is moved under the anon_vma lock but not
-               the
-                * page tables leading to a race where migration
-               cannot
-                * find the migration ptes. Rather than increasing the
-                * locking requirements of exec(), migration skips
-                * temporary VMAs until after exec() completes.
-                */
-               if (PAGE_MIGRATION && (flags & TTU_MIGRATION) &&
-                               is_vma_temporary_stack(vma))
-                       continue;
-

Now if Linus tells me to drop this patch, I will add the above block
to huge_memory.c, and then I can drop this 2/66 patch immediately
(hopefully the user stack will never be born as an hugepage or the
randomizing mremap during execve will be executed with perfect 2m
aligned offsets or then it wouldn't work anymore, but for now I can
safely remove this patch and add the above block to huge_memory.c).

I see the ugliness in exec.c, but I think we're less likely to have
bugs in the kernel in the future without the above magic in every
remap walk that needs to be accurate (like huge_memory.c and
migrate.c), the rmap walks are complicated enough already for my
taste. But I'm totally fine either ways, as I think I can remember.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-09 21:30       ` Andrea Arcangeli
@ 2010-11-09 21:38         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-09 21:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 10:30:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 12:18:49PM +0900, KOSAKI Motohiro wrote:
> > I'm talking very personal thing now. I'm usually testing both feature.
> > Then, runtime switching makes my happy :-)
> > However I don't know what are you and Mel talking and agree about this.
> > So, If many developer prefer this approach, I don't oppose anymore.
> 
> Mel seem to still prefer I allow lumpy for hugetlbfs with a
> __GFP_LUMPY specified only for hugetlbfs. But he measured compaction
> is more reliable than lumpy at creating hugepages so he seems to be ok
> with this too.
> 

Specifically, I measured that lumpy in combination with compaction is
more reliable and lower latency but that's not the same as deleting it.

That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
moment that pushes lumpy reclaim to the side and for the majority of cases
replaces it with "lumpy compaction". I'd hoping this will be sufficient for
THP and alleviate the need to delete it entirely - at least until we are 100%
sure that compaction can replace it in all cases.

Unfortunately, in the process of testing it today I also found out that
2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
side-tracked trying to chase that down. My initial theories for the regression
have shown up nothing so I'm currently preparing to do a bisection. This
will take a long time though because the test is very slow :(

I can still post the series as an RFC if you like to show what direction
I'm thinking of but at the moment, I'm unable to test it until I pin the
regression down.

> > But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
> > will become nearly dead code. So, I like just kill than dead code. however
> > it is also only my preference. ;)
> 
> Killing dead code is my preference too indeed. But then it's fine with
> me to delete it only later. In short this is least intrusive
> modification I could make to the VM that wouldn't than hang the system
> when THP is selected because all pte young bits are ignored for >50%
> of page reclaim invocations like lumpy requires.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-09 21:38         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-09 21:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 10:30:49PM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 12:18:49PM +0900, KOSAKI Motohiro wrote:
> > I'm talking very personal thing now. I'm usually testing both feature.
> > Then, runtime switching makes my happy :-)
> > However I don't know what are you and Mel talking and agree about this.
> > So, If many developer prefer this approach, I don't oppose anymore.
> 
> Mel seem to still prefer I allow lumpy for hugetlbfs with a
> __GFP_LUMPY specified only for hugetlbfs. But he measured compaction
> is more reliable than lumpy at creating hugepages so he seems to be ok
> with this too.
> 

Specifically, I measured that lumpy in combination with compaction is
more reliable and lower latency but that's not the same as deleting it.

That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
moment that pushes lumpy reclaim to the side and for the majority of cases
replaces it with "lumpy compaction". I'd hoping this will be sufficient for
THP and alleviate the need to delete it entirely - at least until we are 100%
sure that compaction can replace it in all cases.

Unfortunately, in the process of testing it today I also found out that
2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
side-tracked trying to chase that down. My initial theories for the regression
have shown up nothing so I'm currently preparing to do a bisection. This
will take a long time though because the test is very slow :(

I can still post the series as an RFC if you like to show what direction
I'm thinking of but at the moment, I'm unable to test it until I pin the
regression down.

> > But, I bet almost all distro choose CONFIG_COMPACTION=y. then, lumpy code
> > will become nearly dead code. So, I like just kill than dead code. however
> > it is also only my preference. ;)
> 
> Killing dead code is my preference too indeed. But then it's fine with
> me to delete it only later. In short this is least intrusive
> modification I could make to the VM that wouldn't than hang the system
> when THP is selected because all pte young bits are ignored for >50%
> of page reclaim invocations like lumpy requires.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
  2010-11-09  3:08     ` KOSAKI Motohiro
@ 2010-11-09 21:40       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:08:25PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> This explanation seems to tell this is bugfix. If so, please separate
> this one from THP and will send mainline and -stable soon.

Right. I'm uncertain if this is so bad to require -stable I think, if
it was more urgent I would have submitted already separately but it's
true it's not THP specific.

It's only fatal for cloud computing, where the manager has to decide
to migrate more VM to one node, but it won't if it sees tons of swap
used and it will think there's not enough margin for KSM cows until
the VM is migrated back to another node.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
@ 2010-11-09 21:40       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:40 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 12:08:25PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> This explanation seems to tell this is bugfix. If so, please separate
> this one from THP and will send mainline and -stable soon.

Right. I'm uncertain if this is so bad to require -stable I think, if
it was more urgent I would have submitted already separately but it's
true it's not THP specific.

It's only fatal for cloud computing, where the manager has to decide
to migrate more VM to one node, but it won't if it sees tons of swap
used and it will think there's not enough margin for KSM cows until
the VM is migrated back to another node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
  2010-11-09 10:27     ` KOSAKI Motohiro
@ 2010-11-09 21:49       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 07:27:37PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > This takes advantage of memory compaction to properly generate pages of order >
> > 0 if regular page reclaim fails and priority level becomes more severe and we
> > don't reach the proper watermarks.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> First, I don't think this patch is related to GFP_ATOMIC. So, I think the 
> patch title is a bit misleading.
> 
> Second, this patch has two changes. 1) remove PAGE_ALLOC_COSTLY_ORDER 
> threshold 2) implement background compaction. please separate them.

Well the subject isn't entirely misleading: background compaction in
kswapd is only for the GFP_ATOMIC so GFP_ATOMIC order >0 allocations
are definitely related to this patch.

Then I ended up then allowing compaction for all order of allocations
as it doesn't make sense to fail order 2 for the kernel stack and
succeed order 9 but it's true I can split that off, I will split it
for #33, thanks for allowing me to clean up the stuff better.

> Third, This patch makes a lot of PFN order page scan and churn LRU
> aggressively. I'm not sure this aggressive lru shuffling is safe and
> works effective. I hope you provide some demonstration and/or show 
> benchmark result.

The patch will increase the amount of compaction for GFP_ATOMIC order
>0, but it won't alter the amount of free pages in the system, but
it'll satisfy the in-function-of order watermarks that are right now
ignored. If user asked GFP_ATOMIC order > 0, this is what it asks,
it's up to the user not to ask for it if it's not worthwhile. If user
doesn't want this but it just wants to poll the LRU it should use
GFP_ATOMIC|__GFP_NO_KSWAPD.

The benchmark results I don't have at the moment but this has been
tested with tg3 with jumbo packets that trigger order 2 allocation and
no degradation was noticed. To be fair it didn't significantly improve
the amount of order 2 (9046 bytes large skb) allocated from irq
though, but I thought it was good idea to keep it in case there are
less aggressive/frequent users doing similar things.

Overall the more important part of the patch is the point 2) that I
can make it cleaner by splitting it off as you noticed and I will do
it.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
@ 2010-11-09 21:49       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 21:49 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 07:27:37PM +0900, KOSAKI Motohiro wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > This takes advantage of memory compaction to properly generate pages of order >
> > 0 if regular page reclaim fails and priority level becomes more severe and we
> > don't reach the proper watermarks.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> First, I don't think this patch is related to GFP_ATOMIC. So, I think the 
> patch title is a bit misleading.
> 
> Second, this patch has two changes. 1) remove PAGE_ALLOC_COSTLY_ORDER 
> threshold 2) implement background compaction. please separate them.

Well the subject isn't entirely misleading: background compaction in
kswapd is only for the GFP_ATOMIC so GFP_ATOMIC order >0 allocations
are definitely related to this patch.

Then I ended up then allowing compaction for all order of allocations
as it doesn't make sense to fail order 2 for the kernel stack and
succeed order 9 but it's true I can split that off, I will split it
for #33, thanks for allowing me to clean up the stuff better.

> Third, This patch makes a lot of PFN order page scan and churn LRU
> aggressively. I'm not sure this aggressive lru shuffling is safe and
> works effective. I hope you provide some demonstration and/or show 
> benchmark result.

The patch will increase the amount of compaction for GFP_ATOMIC order
>0, but it won't alter the amount of free pages in the system, but
it'll satisfy the in-function-of order watermarks that are right now
ignored. If user asked GFP_ATOMIC order > 0, this is what it asks,
it's up to the user not to ask for it if it's not worthwhile. If user
doesn't want this but it just wants to poll the LRU it should use
GFP_ATOMIC|__GFP_NO_KSWAPD.

The benchmark results I don't have at the moment but this has been
tested with tg3 with jumbo packets that trigger order 2 allocation and
no degradation was noticed. To be fair it didn't significantly improve
the amount of order 2 (9046 bytes large skb) allocated from irq
though, but I thought it was good idea to keep it in case there are
less aggressive/frequent users doing similar things.

Overall the more important part of the patch is the point 2) that I
can make it cleaner by splitting it off as you noticed and I will do
it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-09 21:38         ` Mel Gorman
@ 2010-11-09 22:22           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hi Mel,

On Tue, Nov 09, 2010 at 09:38:55PM +0000, Mel Gorman wrote:
> Specifically, I measured that lumpy in combination with compaction is
> more reliable and lower latency but that's not the same as deleting it.

Thanks for the clarification. Well no doubt that using both could only
increase the success rate. So the thing with hugetlbfs you may want to
run both, but with THP we want to stop at compaction. So this would
then require a __GFP_LUMPY if we want hugetlbfs to fallback on lumpy
whenever compaction isn't successful. We can't just nuke and ignore
young bits in pte if compaction fails. Trying later in khugepaged once
every 10 seconds is a lot better.

> That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
> moment that pushes lumpy reclaim to the side and for the majority of cases
> replaces it with "lumpy compaction". I'd hoping this will be sufficient for
> THP and alleviate the need to delete it entirely - at least until we are 100%
> sure that compaction can replace it in all cases.
> 
> Unfortunately, in the process of testing it today I also found out that
> 2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
> side-tracked trying to chase that down. My initial theories for the regression
> have shown up nothing so I'm currently preparing to do a bisection. This
> will take a long time though because the test is very slow :(

On my side (unrelated) I also found 37-rc1 broke my mic by changing
soundcard type (luckily csipsimple and skype on my cellphone are now
working better than laptop for making voip calls so it was easy to
workaround) and my backlight goes blank forever after a "xset dpms
force standby" (so I'm stuck in presentation mode to workaround it,
suspend to ram was successful to avoid having to reboot too as the
bios restarts the backlight during boot).

> I can still post the series as an RFC if you like to show what direction
> I'm thinking of but at the moment, I'm unable to test it until I pin the
> regression down.

Sure feel free to post it, if it's already worth testing it, I can
keep at the end of the patchset considering it's new code while what I
posted had lots of testing.

With THP we have khugepaged in the background, nothing is mandatory at
allocation time. I don't want a super aggressive thing at allocation
time, and lumpy by ignoring all young bits is too aggressive and
generates swap storms for every single allocation. We need to fail
order 9 allocation quick even if compaction fails (like if more than
90% of the ram is asked in hugepages so having to use ram in the
unmovable page blocks selected by anti-frag) to avoid hanging the
system during allocations. Looking my stats things seem to be working
ok with compaction in 37-rc1, so maybe it's just the lumpy changes
that introduced your regression?

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-09 22:22           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-09 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hi Mel,

On Tue, Nov 09, 2010 at 09:38:55PM +0000, Mel Gorman wrote:
> Specifically, I measured that lumpy in combination with compaction is
> more reliable and lower latency but that's not the same as deleting it.

Thanks for the clarification. Well no doubt that using both could only
increase the success rate. So the thing with hugetlbfs you may want to
run both, but with THP we want to stop at compaction. So this would
then require a __GFP_LUMPY if we want hugetlbfs to fallback on lumpy
whenever compaction isn't successful. We can't just nuke and ignore
young bits in pte if compaction fails. Trying later in khugepaged once
every 10 seconds is a lot better.

> That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
> moment that pushes lumpy reclaim to the side and for the majority of cases
> replaces it with "lumpy compaction". I'd hoping this will be sufficient for
> THP and alleviate the need to delete it entirely - at least until we are 100%
> sure that compaction can replace it in all cases.
> 
> Unfortunately, in the process of testing it today I also found out that
> 2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
> side-tracked trying to chase that down. My initial theories for the regression
> have shown up nothing so I'm currently preparing to do a bisection. This
> will take a long time though because the test is very slow :(

On my side (unrelated) I also found 37-rc1 broke my mic by changing
soundcard type (luckily csipsimple and skype on my cellphone are now
working better than laptop for making voip calls so it was easy to
workaround) and my backlight goes blank forever after a "xset dpms
force standby" (so I'm stuck in presentation mode to workaround it,
suspend to ram was successful to avoid having to reboot too as the
bios restarts the backlight during boot).

> I can still post the series as an RFC if you like to show what direction
> I'm thinking of but at the moment, I'm unable to test it until I pin the
> regression down.

Sure feel free to post it, if it's already worth testing it, I can
keep at the end of the patchset considering it's new code while what I
posted had lots of testing.

With THP we have khugepaged in the background, nothing is mandatory at
allocation time. I don't want a super aggressive thing at allocation
time, and lumpy by ignoring all young bits is too aggressive and
generates swap storms for every single allocation. We need to fail
order 9 allocation quick even if compaction fails (like if more than
90% of the ram is asked in hugepages so having to use ram in the
unmovable page blocks selected by anti-frag) to avoid hanging the
system during allocations. Looking my stats things seem to be working
ok with compaction in 37-rc1, so maybe it's just the lumpy changes
that introduced your regression?

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
  2010-11-09 21:40       ` Andrea Arcangeli
@ 2010-11-10  7:49         ` Hugh Dickins
  -1 siblings, 0 replies; 331+ messages in thread
From: Hugh Dickins @ 2010-11-10  7:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, 9 Nov 2010, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 12:08:25PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Reviewed-by: Rik van Riel <riel@redhat.com>
> > 
> > This explanation seems to tell this is bugfix. If so, please separate
> > this one from THP and will send mainline and -stable soon.
> 
> Right. I'm uncertain if this is so bad to require -stable I think, if
> it was more urgent I would have submitted already separately but it's
> true it's not THP specific.

Yes, we discussed this a few months ago: it's a welcome catch, but not
very serious, since it's normal for some pages to evade swap freeing,
then eventually memory pressure sorts it all out in __remove_mapping().

We did ask you back then to send in a fix separate from THP, but both
sides then forgot about it until recently.

We didn't agree on what the fix should look like.  You're keen to change
the page locking there, I didn't make a persuasive case for keeping it
as is, yet I can see no point whatever in changing it for this swap fix.
Could I persuade you to approve this simpler alternative?


[PATCH] ksm: free swap when swapcache page is replaced

When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/ksm.c |    2 ++
 1 file changed, 2 insertions(+)

--- 2.6.37-rc1/mm/ksm.c	2010-10-20 13:30:22.000000000 -0700
+++ linux/mm/ksm.c	2010-11-09 23:01:24.000000000 -0800
@@ -800,6 +800,8 @@ static int replace_page(struct vm_area_s
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
 	put_page(page);
 
 	pte_unmap_unlock(ptep, ptl);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
@ 2010-11-10  7:49         ` Hugh Dickins
  0 siblings, 0 replies; 331+ messages in thread
From: Hugh Dickins @ 2010-11-10  7:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, 9 Nov 2010, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 12:08:25PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > When swapcache is replaced by a ksm page don't leave orhpaned swap cache.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Reviewed-by: Rik van Riel <riel@redhat.com>
> > 
> > This explanation seems to tell this is bugfix. If so, please separate
> > this one from THP and will send mainline and -stable soon.
> 
> Right. I'm uncertain if this is so bad to require -stable I think, if
> it was more urgent I would have submitted already separately but it's
> true it's not THP specific.

Yes, we discussed this a few months ago: it's a welcome catch, but not
very serious, since it's normal for some pages to evade swap freeing,
then eventually memory pressure sorts it all out in __remove_mapping().

We did ask you back then to send in a fix separate from THP, but both
sides then forgot about it until recently.

We didn't agree on what the fix should look like.  You're keen to change
the page locking there, I didn't make a persuasive case for keeping it
as is, yet I can see no point whatever in changing it for this swap fix.
Could I persuade you to approve this simpler alternative?


[PATCH] ksm: free swap when swapcache page is replaced

When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
---

 mm/ksm.c |    2 ++
 1 file changed, 2 insertions(+)

--- 2.6.37-rc1/mm/ksm.c	2010-10-20 13:30:22.000000000 -0700
+++ linux/mm/ksm.c	2010-11-09 23:01:24.000000000 -0800
@@ -800,6 +800,8 @@ static int replace_page(struct vm_area_s
 	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
 
 	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
 	put_page(page);
 
 	pte_unmap_unlock(ptep, ptl);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-09 22:22           ` Andrea Arcangeli
@ 2010-11-10 14:27             ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-10 14:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 11:22:40PM +0100, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Tue, Nov 09, 2010 at 09:38:55PM +0000, Mel Gorman wrote:
> > Specifically, I measured that lumpy in combination with compaction is
> > more reliable and lower latency but that's not the same as deleting it.
> 
> Thanks for the clarification. Well no doubt that using both could only
> increase the success rate. So the thing with hugetlbfs you may want to
> run both, but with THP we want to stop at compaction.

Agreed. Any performance increase from THP is not likely to offset the
cost of lumpy reclaim.

> So this would
> then require a __GFP_LUMPY if we want hugetlbfs to fallback on lumpy
> whenever compaction isn't successful. We can't just nuke and ignore
> young bits in pte if compaction fails. Trying later in khugepaged once
> every 10 seconds is a lot better.
> 

Again agreed, I have no problem with lumpy reclaim being pushed aside.
I'm just less keen on it being disabled altogether. I have high hopes
for the series I'm working on that it can be extended slightly to suit
the needs of THP.

> > That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
> > moment that pushes lumpy reclaim to the side and for the majority of cases
> > replaces it with "lumpy compaction". I'd hoping this will be sufficient for
> > THP and alleviate the need to delete it entirely - at least until we are 100%
> > sure that compaction can replace it in all cases.
> > 
> > Unfortunately, in the process of testing it today I also found out that
> > 2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
> > side-tracked trying to chase that down. My initial theories for the regression
> > have shown up nothing so I'm currently preparing to do a bisection. This
> > will take a long time though because the test is very slow :(
> 
> On my side (unrelated) I also found 37-rc1 broke my mic by changing
> soundcard type (luckily csipsimple and skype on my cellphone are now
> working better than laptop for making voip calls so it was easy to
> workaround) and my backlight goes blank forever after a "xset dpms
> force standby" (so I'm stuck in presentation mode to workaround it,
> suspend to ram was successful to avoid having to reboot too as the
> bios restarts the backlight during boot).
> 

I do not believe they are related. Fortunately, I did not have to do a
full bisect but I know roughly what area the problem must be in. The
problem commit looks like d065bd81. I'm running further tests with it
reverted to see if it's true but it'll take a few hours to complete.

> > I can still post the series as an RFC if you like to show what direction
> > I'm thinking of but at the moment, I'm unable to test it until I pin the
> > regression down.
> 
> Sure feel free to post it, if it's already worth testing it, I can
> keep at the end of the patchset considering it's new code while what I
> posted had lots of testing.
> 

As I hopefully have pinned down the problem commit, I'm going to hold
off for another day to see can I get real data.

> With THP we have khugepaged in the background, nothing is mandatory at
> allocation time. I don't want a super aggressive thing at allocation
> time, and lumpy by ignoring all young bits is too aggressive and
> generates swap storms for every single allocation. We need to fail
> order 9 allocation quick even if compaction fails (like if more than
> 90% of the ram is asked in hugepages so having to use ram in the
> unmovable page blocks selected by anti-frag) to avoid hanging the
> system during allocations. Looking my stats things seem to be working
> ok with compaction in 37-rc1, so maybe it's just the lumpy changes
> that introduced your regression?
> 

Nah, the first thing I did was eliminate being "my fault" :). It would
have surprised me because the patches in isolation worked fine. It
thought the inode changes might have had something to do with it so I
was chasing blind alleys for a while. Hopefully d065bd81 will prove to
be the real problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-10 14:27             ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-10 14:27 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 11:22:40PM +0100, Andrea Arcangeli wrote:
> Hi Mel,
> 
> On Tue, Nov 09, 2010 at 09:38:55PM +0000, Mel Gorman wrote:
> > Specifically, I measured that lumpy in combination with compaction is
> > more reliable and lower latency but that's not the same as deleting it.
> 
> Thanks for the clarification. Well no doubt that using both could only
> increase the success rate. So the thing with hugetlbfs you may want to
> run both, but with THP we want to stop at compaction.

Agreed. Any performance increase from THP is not likely to offset the
cost of lumpy reclaim.

> So this would
> then require a __GFP_LUMPY if we want hugetlbfs to fallback on lumpy
> whenever compaction isn't successful. We can't just nuke and ignore
> young bits in pte if compaction fails. Trying later in khugepaged once
> every 10 seconds is a lot better.
> 

Again agreed, I have no problem with lumpy reclaim being pushed aside.
I'm just less keen on it being disabled altogether. I have high hopes
for the series I'm working on that it can be extended slightly to suit
the needs of THP.

> > That said, lumpy does hurt the system a lot.  I'm prototyping a series at the
> > moment that pushes lumpy reclaim to the side and for the majority of cases
> > replaces it with "lumpy compaction". I'd hoping this will be sufficient for
> > THP and alleviate the need to delete it entirely - at least until we are 100%
> > sure that compaction can replace it in all cases.
> > 
> > Unfortunately, in the process of testing it today I also found out that
> > 2.6.37-rc1 had regressed severely in terms of huge page allocations so I'm
> > side-tracked trying to chase that down. My initial theories for the regression
> > have shown up nothing so I'm currently preparing to do a bisection. This
> > will take a long time though because the test is very slow :(
> 
> On my side (unrelated) I also found 37-rc1 broke my mic by changing
> soundcard type (luckily csipsimple and skype on my cellphone are now
> working better than laptop for making voip calls so it was easy to
> workaround) and my backlight goes blank forever after a "xset dpms
> force standby" (so I'm stuck in presentation mode to workaround it,
> suspend to ram was successful to avoid having to reboot too as the
> bios restarts the backlight during boot).
> 

I do not believe they are related. Fortunately, I did not have to do a
full bisect but I know roughly what area the problem must be in. The
problem commit looks like d065bd81. I'm running further tests with it
reverted to see if it's true but it'll take a few hours to complete.

> > I can still post the series as an RFC if you like to show what direction
> > I'm thinking of but at the moment, I'm unable to test it until I pin the
> > regression down.
> 
> Sure feel free to post it, if it's already worth testing it, I can
> keep at the end of the patchset considering it's new code while what I
> posted had lots of testing.
> 

As I hopefully have pinned down the problem commit, I'm going to hold
off for another day to see can I get real data.

> With THP we have khugepaged in the background, nothing is mandatory at
> allocation time. I don't want a super aggressive thing at allocation
> time, and lumpy by ignoring all young bits is too aggressive and
> generates swap storms for every single allocation. We need to fail
> order 9 allocation quick even if compaction fails (like if more than
> 90% of the ram is asked in hugepages so having to use ram in the
> unmovable page blocks selected by anti-frag) to avoid hanging the
> system during allocations. Looking my stats things seem to be working
> ok with compaction in 37-rc1, so maybe it's just the lumpy changes
> that introduced your regression?
> 

Nah, the first thing I did was eliminate being "my fault" :). It would
have surprised me because the patches in isolation worked fine. It
thought the inode changes might have had something to do with it so I
was chasing blind alleys for a while. Hopefully d065bd81 will prove to
be the real problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-10 14:27             ` Mel Gorman
@ 2010-11-10 16:03               ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-10 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 10, 2010 at 02:27:04PM +0000, Mel Gorman wrote:
> Agreed. Any performance increase from THP is not likely to offset the
> cost of lumpy reclaim.

Exactly. Furthermore the improvement will still happen later by
polling compaction once every 10 sec with khugepaged (this is also
required in case some other guest or application quit releasing tons
of ram maybe natively order 9 in the buddy without requiring any
further compaction invocation).

What the default should be I don't know, but I like a default that
fails without causing swap storms. If you want the swap storms and to
drop all ptes regardless of their young bits, you should ask
explicitly for it I think. Anybody asking for high order allocation
and pretending to succeed despite the anti-frag and movable pageblocks
migrated with compaction aren't enough to succeed should be able to
handle a full graceful failure like THP does by design (or worst case
to return error to userland). As far as I can tell tg3 atomic order 2
allocation also provides for a graceful fallback for the same reason
(however in new mainline it floods the dmesg with tons of printk,
which it didn't used to with older kernels but it's not an actual
regression).

> Again agreed, I have no problem with lumpy reclaim being pushed aside.
> I'm just less keen on it being disabled altogether. I have high hopes
> for the series I'm working on that it can be extended slightly to suit
> the needs of THP.

Great. Well this is also why I disabled it with the smallest possible
modification, to avoid stepping on your toes.

> Nah, the first thing I did was eliminate being "my fault" :). It would
> have surprised me because the patches in isolation worked fine. It
> thought the inode changes might have had something to do with it so I
> was chasing blind alleys for a while. Hopefully d065bd81 will prove to
> be the real problem.

Well I wasn't sure if you tested it already on that very workload, the
patches weren't from you (even if you were in the signoffs). I
mentioned it just in case, glad it's not related :).

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-10 16:03               ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-10 16:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 10, 2010 at 02:27:04PM +0000, Mel Gorman wrote:
> Agreed. Any performance increase from THP is not likely to offset the
> cost of lumpy reclaim.

Exactly. Furthermore the improvement will still happen later by
polling compaction once every 10 sec with khugepaged (this is also
required in case some other guest or application quit releasing tons
of ram maybe natively order 9 in the buddy without requiring any
further compaction invocation).

What the default should be I don't know, but I like a default that
fails without causing swap storms. If you want the swap storms and to
drop all ptes regardless of their young bits, you should ask
explicitly for it I think. Anybody asking for high order allocation
and pretending to succeed despite the anti-frag and movable pageblocks
migrated with compaction aren't enough to succeed should be able to
handle a full graceful failure like THP does by design (or worst case
to return error to userland). As far as I can tell tg3 atomic order 2
allocation also provides for a graceful fallback for the same reason
(however in new mainline it floods the dmesg with tons of printk,
which it didn't used to with older kernels but it's not an actual
regression).

> Again agreed, I have no problem with lumpy reclaim being pushed aside.
> I'm just less keen on it being disabled altogether. I have high hopes
> for the series I'm working on that it can be extended slightly to suit
> the needs of THP.

Great. Well this is also why I disabled it with the smallest possible
modification, to avoid stepping on your toes.

> Nah, the first thing I did was eliminate being "my fault" :). It would
> have surprised me because the patches in isolation worked fine. It
> thought the inode changes might have had something to do with it so I
> was chasing blind alleys for a while. Hopefully d065bd81 will prove to
> be the real problem.

Well I wasn't sure if you tested it already on that very workload, the
patches weren't from you (even if you were in the signoffs). I
mentioned it just in case, glad it's not related :).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
  2010-11-10  7:49         ` Hugh Dickins
@ 2010-11-10 16:08           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-10 16:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 11:49:30PM -0800, Hugh Dickins wrote:
> We did ask you back then to send in a fix separate from THP, but both
> sides then forgot about it until recently.

Correct :).

> We didn't agree on what the fix should look like.  You're keen to change
> the page locking there, I didn't make a persuasive case for keeping it
> as is, yet I can see no point whatever in changing it for this swap fix.
> Could I persuade you to approve this simpler alternative?

Sure your version will work fine too. I insisted in removing the page
lock around replace_page because I didn't see the point of it and I
like strict code, but keeping it can do no harm.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging
@ 2010-11-10 16:08           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-10 16:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 11:49:30PM -0800, Hugh Dickins wrote:
> We did ask you back then to send in a fix separate from THP, but both
> sides then forgot about it until recently.

Correct :).

> We didn't agree on what the fix should look like.  You're keen to change
> the page locking there, I didn't make a persuasive case for keeping it
> as is, yet I can see no point whatever in changing it for this swap fix.
> Could I persuade you to approve this simpler alternative?

Sure your version will work fine too. I insisted in removing the page
lock around replace_page because I didn't see the point of it and I
like strict code, but keeping it can do no harm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-09 21:11       ` Andrea Arcangeli
@ 2010-11-14  5:07         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > With transparent hugepage support we need compaction for the "defrag" sysfs
> > > controls to be effective.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> > >  config TRANSPARENT_HUGEPAGE
> > >  	bool "Transparent Hugepage Support"
> > >  	depends on X86 && MMU
> > > +	select COMPACTION
> > >  	help
> > >  	  Transparent Hugepages allows the kernel to use huge pages and
> > >  	  huge tlb transparently to the applications whenever possible.
> > 
> > I dislike this. THP and compaction are completely orthogonal. I think 
> > you are talking only your performance recommendation. I mean I dislike
> > Kconfig 'select' hell and I hope every developers try to avoid it as 
> > far as possible.
> 
> At the moment THP hangs the system if COMPACTION isn't selected
> (please try yourself if you don't believe), as without COMPACTION
> lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
> orthogonal. When lumpy will be removed from the VM (like I tried
> multiple times to achieve) I can remove the select COMPACTION in
> theory, but then 99% of THP users would be still doing a mistake in
> disabling compaction, even if the mistake won't return in fatal
> runtime but just slightly degraded performance.

ok, I beleive you.
but please add this reason to the description.




^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-14  5:07         ` KOSAKI Motohiro
  0 siblings, 0 replies; 331+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  5:07 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: kosaki.motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	Balbir Singh, Michael S. Tsirkin, Peter Zijlstra,
	Johannes Weiner, Daisuke Nishimura, Chris Mason, Borislav Petkov

> On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > With transparent hugepage support we need compaction for the "defrag" sysfs
> > > controls to be effective.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> > >  config TRANSPARENT_HUGEPAGE
> > >  	bool "Transparent Hugepage Support"
> > >  	depends on X86 && MMU
> > > +	select COMPACTION
> > >  	help
> > >  	  Transparent Hugepages allows the kernel to use huge pages and
> > >  	  huge tlb transparently to the applications whenever possible.
> > 
> > I dislike this. THP and compaction are completely orthogonal. I think 
> > you are talking only your performance recommendation. I mean I dislike
> > Kconfig 'select' hell and I hope every developers try to avoid it as 
> > far as possible.
> 
> At the moment THP hangs the system if COMPACTION isn't selected
> (please try yourself if you don't believe), as without COMPACTION
> lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
> orthogonal. When lumpy will be removed from the VM (like I tried
> multiple times to achieve) I can remove the select COMPACTION in
> theory, but then 99% of THP users would be still doing a mistake in
> disabling compaction, even if the mistake won't return in fatal
> runtime but just slightly degraded performance.

ok, I beleive you.
but please add this reason to the description.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-14  5:07         ` KOSAKI Motohiro
@ 2010-11-15 15:13           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 15:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Sun, Nov 14, 2010 at 02:07:07PM +0900, KOSAKI Motohiro wrote:
> ok, I beleive you.
> but please add this reason to the description.

Good idea, I will. BTW, in general I agreed with your remark from a
theoretical standpoint.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-15 15:13           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-15 15:13 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Sun, Nov 14, 2010 at 02:07:07PM +0900, KOSAKI Motohiro wrote:
> ok, I beleive you.
> but please add this reason to the description.

Good idea, I will. BTW, in general I agreed with your remark from a
theoretical standpoint.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18  8:30     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18  8:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:36PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Compaction is more reliable than lumpy, and lumpy makes the system unusable
> when it runs.
> 

It took me a while but is "[PATCH 0/8] Use memory compaction instead of
lumpy reclaim during high-order allocations" a suitable replacement for
this patch?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 01 of 66] disable lumpy when compaction is enabled
@ 2010-11-18  8:30     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18  8:30 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:36PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Compaction is more reliable than lumpy, and lumpy makes the system unusable
> when it runs.
> 

It took me a while but is "[PATCH 0/8] Use memory compaction instead of
lumpy reclaim during high-order allocations" a suitable replacement for
this patch?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 11:13     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:37PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Page migration requires rmap to be able to find all migration ptes
> created by migration. If the second rmap_walk clearing migration PTEs
> misses an entry, it is left dangling causing a BUG_ON to trigger during
> fault. For example;
> 
> [  511.201534] kernel BUG at include/linux/swapops.h:105!
> [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> [  511.201534] last sysfs file: /sys/block/sde/size
> [  511.201534] CPU 0
> [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> [  511.888526]
> [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
> [  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
> [  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
> [  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
> [  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
> [  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
> [  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> [  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
> [  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
> [  513.068415] Stack:
> [  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
> [  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
> [  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
> [  513.228068] Call Trace:
> [  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
> [  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
> [  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
> [  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
> [  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
> [  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
> [  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
> [  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
> [  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
> [  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
> [  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
> [  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
> [  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
> [  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  514.808221]  RSP <ffff880037b979d8>
> [  514.906179] ---[ end trace 4f88495edc224d6b ]---
> 
> This particular BUG_ON is caused by a race between shift_arg_pages and
> migration. During exec, a temporary stack is created and later moved to its
> final location. If migration selects a page within the temporary stack,
> the page tables and migration PTE can be copied to the new location
> before rmap_walk is able to find the copy. This leaves a dangling
> migration PTE behind that later triggers the bug.
> 
> This patch fixes the problem by using two VMAs - one which covers the temporary
> stack and the other which covers the new location. This guarantees that rmap
> can always find the migration PTE even if it is copied while rmap_walk is
> taking place.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

This old chestnut. IIRC, this was the more complete solution to a fix that made
it into mainline. The patch still looks reasonable. It does add a kmalloc()
but I can't remember if we decided we were ok with it or not. Can you remind
me? More importantly, it appears to be surviving the original testcase that
this bug was about (20 minutes so far but will leave it a few hours). Assuming
the test does not crash;

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-18 11:13     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:37PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Page migration requires rmap to be able to find all migration ptes
> created by migration. If the second rmap_walk clearing migration PTEs
> misses an entry, it is left dangling causing a BUG_ON to trigger during
> fault. For example;
> 
> [  511.201534] kernel BUG at include/linux/swapops.h:105!
> [  511.201534] invalid opcode: 0000 [#1] PREEMPT SMP
> [  511.201534] last sysfs file: /sys/block/sde/size
> [  511.201534] CPU 0
> [  511.201534] Modules linked in: kvm_amd kvm dm_crypt loop i2c_piix4 serio_raw tpm_tis shpchp evdev tpm i2c_core pci_hotplug tpm_bios wmi processor button ext3 jbd mbcache dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 raid0 multipath linear md_mod sg sr_mod cdrom sd_mod ata_generic ahci libahci libata ide_pci_generic ehci_hcd ide_core r8169 mii ohci_hcd scsi_mod floppy thermal fan thermal_sys
> [  511.888526]
> [  511.888526] Pid: 20431, comm: date Not tainted 2.6.34-rc4-mm1-fix-swapops #6 GA-MA790GP-UD4H/GA-MA790GP-UD4H
> [  511.888526] RIP: 0010:[<ffffffff811094ff>]  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  512.173545] RSP: 0018:ffff880037b979d8  EFLAGS: 00010246
> [  512.198503] RAX: ffffea0000000000 RBX: ffffea0001a2ba10 RCX: 0000000000029830
> [  512.329617] RDX: 0000000001a2ba10 RSI: ffffffff818264b8 RDI: 000000000ef45c3e
> [  512.380001] RBP: ffff880037b97a08 R08: ffff880078003f00 R09: ffff880037b979e8
> [  512.380001] R10: ffffffff8114ddaa R11: 0000000000000246 R12: 0000000037304000
> [  512.380001] R13: ffff88007a9ed5c8 R14: f800000000077a2e R15: 000000000ef45c3e
> [  512.380001] FS:  00007f3d346866e0(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
> [  512.380001] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  512.380001] CR2: 00007fff6abec9c1 CR3: 0000000037a15000 CR4: 00000000000006f0
> [  512.380001] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  513.004775] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  513.068415] Process date (pid: 20431, threadinfo ffff880037b96000, task ffff880078003f00)
> [  513.068415] Stack:
> [  513.068415]  ffff880037b97b98 ffff880037b97a18 ffff880037b97be8 0000000000000c00
> [  513.228068] <0> ffff880037304f60 00007fff6abec9c1 ffff880037b97aa8 ffffffff810e951a
> [  513.228068] <0> ffff880037b97a88 0000000000000246 0000000000000000 ffffffff8130c5c2
> [  513.228068] Call Trace:
> [  513.228068]  [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
> [  513.228068]  [<ffffffff8130c5c2>] ? do_page_fault+0x26a/0x46e
> [  513.228068]  [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
> [  513.720755]  [<ffffffff8130875d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [  513.789278]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff813099b5>] page_fault+0x25/0x30
> [  513.851506]  [<ffffffff8114ddaa>] ? load_elf_binary+0x14a1/0x192b
> [  513.851506]  [<ffffffff811c1e27>] ? strnlen_user+0x3f/0x57
> [  513.851506]  [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
> [  513.851506]  [<ffffffff8111329b>] search_binary_handler+0x173/0x313
> [  513.851506]  [<ffffffff8114c909>] ? load_elf_binary+0x0/0x192b
> [  513.851506]  [<ffffffff81114896>] do_execve+0x219/0x30a
> [  513.851506]  [<ffffffff8111887f>] ? getname+0x14d/0x1b3
> [  513.851506]  [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
> [  514.483501]  [<ffffffff8100320a>] stub_execve+0x6a/0xc0
> [  514.548357] Code: 74 05 83 f8 1f 75 68 48 b8 ff ff ff ff ff ff ff 07 48 21 c2 48 b8 00 00 00 00 00 ea ff ff 48 6b d2 38 48 8d 1c 02 f6 03 01 75 04 <0f> 0b eb fe 8b 4b 08 48 8d 73 08 85 c9 74 35 8d 41 01 89 4d e0
> [  514.704292] RIP  [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129
> [  514.808221]  RSP <ffff880037b979d8>
> [  514.906179] ---[ end trace 4f88495edc224d6b ]---
> 
> This particular BUG_ON is caused by a race between shift_arg_pages and
> migration. During exec, a temporary stack is created and later moved to its
> final location. If migration selects a page within the temporary stack,
> the page tables and migration PTE can be copied to the new location
> before rmap_walk is able to find the copy. This leaves a dangling
> migration PTE behind that later triggers the bug.
> 
> This patch fixes the problem by using two VMAs - one which covers the temporary
> stack and the other which covers the new location. This guarantees that rmap
> can always find the migration PTE even if it is copied while rmap_walk is
> taking place.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

This old chestnut. IIRC, this was the more complete solution to a fix that made
it into mainline. The patch still looks reasonable. It does add a kmalloc()
but I can't remember if we decided we were ok with it or not. Can you remind
me? More importantly, it appears to be surviving the original testcase that
this bug was about (20 minutes so far but will leave it a few hours). Assuming
the test does not crash;

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 11:41     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:38PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Documentation/vm/transhuge.txt
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> new file mode 100644
> --- /dev/null
> +++ b/Documentation/vm/transhuge.txt
> @@ -0,0 +1,283 @@
> += Transparent Hugepage Support =
> +
> +== Objective ==
> +
> +Performance critical computing applications dealing with large memory
> +working sets are already running on top of libhugetlbfs and in turn
> +hugetlbfs. Transparent Hugepage Support is an alternative to
> +libhugetlbfs that offers the same feature of libhugetlbfs but without
> +the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).

libhugetlbfs can also automatically back shared memory, text and data
with huge pages which THP cannot do. libhugetlbfs cannot demote and
promote memory where THP can. They are not exactly like-with-like
comparisons. How about;

"Transparent Hugepage Support is an alternative means of using huge pages for
the backing of anonymous memory with huge pages that supports the automatic
promotion and demotion of page sizes."

?

> +
> +In the future it can expand over the pagecache layer starting with
> +tmpfs to reduce even further the hugetlbfs usages.
> +
> +The reason applications are running faster is because of two
> +factors. The first factor is almost completely irrelevant and it's not
> +of significant interest because it'll also have the downside of
> +requiring larger clear-page copy-page in page faults which is a
> +potentially negative effect. The first factor consists in taking a
> +single page fault for each 2M virtual region touched by userland (so
> +reducing the enter/exit kernel frequency by a 512 times factor). This
> +only matters the first time the memory is accessed for the lifetime of
> +a memory mapping. The second long lasting and much more important
> +factor will affect all subsequent accesses to the memory for the whole
> +runtime of the application. The second factor consist of two
> +components: 1) the TLB miss will run faster (especially with
> +virtualization using nested pagetables but also on bare metal without
> +virtualization)

Careful on that first statement. It's not necessarily true for bare metal
as some processors show that the TLB miss handler for huge pages is slower
than base pages. Not sure why but it seemed to be the case on P4 anyway
(at least the one I have). Maybe it was a measurement error but on chips
with split TLBs for page sizes, there is no guarantee they are the same speed.

It's probably true for virtualisation though considering the vastly reduced
number of cache lines required to translate an address.

I'd weaken the language for bare metal to say "almost always" but it's
not a big deal.

> and 2) a single TLB entry will be mapping a much
> +larger amount of virtual memory in turn reducing the number of TLB
> +misses.

This on the other hand is certainly true.

> +With virtualization and nested pagetables the TLB can be
> +mapped of larger size only if both KVM and the Linux guest are using
> +hugepages but a significant speedup already happens if only one of the
> +two is using hugepages just because of the fact the TLB miss is going
> +to run faster.
> +
> +== Design ==
> +
> +- "graceful fallback": mm components which don't have transparent
> +  hugepage knownledge fall back to breaking a transparent hugepage and

%s/knownledge/knowledge/

> +  working on the regular pages and their respective regular pmd/pte
> +  mappings
> +
> +- if an hugepage allocation fails because of memory fragmentation,

s/an/a/

> +  regular pages should be gracefully allocated instead and mixed in
> +  the same vma without any failure or significant delay and generally
> +  without userland noticing
> +

why "generally"? At worst the application will see varying performance
characteristics but that applies to a lot more than THP.

> +- if some task quits and more hugepages become available (either
> +  immediately in the buddy or through the VM), guest physical memory
> +  backed by regular pages should be relocated on hugepages
> +  automatically (with khugepaged)
> +
> +- it doesn't require boot-time memory reservation and in turn it uses

neither does hugetlbfs.

> +  hugepages whenever possible (the only possible reservation here is
> +  kernelcore= to avoid unmovable pages to fragment all the memory but
> +  such a tweak is not specific to transparent hugepage support and
> +  it's a generic feature that applies to all dynamic high order
> +  allocations in the kernel)
> +
> +- this initial support only offers the feature in the anonymous memory
> +  regions but it'd be ideal to move it to tmpfs and the pagecache
> +  later
> +
> +Transparent Hugepage Support maximizes the usefulness of free memory
> +if compared to the reservation approach of hugetlbfs by allowing all
> +unused memory to be used as cache or other movable (or even unmovable
> +entities).

hugetlbfs with memory overcommit offers something similar, particularly
in combination with libhugetlbfs with can automatically fall back to base
pages. I've run benchmarks comparing hugetlbfs using a static hugepage
pool with hugetlbfs dynamically allocating hugepages as required with no
discernable performance difference. So this statement is not strictly accurate.

> +It doesn't require reservation to prevent hugepage
> +allocation failures to be noticeable from userland. It allows paging
> +and all other advanced VM features to be available on the
> +hugepages. It requires no modifications for applications to take
> +advantage of it.
> +
> +Applications however can be further optimized to take advantage of
> +this feature, like for example they've been optimized before to avoid
> +a flood of mmap system calls for every malloc(4k). Optimizing userland
> +is by far not mandatory and khugepaged already can take care of long
> +lived page allocations even for hugepage unaware applications that
> +deals with large amounts of memory.
> +
> +In certain cases when hugepages are enabled system wide, application
> +may end up allocating more memory resources. An application may mmap a
> +large region but only touch 1 byte of it, in that case a 2M page might
> +be allocated instead of a 4k page for no good. This is why it's
> +possible to disable hugepages system-wide and to only have them inside
> +MADV_HUGEPAGE madvise regions.
> +
> +Embedded systems should enable hugepages only inside madvise regions
> +to eliminate any risk of wasting any precious byte of memory and to
> +only run faster.
> +
> +Applications that gets a lot of benefit from hugepages and that don't
> +risk to lose memory by using hugepages, should use
> +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> +
> +== sysfs ==
> +
> +Transparent Hugepage Support can be entirely disabled (mostly for
> +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> +avoid the risk of consuming more memory resources) or enabled system
> +wide. This can be achieved with one of:
> +
> +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> +
> +It's also possible to limit defrag efforts in the VM to generate
> +hugepages in case they're not immediately free to madvise regions or
> +to never try to defrag memory and simply fallback to regular pages
> +unless hugepages are immediately available.

This is the first mention of defrag but hey, it's not a paper :)

> Clearly if we spend CPU
> +time to defrag memory, we would expect to gain even more by the fact
> +we use hugepages later instead of regular pages. This isn't always
> +guaranteed, but it may be more likely in case the allocation is for a
> +MADV_HUGEPAGE region.
> +
> +echo always >/sys/kernel/mm/transparent_hugepage/defrag
> +echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
> +echo never >/sys/kernel/mm/transparent_hugepage/defrag
> +
> +khugepaged will be automatically started when
> +transparent_hugepage/enabled is set to "always" or "madvise, and it'll
> +be automatically shutdown if it's set to "never".
> +
> +khugepaged runs usually at low frequency so while one may not want to
> +invoke defrag algorithms synchronously during the page faults, it
> +should be worth invoking defrag at least in khugepaged. However it's
> +also possible to disable defrag in khugepaged:
> +
> +echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
> +echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
> +
> +You can also control how many pages khugepaged should scan at each
> +pass:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> +
> +and how many milliseconds to wait in khugepaged between each pass (you
> +can se this to 0 to run khugepaged at 100% utilization of one core):

s/se/set/

> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> +
> +and how many milliseconds to wait in khugepaged if there's an hugepage
> +allocation failure to throttle the next allocation attempt.
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> +
> +The khugepaged progress can be seen in the number of pages collapsed:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> +
> +for each pass:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
> +
> +== Boot parameter ==
> +
> +You can change the sysfs boot time defaults of Transparent Hugepage
> +Support by passing the parameter "transparent_hugepage=always" or
> +"transparent_hugepage=madvise" or "transparent_hugepage=never"
> +(without "") to the kernel command line.
> +
> +== Need of restart ==
> +

Need of application restart?

A casual reader might otherwise interpret it as a system restart for
some godawful reason of their own.


> +The transparent_hugepage/enabled values only affect future
> +behavior. So to make them effective you need to restart any

s/behavior/behaviour/

> +application that could have been using hugepages. This also applies to
> +the regions registered in khugepaged.
> +
> +== get_user_pages and follow_page ==
> +
> +get_user_pages and follow_page if run on a hugepage, will return the
> +head or tail pages as usual (exactly as they would do on
> +hugetlbfs). Most gup users will only care about the actual physical
> +address of the page and its temporary pinning to release after the I/O
> +is complete, so they won't ever notice the fact the page is huge. But
> +if any driver is going to mangle over the page structure of the tail
> +page (like for checking page->mapping or other bits that are relevant
> +for the head page and not the tail page), it should be updated to jump
> +to check head page instead (while serializing properly against
> +split_huge_page() to avoid the head and tail pages to disappear from
> +under it, see the futex code to see an example of that, hugetlbfs also
> +needed special handling in futex code for similar reasons).
> +
> +NOTE: these aren't new constraints to the GUP API, and they match the
> +same constrains that applies to hugetlbfs too, so any driver capable
> +of handling GUP on hugetlbfs will also work fine on transparent
> +hugepage backed mappings.
> +
> +In case you can't handle compound pages if they're returned by
> +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> +follow_page, so that it will split the hugepages before returning
> +them. Migration for example passes FOLL_SPLIT as parameter to
> +follow_page because it's not hugepage aware and in fact it can't work
> +at all on hugetlbfs (but it instead works fine on transparent

hugetlbfs pages can now migrate although it's only used by hwpoison.

> +hugepages thanks to FOLL_SPLIT). migration simply can't deal with
> +hugepages being returned (as it's not only checking the pfn of the
> +page and pinning it during the copy but it pretends to migrate the
> +memory in regular page sizes and with regular pte/pmd mappings).
> +
> +== Optimizing the applications ==
> +
> +To be guaranteed that the kernel will map a 2M page immediately in any
> +memory region, the mmap region has to be hugepage naturally
> +aligned. posix_memalign() can provide that guarantee.
> +
> +== Hugetlbfs ==
> +
> +You can use hugetlbfs on a kernel that has transparent hugepage
> +support enabled just fine as always. No difference can be noted in
> +hugetlbfs other than there will be less overall fragmentation. All
> +usual features belonging to hugetlbfs are preserved and
> +unaffected. libhugetlbfs will also work fine as usual.
> +
> +== Graceful fallback ==
> +
> +Code walking pagetables but unware about huge pmds can simply call
> +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> +pmd_offset. It's trivial to make the code transparent hugepage aware
> +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> +missing after pmd_offset returns the pmd. Thanks to the graceful
> +fallback design, with a one liner change, you can avoid to write
> +hundred if not thousand of lines of complex code to make your code
> +hugepage aware.
> +

It'd be nice if you could point to a specific example but by no means
mandatory.

> +If you're not walking pagetables but you run into a physical hugepage
> +but you can't handle it natively in your code, you can split it by
> +calling split_huge_page(page). This is what the Linux VM does before
> +it tries to swapout the hugepage for example.
> +
> +== Locking in hugepage aware code ==
> +
> +We want as much code as possible hugepage aware, as calling
> +split_huge_page() or split_huge_page_pmd() has a cost.
> +
> +To make pagetable walks huge pmd aware, all you need to do is to call
> +pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
> +mmap_sem in read (or write) mode to be sure an huge pmd cannot be
> +created from under you by khugepaged (khugepaged collapse_huge_page
> +takes the mmap_sem in write mode in addition to the anon_vma lock). If
> +pmd_trans_huge returns false, you just fallback in the old code
> +paths. If instead pmd_trans_huge returns true, you have to take the
> +mm->page_table_lock and re-run pmd_trans_huge. Taking the
> +page_table_lock will prevent the huge pmd to be converted into a
> +regular pmd from under you (split_huge_page can run in parallel to the
> +pagetable walk). If the second pmd_trans_huge returns false, you
> +should just drop the page_table_lock and fallback to the old code as
> +before. Otherwise you should run pmd_trans_splitting on the pmd. In
> +case pmd_trans_splitting returns true, it means split_huge_page is
> +already in the middle of splitting the page. So if pmd_trans_splitting
> +returns true it's enough to drop the page_table_lock and call
> +wait_split_huge_page and then fallback the old code paths. You are
> +guaranteed by the time wait_split_huge_page returns, the pmd isn't
> +huge anymore. If pmd_trans_splitting returns false, you can proceed to
> +process the huge pmd and the hugepage natively. Once finished you can
> +drop the page_table_lock.
> +
> +== compound_lock, get_user_pages and put_page ==
> +
> +split_huge_page internally has to distribute the refcounts in the head
> +page to the tail pages before clearing all PG_head/tail bits from the
> +page structures. It can do that easily for refcounts taken by huge pmd
> +mappings. But the GUI API as created by hugetlbfs (that returns head
> +and tail pages if running get_user_pages on an address backed by any
> +hugepage), requires the refcount to be accounted on the tail pages and
> +not only in the head pages, if we want to be able to run
> +split_huge_page while there are gup pins established on any tail
> +page. Failure to be able to run split_huge_page if there's any gup pin
> +on any tail page, would mean having to split all hugepages upfront in
> +get_user_pages which is unacceptable as too many gup users are
> +performance critical and they must work natively on hugepages like
> +they work natively on hugetlbfs already (hugetlbfs is simpler because
> +hugetlbfs pages cannot be splitted so there wouldn't be requirement of
> +accounting the pins on the tail pages for hugetlbfs). If we wouldn't
> +account the gup refcounts on the tail pages during gup, we won't know
> +anymore which tail page is pinned by gup and which is not while we run
> +split_huge_page. But we still have to add the gup pin to the head page
> +too, to know when we can free the compound page in case it's never
> +splitted during its lifetime. That requires changing not just
> +get_page, but put_page as well so that when put_page runs on a tail
> +page (and only on a tail page) it will find its respective head page,
> +and then it will decrease the head page refcount in addition to the
> +tail page refcount. To obtain a head page reliably and to decrease its
> +refcount without race conditions, put_page has to serialize against
> +__split_huge_page_refcount using a special per-page lock called
> +compound_lock.
> 

Ok, I'll need to read the rest of the series to verify if this is
correct but by and large it looks good. I think some of the language is
stronger than it should be and some of the comparisons with libhugetlbfs
are a bit off but I'd be naturally defensive on that topic. Make the
suggested changes if you like but if you don't, it shouldn't affect the
series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
@ 2010-11-18 11:41     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:41 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:38PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Documentation/vm/transhuge.txt
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> new file mode 100644
> --- /dev/null
> +++ b/Documentation/vm/transhuge.txt
> @@ -0,0 +1,283 @@
> += Transparent Hugepage Support =
> +
> +== Objective ==
> +
> +Performance critical computing applications dealing with large memory
> +working sets are already running on top of libhugetlbfs and in turn
> +hugetlbfs. Transparent Hugepage Support is an alternative to
> +libhugetlbfs that offers the same feature of libhugetlbfs but without
> +the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).

libhugetlbfs can also automatically back shared memory, text and data
with huge pages which THP cannot do. libhugetlbfs cannot demote and
promote memory where THP can. They are not exactly like-with-like
comparisons. How about;

"Transparent Hugepage Support is an alternative means of using huge pages for
the backing of anonymous memory with huge pages that supports the automatic
promotion and demotion of page sizes."

?

> +
> +In the future it can expand over the pagecache layer starting with
> +tmpfs to reduce even further the hugetlbfs usages.
> +
> +The reason applications are running faster is because of two
> +factors. The first factor is almost completely irrelevant and it's not
> +of significant interest because it'll also have the downside of
> +requiring larger clear-page copy-page in page faults which is a
> +potentially negative effect. The first factor consists in taking a
> +single page fault for each 2M virtual region touched by userland (so
> +reducing the enter/exit kernel frequency by a 512 times factor). This
> +only matters the first time the memory is accessed for the lifetime of
> +a memory mapping. The second long lasting and much more important
> +factor will affect all subsequent accesses to the memory for the whole
> +runtime of the application. The second factor consist of two
> +components: 1) the TLB miss will run faster (especially with
> +virtualization using nested pagetables but also on bare metal without
> +virtualization)

Careful on that first statement. It's not necessarily true for bare metal
as some processors show that the TLB miss handler for huge pages is slower
than base pages. Not sure why but it seemed to be the case on P4 anyway
(at least the one I have). Maybe it was a measurement error but on chips
with split TLBs for page sizes, there is no guarantee they are the same speed.

It's probably true for virtualisation though considering the vastly reduced
number of cache lines required to translate an address.

I'd weaken the language for bare metal to say "almost always" but it's
not a big deal.

> and 2) a single TLB entry will be mapping a much
> +larger amount of virtual memory in turn reducing the number of TLB
> +misses.

This on the other hand is certainly true.

> +With virtualization and nested pagetables the TLB can be
> +mapped of larger size only if both KVM and the Linux guest are using
> +hugepages but a significant speedup already happens if only one of the
> +two is using hugepages just because of the fact the TLB miss is going
> +to run faster.
> +
> +== Design ==
> +
> +- "graceful fallback": mm components which don't have transparent
> +  hugepage knownledge fall back to breaking a transparent hugepage and

%s/knownledge/knowledge/

> +  working on the regular pages and their respective regular pmd/pte
> +  mappings
> +
> +- if an hugepage allocation fails because of memory fragmentation,

s/an/a/

> +  regular pages should be gracefully allocated instead and mixed in
> +  the same vma without any failure or significant delay and generally
> +  without userland noticing
> +

why "generally"? At worst the application will see varying performance
characteristics but that applies to a lot more than THP.

> +- if some task quits and more hugepages become available (either
> +  immediately in the buddy or through the VM), guest physical memory
> +  backed by regular pages should be relocated on hugepages
> +  automatically (with khugepaged)
> +
> +- it doesn't require boot-time memory reservation and in turn it uses

neither does hugetlbfs.

> +  hugepages whenever possible (the only possible reservation here is
> +  kernelcore= to avoid unmovable pages to fragment all the memory but
> +  such a tweak is not specific to transparent hugepage support and
> +  it's a generic feature that applies to all dynamic high order
> +  allocations in the kernel)
> +
> +- this initial support only offers the feature in the anonymous memory
> +  regions but it'd be ideal to move it to tmpfs and the pagecache
> +  later
> +
> +Transparent Hugepage Support maximizes the usefulness of free memory
> +if compared to the reservation approach of hugetlbfs by allowing all
> +unused memory to be used as cache or other movable (or even unmovable
> +entities).

hugetlbfs with memory overcommit offers something similar, particularly
in combination with libhugetlbfs with can automatically fall back to base
pages. I've run benchmarks comparing hugetlbfs using a static hugepage
pool with hugetlbfs dynamically allocating hugepages as required with no
discernable performance difference. So this statement is not strictly accurate.

> +It doesn't require reservation to prevent hugepage
> +allocation failures to be noticeable from userland. It allows paging
> +and all other advanced VM features to be available on the
> +hugepages. It requires no modifications for applications to take
> +advantage of it.
> +
> +Applications however can be further optimized to take advantage of
> +this feature, like for example they've been optimized before to avoid
> +a flood of mmap system calls for every malloc(4k). Optimizing userland
> +is by far not mandatory and khugepaged already can take care of long
> +lived page allocations even for hugepage unaware applications that
> +deals with large amounts of memory.
> +
> +In certain cases when hugepages are enabled system wide, application
> +may end up allocating more memory resources. An application may mmap a
> +large region but only touch 1 byte of it, in that case a 2M page might
> +be allocated instead of a 4k page for no good. This is why it's
> +possible to disable hugepages system-wide and to only have them inside
> +MADV_HUGEPAGE madvise regions.
> +
> +Embedded systems should enable hugepages only inside madvise regions
> +to eliminate any risk of wasting any precious byte of memory and to
> +only run faster.
> +
> +Applications that gets a lot of benefit from hugepages and that don't
> +risk to lose memory by using hugepages, should use
> +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> +
> +== sysfs ==
> +
> +Transparent Hugepage Support can be entirely disabled (mostly for
> +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> +avoid the risk of consuming more memory resources) or enabled system
> +wide. This can be achieved with one of:
> +
> +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> +
> +It's also possible to limit defrag efforts in the VM to generate
> +hugepages in case they're not immediately free to madvise regions or
> +to never try to defrag memory and simply fallback to regular pages
> +unless hugepages are immediately available.

This is the first mention of defrag but hey, it's not a paper :)

> Clearly if we spend CPU
> +time to defrag memory, we would expect to gain even more by the fact
> +we use hugepages later instead of regular pages. This isn't always
> +guaranteed, but it may be more likely in case the allocation is for a
> +MADV_HUGEPAGE region.
> +
> +echo always >/sys/kernel/mm/transparent_hugepage/defrag
> +echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
> +echo never >/sys/kernel/mm/transparent_hugepage/defrag
> +
> +khugepaged will be automatically started when
> +transparent_hugepage/enabled is set to "always" or "madvise, and it'll
> +be automatically shutdown if it's set to "never".
> +
> +khugepaged runs usually at low frequency so while one may not want to
> +invoke defrag algorithms synchronously during the page faults, it
> +should be worth invoking defrag at least in khugepaged. However it's
> +also possible to disable defrag in khugepaged:
> +
> +echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
> +echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
> +
> +You can also control how many pages khugepaged should scan at each
> +pass:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> +
> +and how many milliseconds to wait in khugepaged between each pass (you
> +can se this to 0 to run khugepaged at 100% utilization of one core):

s/se/set/

> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> +
> +and how many milliseconds to wait in khugepaged if there's an hugepage
> +allocation failure to throttle the next allocation attempt.
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> +
> +The khugepaged progress can be seen in the number of pages collapsed:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> +
> +for each pass:
> +
> +/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
> +
> +== Boot parameter ==
> +
> +You can change the sysfs boot time defaults of Transparent Hugepage
> +Support by passing the parameter "transparent_hugepage=always" or
> +"transparent_hugepage=madvise" or "transparent_hugepage=never"
> +(without "") to the kernel command line.
> +
> +== Need of restart ==
> +

Need of application restart?

A casual reader might otherwise interpret it as a system restart for
some godawful reason of their own.


> +The transparent_hugepage/enabled values only affect future
> +behavior. So to make them effective you need to restart any

s/behavior/behaviour/

> +application that could have been using hugepages. This also applies to
> +the regions registered in khugepaged.
> +
> +== get_user_pages and follow_page ==
> +
> +get_user_pages and follow_page if run on a hugepage, will return the
> +head or tail pages as usual (exactly as they would do on
> +hugetlbfs). Most gup users will only care about the actual physical
> +address of the page and its temporary pinning to release after the I/O
> +is complete, so they won't ever notice the fact the page is huge. But
> +if any driver is going to mangle over the page structure of the tail
> +page (like for checking page->mapping or other bits that are relevant
> +for the head page and not the tail page), it should be updated to jump
> +to check head page instead (while serializing properly against
> +split_huge_page() to avoid the head and tail pages to disappear from
> +under it, see the futex code to see an example of that, hugetlbfs also
> +needed special handling in futex code for similar reasons).
> +
> +NOTE: these aren't new constraints to the GUP API, and they match the
> +same constrains that applies to hugetlbfs too, so any driver capable
> +of handling GUP on hugetlbfs will also work fine on transparent
> +hugepage backed mappings.
> +
> +In case you can't handle compound pages if they're returned by
> +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> +follow_page, so that it will split the hugepages before returning
> +them. Migration for example passes FOLL_SPLIT as parameter to
> +follow_page because it's not hugepage aware and in fact it can't work
> +at all on hugetlbfs (but it instead works fine on transparent

hugetlbfs pages can now migrate although it's only used by hwpoison.

> +hugepages thanks to FOLL_SPLIT). migration simply can't deal with
> +hugepages being returned (as it's not only checking the pfn of the
> +page and pinning it during the copy but it pretends to migrate the
> +memory in regular page sizes and with regular pte/pmd mappings).
> +
> +== Optimizing the applications ==
> +
> +To be guaranteed that the kernel will map a 2M page immediately in any
> +memory region, the mmap region has to be hugepage naturally
> +aligned. posix_memalign() can provide that guarantee.
> +
> +== Hugetlbfs ==
> +
> +You can use hugetlbfs on a kernel that has transparent hugepage
> +support enabled just fine as always. No difference can be noted in
> +hugetlbfs other than there will be less overall fragmentation. All
> +usual features belonging to hugetlbfs are preserved and
> +unaffected. libhugetlbfs will also work fine as usual.
> +
> +== Graceful fallback ==
> +
> +Code walking pagetables but unware about huge pmds can simply call
> +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> +pmd_offset. It's trivial to make the code transparent hugepage aware
> +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> +missing after pmd_offset returns the pmd. Thanks to the graceful
> +fallback design, with a one liner change, you can avoid to write
> +hundred if not thousand of lines of complex code to make your code
> +hugepage aware.
> +

It'd be nice if you could point to a specific example but by no means
mandatory.

> +If you're not walking pagetables but you run into a physical hugepage
> +but you can't handle it natively in your code, you can split it by
> +calling split_huge_page(page). This is what the Linux VM does before
> +it tries to swapout the hugepage for example.
> +
> +== Locking in hugepage aware code ==
> +
> +We want as much code as possible hugepage aware, as calling
> +split_huge_page() or split_huge_page_pmd() has a cost.
> +
> +To make pagetable walks huge pmd aware, all you need to do is to call
> +pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
> +mmap_sem in read (or write) mode to be sure an huge pmd cannot be
> +created from under you by khugepaged (khugepaged collapse_huge_page
> +takes the mmap_sem in write mode in addition to the anon_vma lock). If
> +pmd_trans_huge returns false, you just fallback in the old code
> +paths. If instead pmd_trans_huge returns true, you have to take the
> +mm->page_table_lock and re-run pmd_trans_huge. Taking the
> +page_table_lock will prevent the huge pmd to be converted into a
> +regular pmd from under you (split_huge_page can run in parallel to the
> +pagetable walk). If the second pmd_trans_huge returns false, you
> +should just drop the page_table_lock and fallback to the old code as
> +before. Otherwise you should run pmd_trans_splitting on the pmd. In
> +case pmd_trans_splitting returns true, it means split_huge_page is
> +already in the middle of splitting the page. So if pmd_trans_splitting
> +returns true it's enough to drop the page_table_lock and call
> +wait_split_huge_page and then fallback the old code paths. You are
> +guaranteed by the time wait_split_huge_page returns, the pmd isn't
> +huge anymore. If pmd_trans_splitting returns false, you can proceed to
> +process the huge pmd and the hugepage natively. Once finished you can
> +drop the page_table_lock.
> +
> +== compound_lock, get_user_pages and put_page ==
> +
> +split_huge_page internally has to distribute the refcounts in the head
> +page to the tail pages before clearing all PG_head/tail bits from the
> +page structures. It can do that easily for refcounts taken by huge pmd
> +mappings. But the GUI API as created by hugetlbfs (that returns head
> +and tail pages if running get_user_pages on an address backed by any
> +hugepage), requires the refcount to be accounted on the tail pages and
> +not only in the head pages, if we want to be able to run
> +split_huge_page while there are gup pins established on any tail
> +page. Failure to be able to run split_huge_page if there's any gup pin
> +on any tail page, would mean having to split all hugepages upfront in
> +get_user_pages which is unacceptable as too many gup users are
> +performance critical and they must work natively on hugepages like
> +they work natively on hugetlbfs already (hugetlbfs is simpler because
> +hugetlbfs pages cannot be splitted so there wouldn't be requirement of
> +accounting the pins on the tail pages for hugetlbfs). If we wouldn't
> +account the gup refcounts on the tail pages during gup, we won't know
> +anymore which tail page is pinned by gup and which is not while we run
> +split_huge_page. But we still have to add the gup pin to the head page
> +too, to know when we can free the compound page in case it's never
> +splitted during its lifetime. That requires changing not just
> +get_page, but put_page as well so that when put_page runs on a tail
> +page (and only on a tail page) it will find its respective head page,
> +and then it will decrease the head page refcount in addition to the
> +tail page refcount. To obtain a head page reliably and to decrease its
> +refcount without race conditions, put_page has to serialize against
> +__split_huge_page_refcount using a special per-page lock called
> +compound_lock.
> 

Ok, I'll need to read the rest of the series to verify if this is
correct but by and large it looks good. I think some of the language is
stronger than it should be and some of the comparisons with libhugetlbfs
are a bit off but I'd be naturally defensive on that topic. Make the
suggested changes if you like but if you don't, it shouldn't affect the
series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 04 of 66] define MADV_HUGEPAGE
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 11:44     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On just the subject, I've been hassled before to add information to the
subject on what is being affected. In this case, it would be just mm:
because you are not affecting any subsystem but others might be

mm: migration: something something

On a practical point of view, it means if you sort mmotm's series file,
you can get an approximate breakdown of how many patches affect each
subsystem. No idea if it's required or not but don't be surprised if
someone complains :)

On Wed, Nov 03, 2010 at 04:27:39PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Define MADV_HUGEPAGE.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Otherwise;

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
> 
> diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
> --- a/arch/alpha/include/asm/mman.h
> +++ b/arch/alpha/include/asm/mman.h
> @@ -53,6 +53,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
> --- a/arch/mips/include/asm/mman.h
> +++ b/arch/mips/include/asm/mman.h
> @@ -77,6 +77,8 @@
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  #define MADV_HWPOISON    100		/* poison a page for testing */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
> --- a/arch/parisc/include/asm/mman.h
> +++ b/arch/parisc/include/asm/mman.h
> @@ -59,6 +59,8 @@
>  #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  #define MAP_VARIABLE	0
> diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
> --- a/arch/xtensa/include/asm/mman.h
> +++ b/arch/xtensa/include/asm/mman.h
> @@ -83,6 +83,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -45,6 +45,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 04 of 66] define MADV_HUGEPAGE
@ 2010-11-18 11:44     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:44 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On just the subject, I've been hassled before to add information to the
subject on what is being affected. In this case, it would be just mm:
because you are not affecting any subsystem but others might be

mm: migration: something something

On a practical point of view, it means if you sort mmotm's series file,
you can get an approximate breakdown of how many patches affect each
subsystem. No idea if it's required or not but don't be surprised if
someone complains :)

On Wed, Nov 03, 2010 at 04:27:39PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Define MADV_HUGEPAGE.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Arnd Bergmann <arnd@arndb.de>

Otherwise;

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
> 
> diff --git a/arch/alpha/include/asm/mman.h b/arch/alpha/include/asm/mman.h
> --- a/arch/alpha/include/asm/mman.h
> +++ b/arch/alpha/include/asm/mman.h
> @@ -53,6 +53,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/mips/include/asm/mman.h b/arch/mips/include/asm/mman.h
> --- a/arch/mips/include/asm/mman.h
> +++ b/arch/mips/include/asm/mman.h
> @@ -77,6 +77,8 @@
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  #define MADV_HWPOISON    100		/* poison a page for testing */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/arch/parisc/include/asm/mman.h b/arch/parisc/include/asm/mman.h
> --- a/arch/parisc/include/asm/mman.h
> +++ b/arch/parisc/include/asm/mman.h
> @@ -59,6 +59,8 @@
>  #define MADV_MERGEABLE   65		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 66		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	67		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  #define MAP_VARIABLE	0
> diff --git a/arch/xtensa/include/asm/mman.h b/arch/xtensa/include/asm/mman.h
> --- a/arch/xtensa/include/asm/mman.h
> +++ b/arch/xtensa/include/asm/mman.h
> @@ -83,6 +83,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
> --- a/include/asm-generic/mman-common.h
> +++ b/include/asm-generic/mman-common.h
> @@ -45,6 +45,8 @@
>  #define MADV_MERGEABLE   12		/* KSM may merge identical pages */
>  #define MADV_UNMERGEABLE 13		/* KSM may not merge identical pages */
>  
> +#define MADV_HUGEPAGE	14		/* Worth backing with hugepages */
> +
>  /* compatibility flags */
>  #define MAP_FILE	0
>  
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 11:49     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:40PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add a new compound_lock() needed to serialize put_page against
> __split_huge_page_refcount().
> 

Does it only apply to a compound page? If I pass in a PageTail, what
happens? Could do with a beefier description on why it's needed but
maybe it's obvious later.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -14,6 +14,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/range.h>
>  #include <linux/pfn.h>
> +#include <linux/bit_spinlock.h>
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -302,6 +303,40 @@ static inline int is_vmalloc_or_module_a
>  }
>  #endif
>  
> +static inline void compound_lock(struct page *page)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	bit_spin_lock(PG_compound_lock, &page->flags);
> +#endif
> +}
> +
> +static inline void compound_unlock(struct page *page)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	bit_spin_unlock(PG_compound_lock, &page->flags);
> +#endif
> +}
> +
> +static inline void compound_lock_irqsave(struct page *page,
> +					 unsigned long *flagsp)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	unsigned long flags;
> +	local_irq_save(flags);
> +	compound_lock(page);
> +	*flagsp = flags;
> +#endif
> +}
> +

The pattern for spinlock irqsave passes in unsigned long, not unsigned
long *. It'd be nice if they matched.

> +static inline void compound_unlock_irqrestore(struct page *page,
> +					      unsigned long flags)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	compound_unlock(page);
> +	local_irq_restore(flags);
> +#endif
> +}
> +
>  static inline struct page *compound_head(struct page *page)
>  {
>  	if (unlikely(PageTail(page)))
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -108,6 +108,9 @@ enum pageflags {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>  #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	PG_compound_lock,
> +#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -397,6 +400,12 @@ static inline void __ClearPageTail(struc
>  #define __PG_MLOCKED		0
>  #endif
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
> +#else
> +#define __PG_COMPOUND_LOCK		0
> +#endif
> +
>  /*
>   * Flags checked when a page is freed.  Pages being freed should not have
>   * these flags set.  It they are, there is a problem.
> @@ -406,7 +415,8 @@ static inline void __ClearPageTail(struc
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> +	 __PG_COMPOUND_LOCK)
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
@ 2010-11-18 11:49     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 11:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:40PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add a new compound_lock() needed to serialize put_page against
> __split_huge_page_refcount().
> 

Does it only apply to a compound page? If I pass in a PageTail, what
happens? Could do with a beefier description on why it's needed but
maybe it's obvious later.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -14,6 +14,7 @@
>  #include <linux/mm_types.h>
>  #include <linux/range.h>
>  #include <linux/pfn.h>
> +#include <linux/bit_spinlock.h>
>  
>  struct mempolicy;
>  struct anon_vma;
> @@ -302,6 +303,40 @@ static inline int is_vmalloc_or_module_a
>  }
>  #endif
>  
> +static inline void compound_lock(struct page *page)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	bit_spin_lock(PG_compound_lock, &page->flags);
> +#endif
> +}
> +
> +static inline void compound_unlock(struct page *page)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	bit_spin_unlock(PG_compound_lock, &page->flags);
> +#endif
> +}
> +
> +static inline void compound_lock_irqsave(struct page *page,
> +					 unsigned long *flagsp)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	unsigned long flags;
> +	local_irq_save(flags);
> +	compound_lock(page);
> +	*flagsp = flags;
> +#endif
> +}
> +

The pattern for spinlock irqsave passes in unsigned long, not unsigned
long *. It'd be nice if they matched.

> +static inline void compound_unlock_irqrestore(struct page *page,
> +					      unsigned long flags)
> +{
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	compound_unlock(page);
> +	local_irq_restore(flags);
> +#endif
> +}
> +
>  static inline struct page *compound_head(struct page *page)
>  {
>  	if (unlikely(PageTail(page)))
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -108,6 +108,9 @@ enum pageflags {
>  #ifdef CONFIG_MEMORY_FAILURE
>  	PG_hwpoison,		/* hardware poisoned page. Don't touch */
>  #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	PG_compound_lock,
> +#endif
>  	__NR_PAGEFLAGS,
>  
>  	/* Filesystems */
> @@ -397,6 +400,12 @@ static inline void __ClearPageTail(struc
>  #define __PG_MLOCKED		0
>  #endif
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define __PG_COMPOUND_LOCK		(1 << PG_compound_lock)
> +#else
> +#define __PG_COMPOUND_LOCK		0
> +#endif
> +
>  /*
>   * Flags checked when a page is freed.  Pages being freed should not have
>   * these flags set.  It they are, there is a problem.
> @@ -406,7 +415,8 @@ static inline void __ClearPageTail(struc
>  	 1 << PG_private | 1 << PG_private_2 | \
>  	 1 << PG_buddy	 | 1 << PG_writeback | 1 << PG_reserved | \
>  	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
> -	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON)
> +	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> +	 __PG_COMPOUND_LOCK)
>  
>  /*
>   * Flags checked when a page is prepped for return by the page allocator.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 12:37     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:41PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Alter compound get_page/put_page to keep references on subpages too, in order
> to allow __split_huge_page_refcount to split an hugepage even while subpages
> have been pinned by one of the get_user_pages() variants.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -16,6 +16,16 @@
>  
>  #ifdef __HAVE_ARCH_PTE_SPECIAL
>  
> +static inline void pin_huge_page_tail(struct page *page)
> +{

Minor nit, but get_huge_page_tail?

Even though "pin" is what it does, pin isn't used elsewhere in naming.

> +	/*
> +	 * __split_huge_page_refcount() cannot run
> +	 * from under us.
> +	 */
> +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> +	atomic_inc(&page->_count);
> +}
> +
>  /*
>   * The performance critical leaf functions are made noinline otherwise gcc
>   * inlines everything into a single function which results in too much
> @@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
>  			put_page(page);
>  			return 0;
>  		}
> +		if (PageTail(page))
> +			pin_huge_page_tail(page);
>  		pages[*nr] = page;
>  		(*nr)++;
>  
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
>  	atomic_add(nr, &page->_count);
>  }
>  
> +static inline void pin_huge_page_tail(struct page *page)
> +{
> +	/*
> +	 * __split_huge_page_refcount() cannot run
> +	 * from under us.
> +	 */
> +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> +	atomic_inc(&page->_count);
> +}
> +

This is identical to the x86 implementation. Any possibility they can be
shared?

>  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
>  		unsigned long end, int write, struct page **pages, int *nr)
>  {
> @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
>  	do {
>  		VM_BUG_ON(compound_head(page) != head);
>  		pages[*nr] = page;
> +		if (PageTail(page))
> +			pin_huge_page_tail(page);
>  		(*nr)++;
>  		page++;
>  		refs++;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -351,9 +351,17 @@ static inline int page_count(struct page
>  
>  static inline void get_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));

Oof, this might need a comment. It's saying that getting a normal page or the
head of a compound page must already have an elevated reference count. If
we are getting a tail page, the reference count is stored in both the head
and the tail so the BUG check does not apply.

>  	atomic_inc(&page->_count);
> +	if (unlikely(PageTail(page))) {
> +		/*
> +		 * This is safe only because
> +		 * __split_huge_page_refcount can't run under
> +		 * get_page().
> +		 */
> +		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> +		atomic_inc(&page->first_page->_count);
> +	}
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
>  		del_page_from_lru(zone, page);
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +}
> +
> +static void __put_single_page(struct page *page)
> +{
> +	__page_cache_release(page);
>  	free_hot_cold_page(page, 0);
>  }
>  
> +static void __put_compound_page(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	__page_cache_release(page);
> +	dtor = get_compound_page_dtor(page);
> +	(*dtor)(page);
> +}
> +
>  static void put_compound_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	if (put_page_testzero(page)) {
> -		compound_page_dtor *dtor;
> -
> -		dtor = get_compound_page_dtor(page);
> -		(*dtor)(page);
> +	if (unlikely(PageTail(page))) {
> +		/* __split_huge_page_refcount can run under us */

So what? The fact you check PageTail twice is a hint as to what is
happening and that we are depending on the order of when the head and
tails bits get cleared but it's hard to be certain of that.

> +		struct page *page_head = page->first_page;
> +		smp_rmb();
> +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> +			unsigned long flags;
> +			if (unlikely(!PageHead(page_head))) {
> +				/* PageHead is cleared after PageTail */
> +				smp_rmb();
> +				VM_BUG_ON(PageTail(page));
> +				goto out_put_head;
> +			}
> +			/*
> +			 * Only run compound_lock on a valid PageHead,
> +			 * after having it pinned with
> +			 * get_page_unless_zero() above.
> +			 */
> +			smp_mb();
> +			/* page_head wasn't a dangling pointer */
> +			compound_lock_irqsave(page_head, &flags);
> +			if (unlikely(!PageTail(page))) {
> +				/* __split_huge_page_refcount run before us */
> +				compound_unlock_irqrestore(page_head, flags);
> +				VM_BUG_ON(PageHead(page_head));
> +			out_put_head:
> +				if (put_page_testzero(page_head))
> +					__put_single_page(page_head);
> +			out_put_single:
> +				if (put_page_testzero(page))
> +					__put_single_page(page);
> +				return;
> +			}
> +			VM_BUG_ON(page_head != page->first_page);
> +			/*
> +			 * We can release the refcount taken by
> +			 * get_page_unless_zero now that
> +			 * split_huge_page_refcount is blocked on the
> +			 * compound_lock.
> +			 */
> +			if (put_page_testzero(page_head))
> +				VM_BUG_ON(1);
> +			/* __split_huge_page_refcount will wait now */
> +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> +			atomic_dec(&page->_count);
> +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> +			compound_unlock_irqrestore(page_head, flags);
> +			if (put_page_testzero(page_head))
> +				__put_compound_page(page_head);
> +		} else {
> +			/* page_head is a dangling pointer */
> +			VM_BUG_ON(PageTail(page));
> +			goto out_put_single;
> +		}
> +	} else if (put_page_testzero(page)) {
> +		if (PageHead(page))
> +			__put_compound_page(page);
> +		else
> +			__put_single_page(page);
>  	}
>  }
>  
> @@ -75,7 +141,7 @@ void put_page(struct page *page)
>  	if (unlikely(PageCompound(page)))
>  		put_compound_page(page);
>  	else if (put_page_testzero(page))
> -		__page_cache_release(page);
> +		__put_single_page(page);
>  }
>  EXPORT_SYMBOL(put_page);
>  

Functionally, I don't see a problem so

Acked-by: Mel Gorman <mel@csn.ul.ie>

but some expansion on the leader and the comment, even if done as a
follow-on patch, would be nice.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
@ 2010-11-18 12:37     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:37 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:41PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Alter compound get_page/put_page to keep references on subpages too, in order
> to allow __split_huge_page_refcount to split an hugepage even while subpages
> have been pinned by one of the get_user_pages() variants.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> --- a/arch/powerpc/mm/gup.c
> +++ b/arch/powerpc/mm/gup.c
> @@ -16,6 +16,16 @@
>  
>  #ifdef __HAVE_ARCH_PTE_SPECIAL
>  
> +static inline void pin_huge_page_tail(struct page *page)
> +{

Minor nit, but get_huge_page_tail?

Even though "pin" is what it does, pin isn't used elsewhere in naming.

> +	/*
> +	 * __split_huge_page_refcount() cannot run
> +	 * from under us.
> +	 */
> +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> +	atomic_inc(&page->_count);
> +}
> +
>  /*
>   * The performance critical leaf functions are made noinline otherwise gcc
>   * inlines everything into a single function which results in too much
> @@ -47,6 +57,8 @@ static noinline int gup_pte_range(pmd_t 
>  			put_page(page);
>  			return 0;
>  		}
> +		if (PageTail(page))
> +			pin_huge_page_tail(page);
>  		pages[*nr] = page;
>  		(*nr)++;
>  
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
>  	atomic_add(nr, &page->_count);
>  }
>  
> +static inline void pin_huge_page_tail(struct page *page)
> +{
> +	/*
> +	 * __split_huge_page_refcount() cannot run
> +	 * from under us.
> +	 */
> +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> +	atomic_inc(&page->_count);
> +}
> +

This is identical to the x86 implementation. Any possibility they can be
shared?

>  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
>  		unsigned long end, int write, struct page **pages, int *nr)
>  {
> @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
>  	do {
>  		VM_BUG_ON(compound_head(page) != head);
>  		pages[*nr] = page;
> +		if (PageTail(page))
> +			pin_huge_page_tail(page);
>  		(*nr)++;
>  		page++;
>  		refs++;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -351,9 +351,17 @@ static inline int page_count(struct page
>  
>  static inline void get_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));

Oof, this might need a comment. It's saying that getting a normal page or the
head of a compound page must already have an elevated reference count. If
we are getting a tail page, the reference count is stored in both the head
and the tail so the BUG check does not apply.

>  	atomic_inc(&page->_count);
> +	if (unlikely(PageTail(page))) {
> +		/*
> +		 * This is safe only because
> +		 * __split_huge_page_refcount can't run under
> +		 * get_page().
> +		 */
> +		VM_BUG_ON(atomic_read(&page->first_page->_count) <= 0);
> +		atomic_inc(&page->first_page->_count);
> +	}
>  }
>  
>  static inline struct page *virt_to_head_page(const void *x)
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
>  		del_page_from_lru(zone, page);
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
> +}
> +
> +static void __put_single_page(struct page *page)
> +{
> +	__page_cache_release(page);
>  	free_hot_cold_page(page, 0);
>  }
>  
> +static void __put_compound_page(struct page *page)
> +{
> +	compound_page_dtor *dtor;
> +
> +	__page_cache_release(page);
> +	dtor = get_compound_page_dtor(page);
> +	(*dtor)(page);
> +}
> +
>  static void put_compound_page(struct page *page)
>  {
> -	page = compound_head(page);
> -	if (put_page_testzero(page)) {
> -		compound_page_dtor *dtor;
> -
> -		dtor = get_compound_page_dtor(page);
> -		(*dtor)(page);
> +	if (unlikely(PageTail(page))) {
> +		/* __split_huge_page_refcount can run under us */

So what? The fact you check PageTail twice is a hint as to what is
happening and that we are depending on the order of when the head and
tails bits get cleared but it's hard to be certain of that.

> +		struct page *page_head = page->first_page;
> +		smp_rmb();
> +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> +			unsigned long flags;
> +			if (unlikely(!PageHead(page_head))) {
> +				/* PageHead is cleared after PageTail */
> +				smp_rmb();
> +				VM_BUG_ON(PageTail(page));
> +				goto out_put_head;
> +			}
> +			/*
> +			 * Only run compound_lock on a valid PageHead,
> +			 * after having it pinned with
> +			 * get_page_unless_zero() above.
> +			 */
> +			smp_mb();
> +			/* page_head wasn't a dangling pointer */
> +			compound_lock_irqsave(page_head, &flags);
> +			if (unlikely(!PageTail(page))) {
> +				/* __split_huge_page_refcount run before us */
> +				compound_unlock_irqrestore(page_head, flags);
> +				VM_BUG_ON(PageHead(page_head));
> +			out_put_head:
> +				if (put_page_testzero(page_head))
> +					__put_single_page(page_head);
> +			out_put_single:
> +				if (put_page_testzero(page))
> +					__put_single_page(page);
> +				return;
> +			}
> +			VM_BUG_ON(page_head != page->first_page);
> +			/*
> +			 * We can release the refcount taken by
> +			 * get_page_unless_zero now that
> +			 * split_huge_page_refcount is blocked on the
> +			 * compound_lock.
> +			 */
> +			if (put_page_testzero(page_head))
> +				VM_BUG_ON(1);
> +			/* __split_huge_page_refcount will wait now */
> +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> +			atomic_dec(&page->_count);
> +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> +			compound_unlock_irqrestore(page_head, flags);
> +			if (put_page_testzero(page_head))
> +				__put_compound_page(page_head);
> +		} else {
> +			/* page_head is a dangling pointer */
> +			VM_BUG_ON(PageTail(page));
> +			goto out_put_single;
> +		}
> +	} else if (put_page_testzero(page)) {
> +		if (PageHead(page))
> +			__put_compound_page(page);
> +		else
> +			__put_single_page(page);
>  	}
>  }
>  
> @@ -75,7 +141,7 @@ void put_page(struct page *page)
>  	if (unlikely(PageCompound(page)))
>  		put_compound_page(page);
>  	else if (put_page_testzero(page))
> -		__page_cache_release(page);
> +		__put_single_page(page);
>  }
>  EXPORT_SYMBOL(put_page);
>  

Functionally, I don't see a problem so

Acked-by: Mel Gorman <mel@csn.ul.ie>

but some expansion on the leader and the comment, even if done as a
follow-on patch, would be nice.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 12:40     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:43PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> page_count shows the count of the head page, but the actual check is done on
> the tail page, so show what is really being checked.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

This can be sent on its own as it's not really related to THP.

> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
>  {
>  	printk(KERN_ALERT
>  	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> -		page, page_count(page), page_mapcount(page),
> +		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
>  	dump_page_flags(page->flags);
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 08 of 66] fix bad_page to show the real reason the page is bad
@ 2010-11-18 12:40     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:43PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> page_count shows the count of the head page, but the actual check is done on
> the tail page, so show what is really being checked.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

This can be sent on its own as it's not really related to THP.

> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5564,7 +5564,7 @@ void dump_page(struct page *page)
>  {
>  	printk(KERN_ALERT
>  	       "page:%p count:%d mapcount:%d mapping:%p index:%#lx\n",
> -		page, page_count(page), page_mapcount(page),
> +		page, atomic_read(&page->_count), page_mapcount(page),
>  		page->mapping, page->index);
>  	dump_page_flags(page->flags);
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 16 of 66] special pmd_trans_* functions
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 12:51     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:51PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> These returns 0 at compile time when the config option is disabled, to allow
> gcc to eliminate the transparent hugepage function calls at compile time
> without additional #ifdefs (only the export of those functions have to be
> visible to gcc but they won't be required at link time and huge_memory.o can be
> not built at all).
> 
> _PAGE_BIT_UNUSED1 is never used for pmd, only on pte.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
>  #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
>  
>  #define __HAVE_ARCH_PTE_SAME
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline int pmd_trans_splitting(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_SPLITTING;
> +}
> +
> +static inline int pmd_trans_huge(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_PSE;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -22,6 +22,7 @@
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
>  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
> +#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
>  #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
>  
>  /* If _PAGE_BIT_PRESENT is clear, we use these: */
> @@ -45,6 +46,7 @@
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
>  #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
> +#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  #ifdef CONFIG_KMEMCHECK
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -348,6 +348,11 @@ extern void untrack_pfn_vma(struct vm_ar
>  				unsigned long size);
>  #endif
>  
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +#define pmd_trans_huge(pmd) 0
> +#define pmd_trans_splitting(pmd) 0
> +#endif
> +

Usually it is insisted upon that this looks like

static inline int pmd_trans_huge(pmd) {
	return 0;
}

I understand it's to avoid any possibility of side-effets though to have type
checking and I am 99% certain the compiler still does the right thing. Still,
with no obvious side-effects here;

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_GENERIC_PGTABLE_H */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 16 of 66] special pmd_trans_* functions
@ 2010-11-18 12:51     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:51PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> These returns 0 at compile time when the config option is disabled, to allow
> gcc to eliminate the transparent hugepage function calls at compile time
> without additional #ifdefs (only the export of those functions have to be
> visible to gcc but they won't be required at link time and huge_memory.o can be
> not built at all).
> 
> _PAGE_BIT_UNUSED1 is never used for pmd, only on pte.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -168,6 +168,19 @@ extern void cleanup_highmap(void);
>  #define	kc_offset_to_vaddr(o) ((o) | ~__VIRTUAL_MASK)
>  
>  #define __HAVE_ARCH_PTE_SAME
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline int pmd_trans_splitting(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_SPLITTING;
> +}
> +
> +static inline int pmd_trans_huge(pmd_t pmd)
> +{
> +	return pmd_val(pmd) & _PAGE_PSE;
> +}
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -22,6 +22,7 @@
>  #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
>  #define _PAGE_BIT_SPECIAL	_PAGE_BIT_UNUSED1
>  #define _PAGE_BIT_CPA_TEST	_PAGE_BIT_UNUSED1
> +#define _PAGE_BIT_SPLITTING	_PAGE_BIT_UNUSED1 /* only valid on a PSE pmd */
>  #define _PAGE_BIT_NX           63       /* No execute: only valid after cpuid check */
>  
>  /* If _PAGE_BIT_PRESENT is clear, we use these: */
> @@ -45,6 +46,7 @@
>  #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
>  #define _PAGE_SPECIAL	(_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
>  #define _PAGE_CPA_TEST	(_AT(pteval_t, 1) << _PAGE_BIT_CPA_TEST)
> +#define _PAGE_SPLITTING	(_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
>  #ifdef CONFIG_KMEMCHECK
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -348,6 +348,11 @@ extern void untrack_pfn_vma(struct vm_ar
>  				unsigned long size);
>  #endif
>  
> +#ifndef CONFIG_TRANSPARENT_HUGEPAGE
> +#define pmd_trans_huge(pmd) 0
> +#define pmd_trans_splitting(pmd) 0
> +#endif
> +

Usually it is insisted upon that this looks like

static inline int pmd_trans_huge(pmd) {
	return 0;
}

I understand it's to avoid any possibility of side-effets though to have type
checking and I am 99% certain the compiler still does the right thing. Still,
with no obvious side-effects here;

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_GENERIC_PGTABLE_H */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 12:52     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:52PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Some are needed to build but not actually used on archs not supporting
> transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-11-18 12:52     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 12:52 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:52PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Some are needed to build but not actually used on archs not supporting
> transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 13:04     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add needed pmd mangling functions with simmetry with their pte counterparts.

symmetry

> pmdp_freeze_flush is the only exception only present on the pmd side and it's
> needed to serialize the VM against split_huge_page, it simply atomically clears
> the present bit in the same way pmdp_clear_flush_young atomically clears the
> accessed bit (and both need to flush the tlb to make it effective, which is
> mandatory to happen synchronously for pmdp_freeze_flush).

I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
pmdp_splitting_flush? Even if it is, it's the splitting bit you are
dealing with which isn't the same as the present bit. I'm missing
something.

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
>  pte_t *populate_extra_pte(unsigned long vaddr);
>  #endif	/* __ASSEMBLY__ */
>  
> +#ifndef __ASSEMBLY__
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_X86_32
>  # include "pgtable_32.h"
>  #else
>  # include "pgtable_64.h"
>  #endif
>  
> -#ifndef __ASSEMBLY__
> -#include <linux/mm_types.h>
> -

Stupid quetion: Why is this move necessary?

>  static inline int pte_none(pte_t pte)
>  {
>  	return !pte.pte;
> @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
>   * Currently stuck as a macro due to indirect forward reference to
>   * linux/mmzone.h's __section_mem_map_addr() definition:
>   */
> -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
>  

Why is it now necessary to use PTE_PFN_MASK?

>  /*
>   * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -59,6 +59,16 @@ static inline void native_set_pte_atomic
>  	native_set_pte(ptep, pte);
>  }
>  
> +static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
> +{
> +	*pmdp = pmd;
> +}
> +
> +static inline void native_pmd_clear(pmd_t *pmd)
> +{
> +	native_set_pmd(pmd, native_make_pmd(0));
> +}
> +
>  static inline pte_t native_ptep_get_and_clear(pte_t *xp)
>  {
>  #ifdef CONFIG_SMP
> @@ -72,14 +82,17 @@ static inline pte_t native_ptep_get_and_
>  #endif
>  }
>  
> -static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
> +static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
>  {
> -	*pmdp = pmd;
> -}
> -
> -static inline void native_pmd_clear(pmd_t *pmd)
> -{
> -	native_set_pmd(pmd, native_make_pmd(0));
> +#ifdef CONFIG_SMP
> +	return native_make_pmd(xchg(&xp->pmd, 0));
> +#else
> +	/* native_local_pmdp_get_and_clear,
> +	   but duplicated because of cyclic dependency */
> +	pmd_t ret = *xp;
> +	native_pmd_clear(xp);
> +	return ret;
> +#endif
>  }
>  
>  static inline void native_set_pud(pud_t *pudp, pud_t pud)
> @@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
> +
> +#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
> +
> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
> +
> +
> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMD_WRITE
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_RW;
> +}
> +
> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
> +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
> +				       pmd_t *pmdp)
> +{
> +	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
> +	pmd_update(mm, addr, pmdp);
> +	return pmd;
> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT
> +static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> +				      unsigned long addr, pmd_t *pmdp)
> +{
> +	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
> +	pmd_update(mm, addr, pmdp);
> +}
> +
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_ACCESSED;
> +}
> +
> +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v | set);
> +}
> +
> +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v & ~clear);
> +}
> +
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_RW);
> +}
> +
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_DIRTY);
> +}
> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_PSE);
> +}
> +
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_RW);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -320,6 +320,25 @@ int ptep_set_access_flags(struct vm_area
>  	return changed;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_set_access_flags(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp,
> +			  pmd_t entry, int dirty)
> +{
> +	int changed = !pmd_same(*pmdp, entry);
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +	if (changed && dirty) {
> +		*pmdp = entry;
> +		pmd_update_defer(vma->vm_mm, address, pmdp);
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +
> +	return changed;
> +}
> +#endif
> +
>  int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  			      unsigned long addr, pte_t *ptep)
>  {
> @@ -335,6 +354,23 @@ int ptep_test_and_clear_young(struct vm_
>  	return ret;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long addr, pmd_t *pmdp)
> +{
> +	int ret = 0;
> +
> +	if (pmd_young(*pmdp))
> +		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
> +					 (unsigned long *) &pmdp->pmd);
> +
> +	if (ret)
> +		pmd_update(vma->vm_mm, addr, pmdp);
> +
> +	return ret;
> +}
> +#endif
> +
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> @@ -347,6 +383,36 @@ int ptep_clear_flush_young(struct vm_are
>  	return young;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmdp)
> +{
> +	int young;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +	young = pmdp_test_and_clear_young(vma, address, pmdp);
> +	if (young)
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +
> +	return young;
> +}
> +
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	int set;
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
> +				(unsigned long *)&pmdp->pmd);
> +	if (set) {
> +		pmd_update(vma->vm_mm, address, pmdp);
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +}
> +#endif
> +

The implementations look fine but I'm having trouble reconsiling what
the leader says with the patch :(

>  /**
>   * reserve_top_address - reserves a hole in the top of kernel address space
>   * @reserve - size of hole to reserve
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
@ 2010-11-18 13:04     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:04 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add needed pmd mangling functions with simmetry with their pte counterparts.

symmetry

> pmdp_freeze_flush is the only exception only present on the pmd side and it's
> needed to serialize the VM against split_huge_page, it simply atomically clears
> the present bit in the same way pmdp_clear_flush_young atomically clears the
> accessed bit (and both need to flush the tlb to make it effective, which is
> mandatory to happen synchronously for pmdp_freeze_flush).

I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
pmdp_splitting_flush? Even if it is, it's the splitting bit you are
dealing with which isn't the same as the present bit. I'm missing
something.

> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
>  pte_t *populate_extra_pte(unsigned long vaddr);
>  #endif	/* __ASSEMBLY__ */
>  
> +#ifndef __ASSEMBLY__
> +#include <linux/mm_types.h>
> +
>  #ifdef CONFIG_X86_32
>  # include "pgtable_32.h"
>  #else
>  # include "pgtable_64.h"
>  #endif
>  
> -#ifndef __ASSEMBLY__
> -#include <linux/mm_types.h>
> -

Stupid quetion: Why is this move necessary?

>  static inline int pte_none(pte_t pte)
>  {
>  	return !pte.pte;
> @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
>   * Currently stuck as a macro due to indirect forward reference to
>   * linux/mmzone.h's __section_mem_map_addr() definition:
>   */
> -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
>  

Why is it now necessary to use PTE_PFN_MASK?

>  /*
>   * the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -59,6 +59,16 @@ static inline void native_set_pte_atomic
>  	native_set_pte(ptep, pte);
>  }
>  
> +static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
> +{
> +	*pmdp = pmd;
> +}
> +
> +static inline void native_pmd_clear(pmd_t *pmd)
> +{
> +	native_set_pmd(pmd, native_make_pmd(0));
> +}
> +
>  static inline pte_t native_ptep_get_and_clear(pte_t *xp)
>  {
>  #ifdef CONFIG_SMP
> @@ -72,14 +82,17 @@ static inline pte_t native_ptep_get_and_
>  #endif
>  }
>  
> -static inline void native_set_pmd(pmd_t *pmdp, pmd_t pmd)
> +static inline pmd_t native_pmdp_get_and_clear(pmd_t *xp)
>  {
> -	*pmdp = pmd;
> -}
> -
> -static inline void native_pmd_clear(pmd_t *pmd)
> -{
> -	native_set_pmd(pmd, native_make_pmd(0));
> +#ifdef CONFIG_SMP
> +	return native_make_pmd(xchg(&xp->pmd, 0));
> +#else
> +	/* native_local_pmdp_get_and_clear,
> +	   but duplicated because of cyclic dependency */
> +	pmd_t ret = *xp;
> +	native_pmd_clear(xp);
> +	return ret;
> +#endif
>  }
>  
>  static inline void native_set_pud(pud_t *pudp, pud_t pud)
> @@ -181,6 +194,98 @@ static inline int pmd_trans_huge(pmd_t p
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +#define mk_pmd(page, pgprot)   pfn_pmd(page_to_pfn(page), (pgprot))
> +
> +#define  __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
> +extern int pmdp_set_access_flags(struct vm_area_struct *vma,
> +				 unsigned long address, pmd_t *pmdp,
> +				 pmd_t entry, int dirty);
> +
> +#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
> +extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +				     unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
> +extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +				  unsigned long address, pmd_t *pmdp);
> +
> +
> +#define __HAVE_ARCH_PMDP_SPLITTING_FLUSH
> +extern void pmdp_splitting_flush(struct vm_area_struct *vma,
> +				 unsigned long addr, pmd_t *pmdp);
> +
> +#define __HAVE_ARCH_PMD_WRITE
> +static inline int pmd_write(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_RW;
> +}
> +
> +#define __HAVE_ARCH_PMDP_GET_AND_CLEAR
> +static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm, unsigned long addr,
> +				       pmd_t *pmdp)
> +{
> +	pmd_t pmd = native_pmdp_get_and_clear(pmdp);
> +	pmd_update(mm, addr, pmdp);
> +	return pmd;
> +}
> +
> +#define __HAVE_ARCH_PMDP_SET_WRPROTECT
> +static inline void pmdp_set_wrprotect(struct mm_struct *mm,
> +				      unsigned long addr, pmd_t *pmdp)
> +{
> +	clear_bit(_PAGE_BIT_RW, (unsigned long *)&pmdp->pmd);
> +	pmd_update(mm, addr, pmdp);
> +}
> +
> +static inline int pmd_young(pmd_t pmd)
> +{
> +	return pmd_flags(pmd) & _PAGE_ACCESSED;
> +}
> +
> +static inline pmd_t pmd_set_flags(pmd_t pmd, pmdval_t set)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v | set);
> +}
> +
> +static inline pmd_t pmd_clear_flags(pmd_t pmd, pmdval_t clear)
> +{
> +	pmdval_t v = native_pmd_val(pmd);
> +
> +	return native_make_pmd(v & ~clear);
> +}
> +
> +static inline pmd_t pmd_mkold(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_wrprotect(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_RW);
> +}
> +
> +static inline pmd_t pmd_mkdirty(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_DIRTY);
> +}
> +
> +static inline pmd_t pmd_mkhuge(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_PSE);
> +}
> +
> +static inline pmd_t pmd_mkyoung(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mkwrite(pmd_t pmd)
> +{
> +	return pmd_set_flags(pmd, _PAGE_RW);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -320,6 +320,25 @@ int ptep_set_access_flags(struct vm_area
>  	return changed;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_set_access_flags(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp,
> +			  pmd_t entry, int dirty)
> +{
> +	int changed = !pmd_same(*pmdp, entry);
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +	if (changed && dirty) {
> +		*pmdp = entry;
> +		pmd_update_defer(vma->vm_mm, address, pmdp);
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +
> +	return changed;
> +}
> +#endif
> +
>  int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  			      unsigned long addr, pte_t *ptep)
>  {
> @@ -335,6 +354,23 @@ int ptep_test_and_clear_young(struct vm_
>  	return ret;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_test_and_clear_young(struct vm_area_struct *vma,
> +			      unsigned long addr, pmd_t *pmdp)
> +{
> +	int ret = 0;
> +
> +	if (pmd_young(*pmdp))
> +		ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
> +					 (unsigned long *) &pmdp->pmd);
> +
> +	if (ret)
> +		pmd_update(vma->vm_mm, addr, pmdp);
> +
> +	return ret;
> +}
> +#endif
> +
>  int ptep_clear_flush_young(struct vm_area_struct *vma,
>  			   unsigned long address, pte_t *ptep)
>  {
> @@ -347,6 +383,36 @@ int ptep_clear_flush_young(struct vm_are
>  	return young;
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int pmdp_clear_flush_young(struct vm_area_struct *vma,
> +			   unsigned long address, pmd_t *pmdp)
> +{
> +	int young;
> +
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> +	young = pmdp_test_and_clear_young(vma, address, pmdp);
> +	if (young)
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +
> +	return young;
> +}
> +
> +void pmdp_splitting_flush(struct vm_area_struct *vma,
> +			  unsigned long address, pmd_t *pmdp)
> +{
> +	int set;
> +	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	set = !test_and_set_bit(_PAGE_BIT_SPLITTING,
> +				(unsigned long *)&pmdp->pmd);
> +	if (set) {
> +		pmd_update(vma->vm_mm, address, pmdp);
> +		/* need tlb flush only to serialize against gup-fast */
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +	}
> +}
> +#endif
> +

The implementations look fine but I'm having trouble reconsiling what
the leader says with the patch :(

>  /**
>   * reserve_top_address - reserves a hole in the top of kernel address space
>   * @reserve - size of hole to reserve
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 22 of 66] clear page compound
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 13:11     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:57PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> split_huge_page must transform a compound page to a regular page and needs
> ClearPageCompound.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 22 of 66] clear page compound
@ 2010-11-18 13:11     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:11 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:57PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> split_huge_page must transform a compound page to a regular page and needs
> ClearPageCompound.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 23 of 66] add pmd_huge_pte to mm_struct
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-18 13:13     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:58PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This increase the size of the mm struct a bit but it is needed to preallocate
> one pte for each hugepage so that split_huge_page will not require a fail path.
> Guarantee of success is a fundamental property of split_huge_page to avoid
> decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
> would otherwise force the hugepage-unaware VM code to learn rolling back in the
> middle of its pte mangling operations (if something we need it to learn
> handling pmd_trans_huge natively rather being capable of rollback). When
> split_huge_page runs a pte is needed to succeed the split, to map the newly
> splitted regular pages with a regular pte.  This way all existing VM code
> remains backwards compatible by just adding a split_huge_page* one liner. The
> memory waste of those preallocated ptes is negligible and so it is worth it.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 23 of 66] add pmd_huge_pte to mm_struct
@ 2010-11-18 13:13     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:58PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This increase the size of the mm struct a bit but it is needed to preallocate
> one pte for each hugepage so that split_huge_page will not require a fail path.
> Guarantee of success is a fundamental property of split_huge_page to avoid
> decrasing swapping reliability and to avoid adding -ENOMEM fail paths that
> would otherwise force the hugepage-unaware VM code to learn rolling back in the
> middle of its pte mangling operations (if something we need it to learn
> handling pmd_trans_huge natively rather being capable of rollback). When
> split_huge_page runs a pte is needed to succeed the split, to map the newly
> splitted regular pages with a regular pte.  This way all existing VM code
> remains backwards compatible by just adding a split_huge_page* one liner. The
> memory waste of those preallocated ptes is negligible and so it is worth it.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 28 of 66] _GFP_NO_KSWAPD
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 13:18     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:03PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Transparent hugepage allocations must be allowed not to invoke kswapd or any
> other kind of indirect reclaim (especially when the defrag sysfs is control
> disabled). It's unacceptable to swap out anonymous pages (potentially
> anonymous transparent hugepages) in order to create new transparent hugepages.
> This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
> machine and so having it suffer an unbearable slowdown, so another one with
> guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
> memory intensive workloads, makes no sense). If a transparent hugepage
> allocation fails the slowdown is minor and there is total fallback, so kswapd
> should never be asked to swapout memory to allow the high order allocation to
> succeed.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -81,13 +81,15 @@ struct vm_area_struct;
>  #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>  
> +#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
> +

This is not an exact merge with what's currently in mm. Look at the top
of gfp.h and see "Plain integer GFP bitmasks. Do not use this
directly.". The 0x400000u definition needs to go there and this becomes

#define __GFP_NO_KSWAPD		((__force_gfp_t)____0x400000u)

What you have just generates sparse warnings (I believe) so it's
harmless.

>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1996,7 +1996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
>  		goto nopage;
>  
>  restart:
> -	wake_all_kswapd(order, zonelist, high_zoneidx);
> +	if (!(gfp_mask & __GFP_NO_KSWAPD))
> +		wake_all_kswapd(order, zonelist, high_zoneidx);
>  

Other than needing to define ____GFP_NO_KSWAPD

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 28 of 66] _GFP_NO_KSWAPD
@ 2010-11-18 13:18     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:03PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Transparent hugepage allocations must be allowed not to invoke kswapd or any
> other kind of indirect reclaim (especially when the defrag sysfs is control
> disabled). It's unacceptable to swap out anonymous pages (potentially
> anonymous transparent hugepages) in order to create new transparent hugepages.
> This is true for the MADV_HUGEPAGE areas too (swapping out a kvm virtual
> machine and so having it suffer an unbearable slowdown, so another one with
> guest physical memory marked MADV_HUGEPAGE can run 30% faster if it is running
> memory intensive workloads, makes no sense). If a transparent hugepage
> allocation fails the slowdown is minor and there is total fallback, so kswapd
> should never be asked to swapout memory to allow the high order allocation to
> succeed.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -81,13 +81,15 @@ struct vm_area_struct;
>  #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>  
> +#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
> +

This is not an exact merge with what's currently in mm. Look at the top
of gfp.h and see "Plain integer GFP bitmasks. Do not use this
directly.". The 0x400000u definition needs to go there and this becomes

#define __GFP_NO_KSWAPD		((__force_gfp_t)____0x400000u)

What you have just generates sparse warnings (I believe) so it's
harmless.

>  /*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 22	/* Room for 22 __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 23	/* Room for 23 __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1996,7 +1996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, u
>  		goto nopage;
>  
>  restart:
> -	wake_all_kswapd(order, zonelist, high_zoneidx);
> +	if (!(gfp_mask & __GFP_NO_KSWAPD))
> +		wake_all_kswapd(order, zonelist, high_zoneidx);
>  

Other than needing to define ____GFP_NO_KSWAPD

Acked-by: Mel Gorman <mel@csn.ul.ie>

>  	/*
>  	 * OK, we're below the kswapd watermark and have kicked background
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 13:19     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:04PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Not worth throwing away the precious reserved free memory pool for allocations
> that can fail gracefully (either through mempool or because they're transhuge
> allocations later falling back to 4k allocations).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
>  	if (!wait) {
> -		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Not worth trying to allocate harder for
> +		 * __GFP_NOMEMALLOC even if it can't schedule.
> +		 */
> +		if  (!(gfp_mask & __GFP_NOMEMALLOC))
> +			alloc_flags |= ALLOC_HARDER;
>  		/*
>  		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait
@ 2010-11-18 13:19     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 13:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:04PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Not worth throwing away the precious reserved free memory pool for allocations
> that can fail gracefully (either through mempool or because they're transhuge
> allocations later falling back to 4k allocations).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

> ---
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1941,7 +1941,12 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
>  	if (!wait) {
> -		alloc_flags |= ALLOC_HARDER;
> +		/*
> +		 * Not worth trying to allocate harder for
> +		 * __GFP_NOMEMALLOC even if it can't schedule.
> +		 */
> +		if  (!(gfp_mask & __GFP_NOMEMALLOC))
> +			alloc_flags |= ALLOC_HARDER;
>  		/*
>  		 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  		 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 30 of 66] transparent hugepage core
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:12     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:05PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
>    not null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 

All that seems fine to me. Nits in part that are simply not worth
calling out. In principal, I Agree With This :)

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> * * *
> adapt to mm_counter in -mm
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> The interface changed slightly.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> * * *
> transparent hugepage bootparam
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allow transparent_hugepage=always|never|madvise at boot.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
>  	return pmd_set_flags(pmd, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -108,6 +108,9 @@ struct vm_area_struct;
>  				 __GFP_HARDWALL | __GFP_HIGHMEM | \
>  				 __GFP_MOVABLE)
>  #define GFP_IOFS	(__GFP_IO | __GFP_FS)
> +#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> +			 __GFP_NO_KSWAPD)
>  
>  #ifdef CONFIG_NUMA
>  #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,126 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address, pmd_t *pmd,
> +				      unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)				\
> +	((transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern int split_huge_page(struct page *page);
> +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
> +#define split_huge_page_pmd(__mm, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\
> +			__split_huge_page_pmd(__mm, ____pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> +		/*							\
> +		 * spin_unlock_wait() is just a loop in C and so the	\
> +		 * CPU can reorder anything around it.			\
> +		 */							\
> +		smp_mb();						\

Just a note as I see nothing wrong with this but that's a good spot. The
unlock isn't a memory barrier. Out of curiousity, does it really need to be
a full barrier or would a write barrier have been enough?

> +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> +		       pmd_trans_huge(*____pmd));			\
> +	} while (0)
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> +	VM_BUG_ON(PageTail(page));
> +	return PageHead(page);
> +}

gfp.h seems an odd place for these. Should the flags go in page-flags.h
and maybe put vma_address() in internal.h?

Not a biggie.

> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
> +#define transparent_hugepage_enabled(__vma) 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_pmd(__mm, __pmd)	\
> +	do { } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
>  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> +#endif
>  
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -240,6 +243,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
>  }
>  
>  static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> +		       struct list_head *head)
> +{
> +	list_add(&page->lru, head);
> +	__inc_zone_state(zone, NR_LRU_BASE + l);
> +	mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
>  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>  {
> -	list_add(&page->lru, &zone->lru[l].list);
> -	__inc_zone_state(zone, NR_LRU_BASE + l);
> -	mem_cgroup_add_lru_list(page, l);
> +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
>  }
>  

Do these really need to be in a public header or can they move to
mm/swap.c?

>  static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
>  /* linux/mm/swap.c */
>  extern void __lru_cache_add(struct page *, enum lru_list lru);
>  extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> +			      struct page *page, struct page *page_tail);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,899 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +/*
> + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> + * memory just to allocate one more hugepage.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init hugepage_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(hugepage_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	int ret = 0;
> +	if (!str)
> +		goto out;
> +	if (!strcmp(str, "always")) {
> +		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			&transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "madvise")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			&transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "never")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	}
> +out:
> +	if (!ret)
> +		printk(KERN_WARNING
> +		       "transparent_hugepage= cannot parse, ignored\n");
> +	return ret;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +

assert_spin_locked() ?

> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long haddr, pmd_t *pmd,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, haddr);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> +	__SetPageUptodate(page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		/*
> +		 * The spinlocking to take the lru_lock inside
> +		 * page_add_new_anon_rmap() acts as a full memory
> +		 * barrier to be sure clear_huge_page writes become
> +		 * visible after the set_pmd_at() write.
> +		 */
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +		spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
> +			   HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
> +	}
> +out:
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> +		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd))) {
> +		pte_free(dst_mm, pgtable);
> +		goto out_unlock;
> +	}
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +		pte_free(dst_mm, pgtable);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_page(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL;
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmd, pmd_t orig_pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int ret = 0, i;
> +	struct page **pages;
> +
> +	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> +			GFP_KERNEL);
> +	if (unlikely(!pages)) {
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +					  vma, address);
> +		if (unlikely(!pages[i])) {
> +			while (--i >= 0)
> +				put_page(pages[i]);
> +			kfree(pages);
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		copy_user_highpage(pages[i], page + i,
> +				   haddr + PAGE_SHIFT*i, vma);
> +		__SetPageUptodate(pages[i]);
> +		cond_resched();
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	VM_BUG_ON(!PageHead(page));
> +
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/* leave pmd empty until pte is filled */
> +
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = mk_pte(pages[i], vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	kfree(pages);
> +
> +	mm->nr_ptes++;
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	page_remove_rmap(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	ret |= VM_FAULT_WRITE;
> +	put_page(page);
> +
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	spin_unlock(&mm->page_table_lock);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_page(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_PMD_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	if (transparent_hugepage_enabled(vma) &&
> +	    !transparent_hugepage_debug_cow())
> +		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +	else
> +		new_page = NULL;
> +
> +	if (unlikely(!new_page)) {
> +		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> +						   pmd, orig_pmd, page, haddr);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +	__SetPageUptodate(new_page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	put_page(page);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		VM_BUG_ON(!PageHead(page));
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		 pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pmd_page(*pmd);
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			VM_BUG_ON(!PageHead(page));
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	if (address & ~HPAGE_PMD_MASK)
> +		goto out;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_page(*pmd) != page)
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd)) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, the pmd must remain marked huge at all
> +		 * times or the VM won't take the pmd_trans_huge paths
> +		 * and it won't wait on the anon_vma->root->lock to
> +		 * serialize against split_huge_page*.
> +		 */
> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +	struct zone *zone = page_zone(page);
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +
> +		page_tail->index = ++head_index;
> +
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		lru_add_page_tail(zone, page, page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +		BUG_ON(page_count(page_tail) <= 0);
> +		/*
> +		 * Tail pages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(page_tail);
> +	}
> +
> +	/*
> +	 * Only the head page (now become a regular page) is required
> +	 * to be pinned by the caller.
> +	 */
> +	BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			BUG_ON(PageCompound(page+i));
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		/*
> +		 * Up to this point the pmd is present and huge and
> +		 * userland has the whole access to the hugepage
> +		 * during the split (which happens in place). If we
> +		 * overwrite the pmd with the not-huge version
> +		 * pointing to the pte here (which of course we could
> +		 * if all CPUs were bug free), userland could trigger
> +		 * a small page size TLB miss on the small sized TLB
> +		 * while the hugepage TLB entry is still established
> +		 * in the huge TLB. Some CPU doesn't like that. See
> +		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> +		 * Erratum 383 on page 93. Intel should be safe but is
> +		 * also warns that it's only safe if the permission
> +		 * and cache attributes of the two entries loaded in
> +		 * the two TLB is identical (which should be the case
> +		 * here). But it is generally safer to never allow
> +		 * small and huge TLB entries for the same virtual
> +		 * address to be loaded simultaneously. So instead of
> +		 * doing "pmd_populate(); flush_tlb_range();" we first
> +		 * mark the current pmd notpresent (atomically because
> +		 * here the pmd_trans_huge and pmd_trans_splitting
> +		 * must remain set at all times on the pmd until the
> +		 * split is complete for this pmd), then we flush the
> +		 * SMP TLB and finally we write the non-huge version
> +		 * of the pmd entry with pmd_populate.
> +		 */
> +		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->root->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct anon_vma_chain *avc;
> +
> +	BUG_ON(!PageHead(page));
> +	BUG_ON(PageTail(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> +	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> +
> +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	struct page *page;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		return;
> +	}
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!page_count(page));
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	split_huge_page(page);
> +
> +	put_page(page);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -726,9 +726,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_PMD_SIZE)
> +				split_huge_page_pmd(vma->vm_mm, pmd);
> +			else if (zap_huge_pmd(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> -	pte = pte_alloc_map(mm, vma, pmd, address);
> -	if (!pte)
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_pmd_anonymous_page(mm, vma, address,
> +							  pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();

What is this barrier for?

> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_pmd_wp_page(mm, vma, address,
> +							   pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
> +
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
>  		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
>  
>  	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
>  	pmd = pmd_offset(pud, address);
>  	if (!pmd_present(*pmd))
>  		return NULL;
> +	if (pmd_trans_huge(*pmd))
> +		return NULL;
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> @@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageTransHuge(page))) {
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
>  
>  EXPORT_SYMBOL(__pagevec_release);
>  
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> +		       struct page *page, struct page *page_tail)
> +{
> +	int active;
> +	enum lru_list lru;
> +	const int file = 0;
> +	struct list_head *head;
> +
> +	VM_BUG_ON(!PageHead(page));
> +	VM_BUG_ON(PageCompound(page_tail));
> +	VM_BUG_ON(PageLRU(page_tail));
> +	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> +	SetPageLRU(page_tail);
> +
> +	if (page_evictable(page_tail, NULL)) {
> +		if (PageActive(page)) {
> +			SetPageActive(page_tail);
> +			active = 1;
> +			lru = LRU_ACTIVE_ANON;
> +		} else {
> +			active = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
> +		update_page_reclaim_stat(zone, page_tail, file, active);
> +		if (likely(PageLRU(page)))
> +			head = page->lru.prev;
> +		else
> +			head = &zone->lru[lru].list;
> +		__add_page_to_lru_list(zone, page_tail, lru, head);
> +	} else {
> +		SetPageUnevictable(page_tail);
> +		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> +	}
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
> 

Other than a few minor questions, these seems very similar to what you
had before. There is a lot going on in this patch but I did not find
anything wrong.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 30 of 66] transparent hugepage core
@ 2010-11-18 15:12     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:05PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Lately I've been working to make KVM use hugepages transparently
> without the usual restrictions of hugetlbfs. Some of the restrictions
> I'd like to see removed:
> 
> 1) hugepages have to be swappable or the guest physical memory remains
>    locked in RAM and can't be paged out to swap
> 
> 2) if a hugepage allocation fails, regular pages should be allocated
>    instead and mixed in the same vma without any failure and without
>    userland noticing
> 
> 3) if some task quits and more hugepages become available in the
>    buddy, guest physical memory backed by regular pages should be
>    relocated on hugepages automatically in regions under
>    madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
>    kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
>    not null)
> 
> 4) avoidance of reservation and maximization of use of hugepages whenever
>    possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
>    1 machine with 1 database with 1 database cache with 1 database cache size
>    known at boot time. It's definitely not feasible with a virtualization
>    hypervisor usage like RHEV-H that runs an unknown number of virtual machines
>    with an unknown size of each virtual machine with an unknown amount of
>    pagecache that could be potentially useful in the host for guest not using
>    O_DIRECT (aka cache=off).
> 
> hugepages in the virtualization hypervisor (and also in the guest!) are
> much more important than in a regular host not using virtualization, becasue
> with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in
> case only the hypervisor uses transparent hugepages, and they decrease the
> tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and
> the linux guest both uses this patch (though the guest will limit the addition
> speedup to anonymous regions only for now...).  Even more important is that the
> tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow
> paging or no-virtualization scenario. So maximizing the amount of virtual
> memory cached by the TLB pays off significantly more with NPT/EPT than without
> (even if there would be no significant speedup in the tlb-miss runtime).
> 
> The first (and more tedious) part of this work requires allowing the VM to
> handle anonymous hugepages mixed with regular pages transparently on regular
> anonymous vmas. This is what this patch tries to achieve in the least intrusive
> possible way. We want hugepages and hugetlb to be used in a way so that all
> applications can benefit without changes (as usual we leverage the KVM
> virtualization design: by improving the Linux VM at large, KVM gets the
> performance boost too).
> 
> The most important design choice is: always fallback to 4k allocation
> if the hugepage allocation fails! This is the _very_ opposite of some
> large pagecache patches that failed with -EIO back then if a 64k (or
> similar) allocation failed...
> 
> Second important decision (to reduce the impact of the feature on the
> existing pagetable handling code) is that at any time we can split an
> hugepage into 512 regular pages and it has to be done with an
> operation that can't fail. This way the reliability of the swapping
> isn't decreased (no need to allocate memory when we are short on
> memory to swap) and it's trivial to plug a split_huge_page* one-liner
> where needed without polluting the VM. Over time we can teach
> mprotect, mremap and friends to handle pmd_trans_huge natively without
> calling split_huge_page*. The fact it can't fail isn't just for swap:
> if split_huge_page would return -ENOMEM (instead of the current void)
> we'd need to rollback the mprotect from the middle of it (ideally
> including undoing the split_vma) which would be a big change and in
> the very wrong direction (it'd likely be simpler not to call
> split_huge_page at all and to teach mprotect and friends to handle
> hugepages instead of rolling them back from the middle). In short the
> very value of split_huge_page is that it can't fail.
> 
> The collapsing and madvise(MADV_HUGEPAGE) part will remain separated
> and incremental and it'll just be an "harmless" addition later if this
> initial part is agreed upon. It also should be noted that locking-wise
> replacing regular pages with hugepages is going to be very easy if
> compared to what I'm doing below in split_huge_page, as it will only
> happen when page_count(page) matches page_mapcount(page) if we can
> take the PG_lock and mmap_sem in write mode. collapse_huge_page will
> be a "best effort" that (unlike split_huge_page) can fail at the
> minimal sign of trouble and we can try again later. collapse_huge_page
> will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will
> work similar to madvise(MADV_MERGEABLE).
> 
> The default I like is that transparent hugepages are used at page fault time.
> This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The
> control knob can be set to three values "always", "madvise", "never" which
> mean respectively that hugepages are always used, or only inside
> madvise(MADV_HUGEPAGE) regions, or never used.
> /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage
> allocation should defrag memory aggressively "always", only inside "madvise"
> regions, or "never".
> 
> The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
> put_page (from get_user_page users that can't use mmu notifier like
> O_DIRECT) that runs against a __split_huge_page_refcount instead was a
> pain to serialize in a way that would result always in a coherent page
> count for both tail and head. I think my locking solution with a
> compound_lock taken only after the page_first is valid and is still a
> PageHead should be safe but it surely needs review from SMP race point
> of view. In short there is no current existing way to serialize the
> O_DIRECT final put_page against split_huge_page_refcount so I had to
> invent a new one (O_DIRECT loses knowledge on the mapping status by
> the time gup_fast returns so...). And I didn't want to impact all
> gup/gup_fast users for now, maybe if we change the gup interface
> substantially we can avoid this locking, I admit I didn't think too
> much about it because changing the gup unpinning interface would be
> invasive.
> 
> If we ignored O_DIRECT we could stick to the existing compound
> refcounting code, by simply adding a
> get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu
> notifier user) would call it without FOLL_GET (and if FOLL_GET isn't
> set we'd just BUG_ON if nobody registered itself in the current task
> mmu notifier list yet). But O_DIRECT is fundamental for decent
> performance of virtualized I/O on fast storage so we can't avoid it to
> solve the race of put_page against split_huge_page_refcount to achieve
> a complete hugepage feature for KVM.
> 
> Swap and oom works fine (well just like with regular pages ;). MMU
> notifier is handled transparently too, with the exception of the young
> bit on the pmd, that didn't have a range check but I think KVM will be
> fine because the whole point of hugepages is that EPT/NPT will also
> use a huge pmd when they notice gup returns pages with PageCompound set,
> so they won't care of a range and there's just the pmd young bit to
> check in that case.
> 
> NOTE: in some cases if the L2 cache is small, this may slowdown and
> waste memory during COWs because 4M of memory are accessed in a single
> fault instead of 8k (the payoff is that after COW the program can run
> faster). So we might want to switch the copy_huge_page (and
> clear_huge_page too) to not temporal stores. I also extensively
> researched ways to avoid this cache trashing with a full prefault
> logic that would cow in 8k/16k/32k/64k up to 1M (I can send those
> patches that fully implemented prefault) but I concluded they're not
> worth it and they add an huge additional complexity and they remove all tlb
> benefits until the full hugepage has been faulted in, to save a little bit of
> memory and some cache during app startup, but they still don't improve
> substantially the cache-trashing during startup if the prefault happens in >4k
> chunks.  One reason is that those 4k pte entries copied are still mapped on a
> perfectly cache-colored hugepage, so the trashing is the worst one can generate
> in those copies (cow of 4k page copies aren't so well colored so they trashes
> less, but again this results in software running faster after the page fault).
> Those prefault patches allowed things like a pte where post-cow pages were
> local 4k regular anon pages and the not-yet-cowed pte entries were pointing in
> the middle of some hugepage mapped read-only. If it doesn't payoff
> substantially with todays hardware it will payoff even less in the future with
> larger l2 caches, and the prefault logic would blot the VM a lot. If one is
> emebdded transparent_hugepage can be disabled during boot with sysfs or with
> the boot commandline parameter transparent_hugepage=0 (or
> transparent_hugepage=2 to restrict hugepages inside madvise regions) that will
> ensure not a single hugepage is allocated at boot time. It is simple enough to
> just disable transparent hugepage globally and let transparent hugepages be
> allocated selectively by applications in the MADV_HUGEPAGE region (both at page
> fault time, and if enabled with the collapse_huge_page too through the kernel
> daemon).
> 
> This patch supports only hugepages mapped in the pmd, archs that have
> smaller hugepages will not fit in this patch alone. Also some archs like power
> have certain tlb limits that prevents mixing different page size in the same
> regions so they will not fit in this framework that requires "graceful
> fallback" to basic PAGE_SIZE in case of physical memory fragmentation.
> hugetlbfs remains a perfect fit for those because its software limits happen to
> match the hardware limits. hugetlbfs also remains a perfect fit for hugepage
> sizes like 1GByte that cannot be hoped to be found not fragmented after a
> certain system uptime and that would be very expensive to defragment with
> relocation, so requiring reservation. hugetlbfs is the "reservation way", the
> point of transparent hugepages is not to have any reservation at all and
> maximizing the use of cache and hugepages at all times automatically.
> 
> Some performance result:
> 
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
> ages3
> memset page fault 1566023
> memset tlb miss 453854
> memset second tlb miss 453321
> random access tlb miss 41635
> random access second tlb miss 41658
> vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
> memset page fault 1566471
> memset tlb miss 453375
> memset second tlb miss 453320
> random access tlb miss 41636
> random access second tlb miss 41637
> vmx andrea # ./largepages3
> memset page fault 1566642
> memset tlb miss 453417
> memset second tlb miss 453313
> random access tlb miss 41630
> random access second tlb miss 41647
> vmx andrea # ./largepages3
> memset page fault 1566872
> memset tlb miss 453418
> memset second tlb miss 453315
> random access tlb miss 41618
> random access second tlb miss 41659
> vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
> vmx andrea # ./largepages3
> memset page fault 2182476
> memset tlb miss 460305
> memset second tlb miss 460179
> random access tlb miss 44483
> random access second tlb miss 44186
> vmx andrea # ./largepages3
> memset page fault 2182791
> memset tlb miss 460742
> memset second tlb miss 459962
> random access tlb miss 43981
> random access second tlb miss 43988
> 
> ============
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (3UL*1024*1024*1024)
> 
> int main()
> {
> 	char *p = malloc(SIZE), *p2;
> 	struct timeval before, after;
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset page fault %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	memset(p, 0, SIZE);
> 	gettimeofday(&after, NULL);
> 	printf("memset second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	gettimeofday(&before, NULL);
> 	for (p2 = p; p2 < p+SIZE; p2 += 4096)
> 		*p2 = 0;
> 	gettimeofday(&after, NULL);
> 	printf("random access second tlb miss %Lu\n",
> 	       (after.tv_sec-before.tv_sec)*1000000UL +
> 	       after.tv_usec-before.tv_usec);
> 
> 	return 0;
> }
> ============
> 

All that seems fine to me. Nits in part that are simply not worth
calling out. In principal, I Agree With This :)

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> * * *
> adapt to mm_counter in -mm
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> The interface changed slightly.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> * * *
> transparent hugepage bootparam
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Allow transparent_hugepage=always|never|madvise at boot.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -286,6 +286,11 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
>  	return pmd_set_flags(pmd, _PAGE_RW);
>  }
>  
> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_PRESENT);
> +}
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_PGTABLE_64_H */
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -108,6 +108,9 @@ struct vm_area_struct;
>  				 __GFP_HARDWALL | __GFP_HIGHMEM | \
>  				 __GFP_MOVABLE)
>  #define GFP_IOFS	(__GFP_IO | __GFP_FS)
> +#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> +			 __GFP_NO_KSWAPD)
>  
>  #ifdef CONFIG_NUMA
>  #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> new file mode 100644
> --- /dev/null
> +++ b/include/linux/huge_mm.h
> @@ -0,0 +1,126 @@
> +#ifndef _LINUX_HUGE_MM_H
> +#define _LINUX_HUGE_MM_H
> +
> +extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +				      struct vm_area_struct *vma,
> +				      unsigned long address, pmd_t *pmd,
> +				      unsigned int flags);
> +extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +			 struct vm_area_struct *vma);
> +extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       pmd_t orig_pmd);
> +extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
> +extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +					  unsigned long addr,
> +					  pmd_t *pmd,
> +					  unsigned int flags);
> +extern int zap_huge_pmd(struct mmu_gather *tlb,
> +			struct vm_area_struct *vma,
> +			pmd_t *pmd);
> +
> +enum transparent_hugepage_flag {
> +	TRANSPARENT_HUGEPAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +#ifdef CONFIG_DEBUG_VM
> +	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> +#endif
> +};
> +
> +enum page_check_address_pmd_flag {
> +	PAGE_CHECK_ADDRESS_PMD_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG,
> +	PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG,
> +};
> +extern pmd_t *page_check_address_pmd(struct page *page,
> +				     struct mm_struct *mm,
> +				     unsigned long address,
> +				     enum page_check_address_pmd_flag flag);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define HPAGE_PMD_SHIFT HPAGE_SHIFT
> +#define HPAGE_PMD_MASK HPAGE_MASK
> +#define HPAGE_PMD_SIZE HPAGE_SIZE
> +
> +#define transparent_hugepage_enabled(__vma)				\
> +	(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG) ||	\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) &&			\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#define transparent_hugepage_defrag(__vma)				\
> +	((transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) ||			\
> +	 (transparent_hugepage_flags &					\
> +	  (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) &&		\
> +	  (__vma)->vm_flags & VM_HUGEPAGE))
> +#ifdef CONFIG_DEBUG_VM
> +#define transparent_hugepage_debug_cow()				\
> +	(transparent_hugepage_flags &					\
> +	 (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG))
> +#else /* CONFIG_DEBUG_VM */
> +#define transparent_hugepage_debug_cow() 0
> +#endif /* CONFIG_DEBUG_VM */
> +
> +extern unsigned long transparent_hugepage_flags;
> +extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +			  pmd_t *dst_pmd, pmd_t *src_pmd,
> +			  struct vm_area_struct *vma,
> +			  unsigned long addr, unsigned long end);
> +extern int handle_pte_fault(struct mm_struct *mm,
> +			    struct vm_area_struct *vma, unsigned long address,
> +			    pte_t *pte, pmd_t *pmd, unsigned int flags);
> +extern int split_huge_page(struct page *page);
> +extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
> +#define split_huge_page_pmd(__mm, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		if (unlikely(pmd_trans_huge(*____pmd)))			\
> +			__split_huge_page_pmd(__mm, ____pmd);		\
> +	}  while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)				\
> +	do {								\
> +		pmd_t *____pmd = (__pmd);				\
> +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> +		/*							\
> +		 * spin_unlock_wait() is just a loop in C and so the	\
> +		 * CPU can reorder anything around it.			\
> +		 */							\
> +		smp_mb();						\

Just a note as I see nothing wrong with this but that's a good spot. The
unlock isn't a memory barrier. Out of curiousity, does it really need to be
a full barrier or would a write barrier have been enough?

> +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> +		       pmd_trans_huge(*____pmd));			\
> +	} while (0)
> +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> +#if HPAGE_PMD_ORDER > MAX_ORDER
> +#error "hugepages can't be allocated by the buddy allocator"
> +#endif
> +
> +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +static inline int PageTransHuge(struct page *page)
> +{
> +	VM_BUG_ON(PageTail(page));
> +	return PageHead(page);
> +}

gfp.h seems an odd place for these. Should the flags go in page-flags.h
and maybe put vma_address() in internal.h?

Not a biggie.

> +#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> +#define HPAGE_PMD_SHIFT ({ BUG(); 0; })
> +#define HPAGE_PMD_MASK ({ BUG(); 0; })
> +#define HPAGE_PMD_SIZE ({ BUG(); 0; })
> +
> +#define transparent_hugepage_enabled(__vma) 0
> +
> +#define transparent_hugepage_flags 0UL
> +static inline int split_huge_page(struct page *page)
> +{
> +	return 0;
> +}
> +#define split_huge_page_pmd(__mm, __pmd)	\
> +	do { } while (0)
> +#define wait_split_huge_page(__anon_vma, __pmd)	\
> +	do { } while (0)
> +#define PageTransHuge(page) 0
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +#endif /* _LINUX_HUGE_MM_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -111,6 +111,9 @@ extern unsigned int kobjsize(const void 
>  #define VM_SAO		0x20000000	/* Strong Access Ordering (powerpc) */
>  #define VM_PFN_AT_MMAP	0x40000000	/* PFNMAP vma that is fully mapped at mmap time */
>  #define VM_MERGEABLE	0x80000000	/* KSM may merge identical pages */
> +#if BITS_PER_LONG > 32
> +#define VM_HUGEPAGE	0x100000000UL	/* MADV_HUGEPAGE marked this vma */
> +#endif
>  
>  #ifndef VM_STACK_DEFAULT_FLAGS		/* arch can override this */
>  #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
> @@ -240,6 +243,7 @@ struct inode;
>   * files which need it (119 of them)
>   */
>  #include <linux/page-flags.h>
> +#include <linux/huge_mm.h>
>  
>  /*
>   * Methods to modify the page usage count.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
>  }
>  
>  static inline void
> +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> +		       struct list_head *head)
> +{
> +	list_add(&page->lru, head);
> +	__inc_zone_state(zone, NR_LRU_BASE + l);
> +	mem_cgroup_add_lru_list(page, l);
> +}
> +
> +static inline void
>  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
>  {
> -	list_add(&page->lru, &zone->lru[l].list);
> -	__inc_zone_state(zone, NR_LRU_BASE + l);
> -	mem_cgroup_add_lru_list(page, l);
> +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
>  }
>  

Do these really need to be in a public header or can they move to
mm/swap.c?

>  static inline void
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -208,6 +208,8 @@ extern unsigned int nr_free_pagecache_pa
>  /* linux/mm/swap.c */
>  extern void __lru_cache_add(struct page *, enum lru_list lru);
>  extern void lru_cache_add_lru(struct page *, enum lru_list lru);
> +extern void lru_add_page_tail(struct zone* zone,
> +			      struct page *page, struct page *page_tail);
>  extern void activate_page(struct page *);
>  extern void mark_page_accessed(struct page *);
>  extern void lru_add_drain(void);
> diff --git a/mm/Makefile b/mm/Makefile
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -42,3 +42,4 @@ obj-$(CONFIG_MEMORY_FAILURE) += memory-f
>  obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> new file mode 100644
> --- /dev/null
> +++ b/mm/huge_memory.c
> @@ -0,0 +1,899 @@
> +/*
> + *  Copyright (C) 2009  Red Hat, Inc.
> + *
> + *  This work is licensed under the terms of the GNU GPL, version 2. See
> + *  the COPYING file in the top-level directory.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/highmem.h>
> +#include <linux/hugetlb.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/rmap.h>
> +#include <linux/swap.h>
> +#include <asm/tlb.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +unsigned long transparent_hugepage_flags __read_mostly =
> +	(1<<TRANSPARENT_HUGEPAGE_FLAG);
> +
> +#ifdef CONFIG_SYSFS
> +static ssize_t double_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag enabled,
> +				enum transparent_hugepage_flag req_madv)
> +{
> +	if (test_bit(enabled, &transparent_hugepage_flags)) {
> +		VM_BUG_ON(test_bit(req_madv, &transparent_hugepage_flags));
> +		return sprintf(buf, "[always] madvise never\n");
> +	} else if (test_bit(req_madv, &transparent_hugepage_flags))
> +		return sprintf(buf, "always [madvise] never\n");
> +	else
> +		return sprintf(buf, "always madvise [never]\n");
> +}
> +static ssize_t double_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag enabled,
> +				 enum transparent_hugepage_flag req_madv)
> +{
> +	if (!memcmp("always", buf,
> +		    min(sizeof("always")-1, count))) {
> +		set_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("madvise", buf,
> +			   min(sizeof("madvise")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		set_bit(req_madv, &transparent_hugepage_flags);
> +	} else if (!memcmp("never", buf,
> +			   min(sizeof("never")-1, count))) {
> +		clear_bit(enabled, &transparent_hugepage_flags);
> +		clear_bit(req_madv, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +static ssize_t enabled_show(struct kobject *kobj,
> +			    struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_FLAG,
> +				TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static ssize_t enabled_store(struct kobject *kobj,
> +			     struct kobj_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_FLAG,
> +				 TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute enabled_attr =
> +	__ATTR(enabled, 0644, enabled_show, enabled_store);
> +
> +static ssize_t single_flag_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf,
> +				enum transparent_hugepage_flag flag)
> +{
> +	if (test_bit(flag, &transparent_hugepage_flags))
> +		return sprintf(buf, "[yes] no\n");
> +	else
> +		return sprintf(buf, "yes [no]\n");
> +}
> +static ssize_t single_flag_store(struct kobject *kobj,
> +				 struct kobj_attribute *attr,
> +				 const char *buf, size_t count,
> +				 enum transparent_hugepage_flag flag)
> +{
> +	if (!memcmp("yes", buf,
> +		    min(sizeof("yes")-1, count))) {
> +		set_bit(flag, &transparent_hugepage_flags);
> +	} else if (!memcmp("no", buf,
> +			   min(sizeof("no")-1, count))) {
> +		clear_bit(flag, &transparent_hugepage_flags);
> +	} else
> +		return -EINVAL;
> +
> +	return count;
> +}
> +
> +/*
> + * Currently defrag only disables __GFP_NOWAIT for allocation. A blind
> + * __GFP_REPEAT is too aggressive, it's never worth swapping tons of
> + * memory just to allocate one more hugepage.
> + */
> +static ssize_t defrag_show(struct kobject *kobj,
> +			   struct kobj_attribute *attr, char *buf)
> +{
> +	return double_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static ssize_t defrag_store(struct kobject *kobj,
> +			    struct kobj_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	return double_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> +				 TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG);
> +}
> +static struct kobj_attribute defrag_attr =
> +	__ATTR(defrag, 0644, defrag_show, defrag_store);
> +
> +#ifdef CONFIG_DEBUG_VM
> +static ssize_t debug_cow_show(struct kobject *kobj,
> +				struct kobj_attribute *attr, char *buf)
> +{
> +	return single_flag_show(kobj, attr, buf,
> +				TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static ssize_t debug_cow_store(struct kobject *kobj,
> +			       struct kobj_attribute *attr,
> +			       const char *buf, size_t count)
> +{
> +	return single_flag_store(kobj, attr, buf, count,
> +				 TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG);
> +}
> +static struct kobj_attribute debug_cow_attr =
> +	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
> +#endif /* CONFIG_DEBUG_VM */
> +
> +static struct attribute *hugepage_attr[] = {
> +	&enabled_attr.attr,
> +	&defrag_attr.attr,
> +#ifdef CONFIG_DEBUG_VM
> +	&debug_cow_attr.attr,
> +#endif
> +	NULL,
> +};
> +
> +static struct attribute_group hugepage_attr_group = {
> +	.attrs = hugepage_attr,
> +	.name = "transparent_hugepage",
> +};
> +#endif /* CONFIG_SYSFS */
> +
> +static int __init hugepage_init(void)
> +{
> +#ifdef CONFIG_SYSFS
> +	int err;
> +
> +	err = sysfs_create_group(mm_kobj, &hugepage_attr_group);
> +	if (err)
> +		printk(KERN_ERR "hugepage: register sysfs failed\n");
> +#endif
> +	return 0;
> +}
> +module_init(hugepage_init)
> +
> +static int __init setup_transparent_hugepage(char *str)
> +{
> +	int ret = 0;
> +	if (!str)
> +		goto out;
> +	if (!strcmp(str, "always")) {
> +		set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			&transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "madvise")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		set_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			&transparent_hugepage_flags);
> +		ret = 1;
> +	} else if (!strcmp(str, "never")) {
> +		clear_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +			  &transparent_hugepage_flags);
> +		clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +			  &transparent_hugepage_flags);
> +		ret = 1;
> +	}
> +out:
> +	if (!ret)
> +		printk(KERN_WARNING
> +		       "transparent_hugepage= cannot parse, ignored\n");
> +	return ret;
> +}
> +__setup("transparent_hugepage=", setup_transparent_hugepage);
> +
> +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> +				 struct mm_struct *mm)
> +{
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +

assert_spin_locked() ?

> +	/* FIFO */
> +	if (!mm->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &mm->pmd_huge_pte->lru);
> +	mm->pmd_huge_pte = pgtable;
> +}
> +
> +static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pmd = pmd_mkwrite(pmd);
> +	return pmd;
> +}
> +
> +static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long haddr, pmd_t *pmd,
> +					struct page *page)
> +{
> +	int ret = 0;
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(!PageCompound(page));
> +	pgtable = pte_alloc_one(mm, haddr);
> +	if (unlikely(!pgtable)) {
> +		put_page(page);
> +		return VM_FAULT_OOM;
> +	}
> +
> +	clear_huge_page(page, haddr, HPAGE_PMD_NR);
> +	__SetPageUptodate(page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_none(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		put_page(page);
> +		pte_free(mm, pgtable);
> +	} else {
> +		pmd_t entry;
> +		entry = mk_pmd(page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		/*
> +		 * The spinlocking to take the lru_lock inside
> +		 * page_add_new_anon_rmap() acts as a full memory
> +		 * barrier to be sure clear_huge_page writes become
> +		 * visible after the set_pmd_at() write.
> +		 */
> +		page_add_new_anon_rmap(page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		prepare_pmd_huge_pte(pgtable, mm);
> +		add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +		spin_unlock(&mm->page_table_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static inline struct page *alloc_hugepage(int defrag)
> +{
> +	return alloc_pages(GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT),
> +			   HPAGE_PMD_ORDER);
> +}
> +
> +int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       unsigned long address, pmd_t *pmd,
> +			       unsigned int flags)
> +{
> +	struct page *page;
> +	unsigned long haddr = address & HPAGE_PMD_MASK;
> +	pte_t *pte;
> +
> +	if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
> +		if (unlikely(anon_vma_prepare(vma)))
> +			return VM_FAULT_OOM;
> +		page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +		if (unlikely(!page))
> +			goto out;
> +
> +		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
> +	}
> +out:
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
> +		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
> +	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +}
> +
> +int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> +		  struct vm_area_struct *vma)
> +{
> +	struct page *src_page;
> +	pmd_t pmd;
> +	pgtable_t pgtable;
> +	int ret;
> +
> +	ret = -ENOMEM;
> +	pgtable = pte_alloc_one(dst_mm, addr);
> +	if (unlikely(!pgtable))
> +		goto out;
> +
> +	spin_lock(&dst_mm->page_table_lock);
> +	spin_lock_nested(&src_mm->page_table_lock, SINGLE_DEPTH_NESTING);
> +
> +	ret = -EAGAIN;
> +	pmd = *src_pmd;
> +	if (unlikely(!pmd_trans_huge(pmd))) {
> +		pte_free(dst_mm, pgtable);
> +		goto out_unlock;
> +	}
> +	if (unlikely(pmd_trans_splitting(pmd))) {
> +		/* split huge page running from under us */
> +		spin_unlock(&src_mm->page_table_lock);
> +		spin_unlock(&dst_mm->page_table_lock);
> +		pte_free(dst_mm, pgtable);
> +
> +		wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
> +		goto out;
> +	}
> +	src_page = pmd_page(pmd);
> +	VM_BUG_ON(!PageHead(src_page));
> +	get_page(src_page);
> +	page_dup_rmap(src_page);
> +	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
> +
> +	pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +	pmd = pmd_mkold(pmd_wrprotect(pmd));
> +	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> +	prepare_pmd_huge_pte(pgtable, dst_mm);
> +
> +	ret = 0;
> +out_unlock:
> +	spin_unlock(&src_mm->page_table_lock);
> +	spin_unlock(&dst_mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +/* no "address" argument so destroys page coloring of some arch */
> +pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
> +{
> +	pgtable_t pgtable;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	/* FIFO */
> +	pgtable = mm->pmd_huge_pte;
> +	if (list_empty(&pgtable->lru))
> +		mm->pmd_huge_pte = NULL;
> +	else {
> +		mm->pmd_huge_pte = list_entry(pgtable->lru.next,
> +					      struct page, lru);
> +		list_del(&pgtable->lru);
> +	}
> +	return pgtable;
> +}
> +
> +static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
> +					struct vm_area_struct *vma,
> +					unsigned long address,
> +					pmd_t *pmd, pmd_t orig_pmd,
> +					struct page *page,
> +					unsigned long haddr)
> +{
> +	pgtable_t pgtable;
> +	pmd_t _pmd;
> +	int ret = 0, i;
> +	struct page **pages;
> +
> +	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
> +			GFP_KERNEL);
> +	if (unlikely(!pages)) {
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> +					  vma, address);
> +		if (unlikely(!pages[i])) {
> +			while (--i >= 0)
> +				put_page(pages[i]);
> +			kfree(pages);
> +			ret |= VM_FAULT_OOM;
> +			goto out;
> +		}
> +	}
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		copy_user_highpage(pages[i], page + i,
> +				   haddr + PAGE_SHIFT*i, vma);
> +		__SetPageUptodate(pages[i]);
> +		cond_resched();
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_free_pages;
> +	VM_BUG_ON(!PageHead(page));
> +
> +	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	/* leave pmd empty until pte is filled */
> +
> +	pgtable = get_pmd_huge_pte(mm);
> +	pmd_populate(mm, &_pmd, pgtable);
> +
> +	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> +		pte_t *pte, entry;
> +		entry = mk_pte(pages[i], vma->vm_page_prot);
> +		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +		page_add_new_anon_rmap(pages[i], vma, haddr);
> +		pte = pte_offset_map(&_pmd, haddr);
> +		VM_BUG_ON(!pte_none(*pte));
> +		set_pte_at(mm, haddr, pte, entry);
> +		pte_unmap(pte);
> +	}
> +	kfree(pages);
> +
> +	mm->nr_ptes++;
> +	smp_wmb(); /* make pte visible before pmd */
> +	pmd_populate(mm, pmd, pgtable);
> +	page_remove_rmap(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	ret |= VM_FAULT_WRITE;
> +	put_page(page);
> +
> +out:
> +	return ret;
> +
> +out_free_pages:
> +	spin_unlock(&mm->page_table_lock);
> +	for (i = 0; i < HPAGE_PMD_NR; i++)
> +		put_page(pages[i]);
> +	kfree(pages);
> +	goto out;
> +}
> +
> +int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +			unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
> +{
> +	int ret = 0;
> +	struct page *page, *new_page;
> +	unsigned long haddr;
> +
> +	VM_BUG_ON(!vma->anon_vma);
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		goto out_unlock;
> +
> +	page = pmd_page(orig_pmd);
> +	VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> +	haddr = address & HPAGE_PMD_MASK;
> +	if (page_mapcount(page) == 1) {
> +		pmd_t entry;
> +		entry = pmd_mkyoung(orig_pmd);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		if (pmdp_set_access_flags(vma, haddr, pmd, entry,  1))
> +			update_mmu_cache(vma, address, entry);
> +		ret |= VM_FAULT_WRITE;
> +		goto out_unlock;
> +	}
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	if (transparent_hugepage_enabled(vma) &&
> +	    !transparent_hugepage_debug_cow())
> +		new_page = alloc_hugepage(transparent_hugepage_defrag(vma));
> +	else
> +		new_page = NULL;
> +
> +	if (unlikely(!new_page)) {
> +		ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
> +						   pmd, orig_pmd, page, haddr);
> +		put_page(page);
> +		goto out;
> +	}
> +
> +	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
> +	__SetPageUptodate(new_page);
> +
> +	spin_lock(&mm->page_table_lock);
> +	put_page(page);
> +	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +		put_page(new_page);
> +	else {
> +		pmd_t entry;
> +		VM_BUG_ON(!PageHead(page));
> +		entry = mk_pmd(new_page, vma->vm_page_prot);
> +		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +		entry = pmd_mkhuge(entry);
> +		pmdp_clear_flush_notify(vma, haddr, pmd);
> +		page_add_new_anon_rmap(new_page, vma, haddr);
> +		set_pmd_at(mm, haddr, pmd, entry);
> +		update_mmu_cache(vma, address, entry);
> +		page_remove_rmap(page);
> +		put_page(page);
> +		ret |= VM_FAULT_WRITE;
> +	}
> +out_unlock:
> +	spin_unlock(&mm->page_table_lock);
> +out:
> +	return ret;
> +}
> +
> +struct page *follow_trans_huge_pmd(struct mm_struct *mm,
> +				   unsigned long addr,
> +				   pmd_t *pmd,
> +				   unsigned int flags)
> +{
> +	struct page *page = NULL;
> +
> +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> +
> +	if (flags & FOLL_WRITE && !pmd_write(*pmd))
> +		goto out;
> +
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!PageHead(page));
> +	if (flags & FOLL_TOUCH) {
> +		pmd_t _pmd;
> +		/*
> +		 * We should set the dirty bit only for FOLL_WRITE but
> +		 * for now the dirty bit in the pmd is meaningless.
> +		 * And if the dirty bit will become meaningful and
> +		 * we'll only set it with FOLL_WRITE, an atomic
> +		 * set_bit will be required on the pmd to set the
> +		 * young bit, instead of the current set_pmd_at.
> +		 */
> +		_pmd = pmd_mkyoung(pmd_mkdirty(*pmd));
> +		set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmd, _pmd);
> +	}
> +	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> +	VM_BUG_ON(!PageCompound(page));
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +
> +out:
> +	return page;
> +}
> +
> +int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		 pmd_t *pmd)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&tlb->mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*pmd))) {
> +		if (unlikely(pmd_trans_splitting(*pmd))) {
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma,
> +					     pmd);
> +		} else {
> +			struct page *page;
> +			pgtable_t pgtable;
> +			pgtable = get_pmd_huge_pte(tlb->mm);
> +			page = pmd_page(*pmd);
> +			pmd_clear(pmd);
> +			page_remove_rmap(page);
> +			VM_BUG_ON(page_mapcount(page) < 0);
> +			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			VM_BUG_ON(!PageHead(page));
> +			spin_unlock(&tlb->mm->page_table_lock);
> +			tlb_remove_page(tlb, page);
> +			pte_free(tlb->mm, pgtable);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&tlb->mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +pmd_t *page_check_address_pmd(struct page *page,
> +			      struct mm_struct *mm,
> +			      unsigned long address,
> +			      enum page_check_address_pmd_flag flag)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd, *ret = NULL;
> +
> +	if (address & ~HPAGE_PMD_MASK)
> +		goto out;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd))
> +		goto out;
> +	if (pmd_page(*pmd) != page)
> +		goto out;
> +	VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG &&
> +		  pmd_trans_splitting(*pmd));
> +	if (pmd_trans_huge(*pmd)) {
> +		VM_BUG_ON(flag == PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG &&
> +			  !pmd_trans_splitting(*pmd));
> +		ret = pmd;
> +	}
> +out:
> +	return ret;
> +}
> +
> +static int __split_huge_page_splitting(struct page *page,
> +				       struct vm_area_struct *vma,
> +				       unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd;
> +	int ret = 0;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
> +	if (pmd) {
> +		/*
> +		 * We can't temporarily set the pmd to null in order
> +		 * to split it, the pmd must remain marked huge at all
> +		 * times or the VM won't take the pmd_trans_huge paths
> +		 * and it won't wait on the anon_vma->root->lock to
> +		 * serialize against split_huge_page*.
> +		 */
> +		pmdp_splitting_flush_notify(vma, address, pmd);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +static void __split_huge_page_refcount(struct page *page)
> +{
> +	int i;
> +	unsigned long head_index = page->index;
> +	struct zone *zone = page_zone(page);
> +
> +	/* prevent PageLRU to go away from under us, and freeze lru stats */
> +	spin_lock_irq(&zone->lru_lock);
> +	compound_lock(page);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +
> +		/* tail_page->_count cannot change */
> +		atomic_sub(atomic_read(&page_tail->_count), &page->_count);
> +		BUG_ON(page_count(page) <= 0);
> +		atomic_add(page_mapcount(page) + 1, &page_tail->_count);
> +		BUG_ON(atomic_read(&page_tail->_count) <= 0);
> +
> +		/* after clearing PageTail the gup refcount can be released */
> +		smp_mb();
> +
> +		page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> +		page_tail->flags |= (page->flags &
> +				     ((1L << PG_referenced) |
> +				      (1L << PG_swapbacked) |
> +				      (1L << PG_mlocked) |
> +				      (1L << PG_uptodate)));
> +		page_tail->flags |= (1L << PG_dirty);
> +
> +		/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> +
> +		/*
> +		 * __split_huge_page_splitting() already set the
> +		 * splitting bit in all pmd that could map this
> +		 * hugepage, that will ensure no CPU can alter the
> +		 * mapcount on the head page. The mapcount is only
> +		 * accounted in the head page and it has to be
> +		 * transferred to all tail pages in the below code. So
> +		 * for this code to be safe, the split the mapcount
> +		 * can't change. But that doesn't mean userland can't
> +		 * keep changing and reading the page contents while
> +		 * we transfer the mapcount, so the pmd splitting
> +		 * status is achieved setting a reserved bit in the
> +		 * pmd, not by clearing the present bit.
> +		*/
> +		BUG_ON(page_mapcount(page_tail));
> +		page_tail->_mapcount = page->_mapcount;
> +
> +		BUG_ON(page_tail->mapping);
> +		page_tail->mapping = page->mapping;
> +
> +		page_tail->index = ++head_index;
> +
> +		BUG_ON(!PageAnon(page_tail));
> +		BUG_ON(!PageUptodate(page_tail));
> +		BUG_ON(!PageDirty(page_tail));
> +		BUG_ON(!PageSwapBacked(page_tail));
> +
> +		lru_add_page_tail(zone, page, page_tail);
> +	}
> +
> +	ClearPageCompound(page);
> +	compound_unlock(page);
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	for (i = 1; i < HPAGE_PMD_NR; i++) {
> +		struct page *page_tail = page + i;
> +		BUG_ON(page_count(page_tail) <= 0);
> +		/*
> +		 * Tail pages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		put_page(page_tail);
> +	}
> +
> +	/*
> +	 * Only the head page (now become a regular page) is required
> +	 * to be pinned by the caller.
> +	 */
> +	BUG_ON(page_count(page) <= 0);
> +}
> +
> +static int __split_huge_page_map(struct page *page,
> +				 struct vm_area_struct *vma,
> +				 unsigned long address)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pmd_t *pmd, _pmd;
> +	int ret = 0, i;
> +	pgtable_t pgtable;
> +	unsigned long haddr;
> +
> +	spin_lock(&mm->page_table_lock);
> +	pmd = page_check_address_pmd(page, mm, address,
> +				     PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
> +	if (pmd) {
> +		pgtable = get_pmd_huge_pte(mm);
> +		pmd_populate(mm, &_pmd, pgtable);
> +
> +		for (i = 0, haddr = address; i < HPAGE_PMD_NR;
> +		     i++, haddr += PAGE_SIZE) {
> +			pte_t *pte, entry;
> +			BUG_ON(PageCompound(page+i));
> +			entry = mk_pte(page + i, vma->vm_page_prot);
> +			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +			if (!pmd_write(*pmd))
> +				entry = pte_wrprotect(entry);
> +			else
> +				BUG_ON(page_mapcount(page) != 1);
> +			if (!pmd_young(*pmd))
> +				entry = pte_mkold(entry);
> +			pte = pte_offset_map(&_pmd, haddr);
> +			BUG_ON(!pte_none(*pte));
> +			set_pte_at(mm, haddr, pte, entry);
> +			pte_unmap(pte);
> +		}
> +
> +		mm->nr_ptes++;
> +		smp_wmb(); /* make pte visible before pmd */
> +		/*
> +		 * Up to this point the pmd is present and huge and
> +		 * userland has the whole access to the hugepage
> +		 * during the split (which happens in place). If we
> +		 * overwrite the pmd with the not-huge version
> +		 * pointing to the pte here (which of course we could
> +		 * if all CPUs were bug free), userland could trigger
> +		 * a small page size TLB miss on the small sized TLB
> +		 * while the hugepage TLB entry is still established
> +		 * in the huge TLB. Some CPU doesn't like that. See
> +		 * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
> +		 * Erratum 383 on page 93. Intel should be safe but is
> +		 * also warns that it's only safe if the permission
> +		 * and cache attributes of the two entries loaded in
> +		 * the two TLB is identical (which should be the case
> +		 * here). But it is generally safer to never allow
> +		 * small and huge TLB entries for the same virtual
> +		 * address to be loaded simultaneously. So instead of
> +		 * doing "pmd_populate(); flush_tlb_range();" we first
> +		 * mark the current pmd notpresent (atomically because
> +		 * here the pmd_trans_huge and pmd_trans_splitting
> +		 * must remain set at all times on the pmd until the
> +		 * split is complete for this pmd), then we flush the
> +		 * SMP TLB and finally we write the non-huge version
> +		 * of the pmd entry with pmd_populate.
> +		 */
> +		set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
> +		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> +		pmd_populate(mm, pmd, pgtable);
> +		ret = 1;
> +	}
> +	spin_unlock(&mm->page_table_lock);
> +
> +	return ret;
> +}
> +
> +/* must be called with anon_vma->root->lock hold */
> +static void __split_huge_page(struct page *page,
> +			      struct anon_vma *anon_vma)
> +{
> +	int mapcount, mapcount2;
> +	struct anon_vma_chain *avc;
> +
> +	BUG_ON(!PageHead(page));
> +	BUG_ON(PageTail(page));
> +
> +	mapcount = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount += __split_huge_page_splitting(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != page_mapcount(page));
> +
> +	__split_huge_page_refcount(page);
> +
> +	mapcount2 = 0;
> +	list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
> +		struct vm_area_struct *vma = avc->vma;
> +		unsigned long addr = vma_address(page, vma);
> +		if (addr == -EFAULT)
> +			continue;
> +		mapcount2 += __split_huge_page_map(page, vma, addr);
> +	}
> +	BUG_ON(mapcount != mapcount2);
> +}
> +
> +int split_huge_page(struct page *page)
> +{
> +	struct anon_vma *anon_vma;
> +	int ret = 1;
> +
> +	BUG_ON(!PageAnon(page));
> +	anon_vma = page_lock_anon_vma(page);
> +	if (!anon_vma)
> +		goto out;
> +	ret = 0;
> +	if (!PageCompound(page))
> +		goto out_unlock;
> +
> +	BUG_ON(!PageSwapBacked(page));
> +	__split_huge_page(page, anon_vma);
> +
> +	BUG_ON(PageCompound(page));
> +out_unlock:
> +	page_unlock_anon_vma(anon_vma);
> +out:
> +	return ret;
> +}
> +
> +void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
> +{
> +	struct page *page;
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (unlikely(!pmd_trans_huge(*pmd))) {
> +		spin_unlock(&mm->page_table_lock);
> +		return;
> +	}
> +	page = pmd_page(*pmd);
> +	VM_BUG_ON(!page_count(page));
> +	get_page(page);
> +	spin_unlock(&mm->page_table_lock);
> +
> +	split_huge_page(page);
> +
> +	put_page(page);
> +	BUG_ON(pmd_trans_huge(*pmd));
> +}
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -726,9 +726,9 @@ out_set_pte:
>  	return 0;
>  }
>  
> -static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> -		pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> -		unsigned long addr, unsigned long end)
> +int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> +		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
> +		   unsigned long addr, unsigned long end)
>  {
>  	pte_t *orig_src_pte, *orig_dst_pte;
>  	pte_t *src_pte, *dst_pte;
> @@ -802,6 +802,16 @@ static inline int copy_pmd_range(struct 
>  	src_pmd = pmd_offset(src_pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*src_pmd)) {
> +			int err;
> +			err = copy_huge_pmd(dst_mm, src_mm,
> +					    dst_pmd, src_pmd, addr, vma);
> +			if (err == -ENOMEM)
> +				return -ENOMEM;
> +			if (!err)
> +				continue;
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(src_pmd))
>  			continue;
>  		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
> @@ -1004,6 +1014,15 @@ static inline unsigned long zap_pmd_rang
>  	pmd = pmd_offset(pud, addr);
>  	do {
>  		next = pmd_addr_end(addr, end);
> +		if (pmd_trans_huge(*pmd)) {
> +			if (next-addr != HPAGE_PMD_SIZE)
> +				split_huge_page_pmd(vma->vm_mm, pmd);
> +			else if (zap_huge_pmd(tlb, vma, pmd)) {
> +				(*zap_work)--;
> +				continue;
> +			}
> +			/* fall through */
> +		}
>  		if (pmd_none_or_clear_bad(pmd)) {
>  			(*zap_work)--;
>  			continue;
> @@ -1280,11 +1299,27 @@ struct page *follow_page(struct vm_area_
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		goto no_page_table;
> -	if (pmd_huge(*pmd)) {
> +	if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
>  		BUG_ON(flags & FOLL_GET);
>  		page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
>  		goto out;
>  	}
> +	if (pmd_trans_huge(*pmd)) {
> +		spin_lock(&mm->page_table_lock);
> +		if (likely(pmd_trans_huge(*pmd))) {
> +			if (unlikely(pmd_trans_splitting(*pmd))) {
> +				spin_unlock(&mm->page_table_lock);
> +				wait_split_huge_page(vma->anon_vma, pmd);
> +			} else {
> +				page = follow_trans_huge_pmd(mm, address,
> +							     pmd, flags);
> +				spin_unlock(&mm->page_table_lock);
> +				goto out;
> +			}
> +		} else
> +			spin_unlock(&mm->page_table_lock);
> +		/* fall through */
> +	}
>  	if (unlikely(pmd_bad(*pmd)))
>  		goto no_page_table;
>  
> @@ -3141,9 +3176,9 @@ static int do_nonlinear_fault(struct mm_
>   * but allow concurrent faults), and pte mapped but not yet locked.
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
> -static inline int handle_pte_fault(struct mm_struct *mm,
> -		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +int handle_pte_fault(struct mm_struct *mm,
> +		     struct vm_area_struct *vma, unsigned long address,
> +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
>  {
>  	pte_t entry;
>  	spinlock_t *ptl;
> @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
>  	pmd = pmd_alloc(mm, pud, address);
>  	if (!pmd)
>  		return VM_FAULT_OOM;
> -	pte = pte_alloc_map(mm, vma, pmd, address);
> -	if (!pte)
> +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> +		if (!vma->vm_ops)
> +			return do_huge_pmd_anonymous_page(mm, vma, address,
> +							  pmd, flags);
> +	} else {
> +		pmd_t orig_pmd = *pmd;
> +		barrier();

What is this barrier for?

> +		if (pmd_trans_huge(orig_pmd)) {
> +			if (flags & FAULT_FLAG_WRITE &&
> +			    !pmd_write(orig_pmd) &&
> +			    !pmd_trans_splitting(orig_pmd))
> +				return do_huge_pmd_wp_page(mm, vma, address,
> +							   pmd, orig_pmd);
> +			return 0;
> +		}
> +	}
> +
> +	/*
> +	 * Use __pte_alloc instead of pte_alloc_map, because we can't
> +	 * run pte_offset_map on the pmd, if an huge pmd could
> +	 * materialize from under us from a different thread.
> +	 */
> +	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
>  		return VM_FAULT_OOM;
> +	/* if an huge pmd materialized from under us just retry later */
> +	if (unlikely(pmd_trans_huge(*pmd)))
> +		return 0;
> +	/*
> +	 * A regular pmd is established and it can't morph into a huge pmd
> +	 * from under us anymore at this point because we hold the mmap_sem
> +	 * read mode and khugepaged takes it in write mode. So now it's
> +	 * safe to run pte_offset_map().
> +	 */
> +	pte = pte_offset_map(pmd, address);
>  
>  	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -360,7 +360,7 @@ void page_unlock_anon_vma(struct anon_vm
>   * Returns virtual address or -EFAULT if page's index/offset is not
>   * within the range mapped the @vma.
>   */
> -static inline unsigned long
> +inline unsigned long
>  vma_address(struct page *page, struct vm_area_struct *vma)
>  {
>  	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
> @@ -435,6 +435,8 @@ pte_t *__page_check_address(struct page 
>  	pmd = pmd_offset(pud, address);
>  	if (!pmd_present(*pmd))
>  		return NULL;
> +	if (pmd_trans_huge(*pmd))
> +		return NULL;
>  
>  	pte = pte_offset_map(pmd, address);
>  	/* Make a quick check before getting the lock */
> @@ -489,35 +491,17 @@ int page_referenced_one(struct page *pag
>  			unsigned long *vm_flags)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> -	pte_t *pte;
> -	spinlock_t *ptl;
>  	int referenced = 0;
>  
> -	pte = page_check_address(page, mm, address, &ptl, 0);
> -	if (!pte)
> -		goto out;
> -
>  	/*
>  	 * Don't want to elevate referenced for mlocked page that gets this far,
>  	 * in order that it progresses to try_to_unmap and is moved to the
>  	 * unevictable list.
>  	 */
>  	if (vma->vm_flags & VM_LOCKED) {
> -		*mapcount = 1;	/* break early from loop */
> +		*mapcount = 0;	/* break early from loop */
>  		*vm_flags |= VM_LOCKED;
> -		goto out_unmap;
> -	}
> -
> -	if (ptep_clear_flush_young_notify(vma, address, pte)) {
> -		/*
> -		 * Don't treat a reference through a sequentially read
> -		 * mapping as such.  If the page has been used in
> -		 * another mapping, we will catch it; if this other
> -		 * mapping is already gone, the unmap path will have
> -		 * set PG_referenced or activated the page.
> -		 */
> -		if (likely(!VM_SequentialReadHint(vma)))
> -			referenced++;
> +		goto out;
>  	}
>  
>  	/* Pretend the page is referenced if the task has the
> @@ -526,9 +510,39 @@ int page_referenced_one(struct page *pag
>  			rwsem_is_locked(&mm->mmap_sem))
>  		referenced++;
>  
> -out_unmap:
> +	if (unlikely(PageTransHuge(page))) {
> +		pmd_t *pmd;
> +
> +		spin_lock(&mm->page_table_lock);
> +		pmd = page_check_address_pmd(page, mm, address,
> +					     PAGE_CHECK_ADDRESS_PMD_FLAG);
> +		if (pmd && !pmd_trans_splitting(*pmd) &&
> +		    pmdp_clear_flush_young_notify(vma, address, pmd))
> +			referenced++;
> +		spin_unlock(&mm->page_table_lock);
> +	} else {
> +		pte_t *pte;
> +		spinlock_t *ptl;
> +
> +		pte = page_check_address(page, mm, address, &ptl, 0);
> +		if (!pte)
> +			goto out;
> +
> +		if (ptep_clear_flush_young_notify(vma, address, pte)) {
> +			/*
> +			 * Don't treat a reference through a sequentially read
> +			 * mapping as such.  If the page has been used in
> +			 * another mapping, we will catch it; if this other
> +			 * mapping is already gone, the unmap path will have
> +			 * set PG_referenced or activated the page.
> +			 */
> +			if (likely(!VM_SequentialReadHint(vma)))
> +				referenced++;
> +		}
> +		pte_unmap_unlock(pte, ptl);
> +	}
> +
>  	(*mapcount)--;
> -	pte_unmap_unlock(pte, ptl);
>  
>  	if (referenced)
>  		*vm_flags |= vma->vm_flags;
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -465,6 +465,43 @@ void __pagevec_release(struct pagevec *p
>  
>  EXPORT_SYMBOL(__pagevec_release);
>  
> +/* used by __split_huge_page_refcount() */
> +void lru_add_page_tail(struct zone* zone,
> +		       struct page *page, struct page *page_tail)
> +{
> +	int active;
> +	enum lru_list lru;
> +	const int file = 0;
> +	struct list_head *head;
> +
> +	VM_BUG_ON(!PageHead(page));
> +	VM_BUG_ON(PageCompound(page_tail));
> +	VM_BUG_ON(PageLRU(page_tail));
> +	VM_BUG_ON(!spin_is_locked(&zone->lru_lock));
> +
> +	SetPageLRU(page_tail);
> +
> +	if (page_evictable(page_tail, NULL)) {
> +		if (PageActive(page)) {
> +			SetPageActive(page_tail);
> +			active = 1;
> +			lru = LRU_ACTIVE_ANON;
> +		} else {
> +			active = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
> +		update_page_reclaim_stat(zone, page_tail, file, active);
> +		if (likely(PageLRU(page)))
> +			head = page->lru.prev;
> +		else
> +			head = &zone->lru[lru].list;
> +		__add_page_to_lru_list(zone, page_tail, lru, head);
> +	} else {
> +		SetPageUnevictable(page_tail);
> +		add_page_to_lru_list(zone, page_tail, LRU_UNEVICTABLE);
> +	}
> +}
> +
>  /*
>   * Add the passed pages to the LRU, then drop the caller's refcount
>   * on them.  Reinitialises the caller's pagevec.
> 

Other than a few minor questions, these seems very similar to what you
had before. There is a lot going on in this patch but I did not find
anything wrong.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 31 of 66] split_huge_page anon_vma ordering dependency
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:13     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:06PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This documents how split_huge_page is safe vs new vma inserctions into
> the anon_vma that may have already released the anon_vma->lock but not
> established pmds yet when split_huge_page starts.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 31 of 66] split_huge_page anon_vma ordering dependency
@ 2010-11-18 15:13     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:06PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This documents how split_huge_page is safe vs new vma inserctions into
> the anon_vma that may have already released the anon_vma->lock but not
> established pmds yet when split_huge_page starts.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 32 of 66] verify pmd_trans_huge isn't leaking
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:15     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:07PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> pte_trans_huge must not leak in certain vmas like the mmio special pfn or
> filebacked mappings.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 32 of 66] verify pmd_trans_huge isn't leaking
@ 2010-11-18 15:15     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:07PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> pte_trans_huge must not leak in certain vmas like the mmio special pfn or
> filebacked mappings.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:19     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:08PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
> backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
> isn't built into the kernel. Never silently return success.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct
>  #endif
>  
>  extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +extern int hugepage_madvise(unsigned long *vm_flags);
>  static inline int PageTransHuge(struct page *page)
>  {
>  	VM_BUG_ON(PageTail(page));
> @@ -121,6 +122,11 @@ static inline int split_huge_page(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)	\
>  	do { } while (0)
>  #define PageTransHuge(page) 0
> +static inline int hugepage_madvise(unsigned long *vm_flags)
> +{
> +	BUG_ON(0);

What's BUG_ON(0) in aid of?

> +	return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -894,6 +894,22 @@ out:
>  	return ret;
>  }
>  
> +int hugepage_madvise(unsigned long *vm_flags)
> +{
> +	/*
> +	 * Be somewhat over-protective like KSM for now!
> +	 */
> +	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
> +			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
> +			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
> +			 VM_MIXEDMAP | VM_SAO))
> +		return -EINVAL;
> +
> +	*vm_flags |= VM_HUGEPAGE;
> +
> +	return 0;
> +}
> +
>  void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
>  {
>  	struct page *page;
> diff --git a/mm/madvise.c b/mm/madvise.c
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_HUGEPAGE:

I should have said it at patch 4 but don't forget that Michael Kerrisk
should be made aware of MADV_HUGEPAGE so it makes it to a manual page
at some point.

> +		error = hugepage_madvise(&new_flags);
> +		if (error)
> +			goto out;
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {
> @@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
>  #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	case MADV_HUGEPAGE:
> +#endif
>  		return 1;
>  
>  	default:
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
@ 2010-11-18 15:19     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:08PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add madvise MADV_HUGEPAGE to mark regions that are important to be hugepage
> backed. Return -EINVAL if the vma is not of an anonymous type, or the feature
> isn't built into the kernel. Never silently return success.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -99,6 +99,7 @@ extern void __split_huge_page_pmd(struct
>  #endif
>  
>  extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> +extern int hugepage_madvise(unsigned long *vm_flags);
>  static inline int PageTransHuge(struct page *page)
>  {
>  	VM_BUG_ON(PageTail(page));
> @@ -121,6 +122,11 @@ static inline int split_huge_page(struct
>  #define wait_split_huge_page(__anon_vma, __pmd)	\
>  	do { } while (0)
>  #define PageTransHuge(page) 0
> +static inline int hugepage_madvise(unsigned long *vm_flags)
> +{
> +	BUG_ON(0);

What's BUG_ON(0) in aid of?

> +	return 0;
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #endif /* _LINUX_HUGE_MM_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -894,6 +894,22 @@ out:
>  	return ret;
>  }
>  
> +int hugepage_madvise(unsigned long *vm_flags)
> +{
> +	/*
> +	 * Be somewhat over-protective like KSM for now!
> +	 */
> +	if (*vm_flags & (VM_HUGEPAGE | VM_SHARED  | VM_MAYSHARE   |
> +			 VM_PFNMAP   | VM_IO      | VM_DONTEXPAND |
> +			 VM_RESERVED | VM_HUGETLB | VM_INSERTPAGE |
> +			 VM_MIXEDMAP | VM_SAO))
> +		return -EINVAL;
> +
> +	*vm_flags |= VM_HUGEPAGE;
> +
> +	return 0;
> +}
> +
>  void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
>  {
>  	struct page *page;
> diff --git a/mm/madvise.c b/mm/madvise.c
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -71,6 +71,11 @@ static long madvise_behavior(struct vm_a
>  		if (error)
>  			goto out;
>  		break;
> +	case MADV_HUGEPAGE:

I should have said it at patch 4 but don't forget that Michael Kerrisk
should be made aware of MADV_HUGEPAGE so it makes it to a manual page
at some point.

> +		error = hugepage_madvise(&new_flags);
> +		if (error)
> +			goto out;
> +		break;
>  	}
>  
>  	if (new_flags == vma->vm_flags) {
> @@ -283,6 +288,9 @@ madvise_behavior_valid(int behavior)
>  	case MADV_MERGEABLE:
>  	case MADV_UNMERGEABLE:
>  #endif
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	case MADV_HUGEPAGE:
> +#endif
>  		return 1;
>  
>  	default:
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 34 of 66] add PageTransCompound
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:20     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:09PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Remove branches when CONFIG_TRANSPARENT_HUGEPAGE=n.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 34 of 66] add PageTransCompound
@ 2010-11-18 15:20     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:09PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Remove branches when CONFIG_TRANSPARENT_HUGEPAGE=n.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 15:26     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:11PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Teach memcg to charge/uncharge compound pages.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup_per_zone *mz;
> +	int page_size = PAGE_SIZE;
> +
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
>   * oom-killer can be invoked.
>   */
>  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> -		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
> +				   gfp_t gfp_mask,
> +				   struct mem_cgroup **memcg, bool oom,
> +				   int page_size)

Any concerns about page_size overflowing int? ppc64 has 16G pages for example
although it will never be in this path. hmm, I see that charge size is already
int so maybe this is more of a memcg issue than it is THP but hugetlbfs
treats page sizes as unsigned long. For example see vma_kernel_pagesize()


>  {
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem = NULL;
>  	int ret;
> -	int csize = CHARGE_SIZE;
> +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> @@ -1909,7 +1915,7 @@ again:
>  		VM_BUG_ON(css_is_removed(&mem->css));
>  		if (mem_cgroup_is_root(mem))
>  			goto done;
> -		if (consume_stock(mem))
> +		if (page_size == PAGE_SIZE && consume_stock(mem))
>  			goto done;
>  		css_get(&mem->css);
>  	} else {
> @@ -1933,7 +1939,7 @@ again:
>  			rcu_read_unlock();
>  			goto done;
>  		}
> -		if (consume_stock(mem)) {
> +		if (page_size == PAGE_SIZE && consume_stock(mem)) {
>  			/*
>  			 * It seems dagerous to access memcg without css_get().
>  			 * But considering how consume_stok works, it's not
> @@ -1974,7 +1980,7 @@ again:
>  		case CHARGE_OK:
>  			break;
>  		case CHARGE_RETRY: /* not in OOM situation but retry */
> -			csize = PAGE_SIZE;
> +			csize = page_size;
>  			css_put(&mem->css);
>  			mem = NULL;
>  			goto again;
> @@ -1995,8 +2001,8 @@ again:
>  		}
>  	} while (ret != CHARGE_OK);
>  
> -	if (csize > PAGE_SIZE)
> -		refill_stock(mem, csize - PAGE_SIZE);
> +	if (csize > page_size)
> +		refill_stock(mem, csize - page_size);
>  	css_put(&mem->css);
>  done:
>  	*memcg = mem;
> @@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
>  	}
>  }
>  
> -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> +				     int page_size)
>  {
> -	__mem_cgroup_cancel_charge(mem, 1);
> +	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
>  }
>  
>  /*
> @@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>   */
>  
>  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> -				     struct page_cgroup *pc,
> -				     enum charge_type ctype)
> +				       struct page_cgroup *pc,
> +				       enum charge_type ctype,
> +				       int page_size)
>  {
>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>  	if (!mem)
> @@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
> -		mem_cgroup_cancel_charge(mem);
> +		mem_cgroup_cancel_charge(mem, page_size);
>  		return;
>  	}
>  
> @@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> -		mem_cgroup_cancel_charge(from);
> +		mem_cgroup_cancel_charge(from, PAGE_SIZE);
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> @@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
>  		goto put;
>  
>  	parent = mem_cgroup_from_cont(pcg);
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> +				      PAGE_SIZE);
>  	if (ret || !parent)
>  		goto put_back;
>  
>  	ret = mem_cgroup_move_account(pc, child, parent, true);
>  	if (ret)
> -		mem_cgroup_cancel_charge(parent);
> +		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
>  put_back:
>  	putback_lru_page(page);
>  put:
> @@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
>  	struct mem_cgroup *mem = NULL;
>  	struct page_cgroup *pc;
>  	int ret;
> +	int page_size = PAGE_SIZE;
> +
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
>  
>  	pc = lookup_page_cgroup(page);
>  	/* can happen at boot */
> @@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
>  		return 0;
>  	prefetchw(pc);
>  
> -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
>  	if (ret || !mem)
>  		return ret;
>  
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
>  	return 0;
>  }
>  
> @@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
>  {
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  	/*
>  	 * If already mapped, we don't have to account.
>  	 * If page cache, page->mapping has address_space.
> @@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
>  	if (!mem)
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
>  	css_put(&mem->css);
>  	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
>  }
>  
>  static void
> @@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  	cgroup_exclude_rmdir(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
> -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
>  	mem_cgroup_lru_add_after_commit_swapcache(page);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
> @@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
>  		return;
>  	if (!mem)
>  		return;
> -	mem_cgroup_cancel_charge(mem);
> +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
>  }
>  
>  static void
> -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> +	      int page_size)
>  {
>  	struct memcg_batch_info *batch = NULL;
>  	bool uncharge_memsw = true;
> @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += PAGE_SIZE;
> +	batch->bytes += page_size;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += PAGE_SIZE;
> +		batch->memsw_bytes += page_size;
>  	return;
>  direct_uncharge:
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, page_size);
>  	if (uncharge_memsw)
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, page_size);
>  	if (unlikely(batch->memcg != mem))
>  		memcg_oom_recover(mem);
>  	return;
> @@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
> +	int page_size = PAGE_SIZE;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
>  	if (PageSwapCache(page))
>  		return NULL;
>  
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
> +
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> @@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
>  		mem_cgroup_get(mem);
>  	}
>  	if (!mem_cgroup_is_root(mem))
> -		__do_uncharge(mem, ctype);
> +		__do_uncharge(mem, ctype, page_size);
>  
>  	return mem;
>  
> @@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
>  	enum charge_type ctype;
>  	int ret = 0;
>  
> +	VM_BUG_ON(PageTransHuge(page));
>  	if (mem_cgroup_disabled())
>  		return 0;
>  
> @@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
>  		return 0;
>  
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
> +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
>  	css_put(&mem->css);/* drop extra refcnt */
>  	if (ret || *ptr == NULL) {
>  		if (PageAnon(page)) {
> @@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
>  		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
>  	else
>  		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
>  	return ret;
>  }
>  
> @@ -4452,7 +4469,8 @@ one_by_one:
>  			batch_count = PRECHARGE_COUNT_AT_ONCE;
>  			cond_resched();
>  		}
> -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> +					      PAGE_SIZE);
>  		if (ret || !mem)
>  			/* mem_cgroup_clear_mc() will do uncharge later */
>  			return -ENOMEM;
> @@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE)
>  		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
> @@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
>  	spinlock_t *ptl;
>  
>  retry:
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
@ 2010-11-18 15:26     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 15:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:11PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Teach memcg to charge/uncharge compound pages.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup_per_zone *mz;
> +	int page_size = PAGE_SIZE;
> +
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
>   * oom-killer can be invoked.
>   */
>  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> -		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
> +				   gfp_t gfp_mask,
> +				   struct mem_cgroup **memcg, bool oom,
> +				   int page_size)

Any concerns about page_size overflowing int? ppc64 has 16G pages for example
although it will never be in this path. hmm, I see that charge size is already
int so maybe this is more of a memcg issue than it is THP but hugetlbfs
treats page sizes as unsigned long. For example see vma_kernel_pagesize()


>  {
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem = NULL;
>  	int ret;
> -	int csize = CHARGE_SIZE;
> +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> @@ -1909,7 +1915,7 @@ again:
>  		VM_BUG_ON(css_is_removed(&mem->css));
>  		if (mem_cgroup_is_root(mem))
>  			goto done;
> -		if (consume_stock(mem))
> +		if (page_size == PAGE_SIZE && consume_stock(mem))
>  			goto done;
>  		css_get(&mem->css);
>  	} else {
> @@ -1933,7 +1939,7 @@ again:
>  			rcu_read_unlock();
>  			goto done;
>  		}
> -		if (consume_stock(mem)) {
> +		if (page_size == PAGE_SIZE && consume_stock(mem)) {
>  			/*
>  			 * It seems dagerous to access memcg without css_get().
>  			 * But considering how consume_stok works, it's not
> @@ -1974,7 +1980,7 @@ again:
>  		case CHARGE_OK:
>  			break;
>  		case CHARGE_RETRY: /* not in OOM situation but retry */
> -			csize = PAGE_SIZE;
> +			csize = page_size;
>  			css_put(&mem->css);
>  			mem = NULL;
>  			goto again;
> @@ -1995,8 +2001,8 @@ again:
>  		}
>  	} while (ret != CHARGE_OK);
>  
> -	if (csize > PAGE_SIZE)
> -		refill_stock(mem, csize - PAGE_SIZE);
> +	if (csize > page_size)
> +		refill_stock(mem, csize - page_size);
>  	css_put(&mem->css);
>  done:
>  	*memcg = mem;
> @@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
>  	}
>  }
>  
> -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> +				     int page_size)
>  {
> -	__mem_cgroup_cancel_charge(mem, 1);
> +	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
>  }
>  
>  /*
> @@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>   */
>  
>  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> -				     struct page_cgroup *pc,
> -				     enum charge_type ctype)
> +				       struct page_cgroup *pc,
> +				       enum charge_type ctype,
> +				       int page_size)
>  {
>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>  	if (!mem)
> @@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
> -		mem_cgroup_cancel_charge(mem);
> +		mem_cgroup_cancel_charge(mem, page_size);
>  		return;
>  	}
>  
> @@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> -		mem_cgroup_cancel_charge(from);
> +		mem_cgroup_cancel_charge(from, PAGE_SIZE);
>  
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
> @@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
>  		goto put;
>  
>  	parent = mem_cgroup_from_cont(pcg);
> -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> +				      PAGE_SIZE);
>  	if (ret || !parent)
>  		goto put_back;
>  
>  	ret = mem_cgroup_move_account(pc, child, parent, true);
>  	if (ret)
> -		mem_cgroup_cancel_charge(parent);
> +		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
>  put_back:
>  	putback_lru_page(page);
>  put:
> @@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
>  	struct mem_cgroup *mem = NULL;
>  	struct page_cgroup *pc;
>  	int ret;
> +	int page_size = PAGE_SIZE;
> +
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
>  
>  	pc = lookup_page_cgroup(page);
>  	/* can happen at boot */
> @@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
>  		return 0;
>  	prefetchw(pc);
>  
> -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
>  	if (ret || !mem)
>  		return ret;
>  
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
>  	return 0;
>  }
>  
> @@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
>  {
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	if (PageCompound(page))
> -		return 0;
>  	/*
>  	 * If already mapped, we don't have to account.
>  	 * If page cache, page->mapping has address_space.
> @@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
>  	if (!mem)
>  		goto charge_cur_mm;
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
>  	css_put(&mem->css);
>  	return ret;
>  charge_cur_mm:
>  	if (unlikely(!mm))
>  		mm = &init_mm;
> -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> +	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
>  }
>  
>  static void
> @@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
>  	cgroup_exclude_rmdir(&ptr->css);
>  	pc = lookup_page_cgroup(page);
>  	mem_cgroup_lru_del_before_commit_swapcache(page);
> -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
>  	mem_cgroup_lru_add_after_commit_swapcache(page);
>  	/*
>  	 * Now swap is on-memory. This means this page may be
> @@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
>  		return;
>  	if (!mem)
>  		return;
> -	mem_cgroup_cancel_charge(mem);
> +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
>  }
>  
>  static void
> -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> +	      int page_size)
>  {
>  	struct memcg_batch_info *batch = NULL;
>  	bool uncharge_memsw = true;
> @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += PAGE_SIZE;
> +	batch->bytes += page_size;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += PAGE_SIZE;
> +		batch->memsw_bytes += page_size;
>  	return;
>  direct_uncharge:
> -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	res_counter_uncharge(&mem->res, page_size);
>  	if (uncharge_memsw)
> -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> +		res_counter_uncharge(&mem->memsw, page_size);
>  	if (unlikely(batch->memcg != mem))
>  		memcg_oom_recover(mem);
>  	return;
> @@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
> +	int page_size = PAGE_SIZE;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
>  	if (PageSwapCache(page))
>  		return NULL;
>  
> +	if (PageTransHuge(page))
> +		page_size <<= compound_order(page);
> +
>  	/*
>  	 * Check if our page_cgroup is valid
>  	 */
> @@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
>  		mem_cgroup_get(mem);
>  	}
>  	if (!mem_cgroup_is_root(mem))
> -		__do_uncharge(mem, ctype);
> +		__do_uncharge(mem, ctype, page_size);
>  
>  	return mem;
>  
> @@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
>  	enum charge_type ctype;
>  	int ret = 0;
>  
> +	VM_BUG_ON(PageTransHuge(page));
>  	if (mem_cgroup_disabled())
>  		return 0;
>  
> @@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
>  		return 0;
>  
>  	*ptr = mem;
> -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
> +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
>  	css_put(&mem->css);/* drop extra refcnt */
>  	if (ret || *ptr == NULL) {
>  		if (PageAnon(page)) {
> @@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
>  		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
>  	else
>  		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> -	__mem_cgroup_commit_charge(mem, pc, ctype);
> +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
>  	return ret;
>  }
>  
> @@ -4452,7 +4469,8 @@ one_by_one:
>  			batch_count = PRECHARGE_COUNT_AT_ONCE;
>  			cond_resched();
>  		}
> -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> +					      PAGE_SIZE);
>  		if (ret || !mem)
>  			/* mem_cgroup_clear_mc() will do uncharge later */
>  			return -ENOMEM;
> @@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
>  	pte_t *pte;
>  	spinlock_t *ptl;
>  
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; pte++, addr += PAGE_SIZE)
>  		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
> @@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
>  	spinlock_t *ptl;
>  
>  retry:
> +	VM_BUG_ON(pmd_trans_huge(*pmd));
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	for (; addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:06     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Skip transhuge pages in ksm for now.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

This is an idle concern that I haven't looked into but is there any conflict
between khugepaged scanning the KSM scanning?

Specifically, I *think* the impact of this patch is that KSM will not
accidentally split a huge page. Is that right? If so, it could do with
being included in the changelog.

On the other hand, can khugepaged be prevented from promoting a hugepage
because of KSM?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
@ 2010-11-18 16:06     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Skip transhuge pages in ksm for now.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

This is an idle concern that I haven't looked into but is there any conflict
between khugepaged scanning the KSM scanning?

Specifically, I *think* the impact of this patch is that KSM will not
accidentally split a huge page. Is that right? If so, it could do with
being included in the changelog.

On the other hand, can khugepaged be prevented from promoting a hugepage
because of KSM?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 45 of 66] remove PG_buddy
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:08     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:20PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
> added to page->flags without overflowing (because of the sparse section bits
> increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
> the memory hotplug code from _mapcount to lru.next to avoid any risk of
> clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
> lru.next even more easily than the mapcount instead.
> 

Does this make much of a difference? I confess I didn't read the patch closely
because I didn't get the motivation.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 45 of 66] remove PG_buddy
@ 2010-11-18 16:08     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:08 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:20PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
> added to page->flags without overflowing (because of the sparse section bits
> increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
> the memory hotplug code from _mapcount to lru.next to avoid any risk of
> clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
> lru.next even more easily than the mapcount instead.
> 

Does this make much of a difference? I confess I didn't read the patch closely
because I didn't get the motivation.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 51 of 66] set recommended min free kbytes
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:16     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:26PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> If transparent hugepage is enabled initialize min_free_kbytes to an optimal
> value by default. This moves the hugeadm algorithm in kernel.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -85,6 +85,47 @@ struct khugepaged_scan {
>  	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
>  };
>  
> +
> +static int set_recommended_min_free_kbytes(void)
> +{
> +	struct zone *zone;
> +	int nr_zones = 0;
> +	unsigned long recommended_min;
> +	extern int min_free_kbytes;
> +
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +		      &transparent_hugepage_flags) &&
> +	    !test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +		      &transparent_hugepage_flags))
> +		return 0;
> +
> +	for_each_populated_zone(zone)
> +		nr_zones++;
> +
> +	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
> +	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
> +

The really important value is pageblock_nr_pages here. It'll just happen
to work on x86 and x86-64 but anti-fragmentation is really about
pageblocks, not PMDs.

> +	/*
> +	 * Make sure that on average at least two pageblocks are almost free
> +	 * of another type, one for a migratetype to fall back to and a
> +	 * second to avoid subsequent fallbacks of other types There are 3
> +	 * MIGRATE_TYPES we care about.
> +	 */
> +	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
> +

Same on the use of pageblock_nr_pages. Also, you can replace 3 with
MIGRATE_PCPTYPES.

> +	/* don't ever allow to reserve more than 5% of the lowmem */
> +	recommended_min = min(recommended_min,
> +			      (unsigned long) nr_free_buffer_pages() / 20);
> +	recommended_min <<= (PAGE_SHIFT-10);
> +
> +	if (recommended_min > min_free_kbytes) {
> +		min_free_kbytes = recommended_min;
> +		setup_per_zone_wmarks();
> +	}


The timing this is called is important. Would you mind doing a quick
debugging check by adding a printk to setup_zone_migrate_reserve() to ensure
MIGRATE_RESERVE is getting set on sensible pageblocks? (see where the comment
Suitable for reserving if this block is movable is) If MIGRATE_RESERVE blocks
are not being created in a sensible fashion, atomic high-order allocations
will suffer in mysterious ways.

SEtting the higher min free kbytes from userspace happens to work because
the system is initialised and MIGRATE_MOVABLE exists but that might not be
the case when automatically set like this patch.

> +	return 0;
> +}
> +late_initcall(set_recommended_min_free_kbytes);
> +
>  static int start_khugepaged(void)
>  {
>  	int err = 0;
> @@ -108,6 +149,8 @@ static int start_khugepaged(void)
>  		mutex_unlock(&khugepaged_mutex);
>  		if (wakeup)
>  			wake_up_interruptible(&khugepaged_wait);
> +
> +		set_recommended_min_free_kbytes();
>  	} else
>  		/* wakeup to exit */
>  		wake_up_interruptible(&khugepaged_wait);
> @@ -177,6 +220,13 @@ static ssize_t enabled_store(struct kobj
>  			ret = err;
>  	}
>  
> +	if (ret > 0 &&
> +	    (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +		      &transparent_hugepage_flags) ||
> +	     test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +		      &transparent_hugepage_flags)))
> +		set_recommended_min_free_kbytes();
> +
>  	return ret;
>  }
>  static struct kobj_attribute enabled_attr =
> @@ -464,6 +514,8 @@ static int __init hugepage_init(void)
>  
>  	start_khugepaged();
>  
> +	set_recommended_min_free_kbytes();
> +
>  out:
>  	return err;
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 51 of 66] set recommended min free kbytes
@ 2010-11-18 16:16     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:26PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> If transparent hugepage is enabled initialize min_free_kbytes to an optimal
> value by default. This moves the hugeadm algorithm in kernel.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -85,6 +85,47 @@ struct khugepaged_scan {
>  	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
>  };
>  
> +
> +static int set_recommended_min_free_kbytes(void)
> +{
> +	struct zone *zone;
> +	int nr_zones = 0;
> +	unsigned long recommended_min;
> +	extern int min_free_kbytes;
> +
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +		      &transparent_hugepage_flags) &&
> +	    !test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +		      &transparent_hugepage_flags))
> +		return 0;
> +
> +	for_each_populated_zone(zone)
> +		nr_zones++;
> +
> +	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
> +	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
> +

The really important value is pageblock_nr_pages here. It'll just happen
to work on x86 and x86-64 but anti-fragmentation is really about
pageblocks, not PMDs.

> +	/*
> +	 * Make sure that on average at least two pageblocks are almost free
> +	 * of another type, one for a migratetype to fall back to and a
> +	 * second to avoid subsequent fallbacks of other types There are 3
> +	 * MIGRATE_TYPES we care about.
> +	 */
> +	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
> +

Same on the use of pageblock_nr_pages. Also, you can replace 3 with
MIGRATE_PCPTYPES.

> +	/* don't ever allow to reserve more than 5% of the lowmem */
> +	recommended_min = min(recommended_min,
> +			      (unsigned long) nr_free_buffer_pages() / 20);
> +	recommended_min <<= (PAGE_SHIFT-10);
> +
> +	if (recommended_min > min_free_kbytes) {
> +		min_free_kbytes = recommended_min;
> +		setup_per_zone_wmarks();
> +	}


The timing this is called is important. Would you mind doing a quick
debugging check by adding a printk to setup_zone_migrate_reserve() to ensure
MIGRATE_RESERVE is getting set on sensible pageblocks? (see where the comment
Suitable for reserving if this block is movable is) If MIGRATE_RESERVE blocks
are not being created in a sensible fashion, atomic high-order allocations
will suffer in mysterious ways.

SEtting the higher min free kbytes from userspace happens to work because
the system is initialised and MIGRATE_MOVABLE exists but that might not be
the case when automatically set like this patch.

> +	return 0;
> +}
> +late_initcall(set_recommended_min_free_kbytes);
> +
>  static int start_khugepaged(void)
>  {
>  	int err = 0;
> @@ -108,6 +149,8 @@ static int start_khugepaged(void)
>  		mutex_unlock(&khugepaged_mutex);
>  		if (wakeup)
>  			wake_up_interruptible(&khugepaged_wait);
> +
> +		set_recommended_min_free_kbytes();
>  	} else
>  		/* wakeup to exit */
>  		wake_up_interruptible(&khugepaged_wait);
> @@ -177,6 +220,13 @@ static ssize_t enabled_store(struct kobj
>  			ret = err;
>  	}
>  
> +	if (ret > 0 &&
> +	    (test_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +		      &transparent_hugepage_flags) ||
> +	     test_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +		      &transparent_hugepage_flags)))
> +		set_recommended_min_free_kbytes();
> +
>  	return ret;
>  }
>  static struct kobj_attribute enabled_attr =
> @@ -464,6 +514,8 @@ static int __init hugepage_init(void)
>  
>  	start_khugepaged();
>  
> +	set_recommended_min_free_kbytes();
> +
>  out:
>  	return err;
>  }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 52 of 66] enable direct defrag
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:17     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:27PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
> defrag memory during the (synchronous) transparent hugepage page faults
> (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async)
> hugepage allocations that was already enabled even before memory compaction was
> in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).
> 

While I'm hoping that my series on using compaction will be used instead
of outright deletion of lumpy reclaim, this patch would still make
sense.

Acked-by: Mel Gorman <mel@csn.ul.ie>

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -28,6 +28,7 @@
>   */
>  unsigned long transparent_hugepage_flags __read_mostly =
>  	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
> +	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
>  
>  /* default scan 8*512 pte (or vmas) every 30 second */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 52 of 66] enable direct defrag
@ 2010-11-18 16:17     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:27PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
> defrag memory during the (synchronous) transparent hugepage page faults
> (TRANSPARENT_HUGEPAGE_DEFRAG_FLAG) and not only during khugepaged (async)
> hugepage allocations that was already enabled even before memory compaction was
> in (TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG).
> 

While I'm hoping that my series on using compaction will be used instead
of outright deletion of lumpy reclaim, this patch would still make
sense.

Acked-by: Mel Gorman <mel@csn.ul.ie>

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -28,6 +28,7 @@
>   */
>  unsigned long transparent_hugepage_flags __read_mostly =
>  	(1<<TRANSPARENT_HUGEPAGE_FLAG)|
> +	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
>  	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
>  
>  /* default scan 8*512 pte (or vmas) every 30 second */
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-09 21:11       ` Andrea Arcangeli
@ 2010-11-18 16:22         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 10:11:45PM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > With transparent hugepage support we need compaction for the "defrag" sysfs
> > > controls to be effective.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> > >  config TRANSPARENT_HUGEPAGE
> > >  	bool "Transparent Hugepage Support"
> > >  	depends on X86 && MMU
> > > +	select COMPACTION
> > >  	help
> > >  	  Transparent Hugepages allows the kernel to use huge pages and
> > >  	  huge tlb transparently to the applications whenever possible.
> > 
> > I dislike this. THP and compaction are completely orthogonal. I think 
> > you are talking only your performance recommendation. I mean I dislike
> > Kconfig 'select' hell and I hope every developers try to avoid it as 
> > far as possible.
> 
> At the moment THP hangs the system if COMPACTION isn't selected

Just to confirm - by hang, you mean grinds to a slow pace as opposed to
coming to a complete stop and having to restart?

If so then

Acked-by: Mel Gorman <mel@csn.ul.ie>

> (please try yourself if you don't believe), as without COMPACTION
> lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
> orthogonal. When lumpy will be removed from the VM (like I tried
> multiple times to achieve) I can remove the select COMPACTION in
> theory, but then 99% of THP users would be still doing a mistake in
> disabling compaction, even if the mistake won't return in fatal
> runtime but just slightly degraded performance.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-11-18 16:22         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 10:11:45PM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 09, 2010 at 03:20:33PM +0900, KOSAKI Motohiro wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > With transparent hugepage support we need compaction for the "defrag" sysfs
> > > controls to be effective.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > ---
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -305,6 +305,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
> > >  config TRANSPARENT_HUGEPAGE
> > >  	bool "Transparent Hugepage Support"
> > >  	depends on X86 && MMU
> > > +	select COMPACTION
> > >  	help
> > >  	  Transparent Hugepages allows the kernel to use huge pages and
> > >  	  huge tlb transparently to the applications whenever possible.
> > 
> > I dislike this. THP and compaction are completely orthogonal. I think 
> > you are talking only your performance recommendation. I mean I dislike
> > Kconfig 'select' hell and I hope every developers try to avoid it as 
> > far as possible.
> 
> At the moment THP hangs the system if COMPACTION isn't selected

Just to confirm - by hang, you mean grinds to a slow pace as opposed to
coming to a complete stop and having to restart?

If so then

Acked-by: Mel Gorman <mel@csn.ul.ie>

> (please try yourself if you don't believe), as without COMPACTION
> lumpy reclaim wouldn't be entirely disabled. So at the moment it's not
> orthogonal. When lumpy will be removed from the VM (like I tried
> multiple times to achieve) I can remove the select COMPACTION in
> theory, but then 99% of THP users would be still doing a mistake in
> disabling compaction, even if the mistake won't return in fatal
> runtime but just slightly degraded performance.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 56 of 66] transhuge isolate_migratepages()
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:25     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:31PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> It's not worth migrating transparent hugepages during compaction. Those
> hugepages don't create fragmentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

I think this will collide with my compaction series because of the "fast
scanning" patch but the resolution should be trivial. Your check still should
go in after the PageLRU check and the PageTransCompound check should still
be after __isolate_lru_page.

> ---
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -272,10 +272,25 @@ static unsigned long isolate_migratepage
>  		if (PageBuddy(page))
>  			continue;
>  
> +		if (!PageLRU(page))
> +			continue;
> +
> +		/*
> +		 * PageLRU is set, and lru_lock excludes isolation,
> +		 * splitting and collapsing (collapsing has already
> +		 * happened if PageLRU is set).
> +		 */
> +		if (PageTransHuge(page)) {
> +			low_pfn += (1 << compound_order(page)) - 1;
> +			continue;
> +		}
> +
>  		/* Try isolate the page */
>  		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
>  			continue;
>  
> +		VM_BUG_ON(PageTransCompound(page));
> +
>  		/* Successfully isolated */
>  		del_page_from_lru_list(zone, page, page_lru(page));
>  		list_add(&page->lru, migratelist);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 56 of 66] transhuge isolate_migratepages()
@ 2010-11-18 16:25     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:31PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> It's not worth migrating transparent hugepages during compaction. Those
> hugepages don't create fragmentation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

I think this will collide with my compaction series because of the "fast
scanning" patch but the resolution should be trivial. Your check still should
go in after the PageLRU check and the PageTransCompound check should still
be after __isolate_lru_page.

> ---
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -272,10 +272,25 @@ static unsigned long isolate_migratepage
>  		if (PageBuddy(page))
>  			continue;
>  
> +		if (!PageLRU(page))
> +			continue;
> +
> +		/*
> +		 * PageLRU is set, and lru_lock excludes isolation,
> +		 * splitting and collapsing (collapsing has already
> +		 * happened if PageLRU is set).
> +		 */
> +		if (PageTransHuge(page)) {
> +			low_pfn += (1 << compound_order(page)) - 1;
> +			continue;
> +		}
> +
>  		/* Try isolate the page */
>  		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
>  			continue;
>  
> +		VM_BUG_ON(PageTransCompound(page));
> +
>  		/* Successfully isolated */
>  		del_page_from_lru_list(zone, page, page_lru(page));
>  		list_add(&page->lru, migratelist);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:31     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:36PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This takes advantage of memory compaction to properly generate pages of order >
> 0 if regular page reclaim fails and priority level becomes more severe and we
> don't reach the proper watermarks.
> 

I don't think this is related to THP although I see what you're doing.
It should be handled on its own. I'd also wonder if some of the tg3
failures are due to MIGRATE_RESERVE not being set properly when
min_free_kbytes is automatically resized.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -11,6 +11,9 @@
>  /* The full zone was compacted */
>  #define COMPACT_COMPLETE	3
>  
> +#define COMPACT_MODE_DIRECT_RECLAIM	0
> +#define COMPACT_MODE_KSWAPD		1
> +
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> @@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
>  			void __user *buffer, size_t *length, loff_t *ppos);
>  
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long compact_zone_order(struct zone *zone,
> +					int order, gfp_t gfp_mask,
> +					int compact_mode);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask);
>  
> @@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +					       int order, gfp_t gfp_mask,
> +					       int compact_mode)
> +{
> +	return COMPACT_CONTINUE;
> +}
> +
>  static inline void defer_compaction(struct zone *zone)
>  {
>  }
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -38,6 +38,8 @@ struct compact_control {
>  	unsigned int order;		/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> +
> +	int compact_mode;
>  };
>  
>  static unsigned long release_freepages(struct list_head *freelist)
> @@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
>  }
>  
>  static int compact_finished(struct zone *zone,
> -						struct compact_control *cc)
> +			    struct compact_control *cc)
>  {
>  	unsigned int order;
> -	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +	unsigned long watermark;
>  
>  	if (fatal_signal_pending(current))
>  		return COMPACT_PARTIAL;
> @@ -370,12 +372,27 @@ static int compact_finished(struct zone 
>  		return COMPACT_COMPLETE;
>  
>  	/* Compaction run is not finished if the watermark is not met */
> +	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
> +		watermark = low_wmark_pages(zone);
> +	else
> +		watermark = high_wmark_pages(zone);
> +	watermark += (1 << cc->order);
> +
>  	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
>  		return COMPACT_CONTINUE;
>  
>  	if (cc->order == -1)
>  		return COMPACT_CONTINUE;
>  
> +	/*
> +	 * Generating only one page of the right order is not enough
> +	 * for kswapd, we must continue until we're above the high
> +	 * watermark as a pool for high order GFP_ATOMIC allocations
> +	 * too.
> +	 */
> +	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
> +		return COMPACT_CONTINUE;
> +
>  	/* Direct compactor: Is a suitable page free? */
>  	for (order = cc->order; order < MAX_ORDER; order++) {
>  		/* Job done if page is free of the right migratetype */
> @@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
>  	return ret;
>  }
>  
> -static unsigned long compact_zone_order(struct zone *zone,
> -						int order, gfp_t gfp_mask)
> +unsigned long compact_zone_order(struct zone *zone,
> +				 int order, gfp_t gfp_mask,
> +				 int compact_mode)
>  {
>  	struct compact_control cc = {
>  		.nr_freepages = 0,
> @@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
>  		.order = order,
>  		.migratetype = allocflags_to_migratetype(gfp_mask),
>  		.zone = zone,
> +		.compact_mode = compact_mode,
>  	};
>  	INIT_LIST_HEAD(&cc.freepages);
>  	INIT_LIST_HEAD(&cc.migratepages);
> @@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
>  	 * made because an assumption is made that the page allocator can satisfy
>  	 * the "cheaper" orders without taking special steps
>  	 */
> -	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
> +	if (!order || !may_enter_fs || !may_perform_io)
>  		return rc;
>  
>  	count_vm_event(COMPACTSTALL);
> @@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
>  			break;
>  		}
>  
> -		status = compact_zone_order(zone, order, gfp_mask);
> +		status = compact_zone_order(zone, order, gfp_mask,
> +					    COMPACT_MODE_DIRECT_RECLAIM);
>  		rc = max(status, rc);
>  
>  		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> @@ -547,6 +567,7 @@ static int compact_node(int nid)
>  			.nr_freepages = 0,
>  			.nr_migratepages = 0,
>  			.order = -1,
> +			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
>  		};
>  
>  		zone = &pgdat->node_zones[zoneid];
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/compaction.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -2254,6 +2255,7 @@ loop_again:
>  		 * cause too much scanning of the lower zones.
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
> +			int compaction;
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;
>  
> @@ -2283,9 +2285,26 @@ loop_again:
>  						lru_pages);
>  			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  			total_scanned += sc.nr_scanned;
> +
> +			compaction = 0;
> +			if (order &&
> +			    zone_watermark_ok(zone, 0,
> +					       high_wmark_pages(zone),
> +					      end_zone, 0) &&
> +			    !zone_watermark_ok(zone, order,
> +					       high_wmark_pages(zone),
> +					       end_zone, 0)) {
> +				compact_zone_order(zone,
> +						   order,
> +						   sc.gfp_mask,
> +						   COMPACT_MODE_KSWAPD);
> +				compaction = 1;
> +			}
> +
>  			if (zone->all_unreclaimable)
>  				continue;
> -			if (nr_slab == 0 && !zone_reclaimable(zone))
> +			if (!compaction && nr_slab == 0 &&
> +			    !zone_reclaimable(zone))
>  				zone->all_unreclaimable = 1;
>  			/*
>  			 * If we've done a decent amount of scanning and
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
@ 2010-11-18 16:31     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:36PM +0100, Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This takes advantage of memory compaction to properly generate pages of order >
> 0 if regular page reclaim fails and priority level becomes more severe and we
> don't reach the proper watermarks.
> 

I don't think this is related to THP although I see what you're doing.
It should be handled on its own. I'd also wonder if some of the tg3
failures are due to MIGRATE_RESERVE not being set properly when
min_free_kbytes is automatically resized.

> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -11,6 +11,9 @@
>  /* The full zone was compacted */
>  #define COMPACT_COMPLETE	3
>  
> +#define COMPACT_MODE_DIRECT_RECLAIM	0
> +#define COMPACT_MODE_KSWAPD		1
> +
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> @@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
>  			void __user *buffer, size_t *length, loff_t *ppos);
>  
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long compact_zone_order(struct zone *zone,
> +					int order, gfp_t gfp_mask,
> +					int compact_mode);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask);
>  
> @@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
>  	return COMPACT_CONTINUE;
>  }
>  
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +					       int order, gfp_t gfp_mask,
> +					       int compact_mode)
> +{
> +	return COMPACT_CONTINUE;
> +}
> +
>  static inline void defer_compaction(struct zone *zone)
>  {
>  }
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -38,6 +38,8 @@ struct compact_control {
>  	unsigned int order;		/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> +
> +	int compact_mode;
>  };
>  
>  static unsigned long release_freepages(struct list_head *freelist)
> @@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
>  }
>  
>  static int compact_finished(struct zone *zone,
> -						struct compact_control *cc)
> +			    struct compact_control *cc)
>  {
>  	unsigned int order;
> -	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +	unsigned long watermark;
>  
>  	if (fatal_signal_pending(current))
>  		return COMPACT_PARTIAL;
> @@ -370,12 +372,27 @@ static int compact_finished(struct zone 
>  		return COMPACT_COMPLETE;
>  
>  	/* Compaction run is not finished if the watermark is not met */
> +	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
> +		watermark = low_wmark_pages(zone);
> +	else
> +		watermark = high_wmark_pages(zone);
> +	watermark += (1 << cc->order);
> +
>  	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
>  		return COMPACT_CONTINUE;
>  
>  	if (cc->order == -1)
>  		return COMPACT_CONTINUE;
>  
> +	/*
> +	 * Generating only one page of the right order is not enough
> +	 * for kswapd, we must continue until we're above the high
> +	 * watermark as a pool for high order GFP_ATOMIC allocations
> +	 * too.
> +	 */
> +	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
> +		return COMPACT_CONTINUE;
> +
>  	/* Direct compactor: Is a suitable page free? */
>  	for (order = cc->order; order < MAX_ORDER; order++) {
>  		/* Job done if page is free of the right migratetype */
> @@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
>  	return ret;
>  }
>  
> -static unsigned long compact_zone_order(struct zone *zone,
> -						int order, gfp_t gfp_mask)
> +unsigned long compact_zone_order(struct zone *zone,
> +				 int order, gfp_t gfp_mask,
> +				 int compact_mode)
>  {
>  	struct compact_control cc = {
>  		.nr_freepages = 0,
> @@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
>  		.order = order,
>  		.migratetype = allocflags_to_migratetype(gfp_mask),
>  		.zone = zone,
> +		.compact_mode = compact_mode,
>  	};
>  	INIT_LIST_HEAD(&cc.freepages);
>  	INIT_LIST_HEAD(&cc.migratepages);
> @@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
>  	 * made because an assumption is made that the page allocator can satisfy
>  	 * the "cheaper" orders without taking special steps
>  	 */
> -	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
> +	if (!order || !may_enter_fs || !may_perform_io)
>  		return rc;
>  
>  	count_vm_event(COMPACTSTALL);
> @@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
>  			break;
>  		}
>  
> -		status = compact_zone_order(zone, order, gfp_mask);
> +		status = compact_zone_order(zone, order, gfp_mask,
> +					    COMPACT_MODE_DIRECT_RECLAIM);
>  		rc = max(status, rc);
>  
>  		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> @@ -547,6 +567,7 @@ static int compact_node(int nid)
>  			.nr_freepages = 0,
>  			.nr_migratepages = 0,
>  			.order = -1,
> +			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
>  		};
>  
>  		zone = &pgdat->node_zones[zoneid];
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -40,6 +40,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/delayacct.h>
>  #include <linux/sysctl.h>
> +#include <linux/compaction.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/div64.h>
> @@ -2254,6 +2255,7 @@ loop_again:
>  		 * cause too much scanning of the lower zones.
>  		 */
>  		for (i = 0; i <= end_zone; i++) {
> +			int compaction;
>  			struct zone *zone = pgdat->node_zones + i;
>  			int nr_slab;
>  
> @@ -2283,9 +2285,26 @@ loop_again:
>  						lru_pages);
>  			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
>  			total_scanned += sc.nr_scanned;
> +
> +			compaction = 0;
> +			if (order &&
> +			    zone_watermark_ok(zone, 0,
> +					       high_wmark_pages(zone),
> +					      end_zone, 0) &&
> +			    !zone_watermark_ok(zone, order,
> +					       high_wmark_pages(zone),
> +					       end_zone, 0)) {
> +				compact_zone_order(zone,
> +						   order,
> +						   sc.gfp_mask,
> +						   COMPACT_MODE_KSWAPD);
> +				compaction = 1;
> +			}
> +
>  			if (zone->all_unreclaimable)
>  				continue;
> -			if (nr_slab == 0 && !zone_reclaimable(zone))
> +			if (!compaction && nr_slab == 0 &&
> +			    !zone_reclaimable(zone))
>  				zone->all_unreclaimable = 1;
>  			/*
>  			 * If we've done a decent amount of scanning and
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 62 of 66] disable transparent hugepages by default on small systems
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-18 16:34     ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:37PM +0100, Andrea Arcangeli wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> On small systems, the extra memory used by the anti-fragmentation
> memory reserve and simply because huge pages are smaller than large
> pages can easily outweigh the benefits of less TLB misses.
> 

A less obvious concern is if run on a NUMA machine with asymmetric node sizes
and one of them is very small. The reserve could make the node unusable. I've
only seen it happen once in practice (via hugeadm) but it was also a <1G
machine with 4 nodes (don't ask me why).

> In case of the crashdump kernel, OOMs have been observed due to
> the anti-fragmentation memory reserve taking up a large fraction
> of the crashdump image.
> 
> This patch disables transparent hugepages on systems with less
> than 1GB of RAM, but the hugepage subsystem is fully initialized
> so administrators can enable THP through /sys if desired.
> 
> v2: reduce the limit to 512MB
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Acked-by: Avi Kiviti <avi@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 62 of 66] disable transparent hugepages by default on small systems
@ 2010-11-18 16:34     ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:28:37PM +0100, Andrea Arcangeli wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> On small systems, the extra memory used by the anti-fragmentation
> memory reserve and simply because huge pages are smaller than large
> pages can easily outweigh the benefits of less TLB misses.
> 

A less obvious concern is if run on a NUMA machine with asymmetric node sizes
and one of them is very small. The reserve could make the node unusable. I've
only seen it happen once in practice (via hugeadm) but it was also a <1G
machine with 4 nodes (don't ask me why).

> In case of the crashdump kernel, OOMs have been observed due to
> the anti-fragmentation memory reserve taking up a large fraction
> of the crashdump image.
> 
> This patch disables transparent hugepages on systems with less
> than 1GB of RAM, but the hugepage subsystem is fully initialized
> so administrators can enable THP through /sys if desired.
> 
> v2: reduce the limit to 512MB
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Acked-by: Avi Kiviti <avi@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 00 of 66] Transparent Hugepage Support #32
  2010-11-03 15:27 ` Andrea Arcangeli
@ 2010-11-18 16:39   ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:35PM +0100, Andrea Arcangeli wrote:
> Some of some relevant user of the project:
> 
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
> 
> It would be great if it could go in -mm.
> 

FWIW, I saw nothing of major concern while reading through the series other
than the delete-lumpy-reclaim portion. I've posted a series that should
address both the lumpy reclaim concerns while also improving the performance
of compaction in general.

For the rest of the series, the vast majority of my comments were nits as
my questions were covered (and addressed) in previous revisions. I'll try
and set time aside to convert some of the libhugetlbfs-orientated to run
against these patches but I'm not anticipating any problems and it'd be
nice to see merged at some point.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 00 of 66] Transparent Hugepage Support #32
@ 2010-11-18 16:39   ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 16:39 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:35PM +0100, Andrea Arcangeli wrote:
> Some of some relevant user of the project:
> 
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
> 
> It would be great if it could go in -mm.
> 

FWIW, I saw nothing of major concern while reading through the series other
than the delete-lumpy-reclaim portion. I've posted a series that should
address both the lumpy reclaim concerns while also improving the performance
of compaction in general.

For the rest of the series, the vast majority of my comments were nits as
my questions were covered (and addressed) in previous revisions. I'll try
and set time aside to convert some of the libhugetlbfs-orientated to run
against these patches but I'm not anticipating any problems and it'd be
nice to see merged at some point.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-18 11:13     ` Mel Gorman
@ 2010-11-18 17:13       ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 17:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:13:49AM +0000, Mel Gorman wrote:
> > This patch fixes the problem by using two VMAs - one which covers the temporary
> > stack and the other which covers the new location. This guarantees that rmap
> > can always find the migration PTE even if it is copied while rmap_walk is
> > taking place.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> This old chestnut. IIRC, this was the more complete solution to a fix that made
> it into mainline. The patch still looks reasonable. It does add a kmalloc()
> but I can't remember if we decided we were ok with it or not. Can you remind
> me? More importantly, it appears to be surviving the original testcase that
> this bug was about (20 minutes so far but will leave it a few hours). Assuming
> the test does not crash;
> 

Incidentally, after 6.5 hours this still hasn't crashed. Previously a
worst case reproduction scenario for the bug was around 35 minutes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-18 17:13       ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-18 17:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:13:49AM +0000, Mel Gorman wrote:
> > This patch fixes the problem by using two VMAs - one which covers the temporary
> > stack and the other which covers the new location. This guarantees that rmap
> > can always find the migration PTE even if it is copied while rmap_walk is
> > taking place.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> This old chestnut. IIRC, this was the more complete solution to a fix that made
> it into mainline. The patch still looks reasonable. It does add a kmalloc()
> but I can't remember if we decided we were ok with it or not. Can you remind
> me? More importantly, it appears to be surviving the original testcase that
> this bug was about (20 minutes so far but will leave it a few hours). Assuming
> the test does not crash;
> 

Incidentally, after 6.5 hours this still hasn't crashed. Previously a
worst case reproduction scenario for the bug was around 35 minutes.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
  2010-11-18 11:49     ` Mel Gorman
@ 2010-11-18 17:28       ` Linus Torvalds
  -1 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-18 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 3:49 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> +
>> +static inline void compound_lock_irqsave(struct page *page,
>> +                                      unsigned long *flagsp)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +     unsigned long flags;
>> +     local_irq_save(flags);
>> +     compound_lock(page);
>> +     *flagsp = flags;
>> +#endif
>> +}
>> +
>
> The pattern for spinlock irqsave passes in unsigned long, not unsigned
> long *. It'd be nice if they matched.

Indeed. Just make the thing return the flags the way the normal
spin_lock_irqsave() function does.

                  Linus

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
@ 2010-11-18 17:28       ` Linus Torvalds
  0 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-18 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 3:49 AM, Mel Gorman <mel@csn.ul.ie> wrote:
>> +
>> +static inline void compound_lock_irqsave(struct page *page,
>> +                                      unsigned long *flagsp)
>> +{
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +     unsigned long flags;
>> +     local_irq_save(flags);
>> +     compound_lock(page);
>> +     *flagsp = flags;
>> +#endif
>> +}
>> +
>
> The pattern for spinlock irqsave passes in unsigned long, not unsigned
> long *. It'd be nice if they matched.

Indeed. Just make the thing return the flags the way the normal
spin_lock_irqsave() function does.

                  Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-18 12:52     ` Mel Gorman
@ 2010-11-18 17:32       ` Linus Torvalds
  -1 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-18 17:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 4:52 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Nov 03, 2010 at 04:27:52PM +0100, Andrea Arcangeli wrote:
>> From: Andrea Arcangeli <aarcange@redhat.com>
>>
>> Some are needed to build but not actually used on archs not supporting
>> transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Acked-by: Rik van Riel <riel@redhat.com>
>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

I dunno. Those macros are _way_ too big and heavy to be macros or
inline functions. Why aren't pmdp_splitting_flush() etc just
functions?

There is no performance advantage to inlining them - the TLB flush is
going to be expensive enough that there's no point in avoiding a
function call. And that header file really does end up being _really_
ugly.

                      Linus

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-11-18 17:32       ` Linus Torvalds
  0 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-18 17:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 4:52 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Wed, Nov 03, 2010 at 04:27:52PM +0100, Andrea Arcangeli wrote:
>> From: Andrea Arcangeli <aarcange@redhat.com>
>>
>> Some are needed to build but not actually used on archs not supporting
>> transparent hugepages. Others like pmdp_clear_flush are used by x86 too.
>>
>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>> Acked-by: Rik van Riel <riel@redhat.com>
>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

I dunno. Those macros are _way_ too big and heavy to be macros or
inline functions. Why aren't pmdp_splitting_flush() etc just
functions?

There is no performance advantage to inlining them - the TLB flush is
going to be expensive enough that there's no point in avoiding a
function call. And that header file really does end up being _really_
ugly.

                      Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 64 of 66] scale nr_rotated to balance memory pressure
  2010-11-09  6:16     ` KOSAKI Motohiro
@ 2010-11-18 19:15       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-18 19:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 03:16:21PM +0900, KOSAKI Motohiro wrote:
> I haven't seen this patch series carefully yet. So, probably
> my question is dumb. Why don't we need to change ->recent_scanned[] too?

That one is taken care by the prev patch where it is increased by 512
through nr_taken (after the previous patch the isolation methods now
return in 4k units). What was left to update was the nr_rotated++
which is why this patch isn't touching the recent_scanned.

Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 64 of 66] scale nr_rotated to balance memory pressure
@ 2010-11-18 19:15       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-18 19:15 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, Nov 09, 2010 at 03:16:21PM +0900, KOSAKI Motohiro wrote:
> I haven't seen this patch series carefully yet. So, probably
> my question is dumb. Why don't we need to change ->recent_scanned[] too?

That one is taken care by the prev patch where it is increased by 512
through nr_taken (after the previous patch the isolation methods now
return in 4k units). What was left to update was the nr_rotated++
which is why this patch isn't touching the recent_scanned.

Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
  2010-11-18 15:26     ` Mel Gorman
@ 2010-11-19  1:10       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, 18 Nov 2010 15:26:28 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Nov 03, 2010 at 04:28:11PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Teach memcg to charge/uncharge compound pages.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
> >  {
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup_per_zone *mz;
> > +	int page_size = PAGE_SIZE;
> > +
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> > @@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
> >   * oom-killer can be invoked.
> >   */
> >  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > -		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
> > +				   gfp_t gfp_mask,
> > +				   struct mem_cgroup **memcg, bool oom,
> > +				   int page_size)
> 
> Any concerns about page_size overflowing int? ppc64 has 16G pages for example
> although it will never be in this path. hmm, I see that charge size is already
> int so maybe this is more of a memcg issue than it is THP but hugetlbfs
> treats page sizes as unsigned long. For example see vma_kernel_pagesize()
> 

If there are requirements of big page > 4GB, unsigned long should be used.


> 
> >  {
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem = NULL;
> >  	int ret;
> > -	int csize = CHARGE_SIZE;
> > +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> >  

unsigned long here.


> >  	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > @@ -1909,7 +1915,7 @@ again:
> >  		VM_BUG_ON(css_is_removed(&mem->css));
> >  		if (mem_cgroup_is_root(mem))
> >  			goto done;
> > -		if (consume_stock(mem))
> > +		if (page_size == PAGE_SIZE && consume_stock(mem))
> >  			goto done;
> >  		css_get(&mem->css);
> >  	} else {
> > @@ -1933,7 +1939,7 @@ again:
> >  			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		if (consume_stock(mem)) {
> > +		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> >  			/*
> >  			 * It seems dagerous to access memcg without css_get().
> >  			 * But considering how consume_stok works, it's not
> > @@ -1974,7 +1980,7 @@ again:
> >  		case CHARGE_OK:
> >  			break;
> >  		case CHARGE_RETRY: /* not in OOM situation but retry */
> > -			csize = PAGE_SIZE;
> > +			csize = page_size;
> >  			css_put(&mem->css);
> >  			mem = NULL;
> >  			goto again;
> > @@ -1995,8 +2001,8 @@ again:
> >  		}
> >  	} while (ret != CHARGE_OK);
> >  
> > -	if (csize > PAGE_SIZE)
> > -		refill_stock(mem, csize - PAGE_SIZE);
> > +	if (csize > page_size)
> > +		refill_stock(mem, csize - page_size);
> >  	css_put(&mem->css);
> >  done:
> >  	*memcg = mem;
> > @@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
> >  	}
> >  }
> >  
> > -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> > +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> > +				     int page_size)
> >  {
> > -	__mem_cgroup_cancel_charge(mem, 1);
> > +	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
> >  }
> >  
> >  /*
> > @@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> >   */
> >  
> >  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> > -				     struct page_cgroup *pc,
> > -				     enum charge_type ctype)
> > +				       struct page_cgroup *pc,
> > +				       enum charge_type ctype,
> > +				       int page_size)
> >  {
> >  	/* try_charge() can return NULL to *memcg, taking care of it. */
> >  	if (!mem)
> > @@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
> >  	lock_page_cgroup(pc);
> >  	if (unlikely(PageCgroupUsed(pc))) {
> >  		unlock_page_cgroup(pc);
> > -		mem_cgroup_cancel_charge(mem);
> > +		mem_cgroup_cancel_charge(mem, page_size);
> >  		return;
> >  	}
> >  
> > @@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > -		mem_cgroup_cancel_charge(from);
> > +		mem_cgroup_cancel_charge(from, PAGE_SIZE);
> >  
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> > @@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
> >  		goto put;
> >  
> >  	parent = mem_cgroup_from_cont(pcg);
> > -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> > +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> > +				      PAGE_SIZE);
> >  	if (ret || !parent)
> >  		goto put_back;
> >  
> >  	ret = mem_cgroup_move_account(pc, child, parent, true);
> >  	if (ret)
> > -		mem_cgroup_cancel_charge(parent);
> > +		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> >  put_back:
> >  	putback_lru_page(page);
> >  put:
> > @@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
> >  	struct mem_cgroup *mem = NULL;
> >  	struct page_cgroup *pc;
> >  	int ret;
> > +	int page_size = PAGE_SIZE;
> > +
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> >  
> >  	pc = lookup_page_cgroup(page);
> >  	/* can happen at boot */
> > @@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
> >  		return 0;
> >  	prefetchw(pc);
> >  
> > -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> > +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
> >  	if (ret || !mem)
> >  		return ret;
> >  
> > -	__mem_cgroup_commit_charge(mem, pc, ctype);
> > +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
> >  	return 0;
> >  }
> >  
> > @@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
> >  {
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> > -	if (PageCompound(page))
> > -		return 0;
> >  	/*
> >  	 * If already mapped, we don't have to account.
> >  	 * If page cache, page->mapping has address_space.
> > @@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
> >  	if (!mem)
> >  		goto charge_cur_mm;
> >  	*ptr = mem;
> > -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> > +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
> >  	css_put(&mem->css);
> >  	return ret;
> >  charge_cur_mm:
> >  	if (unlikely(!mm))
> >  		mm = &init_mm;
> > -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > +	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
> >  }
> >  
> >  static void
> > @@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >  	cgroup_exclude_rmdir(&ptr->css);
> >  	pc = lookup_page_cgroup(page);
> >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> > -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
> >  	mem_cgroup_lru_add_after_commit_swapcache(page);
> >  	/*
> >  	 * Now swap is on-memory. This means this page may be
> > @@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
> >  		return;
> >  	if (!mem)
> >  		return;
> > -	mem_cgroup_cancel_charge(mem);
> > +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
> >  }
> >  
> >  static void
> > -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> > +	      int page_size)
> >  {
> >  	struct memcg_batch_info *batch = NULL;
> >  	bool uncharge_memsw = true;
> > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (batch->memcg != mem)
> >  		goto direct_uncharge;
> >  	/* remember freed charge and uncharge it later */
> > -	batch->bytes += PAGE_SIZE;
> > +	batch->bytes += page_size;

Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?



> >  	if (uncharge_memsw)
> > -		batch->memsw_bytes += PAGE_SIZE;
> > +		batch->memsw_bytes += page_size;
> >  	return;
> >  direct_uncharge:
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->res, page_size);
> >  	if (uncharge_memsw)
> > -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->memsw, page_size);
> >  	if (unlikely(batch->memcg != mem))
> >  		memcg_oom_recover(mem);
> >  	return;
> > @@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
> >  {
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup *mem = NULL;
> > +	int page_size = PAGE_SIZE;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> > @@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
> >  	if (PageSwapCache(page))
> >  		return NULL;
> >  
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> > +
> >  	/*
> >  	 * Check if our page_cgroup is valid
> >  	 */
> > @@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
> >  		mem_cgroup_get(mem);
> >  	}
> >  	if (!mem_cgroup_is_root(mem))
> > -		__do_uncharge(mem, ctype);
> > +		__do_uncharge(mem, ctype, page_size);
> >  
> >  	return mem;
> >  
> > @@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
> >  	enum charge_type ctype;
> >  	int ret = 0;
> >  
> > +	VM_BUG_ON(PageTransHuge(page));
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> >  
> > @@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
> >  		return 0;
> >  
> >  	*ptr = mem;
> > -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
> > +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
> >  	css_put(&mem->css);/* drop extra refcnt */
> >  	if (ret || *ptr == NULL) {
> >  		if (PageAnon(page)) {
> > @@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
> >  		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> >  	else
> >  		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> > -	__mem_cgroup_commit_charge(mem, pc, ctype);
> > +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
> >  	return ret;
> >  }
> >  
> > @@ -4452,7 +4469,8 @@ one_by_one:
> >  			batch_count = PRECHARGE_COUNT_AT_ONCE;
> >  			cond_resched();
> >  		}
> > -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> > +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> > +					      PAGE_SIZE);
> >  		if (ret || !mem)
> >  			/* mem_cgroup_clear_mc() will do uncharge later */
> >  			return -ENOMEM;
> > @@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >  
> > +	VM_BUG_ON(pmd_trans_huge(*pmd));
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; pte++, addr += PAGE_SIZE)
> >  		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
> > @@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
> >  	spinlock_t *ptl;
> >  
> >  retry:
> > +	VM_BUG_ON(pmd_trans_huge(*pmd));
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; addr += PAGE_SIZE) {
> >  		pte_t ptent = *(pte++);
> > 
> 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
@ 2010-11-19  1:10       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, 18 Nov 2010 15:26:28 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Wed, Nov 03, 2010 at 04:28:11PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Teach memcg to charge/uncharge compound pages.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1019,6 +1019,10 @@ mem_cgroup_get_reclaim_stat_from_page(st
> >  {
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup_per_zone *mz;
> > +	int page_size = PAGE_SIZE;
> > +
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> > @@ -1879,12 +1883,14 @@ static int __mem_cgroup_do_charge(struct
> >   * oom-killer can be invoked.
> >   */
> >  static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > -		gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
> > +				   gfp_t gfp_mask,
> > +				   struct mem_cgroup **memcg, bool oom,
> > +				   int page_size)
> 
> Any concerns about page_size overflowing int? ppc64 has 16G pages for example
> although it will never be in this path. hmm, I see that charge size is already
> int so maybe this is more of a memcg issue than it is THP but hugetlbfs
> treats page sizes as unsigned long. For example see vma_kernel_pagesize()
> 

If there are requirements of big page > 4GB, unsigned long should be used.


> 
> >  {
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem = NULL;
> >  	int ret;
> > -	int csize = CHARGE_SIZE;
> > +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> >  

unsigned long here.


> >  	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > @@ -1909,7 +1915,7 @@ again:
> >  		VM_BUG_ON(css_is_removed(&mem->css));
> >  		if (mem_cgroup_is_root(mem))
> >  			goto done;
> > -		if (consume_stock(mem))
> > +		if (page_size == PAGE_SIZE && consume_stock(mem))
> >  			goto done;
> >  		css_get(&mem->css);
> >  	} else {
> > @@ -1933,7 +1939,7 @@ again:
> >  			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		if (consume_stock(mem)) {
> > +		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> >  			/*
> >  			 * It seems dagerous to access memcg without css_get().
> >  			 * But considering how consume_stok works, it's not
> > @@ -1974,7 +1980,7 @@ again:
> >  		case CHARGE_OK:
> >  			break;
> >  		case CHARGE_RETRY: /* not in OOM situation but retry */
> > -			csize = PAGE_SIZE;
> > +			csize = page_size;
> >  			css_put(&mem->css);
> >  			mem = NULL;
> >  			goto again;
> > @@ -1995,8 +2001,8 @@ again:
> >  		}
> >  	} while (ret != CHARGE_OK);
> >  
> > -	if (csize > PAGE_SIZE)
> > -		refill_stock(mem, csize - PAGE_SIZE);
> > +	if (csize > page_size)
> > +		refill_stock(mem, csize - page_size);
> >  	css_put(&mem->css);
> >  done:
> >  	*memcg = mem;
> > @@ -2024,9 +2030,10 @@ static void __mem_cgroup_cancel_charge(s
> >  	}
> >  }
> >  
> > -static void mem_cgroup_cancel_charge(struct mem_cgroup *mem)
> > +static void mem_cgroup_cancel_charge(struct mem_cgroup *mem,
> > +				     int page_size)
> >  {
> > -	__mem_cgroup_cancel_charge(mem, 1);
> > +	__mem_cgroup_cancel_charge(mem, page_size >> PAGE_SHIFT);
> >  }
> >  
> >  /*
> > @@ -2082,8 +2089,9 @@ struct mem_cgroup *try_get_mem_cgroup_fr
> >   */
> >  
> >  static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> > -				     struct page_cgroup *pc,
> > -				     enum charge_type ctype)
> > +				       struct page_cgroup *pc,
> > +				       enum charge_type ctype,
> > +				       int page_size)
> >  {
> >  	/* try_charge() can return NULL to *memcg, taking care of it. */
> >  	if (!mem)
> > @@ -2092,7 +2100,7 @@ static void __mem_cgroup_commit_charge(s
> >  	lock_page_cgroup(pc);
> >  	if (unlikely(PageCgroupUsed(pc))) {
> >  		unlock_page_cgroup(pc);
> > -		mem_cgroup_cancel_charge(mem);
> > +		mem_cgroup_cancel_charge(mem, page_size);
> >  		return;
> >  	}
> >  
> > @@ -2166,7 +2174,7 @@ static void __mem_cgroup_move_account(st
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > -		mem_cgroup_cancel_charge(from);
> > +		mem_cgroup_cancel_charge(from, PAGE_SIZE);
> >  
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> > @@ -2227,13 +2235,14 @@ static int mem_cgroup_move_parent(struct
> >  		goto put;
> >  
> >  	parent = mem_cgroup_from_cont(pcg);
> > -	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
> > +	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false,
> > +				      PAGE_SIZE);
> >  	if (ret || !parent)
> >  		goto put_back;
> >  
> >  	ret = mem_cgroup_move_account(pc, child, parent, true);
> >  	if (ret)
> > -		mem_cgroup_cancel_charge(parent);
> > +		mem_cgroup_cancel_charge(parent, PAGE_SIZE);
> >  put_back:
> >  	putback_lru_page(page);
> >  put:
> > @@ -2254,6 +2263,10 @@ static int mem_cgroup_charge_common(stru
> >  	struct mem_cgroup *mem = NULL;
> >  	struct page_cgroup *pc;
> >  	int ret;
> > +	int page_size = PAGE_SIZE;
> > +
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> >  
> >  	pc = lookup_page_cgroup(page);
> >  	/* can happen at boot */
> > @@ -2261,11 +2274,11 @@ static int mem_cgroup_charge_common(stru
> >  		return 0;
> >  	prefetchw(pc);
> >  
> > -	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
> > +	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
> >  	if (ret || !mem)
> >  		return ret;
> >  
> > -	__mem_cgroup_commit_charge(mem, pc, ctype);
> > +	__mem_cgroup_commit_charge(mem, pc, ctype, page_size);
> >  	return 0;
> >  }
> >  
> > @@ -2274,8 +2287,6 @@ int mem_cgroup_newpage_charge(struct pag
> >  {
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> > -	if (PageCompound(page))
> > -		return 0;
> >  	/*
> >  	 * If already mapped, we don't have to account.
> >  	 * If page cache, page->mapping has address_space.
> > @@ -2381,13 +2392,13 @@ int mem_cgroup_try_charge_swapin(struct 
> >  	if (!mem)
> >  		goto charge_cur_mm;
> >  	*ptr = mem;
> > -	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true);
> > +	ret = __mem_cgroup_try_charge(NULL, mask, ptr, true, PAGE_SIZE);
> >  	css_put(&mem->css);
> >  	return ret;
> >  charge_cur_mm:
> >  	if (unlikely(!mm))
> >  		mm = &init_mm;
> > -	return __mem_cgroup_try_charge(mm, mask, ptr, true);
> > +	return __mem_cgroup_try_charge(mm, mask, ptr, true, PAGE_SIZE);
> >  }
> >  
> >  static void
> > @@ -2403,7 +2414,7 @@ __mem_cgroup_commit_charge_swapin(struct
> >  	cgroup_exclude_rmdir(&ptr->css);
> >  	pc = lookup_page_cgroup(page);
> >  	mem_cgroup_lru_del_before_commit_swapcache(page);
> > -	__mem_cgroup_commit_charge(ptr, pc, ctype);
> > +	__mem_cgroup_commit_charge(ptr, pc, ctype, PAGE_SIZE);
> >  	mem_cgroup_lru_add_after_commit_swapcache(page);
> >  	/*
> >  	 * Now swap is on-memory. This means this page may be
> > @@ -2452,11 +2463,12 @@ void mem_cgroup_cancel_charge_swapin(str
> >  		return;
> >  	if (!mem)
> >  		return;
> > -	mem_cgroup_cancel_charge(mem);
> > +	mem_cgroup_cancel_charge(mem, PAGE_SIZE);
> >  }
> >  
> >  static void
> > -__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype)
> > +__do_uncharge(struct mem_cgroup *mem, const enum charge_type ctype,
> > +	      int page_size)
> >  {
> >  	struct memcg_batch_info *batch = NULL;
> >  	bool uncharge_memsw = true;
> > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (batch->memcg != mem)
> >  		goto direct_uncharge;
> >  	/* remember freed charge and uncharge it later */
> > -	batch->bytes += PAGE_SIZE;
> > +	batch->bytes += page_size;

Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?



> >  	if (uncharge_memsw)
> > -		batch->memsw_bytes += PAGE_SIZE;
> > +		batch->memsw_bytes += page_size;
> >  	return;
> >  direct_uncharge:
> > -	res_counter_uncharge(&mem->res, PAGE_SIZE);
> > +	res_counter_uncharge(&mem->res, page_size);
> >  	if (uncharge_memsw)
> > -		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
> > +		res_counter_uncharge(&mem->memsw, page_size);
> >  	if (unlikely(batch->memcg != mem))
> >  		memcg_oom_recover(mem);
> >  	return;
> > @@ -2512,6 +2524,7 @@ __mem_cgroup_uncharge_common(struct page
> >  {
> >  	struct page_cgroup *pc;
> >  	struct mem_cgroup *mem = NULL;
> > +	int page_size = PAGE_SIZE;
> >  
> >  	if (mem_cgroup_disabled())
> >  		return NULL;
> > @@ -2519,6 +2532,9 @@ __mem_cgroup_uncharge_common(struct page
> >  	if (PageSwapCache(page))
> >  		return NULL;
> >  
> > +	if (PageTransHuge(page))
> > +		page_size <<= compound_order(page);
> > +
> >  	/*
> >  	 * Check if our page_cgroup is valid
> >  	 */
> > @@ -2572,7 +2588,7 @@ __mem_cgroup_uncharge_common(struct page
> >  		mem_cgroup_get(mem);
> >  	}
> >  	if (!mem_cgroup_is_root(mem))
> > -		__do_uncharge(mem, ctype);
> > +		__do_uncharge(mem, ctype, page_size);
> >  
> >  	return mem;
> >  
> > @@ -2767,6 +2783,7 @@ int mem_cgroup_prepare_migration(struct 
> >  	enum charge_type ctype;
> >  	int ret = 0;
> >  
> > +	VM_BUG_ON(PageTransHuge(page));
> >  	if (mem_cgroup_disabled())
> >  		return 0;
> >  
> > @@ -2816,7 +2833,7 @@ int mem_cgroup_prepare_migration(struct 
> >  		return 0;
> >  
> >  	*ptr = mem;
> > -	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
> > +	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
> >  	css_put(&mem->css);/* drop extra refcnt */
> >  	if (ret || *ptr == NULL) {
> >  		if (PageAnon(page)) {
> > @@ -2843,7 +2860,7 @@ int mem_cgroup_prepare_migration(struct 
> >  		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
> >  	else
> >  		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
> > -	__mem_cgroup_commit_charge(mem, pc, ctype);
> > +	__mem_cgroup_commit_charge(mem, pc, ctype, PAGE_SIZE);
> >  	return ret;
> >  }
> >  
> > @@ -4452,7 +4469,8 @@ one_by_one:
> >  			batch_count = PRECHARGE_COUNT_AT_ONCE;
> >  			cond_resched();
> >  		}
> > -		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false);
> > +		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem, false,
> > +					      PAGE_SIZE);
> >  		if (ret || !mem)
> >  			/* mem_cgroup_clear_mc() will do uncharge later */
> >  			return -ENOMEM;
> > @@ -4614,6 +4632,7 @@ static int mem_cgroup_count_precharge_pt
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> >  
> > +	VM_BUG_ON(pmd_trans_huge(*pmd));
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; pte++, addr += PAGE_SIZE)
> >  		if (is_target_pte_for_mc(vma, addr, *pte, NULL))
> > @@ -4765,6 +4784,7 @@ static int mem_cgroup_move_charge_pte_ra
> >  	spinlock_t *ptl;
> >  
> >  retry:
> > +	VM_BUG_ON(pmd_trans_huge(*pmd));
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >  	for (; addr != end; addr += PAGE_SIZE) {
> >  		pte_t ptent = *(pte++);
> > 
> 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 38 of 66] memcontrol: try charging huge pages from stock
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-19  1:14     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, 03 Nov 2010 16:28:13 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The stock unit is just bytes, there is no reason to only take normal
> pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

nonsense. The stock size is CHARGE_SIZE=32*PAGE_SIZE at maximum.
When we make this to be larger than HUGEPAGE size, 2M per cpu at least.
This means memcg's resource "usage" accounting will have 128MB inaccuracy.

Nack.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 38 of 66] memcontrol: try charging huge pages from stock
@ 2010-11-19  1:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, 03 Nov 2010 16:28:13 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Johannes Weiner <hannes@cmpxchg.org>
> 
> The stock unit is just bytes, there is no reason to only take normal
> pages from it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

nonsense. The stock size is CHARGE_SIZE=32*PAGE_SIZE at maximum.
When we make this to be larger than HUGEPAGE size, 2M per cpu at least.
This means memcg's resource "usage" accounting will have 128MB inaccuracy.

Nack.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 39 of 66] memcg huge memory
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-19  1:19     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, 03 Nov 2010 16:28:14 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add memcg charge/uncharge to hugepage faults in huge_memory.c.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -233,6 +233,7 @@ static int __do_huge_pmd_anonymous_page(
>  	VM_BUG_ON(!PageCompound(page));
>  	pgtable = pte_alloc_one(mm, haddr);
>  	if (unlikely(!pgtable)) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -243,6 +244,7 @@ static int __do_huge_pmd_anonymous_page(
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
>  		spin_unlock(&mm->page_table_lock);
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -286,6 +288,10 @@ int do_huge_pmd_anonymous_page(struct mm
>  		page = alloc_hugepage(transparent_hugepage_defrag(vma));
>  		if (unlikely(!page))
>  			goto out;
> +		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
> +			put_page(page);
> +			goto out;
> +		}
>  
>  		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
>  	}
> @@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>  					  vma, address);
> -		if (unlikely(!pages[i])) {
> -			while (--i >= 0)
> +		if (unlikely(!pages[i] ||
> +			     mem_cgroup_newpage_charge(pages[i], mm,
> +						       GFP_KERNEL))) {
> +			if (pages[i])
>  				put_page(pages[i]);
> +			while (--i >= 0) {
> +				mem_cgroup_uncharge_page(pages[i]);
> +				put_page(pages[i]);
> +			}

Maybe you can use batched-uncharge here.
==
mem_cgroup_uncharge_start()
{
	do loop;
}
mem_cgroup_uncharge_end();
==
Then, many atomic ops can be reduced.


>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
> @@ -455,8 +467,10 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(&mm->page_table_lock);
> -	for (i = 0; i < HPAGE_PMD_NR; i++)
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		mem_cgroup_uncharge_page(pages[i]);
>  		put_page(pages[i]);
> +	}

here, too.

>  	kfree(pages);
>  	goto out;
>  }
> @@ -501,14 +515,22 @@ int do_huge_pmd_wp_page(struct mm_struct
>  		goto out;
>  	}
>  
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		put_page(new_page);
> +		put_page(page);
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
>  	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
>  	spin_lock(&mm->page_table_lock);
>  	put_page(page);
> -	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> +		mem_cgroup_uncharge_page(new_page);
>  		put_page(new_page);
> -	else {
> +	} else {
>  		pmd_t entry;
>  		VM_BUG_ON(!PageHead(page));
>  		entry = mk_pmd(new_page, vma->vm_page_prot);


Hmm...it seems there are no codes for move_account() hugepage in series.
I think it needs some complicated work to walk page table.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 39 of 66] memcg huge memory
@ 2010-11-19  1:19     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-11-19  1:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Wed, 03 Nov 2010 16:28:14 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Add memcg charge/uncharge to hugepage faults in huge_memory.c.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -233,6 +233,7 @@ static int __do_huge_pmd_anonymous_page(
>  	VM_BUG_ON(!PageCompound(page));
>  	pgtable = pte_alloc_one(mm, haddr);
>  	if (unlikely(!pgtable)) {
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		return VM_FAULT_OOM;
>  	}
> @@ -243,6 +244,7 @@ static int __do_huge_pmd_anonymous_page(
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_none(*pmd))) {
>  		spin_unlock(&mm->page_table_lock);
> +		mem_cgroup_uncharge_page(page);
>  		put_page(page);
>  		pte_free(mm, pgtable);
>  	} else {
> @@ -286,6 +288,10 @@ int do_huge_pmd_anonymous_page(struct mm
>  		page = alloc_hugepage(transparent_hugepage_defrag(vma));
>  		if (unlikely(!page))
>  			goto out;
> +		if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
> +			put_page(page);
> +			goto out;
> +		}
>  
>  		return __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page);
>  	}
> @@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
>  	for (i = 0; i < HPAGE_PMD_NR; i++) {
>  		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>  					  vma, address);
> -		if (unlikely(!pages[i])) {
> -			while (--i >= 0)
> +		if (unlikely(!pages[i] ||
> +			     mem_cgroup_newpage_charge(pages[i], mm,
> +						       GFP_KERNEL))) {
> +			if (pages[i])
>  				put_page(pages[i]);
> +			while (--i >= 0) {
> +				mem_cgroup_uncharge_page(pages[i]);
> +				put_page(pages[i]);
> +			}

Maybe you can use batched-uncharge here.
==
mem_cgroup_uncharge_start()
{
	do loop;
}
mem_cgroup_uncharge_end();
==
Then, many atomic ops can be reduced.


>  			kfree(pages);
>  			ret |= VM_FAULT_OOM;
>  			goto out;
> @@ -455,8 +467,10 @@ out:
>  
>  out_free_pages:
>  	spin_unlock(&mm->page_table_lock);
> -	for (i = 0; i < HPAGE_PMD_NR; i++)
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		mem_cgroup_uncharge_page(pages[i]);
>  		put_page(pages[i]);
> +	}

here, too.

>  	kfree(pages);
>  	goto out;
>  }
> @@ -501,14 +515,22 @@ int do_huge_pmd_wp_page(struct mm_struct
>  		goto out;
>  	}
>  
> +	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> +		put_page(new_page);
> +		put_page(page);
> +		ret |= VM_FAULT_OOM;
> +		goto out;
> +	}
> +
>  	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
>  	__SetPageUptodate(new_page);
>  
>  	spin_lock(&mm->page_table_lock);
>  	put_page(page);
> -	if (unlikely(!pmd_same(*pmd, orig_pmd)))
> +	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
> +		mem_cgroup_uncharge_page(new_page);
>  		put_page(new_page);
> -	else {
> +	} else {
>  		pmd_t entry;
>  		VM_BUG_ON(!PageHead(page));
>  		entry = mk_pmd(new_page, vma->vm_page_prot);


Hmm...it seems there are no codes for move_account() hugepage in series.
I think it needs some complicated work to walk page table.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-18 11:13     ` Mel Gorman
@ 2010-11-19 17:38       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-19 17:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:13:49AM +0000, Mel Gorman wrote:
> This old chestnut. IIRC, this was the more complete solution to a fix that made
> it into mainline. The patch still looks reasonable. It does add a kmalloc()
> but I can't remember if we decided we were ok with it or not. Can you remind

We decided the kmalloc was ok, but Linus didn't like this approach. I
kept it in my tree because I didn't want to remember when/if to add the
special check in the accurate rmap walks. I find it simpler if all
rmap walks are accurate by default.

> me? More importantly, it appears to be surviving the original testcase that
> this bug was about (20 minutes so far but will leave it a few hours). Assuming
> the test does not crash;

Sure the patch is safe.

If Linus still doesn't like this, I will immediately remove this patch
and add the special checks to the rmap walks in huge_memory.c, you
know my preference but this is a detail and my preference is
irrelevant.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-19 17:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-19 17:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:13:49AM +0000, Mel Gorman wrote:
> This old chestnut. IIRC, this was the more complete solution to a fix that made
> it into mainline. The patch still looks reasonable. It does add a kmalloc()
> but I can't remember if we decided we were ok with it or not. Can you remind

We decided the kmalloc was ok, but Linus didn't like this approach. I
kept it in my tree because I didn't want to remember when/if to add the
special check in the accurate rmap walks. I find it simpler if all
rmap walks are accurate by default.

> me? More importantly, it appears to be surviving the original testcase that
> this bug was about (20 minutes so far but will leave it a few hours). Assuming
> the test does not crash;

Sure the patch is safe.

If Linus still doesn't like this, I will immediately remove this patch
and add the special checks to the rmap walks in huge_memory.c, you
know my preference but this is a detail and my preference is
irrelevant.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-19 17:38       ` Andrea Arcangeli
@ 2010-11-19 17:54         ` Linus Torvalds
  -1 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-19 17:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> We decided the kmalloc was ok, but Linus didn't like this approach. I
> kept it in my tree because I didn't want to remember when/if to add the
> special check in the accurate rmap walks. I find it simpler if all
> rmap walks are accurate by default.

Why isn't the existing cheap solution sufficient?

My opinion is still that we shouldn't add the expense to the common
case, and it's the uncommon case (migration) that should just handle
it.

                Linus

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-19 17:54         ` Linus Torvalds
  0 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-19 17:54 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> We decided the kmalloc was ok, but Linus didn't like this approach. I
> kept it in my tree because I didn't want to remember when/if to add the
> special check in the accurate rmap walks. I find it simpler if all
> rmap walks are accurate by default.

Why isn't the existing cheap solution sufficient?

My opinion is still that we shouldn't add the expense to the common
case, and it's the uncommon case (migration) that should just handle
it.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
  2010-11-19 17:54         ` Linus Torvalds
@ 2010-11-19 19:26           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-19 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 09:54:27AM -0800, Linus Torvalds wrote:
> On Fri, Nov 19, 2010 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > We decided the kmalloc was ok, but Linus didn't like this approach. I
> > kept it in my tree because I didn't want to remember when/if to add the
> > special check in the accurate rmap walks. I find it simpler if all
> > rmap walks are accurate by default.
> 
> Why isn't the existing cheap solution sufficient?

It is sufficient.

> My opinion is still that we shouldn't add the expense to the common
> case, and it's the uncommon case (migration) that should just handle
> it.

Ok, I'll remove this patch from the next submit.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack
@ 2010-11-19 19:26           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-19 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 09:54:27AM -0800, Linus Torvalds wrote:
> On Fri, Nov 19, 2010 at 9:38 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > We decided the kmalloc was ok, but Linus didn't like this approach. I
> > kept it in my tree because I didn't want to remember when/if to add the
> > special check in the accurate rmap walks. I find it simpler if all
> > rmap walks are accurate by default.
> 
> Why isn't the existing cheap solution sufficient?

It is sufficient.

> My opinion is still that we shouldn't add the expense to the common
> case, and it's the uncommon case (migration) that should just handle
> it.

Ok, I'll remove this patch from the next submit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
  2010-11-18 11:41     ` Mel Gorman
@ 2010-11-25 14:35       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:41:05AM +0000, Mel Gorman wrote:
> libhugetlbfs can also automatically back shared memory, text and data
> with huge pages which THP cannot do. libhugetlbfs cannot demote and
> promote memory where THP can. They are not exactly like-with-like
> comparisons. How about;
> 
> "Transparent Hugepage Support is an alternative means of using huge pages for
> the backing of anonymous memory with huge pages that supports the automatic
> promotion and demotion of page sizes."
> 
> ?

Ok I changed it like this:

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -4,12 +4,13 @@
 
 Performance critical computing applications dealing with large memory
 working sets are already running on top of libhugetlbfs and in turn
-hugetlbfs. Transparent Hugepage Support is an alternative to
-libhugetlbfs that offers the same feature of libhugetlbfs but without
-the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).
+hugetlbfs. Transparent Hugepage Support is an alternative means of
+using huge pages for the backing of virtual memory with huge pages
+that supports the automatic promotion and demotion of page sizes and
+without the shortcomings of hugetlbfs.
 
-In the future it can expand over the pagecache layer starting with
-tmpfs to reduce even further the hugetlbfs usages.
+Currently it only works for anonymous memory mappings but in the
+future it can expand over the pagecache layer starting with tmpfs.
 
 The reason applications are running faster is because of two
 factors. The first factor is almost completely irrelevant and it's not


> > +In the future it can expand over the pagecache layer starting with
> > +tmpfs to reduce even further the hugetlbfs usages.
> > +
> > +The reason applications are running faster is because of two
> > +factors. The first factor is almost completely irrelevant and it's not
> > +of significant interest because it'll also have the downside of
> > +requiring larger clear-page copy-page in page faults which is a
> > +potentially negative effect. The first factor consists in taking a
> > +single page fault for each 2M virtual region touched by userland (so
> > +reducing the enter/exit kernel frequency by a 512 times factor). This
> > +only matters the first time the memory is accessed for the lifetime of
> > +a memory mapping. The second long lasting and much more important
> > +factor will affect all subsequent accesses to the memory for the whole
> > +runtime of the application. The second factor consist of two
> > +components: 1) the TLB miss will run faster (especially with
> > +virtualization using nested pagetables but also on bare metal without
> > +virtualization)
> 
> Careful on that first statement. It's not necessarily true for bare metal
> as some processors show that the TLB miss handler for huge pages is slower
> than base pages. Not sure why but it seemed to be the case on P4 anyway
> (at least the one I have). Maybe it was a measurement error but on chips
> with split TLBs for page sizes, there is no guarantee they are the same speed.
> 
> It's probably true for virtualisation though considering the vastly reduced
> number of cache lines required to translate an address.
>
> I'd weaken the language for bare metal to say "almost always" but it's
> not a big deal.

Agreed.

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -23,14 +24,14 @@ a memory mapping. The second long lastin
 factor will affect all subsequent accesses to the memory for the whole
 runtime of the application. The second factor consist of two
 components: 1) the TLB miss will run faster (especially with
-virtualization using nested pagetables but also on bare metal without
-virtualization) and 2) a single TLB entry will be mapping a much
-larger amount of virtual memory in turn reducing the number of TLB
-misses. With virtualization and nested pagetables the TLB can be
-mapped of larger size only if both KVM and the Linux guest are using
-hugepages but a significant speedup already happens if only one of the
-two is using hugepages just because of the fact the TLB miss is going
-to run faster.
+virtualization using nested pagetables but almost always also on bare
+metal without virtualization) and 2) a single TLB entry will be
+mapping a much larger amount of virtual memory in turn reducing the
+number of TLB misses. With virtualization and nested pagetables the
+TLB can be mapped of larger size only if both KVM and the Linux guest
+are using hugepages but a significant speedup already happens if only
+one of the two is using hugepages just because of the fact the TLB
+miss is going to run faster.
 
 == Design ==
 

> > +With virtualization and nested pagetables the TLB can be
> > +mapped of larger size only if both KVM and the Linux guest are using
> > +hugepages but a significant speedup already happens if only one of the
> > +two is using hugepages just because of the fact the TLB miss is going
> > +to run faster.
> > +
> > +== Design ==
> > +
> > +- "graceful fallback": mm components which don't have transparent
> > +  hugepage knownledge fall back to breaking a transparent hugepage and
> 
> %s/knownledge/knowledge/

corrected.

> > +  working on the regular pages and their respective regular pmd/pte
> > +  mappings
> > +
> > +- if an hugepage allocation fails because of memory fragmentation,
> 
> s/an/a/

I'll trust you on this one ;).

> > +  regular pages should be gracefully allocated instead and mixed in
> > +  the same vma without any failure or significant delay and generally
> > +  without userland noticing
> > +
> 
> why "generally"? At worst the application will see varying performance
> characteristics but that applies to a lot more than THP.

Removed "generally".

> > +- if some task quits and more hugepages become available (either
> > +  immediately in the buddy or through the VM), guest physical memory
> > +  backed by regular pages should be relocated on hugepages
> > +  automatically (with khugepaged)
> > +
> > +- it doesn't require boot-time memory reservation and in turn it uses
> 
> neither does hugetlbfs.
> 
> > +  hugepages whenever possible (the only possible reservation here is
> > +  kernelcore= to avoid unmovable pages to fragment all the memory but
> > +  such a tweak is not specific to transparent hugepage support and
> > +  it's a generic feature that applies to all dynamic high order
> > +  allocations in the kernel)
> > +
> > +- this initial support only offers the feature in the anonymous memory
> > +  regions but it'd be ideal to move it to tmpfs and the pagecache
> > +  later
> > +
> > +Transparent Hugepage Support maximizes the usefulness of free memory
> > +if compared to the reservation approach of hugetlbfs by allowing all
> > +unused memory to be used as cache or other movable (or even unmovable
> > +entities).
> 
> hugetlbfs with memory overcommit offers something similar, particularly
> in combination with libhugetlbfs with can automatically fall back to base
> pages. I've run benchmarks comparing hugetlbfs using a static hugepage
> pool with hugetlbfs dynamically allocating hugepages as required with no
> discernable performance difference. So this statement is not strictly accurate.

Ok, but without libhugetlbfs fallback to regular pages splitting the
vma and creating a no-hugetlbfs one in the middle, it really requires
memory reservation to be _sure_ (libhugtlbfs runs outside of hugetlbfs
and I doubt it allows later collapsing like khugepaged provides,
furthermore I'm comparing kernel vs kernel not kernel vs
kernel+userlandwrapping)... So I think mentioning the fact THP doesn't
require memory reservation is more or less fair enough. I removed
"boot-time" though.

> > +It doesn't require reservation to prevent hugepage
> > +allocation failures to be noticeable from userland. It allows paging
> > +and all other advanced VM features to be available on the
> > +hugepages. It requires no modifications for applications to take
> > +advantage of it.
> > +
> > +Applications however can be further optimized to take advantage of
> > +this feature, like for example they've been optimized before to avoid
> > +a flood of mmap system calls for every malloc(4k). Optimizing userland
> > +is by far not mandatory and khugepaged already can take care of long
> > +lived page allocations even for hugepage unaware applications that
> > +deals with large amounts of memory.
> > +
> > +In certain cases when hugepages are enabled system wide, application
> > +may end up allocating more memory resources. An application may mmap a
> > +large region but only touch 1 byte of it, in that case a 2M page might
> > +be allocated instead of a 4k page for no good. This is why it's
> > +possible to disable hugepages system-wide and to only have them inside
> > +MADV_HUGEPAGE madvise regions.
> > +
> > +Embedded systems should enable hugepages only inside madvise regions
> > +to eliminate any risk of wasting any precious byte of memory and to
> > +only run faster.
> > +
> > +Applications that gets a lot of benefit from hugepages and that don't
> > +risk to lose memory by using hugepages, should use
> > +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> > +
> > +== sysfs ==
> > +
> > +Transparent Hugepage Support can be entirely disabled (mostly for
> > +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> > +avoid the risk of consuming more memory resources) or enabled system
> > +wide. This can be achieved with one of:
> > +
> > +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> > +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> > +
> > +It's also possible to limit defrag efforts in the VM to generate
> > +hugepages in case they're not immediately free to madvise regions or
> > +to never try to defrag memory and simply fallback to regular pages
> > +unless hugepages are immediately available.
> 
> This is the first mention of defrag but hey, it's not a paper :)

Not sure I get it this, is this too early or too late to mention
defrag? But yes this is not a paper so I guess I don't need to care ;)

> > +and how many milliseconds to wait in khugepaged between each pass (you
> > +can se this to 0 to run khugepaged at 100% utilization of one core):
> 
> s/se/set/

good eye ;)

> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> > +
> > +and how many milliseconds to wait in khugepaged if there's an hugepage
> > +allocation failure to throttle the next allocation attempt.
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> > +
> > +The khugepaged progress can be seen in the number of pages collapsed:
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> > +
> > +for each pass:
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
> > +
> > +== Boot parameter ==
> > +
> > +You can change the sysfs boot time defaults of Transparent Hugepage
> > +Support by passing the parameter "transparent_hugepage=always" or
> > +"transparent_hugepage=madvise" or "transparent_hugepage=never"
> > +(without "") to the kernel command line.
> > +
> > +== Need of restart ==
> > +
> 
> Need of application restart?
> 
> A casual reader might otherwise interpret it as a system restart for
> some godawful reason of their own.

Agreed.

> > +The transparent_hugepage/enabled values only affect future
> > +behavior. So to make them effective you need to restart any
> 
> s/behavior/behaviour/

Funny, after this change my spell checker asks me to rename behaviour
back to behavior. I'll stick to the spell checker, they are synonymous
anyway.

> > +In case you can't handle compound pages if they're returned by
> > +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> > +follow_page, so that it will split the hugepages before returning
> > +them. Migration for example passes FOLL_SPLIT as parameter to
> > +follow_page because it's not hugepage aware and in fact it can't work
> > +at all on hugetlbfs (but it instead works fine on transparent
> 
> hugetlbfs pages can now migrate although it's only used by hwpoison.

Yep. I'll need to teach migrate to avoid splitting the hugepage to
migrate it too... (especially for numa, not much for hwpoison). And
the migration support for hugetlbfs now makes it more complicated to
migrate transhuge pages too with the same function because that code
isn't inside a VM_HUGETLB check... Worst case we can check the hugepage
destructor to differentiate the two, I've yet to check that. Surely
it's feasible and mostly an implementation issue.

> > +Code walking pagetables but unware about huge pmds can simply call
> > +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> > +pmd_offset. It's trivial to make the code transparent hugepage aware
> > +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> > +missing after pmd_offset returns the pmd. Thanks to the graceful
> > +fallback design, with a one liner change, you can avoid to write
> > +hundred if not thousand of lines of complex code to make your code
> > +hugepage aware.
> > +
> 
> It'd be nice if you could point to a specific example but by no means
> mandatory.

Ok:

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -226,6 +226,20 @@ but you can't handle it natively in your
 calling split_huge_page(page). This is what the Linux VM does before
 it tries to swapout the hugepage for example.
 
+Example to make mremap.c transparent hugepage aware with a one liner
+change:
+
+diff --git a/mm/mremap.c b/mm/mremap.c
+--- a/mm/mremap.c
++++ b/mm/mremap.c
+@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
+ 		return NULL;
+ 
+ 	pmd = pmd_offset(pud, addr);
++	split_huge_page_pmd(mm, pmd);
+ 	if (pmd_none_or_clear_bad(pmd))
+ 		return NULL;
+
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling


> Ok, I'll need to read the rest of the series to verify if this is
> correct but by and large it looks good. I think some of the language is
> stronger than it should be and some of the comparisons with libhugetlbfs
> are a bit off but I'd be naturally defensive on that topic. Make the
> suggested changes if you like but if you don't, it shouldn't affect the
> series.

I hope it looks better now, thanks!

---------
= Transparent Hugepage Support =

== Objective ==

Performance critical computing applications dealing with large memory
working sets are already running on top of libhugetlbfs and in turn
hugetlbfs. Transparent Hugepage Support is an alternative means of
using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.

Currently it only works for anonymous memory mappings but in the
future it can expand over the pagecache layer starting with tmpfs.

The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
of significant interest because it'll also have the downside of
requiring larger clear-page copy-page in page faults which is a
potentially negative effect. The first factor consists in taking a
single page fault for each 2M virtual region touched by userland (so
reducing the enter/exit kernel frequency by a 512 times factor). This
only matters the first time the memory is accessed for the lifetime of
a memory mapping. The second long lasting and much more important
factor will affect all subsequent accesses to the memory for the whole
runtime of the application. The second factor consist of two
components: 1) the TLB miss will run faster (especially with
virtualization using nested pagetables but almost always also on bare
metal without virtualization) and 2) a single TLB entry will be
mapping a much larger amount of virtual memory in turn reducing the
number of TLB misses. With virtualization and nested pagetables the
TLB can be mapped of larger size only if both KVM and the Linux guest
are using hugepages but a significant speedup already happens if only
one of the two is using hugepages just because of the fact the TLB
miss is going to run faster.

== Design ==

- "graceful fallback": mm components which don't have transparent
  hugepage knowledge fall back to breaking a transparent hugepage and
  working on the regular pages and their respective regular pmd/pte
  mappings

- if a hugepage allocation fails because of memory fragmentation,
  regular pages should be gracefully allocated instead and mixed in
  the same vma without any failure or significant delay and without
  userland noticing

- if some task quits and more hugepages become available (either
  immediately in the buddy or through the VM), guest physical memory
  backed by regular pages should be relocated on hugepages
  automatically (with khugepaged)

- it doesn't require memory reservation and in turn it uses hugepages
  whenever possible (the only possible reservation here is kernelcore=
  to avoid unmovable pages to fragment all the memory but such a tweak
  is not specific to transparent hugepage support and it's a generic
  feature that applies to all dynamic high order allocations in the
  kernel)

- this initial support only offers the feature in the anonymous memory
  regions but it'd be ideal to move it to tmpfs and the pagecache
  later

Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
unused memory to be used as cache or other movable (or even unmovable
entities). It doesn't require reservation to prevent hugepage
allocation failures to be noticeable from userland. It allows paging
and all other advanced VM features to be available on the
hugepages. It requires no modifications for applications to take
advantage of it.

Applications however can be further optimized to take advantage of
this feature, like for example they've been optimized before to avoid
a flood of mmap system calls for every malloc(4k). Optimizing userland
is by far not mandatory and khugepaged already can take care of long
lived page allocations even for hugepage unaware applications that
deals with large amounts of memory.

In certain cases when hugepages are enabled system wide, application
may end up allocating more memory resources. An application may mmap a
large region but only touch 1 byte of it, in that case a 2M page might
be allocated instead of a 4k page for no good. This is why it's
possible to disable hugepages system-wide and to only have them inside
MADV_HUGEPAGE madvise regions.

Embedded systems should enable hugepages only inside madvise regions
to eliminate any risk of wasting any precious byte of memory and to
only run faster.

Applications that gets a lot of benefit from hugepages and that don't
risk to lose memory by using hugepages, should use
madvise(MADV_HUGEPAGE) on their critical mmapped regions.

== sysfs ==

Transparent Hugepage Support can be entirely disabled (mostly for
debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
avoid the risk of consuming more memory resources) or enabled system
wide. This can be achieved with one of:

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

It's also possible to limit defrag efforts in the VM to generate
hugepages in case they're not immediately free to madvise regions or
to never try to defrag memory and simply fallback to regular pages
unless hugepages are immediately available. Clearly if we spend CPU
time to defrag memory, we would expect to gain even more by the fact
we use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.

echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag

khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".

khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
also possible to disable defrag in khugepaged:

echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

You can also control how many pages khugepaged should scan at each
pass:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

and how many milliseconds to wait in khugepaged between each pass (you
can set this to 0 to run khugepaged at 100% utilization of one core):

/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

and how many milliseconds to wait in khugepaged if there's an hugepage
allocation failure to throttle the next allocation attempt.

/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

The khugepaged progress can be seen in the number of pages collapsed:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed

for each pass:

/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans

== Boot parameter ==

You can change the sysfs boot time defaults of Transparent Hugepage
Support by passing the parameter "transparent_hugepage=always" or
"transparent_hugepage=madvise" or "transparent_hugepage=never"
(without "") to the kernel command line.

== Need of application restart ==

The transparent_hugepage/enabled values only affect future
behavior. So to make them effective you need to restart any
application that could have been using hugepages. This also applies to
the regions registered in khugepaged.

== get_user_pages and follow_page ==

get_user_pages and follow_page if run on a hugepage, will return the
head or tail pages as usual (exactly as they would do on
hugetlbfs). Most gup users will only care about the actual physical
address of the page and its temporary pinning to release after the I/O
is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
to check head page instead (while serializing properly against
split_huge_page() to avoid the head and tail pages to disappear from
under it, see the futex code to see an example of that, hugetlbfs also
needed special handling in futex code for similar reasons).

NOTE: these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.

In case you can't handle compound pages if they're returned by
follow_page, the FOLL_SPLIT bit can be specified as parameter to
follow_page, so that it will split the hugepages before returning
them. Migration for example passes FOLL_SPLIT as parameter to
follow_page because it's not hugepage aware and in fact it can't work
at all on hugetlbfs (but it instead works fine on transparent
hugepages thanks to FOLL_SPLIT). migration simply can't deal with
hugepages being returned (as it's not only checking the pfn of the
page and pinning it during the copy but it pretends to migrate the
memory in regular page sizes and with regular pte/pmd mappings).

== Optimizing the applications ==

To be guaranteed that the kernel will map a 2M page immediately in any
memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee.

== Hugetlbfs ==

You can use hugetlbfs on a kernel that has transparent hugepage
support enabled just fine as always. No difference can be noted in
hugetlbfs other than there will be less overall fragmentation. All
usual features belonging to hugetlbfs are preserved and
unaffected. libhugetlbfs will also work fine as usual.

== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
hugepage aware.

If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
it tries to swapout the hugepage for example.

Example to make mremap.c transparent hugepage aware with a one liner
change:

diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;

== Locking in hugepage aware code ==

We want as much code as possible hugepage aware, as calling
split_huge_page() or split_huge_page_pmd() has a cost.

To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
mm->page_table_lock and re-run pmd_trans_huge. Taking the
page_table_lock will prevent the huge pmd to be converted into a
regular pmd from under you (split_huge_page can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
should just drop the page_table_lock and fallback to the old code as
before. Otherwise you should run pmd_trans_splitting on the pmd. In
case pmd_trans_splitting returns true, it means split_huge_page is
already in the middle of splitting the page. So if pmd_trans_splitting
returns true it's enough to drop the page_table_lock and call
wait_split_huge_page and then fallback the old code paths. You are
guaranteed by the time wait_split_huge_page returns, the pmd isn't
huge anymore. If pmd_trans_splitting returns false, you can proceed to
process the huge pmd and the hugepage natively. Once finished you can
drop the page_table_lock.

== compound_lock, get_user_pages and put_page ==

split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the
page structures. It can do that easily for refcounts taken by huge pmd
mappings. But the GUI API as created by hugetlbfs (that returns head
and tail pages if running get_user_pages on an address backed by any
hugepage), requires the refcount to be accounted on the tail pages and
not only in the head pages, if we want to be able to run
split_huge_page while there are gup pins established on any tail
page. Failure to be able to run split_huge_page if there's any gup pin
on any tail page, would mean having to split all hugepages upfront in
get_user_pages which is unacceptable as too many gup users are
performance critical and they must work natively on hugepages like
they work natively on hugetlbfs already (hugetlbfs is simpler because
hugetlbfs pages cannot be splitted so there wouldn't be requirement of
accounting the pins on the tail pages for hugetlbfs). If we wouldn't
account the gup refcounts on the tail pages during gup, we won't know
anymore which tail page is pinned by gup and which is not while we run
split_huge_page. But we still have to add the gup pin to the head page
too, to know when we can free the compound page in case it's never
splitted during its lifetime. That requires changing not just
get_page, but put_page as well so that when put_page runs on a tail
page (and only on a tail page) it will find its respective head page,
and then it will decrease the head page refcount in addition to the
tail page refcount. To obtain a head page reliably and to decrease its
refcount without race conditions, put_page has to serialize against
__split_huge_page_refcount using a special per-page lock called
compound_lock.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
@ 2010-11-25 14:35       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 14:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 11:41:05AM +0000, Mel Gorman wrote:
> libhugetlbfs can also automatically back shared memory, text and data
> with huge pages which THP cannot do. libhugetlbfs cannot demote and
> promote memory where THP can. They are not exactly like-with-like
> comparisons. How about;
> 
> "Transparent Hugepage Support is an alternative means of using huge pages for
> the backing of anonymous memory with huge pages that supports the automatic
> promotion and demotion of page sizes."
> 
> ?

Ok I changed it like this:

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -4,12 +4,13 @@
 
 Performance critical computing applications dealing with large memory
 working sets are already running on top of libhugetlbfs and in turn
-hugetlbfs. Transparent Hugepage Support is an alternative to
-libhugetlbfs that offers the same feature of libhugetlbfs but without
-the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).
+hugetlbfs. Transparent Hugepage Support is an alternative means of
+using huge pages for the backing of virtual memory with huge pages
+that supports the automatic promotion and demotion of page sizes and
+without the shortcomings of hugetlbfs.
 
-In the future it can expand over the pagecache layer starting with
-tmpfs to reduce even further the hugetlbfs usages.
+Currently it only works for anonymous memory mappings but in the
+future it can expand over the pagecache layer starting with tmpfs.
 
 The reason applications are running faster is because of two
 factors. The first factor is almost completely irrelevant and it's not


> > +In the future it can expand over the pagecache layer starting with
> > +tmpfs to reduce even further the hugetlbfs usages.
> > +
> > +The reason applications are running faster is because of two
> > +factors. The first factor is almost completely irrelevant and it's not
> > +of significant interest because it'll also have the downside of
> > +requiring larger clear-page copy-page in page faults which is a
> > +potentially negative effect. The first factor consists in taking a
> > +single page fault for each 2M virtual region touched by userland (so
> > +reducing the enter/exit kernel frequency by a 512 times factor). This
> > +only matters the first time the memory is accessed for the lifetime of
> > +a memory mapping. The second long lasting and much more important
> > +factor will affect all subsequent accesses to the memory for the whole
> > +runtime of the application. The second factor consist of two
> > +components: 1) the TLB miss will run faster (especially with
> > +virtualization using nested pagetables but also on bare metal without
> > +virtualization)
> 
> Careful on that first statement. It's not necessarily true for bare metal
> as some processors show that the TLB miss handler for huge pages is slower
> than base pages. Not sure why but it seemed to be the case on P4 anyway
> (at least the one I have). Maybe it was a measurement error but on chips
> with split TLBs for page sizes, there is no guarantee they are the same speed.
> 
> It's probably true for virtualisation though considering the vastly reduced
> number of cache lines required to translate an address.
>
> I'd weaken the language for bare metal to say "almost always" but it's
> not a big deal.

Agreed.

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -23,14 +24,14 @@ a memory mapping. The second long lastin
 factor will affect all subsequent accesses to the memory for the whole
 runtime of the application. The second factor consist of two
 components: 1) the TLB miss will run faster (especially with
-virtualization using nested pagetables but also on bare metal without
-virtualization) and 2) a single TLB entry will be mapping a much
-larger amount of virtual memory in turn reducing the number of TLB
-misses. With virtualization and nested pagetables the TLB can be
-mapped of larger size only if both KVM and the Linux guest are using
-hugepages but a significant speedup already happens if only one of the
-two is using hugepages just because of the fact the TLB miss is going
-to run faster.
+virtualization using nested pagetables but almost always also on bare
+metal without virtualization) and 2) a single TLB entry will be
+mapping a much larger amount of virtual memory in turn reducing the
+number of TLB misses. With virtualization and nested pagetables the
+TLB can be mapped of larger size only if both KVM and the Linux guest
+are using hugepages but a significant speedup already happens if only
+one of the two is using hugepages just because of the fact the TLB
+miss is going to run faster.
 
 == Design ==
 

> > +With virtualization and nested pagetables the TLB can be
> > +mapped of larger size only if both KVM and the Linux guest are using
> > +hugepages but a significant speedup already happens if only one of the
> > +two is using hugepages just because of the fact the TLB miss is going
> > +to run faster.
> > +
> > +== Design ==
> > +
> > +- "graceful fallback": mm components which don't have transparent
> > +  hugepage knownledge fall back to breaking a transparent hugepage and
> 
> %s/knownledge/knowledge/

corrected.

> > +  working on the regular pages and their respective regular pmd/pte
> > +  mappings
> > +
> > +- if an hugepage allocation fails because of memory fragmentation,
> 
> s/an/a/

I'll trust you on this one ;).

> > +  regular pages should be gracefully allocated instead and mixed in
> > +  the same vma without any failure or significant delay and generally
> > +  without userland noticing
> > +
> 
> why "generally"? At worst the application will see varying performance
> characteristics but that applies to a lot more than THP.

Removed "generally".

> > +- if some task quits and more hugepages become available (either
> > +  immediately in the buddy or through the VM), guest physical memory
> > +  backed by regular pages should be relocated on hugepages
> > +  automatically (with khugepaged)
> > +
> > +- it doesn't require boot-time memory reservation and in turn it uses
> 
> neither does hugetlbfs.
> 
> > +  hugepages whenever possible (the only possible reservation here is
> > +  kernelcore= to avoid unmovable pages to fragment all the memory but
> > +  such a tweak is not specific to transparent hugepage support and
> > +  it's a generic feature that applies to all dynamic high order
> > +  allocations in the kernel)
> > +
> > +- this initial support only offers the feature in the anonymous memory
> > +  regions but it'd be ideal to move it to tmpfs and the pagecache
> > +  later
> > +
> > +Transparent Hugepage Support maximizes the usefulness of free memory
> > +if compared to the reservation approach of hugetlbfs by allowing all
> > +unused memory to be used as cache or other movable (or even unmovable
> > +entities).
> 
> hugetlbfs with memory overcommit offers something similar, particularly
> in combination with libhugetlbfs with can automatically fall back to base
> pages. I've run benchmarks comparing hugetlbfs using a static hugepage
> pool with hugetlbfs dynamically allocating hugepages as required with no
> discernable performance difference. So this statement is not strictly accurate.

Ok, but without libhugetlbfs fallback to regular pages splitting the
vma and creating a no-hugetlbfs one in the middle, it really requires
memory reservation to be _sure_ (libhugtlbfs runs outside of hugetlbfs
and I doubt it allows later collapsing like khugepaged provides,
furthermore I'm comparing kernel vs kernel not kernel vs
kernel+userlandwrapping)... So I think mentioning the fact THP doesn't
require memory reservation is more or less fair enough. I removed
"boot-time" though.

> > +It doesn't require reservation to prevent hugepage
> > +allocation failures to be noticeable from userland. It allows paging
> > +and all other advanced VM features to be available on the
> > +hugepages. It requires no modifications for applications to take
> > +advantage of it.
> > +
> > +Applications however can be further optimized to take advantage of
> > +this feature, like for example they've been optimized before to avoid
> > +a flood of mmap system calls for every malloc(4k). Optimizing userland
> > +is by far not mandatory and khugepaged already can take care of long
> > +lived page allocations even for hugepage unaware applications that
> > +deals with large amounts of memory.
> > +
> > +In certain cases when hugepages are enabled system wide, application
> > +may end up allocating more memory resources. An application may mmap a
> > +large region but only touch 1 byte of it, in that case a 2M page might
> > +be allocated instead of a 4k page for no good. This is why it's
> > +possible to disable hugepages system-wide and to only have them inside
> > +MADV_HUGEPAGE madvise regions.
> > +
> > +Embedded systems should enable hugepages only inside madvise regions
> > +to eliminate any risk of wasting any precious byte of memory and to
> > +only run faster.
> > +
> > +Applications that gets a lot of benefit from hugepages and that don't
> > +risk to lose memory by using hugepages, should use
> > +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> > +
> > +== sysfs ==
> > +
> > +Transparent Hugepage Support can be entirely disabled (mostly for
> > +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> > +avoid the risk of consuming more memory resources) or enabled system
> > +wide. This can be achieved with one of:
> > +
> > +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> > +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> > +
> > +It's also possible to limit defrag efforts in the VM to generate
> > +hugepages in case they're not immediately free to madvise regions or
> > +to never try to defrag memory and simply fallback to regular pages
> > +unless hugepages are immediately available.
> 
> This is the first mention of defrag but hey, it's not a paper :)

Not sure I get it this, is this too early or too late to mention
defrag? But yes this is not a paper so I guess I don't need to care ;)

> > +and how many milliseconds to wait in khugepaged between each pass (you
> > +can se this to 0 to run khugepaged at 100% utilization of one core):
> 
> s/se/set/

good eye ;)

> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> > +
> > +and how many milliseconds to wait in khugepaged if there's an hugepage
> > +allocation failure to throttle the next allocation attempt.
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> > +
> > +The khugepaged progress can be seen in the number of pages collapsed:
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
> > +
> > +for each pass:
> > +
> > +/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
> > +
> > +== Boot parameter ==
> > +
> > +You can change the sysfs boot time defaults of Transparent Hugepage
> > +Support by passing the parameter "transparent_hugepage=always" or
> > +"transparent_hugepage=madvise" or "transparent_hugepage=never"
> > +(without "") to the kernel command line.
> > +
> > +== Need of restart ==
> > +
> 
> Need of application restart?
> 
> A casual reader might otherwise interpret it as a system restart for
> some godawful reason of their own.

Agreed.

> > +The transparent_hugepage/enabled values only affect future
> > +behavior. So to make them effective you need to restart any
> 
> s/behavior/behaviour/

Funny, after this change my spell checker asks me to rename behaviour
back to behavior. I'll stick to the spell checker, they are synonymous
anyway.

> > +In case you can't handle compound pages if they're returned by
> > +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> > +follow_page, so that it will split the hugepages before returning
> > +them. Migration for example passes FOLL_SPLIT as parameter to
> > +follow_page because it's not hugepage aware and in fact it can't work
> > +at all on hugetlbfs (but it instead works fine on transparent
> 
> hugetlbfs pages can now migrate although it's only used by hwpoison.

Yep. I'll need to teach migrate to avoid splitting the hugepage to
migrate it too... (especially for numa, not much for hwpoison). And
the migration support for hugetlbfs now makes it more complicated to
migrate transhuge pages too with the same function because that code
isn't inside a VM_HUGETLB check... Worst case we can check the hugepage
destructor to differentiate the two, I've yet to check that. Surely
it's feasible and mostly an implementation issue.

> > +Code walking pagetables but unware about huge pmds can simply call
> > +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> > +pmd_offset. It's trivial to make the code transparent hugepage aware
> > +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> > +missing after pmd_offset returns the pmd. Thanks to the graceful
> > +fallback design, with a one liner change, you can avoid to write
> > +hundred if not thousand of lines of complex code to make your code
> > +hugepage aware.
> > +
> 
> It'd be nice if you could point to a specific example but by no means
> mandatory.

Ok:

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -226,6 +226,20 @@ but you can't handle it natively in your
 calling split_huge_page(page). This is what the Linux VM does before
 it tries to swapout the hugepage for example.
 
+Example to make mremap.c transparent hugepage aware with a one liner
+change:
+
+diff --git a/mm/mremap.c b/mm/mremap.c
+--- a/mm/mremap.c
++++ b/mm/mremap.c
+@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
+ 		return NULL;
+ 
+ 	pmd = pmd_offset(pud, addr);
++	split_huge_page_pmd(mm, pmd);
+ 	if (pmd_none_or_clear_bad(pmd))
+ 		return NULL;
+
 == Locking in hugepage aware code ==
 
 We want as much code as possible hugepage aware, as calling


> Ok, I'll need to read the rest of the series to verify if this is
> correct but by and large it looks good. I think some of the language is
> stronger than it should be and some of the comparisons with libhugetlbfs
> are a bit off but I'd be naturally defensive on that topic. Make the
> suggested changes if you like but if you don't, it shouldn't affect the
> series.

I hope it looks better now, thanks!

---------
= Transparent Hugepage Support =

== Objective ==

Performance critical computing applications dealing with large memory
working sets are already running on top of libhugetlbfs and in turn
hugetlbfs. Transparent Hugepage Support is an alternative means of
using huge pages for the backing of virtual memory with huge pages
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.

Currently it only works for anonymous memory mappings but in the
future it can expand over the pagecache layer starting with tmpfs.

The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
of significant interest because it'll also have the downside of
requiring larger clear-page copy-page in page faults which is a
potentially negative effect. The first factor consists in taking a
single page fault for each 2M virtual region touched by userland (so
reducing the enter/exit kernel frequency by a 512 times factor). This
only matters the first time the memory is accessed for the lifetime of
a memory mapping. The second long lasting and much more important
factor will affect all subsequent accesses to the memory for the whole
runtime of the application. The second factor consist of two
components: 1) the TLB miss will run faster (especially with
virtualization using nested pagetables but almost always also on bare
metal without virtualization) and 2) a single TLB entry will be
mapping a much larger amount of virtual memory in turn reducing the
number of TLB misses. With virtualization and nested pagetables the
TLB can be mapped of larger size only if both KVM and the Linux guest
are using hugepages but a significant speedup already happens if only
one of the two is using hugepages just because of the fact the TLB
miss is going to run faster.

== Design ==

- "graceful fallback": mm components which don't have transparent
  hugepage knowledge fall back to breaking a transparent hugepage and
  working on the regular pages and their respective regular pmd/pte
  mappings

- if a hugepage allocation fails because of memory fragmentation,
  regular pages should be gracefully allocated instead and mixed in
  the same vma without any failure or significant delay and without
  userland noticing

- if some task quits and more hugepages become available (either
  immediately in the buddy or through the VM), guest physical memory
  backed by regular pages should be relocated on hugepages
  automatically (with khugepaged)

- it doesn't require memory reservation and in turn it uses hugepages
  whenever possible (the only possible reservation here is kernelcore=
  to avoid unmovable pages to fragment all the memory but such a tweak
  is not specific to transparent hugepage support and it's a generic
  feature that applies to all dynamic high order allocations in the
  kernel)

- this initial support only offers the feature in the anonymous memory
  regions but it'd be ideal to move it to tmpfs and the pagecache
  later

Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
unused memory to be used as cache or other movable (or even unmovable
entities). It doesn't require reservation to prevent hugepage
allocation failures to be noticeable from userland. It allows paging
and all other advanced VM features to be available on the
hugepages. It requires no modifications for applications to take
advantage of it.

Applications however can be further optimized to take advantage of
this feature, like for example they've been optimized before to avoid
a flood of mmap system calls for every malloc(4k). Optimizing userland
is by far not mandatory and khugepaged already can take care of long
lived page allocations even for hugepage unaware applications that
deals with large amounts of memory.

In certain cases when hugepages are enabled system wide, application
may end up allocating more memory resources. An application may mmap a
large region but only touch 1 byte of it, in that case a 2M page might
be allocated instead of a 4k page for no good. This is why it's
possible to disable hugepages system-wide and to only have them inside
MADV_HUGEPAGE madvise regions.

Embedded systems should enable hugepages only inside madvise regions
to eliminate any risk of wasting any precious byte of memory and to
only run faster.

Applications that gets a lot of benefit from hugepages and that don't
risk to lose memory by using hugepages, should use
madvise(MADV_HUGEPAGE) on their critical mmapped regions.

== sysfs ==

Transparent Hugepage Support can be entirely disabled (mostly for
debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
avoid the risk of consuming more memory resources) or enabled system
wide. This can be achieved with one of:

echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

It's also possible to limit defrag efforts in the VM to generate
hugepages in case they're not immediately free to madvise regions or
to never try to defrag memory and simply fallback to regular pages
unless hugepages are immediately available. Clearly if we spend CPU
time to defrag memory, we would expect to gain even more by the fact
we use hugepages later instead of regular pages. This isn't always
guaranteed, but it may be more likely in case the allocation is for a
MADV_HUGEPAGE region.

echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag

khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".

khugepaged runs usually at low frequency so while one may not want to
invoke defrag algorithms synchronously during the page faults, it
should be worth invoking defrag at least in khugepaged. However it's
also possible to disable defrag in khugepaged:

echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag

You can also control how many pages khugepaged should scan at each
pass:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

and how many milliseconds to wait in khugepaged between each pass (you
can set this to 0 to run khugepaged at 100% utilization of one core):

/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

and how many milliseconds to wait in khugepaged if there's an hugepage
allocation failure to throttle the next allocation attempt.

/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

The khugepaged progress can be seen in the number of pages collapsed:

/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed

for each pass:

/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans

== Boot parameter ==

You can change the sysfs boot time defaults of Transparent Hugepage
Support by passing the parameter "transparent_hugepage=always" or
"transparent_hugepage=madvise" or "transparent_hugepage=never"
(without "") to the kernel command line.

== Need of application restart ==

The transparent_hugepage/enabled values only affect future
behavior. So to make them effective you need to restart any
application that could have been using hugepages. This also applies to
the regions registered in khugepaged.

== get_user_pages and follow_page ==

get_user_pages and follow_page if run on a hugepage, will return the
head or tail pages as usual (exactly as they would do on
hugetlbfs). Most gup users will only care about the actual physical
address of the page and its temporary pinning to release after the I/O
is complete, so they won't ever notice the fact the page is huge. But
if any driver is going to mangle over the page structure of the tail
page (like for checking page->mapping or other bits that are relevant
for the head page and not the tail page), it should be updated to jump
to check head page instead (while serializing properly against
split_huge_page() to avoid the head and tail pages to disappear from
under it, see the futex code to see an example of that, hugetlbfs also
needed special handling in futex code for similar reasons).

NOTE: these aren't new constraints to the GUP API, and they match the
same constrains that applies to hugetlbfs too, so any driver capable
of handling GUP on hugetlbfs will also work fine on transparent
hugepage backed mappings.

In case you can't handle compound pages if they're returned by
follow_page, the FOLL_SPLIT bit can be specified as parameter to
follow_page, so that it will split the hugepages before returning
them. Migration for example passes FOLL_SPLIT as parameter to
follow_page because it's not hugepage aware and in fact it can't work
at all on hugetlbfs (but it instead works fine on transparent
hugepages thanks to FOLL_SPLIT). migration simply can't deal with
hugepages being returned (as it's not only checking the pfn of the
page and pinning it during the copy but it pretends to migrate the
memory in regular page sizes and with regular pte/pmd mappings).

== Optimizing the applications ==

To be guaranteed that the kernel will map a 2M page immediately in any
memory region, the mmap region has to be hugepage naturally
aligned. posix_memalign() can provide that guarantee.

== Hugetlbfs ==

You can use hugetlbfs on a kernel that has transparent hugepage
support enabled just fine as always. No difference can be noted in
hugetlbfs other than there will be less overall fragmentation. All
usual features belonging to hugetlbfs are preserved and
unaffected. libhugetlbfs will also work fine as usual.

== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
fallback design, with a one liner change, you can avoid to write
hundred if not thousand of lines of complex code to make your code
hugepage aware.

If you're not walking pagetables but you run into a physical hugepage
but you can't handle it natively in your code, you can split it by
calling split_huge_page(page). This is what the Linux VM does before
it tries to swapout the hugepage for example.

Example to make mremap.c transparent hugepage aware with a one liner
change:

diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
+	split_huge_page_pmd(mm, pmd);
 	if (pmd_none_or_clear_bad(pmd))
 		return NULL;

== Locking in hugepage aware code ==

We want as much code as possible hugepage aware, as calling
split_huge_page() or split_huge_page_pmd() has a cost.

To make pagetable walks huge pmd aware, all you need to do is to call
pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
mmap_sem in read (or write) mode to be sure an huge pmd cannot be
created from under you by khugepaged (khugepaged collapse_huge_page
takes the mmap_sem in write mode in addition to the anon_vma lock). If
pmd_trans_huge returns false, you just fallback in the old code
paths. If instead pmd_trans_huge returns true, you have to take the
mm->page_table_lock and re-run pmd_trans_huge. Taking the
page_table_lock will prevent the huge pmd to be converted into a
regular pmd from under you (split_huge_page can run in parallel to the
pagetable walk). If the second pmd_trans_huge returns false, you
should just drop the page_table_lock and fallback to the old code as
before. Otherwise you should run pmd_trans_splitting on the pmd. In
case pmd_trans_splitting returns true, it means split_huge_page is
already in the middle of splitting the page. So if pmd_trans_splitting
returns true it's enough to drop the page_table_lock and call
wait_split_huge_page and then fallback the old code paths. You are
guaranteed by the time wait_split_huge_page returns, the pmd isn't
huge anymore. If pmd_trans_splitting returns false, you can proceed to
process the huge pmd and the hugepage natively. Once finished you can
drop the page_table_lock.

== compound_lock, get_user_pages and put_page ==

split_huge_page internally has to distribute the refcounts in the head
page to the tail pages before clearing all PG_head/tail bits from the
page structures. It can do that easily for refcounts taken by huge pmd
mappings. But the GUI API as created by hugetlbfs (that returns head
and tail pages if running get_user_pages on an address backed by any
hugepage), requires the refcount to be accounted on the tail pages and
not only in the head pages, if we want to be able to run
split_huge_page while there are gup pins established on any tail
page. Failure to be able to run split_huge_page if there's any gup pin
on any tail page, would mean having to split all hugepages upfront in
get_user_pages which is unacceptable as too many gup users are
performance critical and they must work natively on hugepages like
they work natively on hugetlbfs already (hugetlbfs is simpler because
hugetlbfs pages cannot be splitted so there wouldn't be requirement of
accounting the pins on the tail pages for hugetlbfs). If we wouldn't
account the gup refcounts on the tail pages during gup, we won't know
anymore which tail page is pinned by gup and which is not while we run
split_huge_page. But we still have to add the gup pin to the head page
too, to know when we can free the compound page in case it's never
splitted during its lifetime. That requires changing not just
get_page, but put_page as well so that when put_page runs on a tail
page (and only on a tail page) it will find its respective head page,
and then it will decrease the head page refcount in addition to the
tail page refcount. To obtain a head page reliably and to decrease its
refcount without race conditions, put_page has to serialize against
__split_huge_page_refcount using a special per-page lock called
compound_lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
  2010-11-18 17:28       ` Linus Torvalds
@ 2010-11-25 16:21         ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 09:28:27AM -0800, Linus Torvalds wrote:
> On Thu, Nov 18, 2010 at 3:49 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> +
> >> +static inline void compound_lock_irqsave(struct page *page,
> >> +                                      unsigned long *flagsp)
> >> +{
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +     unsigned long flags;
> >> +     local_irq_save(flags);
> >> +     compound_lock(page);
> >> +     *flagsp = flags;
> >> +#endif
> >> +}
> >> +
> >
> > The pattern for spinlock irqsave passes in unsigned long, not unsigned
> > long *. It'd be nice if they matched.
> 
> Indeed. Just make the thing return the flags the way the normal
> spin_lock_irqsave() function does.

Done.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -320,15 +320,14 @@ static inline void compound_unlock(struc
 #endif
 }
 
-static inline void compound_lock_irqsave(struct page *page,
-					 unsigned long *flagsp)
+static inline unsigned long compound_lock_irqsave(struct page *page)
 {
+	unsigned long uninitialized_var(flags);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	unsigned long flags;
 	local_irq_save(flags);
 	compound_lock(page);
-	*flagsp = flags;
 #endif
+	return flags;
 }
 
 static inline void compound_unlock_irqrestore(struct page *page,


diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,7 +94,7 @@ static void put_compound_page(struct pag
 			 */
 			smp_mb();
 			/* page_head wasn't a dangling pointer */
-			compound_lock_irqsave(page_head, &flags);
+			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);


Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 05 of 66] compound_lock
@ 2010-11-25 16:21         ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 16:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 09:28:27AM -0800, Linus Torvalds wrote:
> On Thu, Nov 18, 2010 at 3:49 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> +
> >> +static inline void compound_lock_irqsave(struct page *page,
> >> +                                      unsigned long *flagsp)
> >> +{
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +     unsigned long flags;
> >> +     local_irq_save(flags);
> >> +     compound_lock(page);
> >> +     *flagsp = flags;
> >> +#endif
> >> +}
> >> +
> >
> > The pattern for spinlock irqsave passes in unsigned long, not unsigned
> > long *. It'd be nice if they matched.
> 
> Indeed. Just make the thing return the flags the way the normal
> spin_lock_irqsave() function does.

Done.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -320,15 +320,14 @@ static inline void compound_unlock(struc
 #endif
 }
 
-static inline void compound_lock_irqsave(struct page *page,
-					 unsigned long *flagsp)
+static inline unsigned long compound_lock_irqsave(struct page *page)
 {
+	unsigned long uninitialized_var(flags);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	unsigned long flags;
 	local_irq_save(flags);
 	compound_lock(page);
-	*flagsp = flags;
 #endif
+	return flags;
 }
 
 static inline void compound_unlock_irqrestore(struct page *page,


diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -94,7 +94,7 @@ static void put_compound_page(struct pag
 			 */
 			smp_mb();
 			/* page_head wasn't a dangling pointer */
-			compound_lock_irqsave(page_head, &flags);
+			flags = compound_lock_irqsave(page_head);
 			if (unlikely(!PageTail(page))) {
 				/* __split_huge_page_refcount run before us */
 				compound_unlock_irqrestore(page_head, flags);


Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
  2010-11-18 12:37     ` Mel Gorman
@ 2010-11-25 16:49       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 16:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 12:37:05PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:27:41PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Alter compound get_page/put_page to keep references on subpages too, in order
> > to allow __split_huge_page_refcount to split an hugepage even while subpages
> > have been pinned by one of the get_user_pages() variants.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> > --- a/arch/powerpc/mm/gup.c
> > +++ b/arch/powerpc/mm/gup.c
> > @@ -16,6 +16,16 @@
> >  
> >  #ifdef __HAVE_ARCH_PTE_SPECIAL
> >  
> > +static inline void pin_huge_page_tail(struct page *page)
> > +{
> 
> Minor nit, but get_huge_page_tail?
> 
> Even though "pin" is what it does, pin isn't used elsewhere in naming.

Agreed.

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,7 +16,7 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
-static inline void pin_huge_page_tail(struct page *page)
+static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
@@ -58,7 +58,7 @@ static noinline int gup_pte_range(pmd_t 
 			return 0;
 		}
 		if (PageTail(page))
-			pin_huge_page_tail(page);
+			get_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,7 +105,7 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
-static inline void pin_huge_page_tail(struct page *page)
+static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
@@ -139,7 +139,7 @@ static noinline int gup_huge_pmd(pmd_t p
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
 		if (PageTail(page))
-			pin_huge_page_tail(page);
+			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;


> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
> >  	atomic_add(nr, &page->_count);
> >  }
> >  
> > +static inline void pin_huge_page_tail(struct page *page)
> > +{
> > +	/*
> > +	 * __split_huge_page_refcount() cannot run
> > +	 * from under us.
> > +	 */
> > +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> > +	atomic_inc(&page->_count);
> > +}
> > +
> 
> This is identical to the x86 implementation. Any possibility they can be
> shared?

There is no place for me today to put gup_fast "equal" bits so I'd
need to create it and just doing it for a single inline function
sounds overkill. I could add a asm-generic/gup_fast.h add move that
function there and include the asm-generic/gup_fast.h from
asm*/include/gup_fast.h, is that what we want just for one function?
With all #ifdefs included I would end up writing more code than the
function itself. No problem with me in doing it though.

> >  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> >  		unsigned long end, int write, struct page **pages, int *nr)
> >  {
> > @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
> >  	do {
> >  		VM_BUG_ON(compound_head(page) != head);
> >  		pages[*nr] = page;
> > +		if (PageTail(page))
> > +			pin_huge_page_tail(page);
> >  		(*nr)++;
> >  		page++;
> >  		refs++;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -351,9 +351,17 @@ static inline int page_count(struct page
> >  
> >  static inline void get_page(struct page *page)
> >  {
> > -	page = compound_head(page);
> > -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> > +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> 
> Oof, this might need a comment. It's saying that getting a normal page or the
> head of a compound page must already have an elevated reference count. If
> we are getting a tail page, the reference count is stored in both the head
> and the tail so the BUG check does not apply.

Ok.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -353,8 +353,20 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count. Only if
+	 * we're getting a tail page, the elevated page->_count is
+	 * required only in the head page, so for tail pages the
+	 * bugcheck only verifies that the page->_count isn't
+	 * negative.
+	 */
 	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	/*
+	 * Getting a tail page will elevate both the head and tail
+	 * page->_count(s).
+	 */
 	if (unlikely(PageTail(page))) {
 		/*
 		 * This is safe only because

> > diff --git a/mm/swap.c b/mm/swap.c
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
> >  		del_page_from_lru(zone, page);
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > +}
> > +
> > +static void __put_single_page(struct page *page)
> > +{
> > +	__page_cache_release(page);
> >  	free_hot_cold_page(page, 0);
> >  }
> >  
> > +static void __put_compound_page(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	__page_cache_release(page);
> > +	dtor = get_compound_page_dtor(page);
> > +	(*dtor)(page);
> > +}
> > +
> >  static void put_compound_page(struct page *page)
> >  {
> > -	page = compound_head(page);
> > -	if (put_page_testzero(page)) {
> > -		compound_page_dtor *dtor;
> > -
> > -		dtor = get_compound_page_dtor(page);
> > -		(*dtor)(page);
> > +	if (unlikely(PageTail(page))) {
> > +		/* __split_huge_page_refcount can run under us */
> 
> So what? The fact you check PageTail twice is a hint as to what is
> happening and that we are depending on the order of when the head and
> tails bits get cleared but it's hard to be certain of that.

This is correct, we depend on the split_huge_page_refcount to clear
PageTail before overwriting first_page.

+	 	/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();

The first PageTail check is just to define the fast path and skip
reading page->first_page and doing smb_rmb() for the much more common
head pages (we're in unlikely branch).

So then we read first_page, we smb_rmb() and if pagetail is still set
after that, we can be sure first_page isn't a dangling pointer.

> 
> > +		struct page *page_head = page->first_page;
> > +		smp_rmb();
> > +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> > +			unsigned long flags;
> > +			if (unlikely(!PageHead(page_head))) {
> > +				/* PageHead is cleared after PageTail */
> > +				smp_rmb();
> > +				VM_BUG_ON(PageTail(page));
> > +				goto out_put_head;
> > +			}
> > +			/*
> > +			 * Only run compound_lock on a valid PageHead,
> > +			 * after having it pinned with
> > +			 * get_page_unless_zero() above.
> > +			 */
> > +			smp_mb();
> > +			/* page_head wasn't a dangling pointer */
> > +			compound_lock_irqsave(page_head, &flags);
> > +			if (unlikely(!PageTail(page))) {
> > +				/* __split_huge_page_refcount run before us */
> > +				compound_unlock_irqrestore(page_head, flags);
> > +				VM_BUG_ON(PageHead(page_head));
> > +			out_put_head:
> > +				if (put_page_testzero(page_head))
> > +					__put_single_page(page_head);
> > +			out_put_single:
> > +				if (put_page_testzero(page))
> > +					__put_single_page(page);
> > +				return;
> > +			}
> > +			VM_BUG_ON(page_head != page->first_page);
> > +			/*
> > +			 * We can release the refcount taken by
> > +			 * get_page_unless_zero now that
> > +			 * split_huge_page_refcount is blocked on the
> > +			 * compound_lock.
> > +			 */
> > +			if (put_page_testzero(page_head))
> > +				VM_BUG_ON(1);
> > +			/* __split_huge_page_refcount will wait now */
> > +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> > +			atomic_dec(&page->_count);
> > +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> > +			compound_unlock_irqrestore(page_head, flags);
> > +			if (put_page_testzero(page_head))
> > +				__put_compound_page(page_head);
> > +		} else {
> > +			/* page_head is a dangling pointer */
> > +			VM_BUG_ON(PageTail(page));
> > +			goto out_put_single;
> > +		}
> > +	} else if (put_page_testzero(page)) {
> > +		if (PageHead(page))
> > +			__put_compound_page(page);
> > +		else
> > +			__put_single_page(page);
> >  	}
> >  }
> >  
> > @@ -75,7 +141,7 @@ void put_page(struct page *page)
> >  	if (unlikely(PageCompound(page)))
> >  		put_compound_page(page);
> >  	else if (put_page_testzero(page))
> > -		__page_cache_release(page);
> > +		__put_single_page(page);
> >  }
> >  EXPORT_SYMBOL(put_page);
> >  
> 
> Functionally, I don't see a problem so
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> but some expansion on the leader and the comment, even if done as a
> follow-on patch, would be nice.

diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,8 +79,18 @@ static void put_compound_page(struct pag
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = page->first_page;
 		smp_rmb();
+		/*
+		 * If PageTail is still set after smp_rmb() we can be sure
+		 * that the page->first_page we read wasn't a dangling pointer.
+		 * See __split_huge_page_refcount() smp_wmb().
+		 */
 		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
 			unsigned long flags;
+			/*
+			 * Verify that our page_head wasn't converted
+			 * to a a regular page before we got a
+			 * reference on it.
+			 */
 			if (unlikely(!PageHead(page_head))) {
 				/* PageHead is cleared after PageTail */
 				smp_rmb();

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
@ 2010-11-25 16:49       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 16:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 12:37:05PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:27:41PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Alter compound get_page/put_page to keep references on subpages too, in order
> > to allow __split_huge_page_refcount to split an hugepage even while subpages
> > have been pinned by one of the get_user_pages() variants.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
> > --- a/arch/powerpc/mm/gup.c
> > +++ b/arch/powerpc/mm/gup.c
> > @@ -16,6 +16,16 @@
> >  
> >  #ifdef __HAVE_ARCH_PTE_SPECIAL
> >  
> > +static inline void pin_huge_page_tail(struct page *page)
> > +{
> 
> Minor nit, but get_huge_page_tail?
> 
> Even though "pin" is what it does, pin isn't used elsewhere in naming.

Agreed.

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -16,7 +16,7 @@
 
 #ifdef __HAVE_ARCH_PTE_SPECIAL
 
-static inline void pin_huge_page_tail(struct page *page)
+static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
@@ -58,7 +58,7 @@ static noinline int gup_pte_range(pmd_t 
 			return 0;
 		}
 		if (PageTail(page))
-			pin_huge_page_tail(page);
+			get_huge_page_tail(page);
 		pages[*nr] = page;
 		(*nr)++;
 
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -105,7 +105,7 @@ static inline void get_head_page_multipl
 	atomic_add(nr, &page->_count);
 }
 
-static inline void pin_huge_page_tail(struct page *page)
+static inline void get_huge_page_tail(struct page *page)
 {
 	/*
 	 * __split_huge_page_refcount() cannot run
@@ -139,7 +139,7 @@ static noinline int gup_huge_pmd(pmd_t p
 		VM_BUG_ON(compound_head(page) != head);
 		pages[*nr] = page;
 		if (PageTail(page))
-			pin_huge_page_tail(page);
+			get_huge_page_tail(page);
 		(*nr)++;
 		page++;
 		refs++;


> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
> >  	atomic_add(nr, &page->_count);
> >  }
> >  
> > +static inline void pin_huge_page_tail(struct page *page)
> > +{
> > +	/*
> > +	 * __split_huge_page_refcount() cannot run
> > +	 * from under us.
> > +	 */
> > +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> > +	atomic_inc(&page->_count);
> > +}
> > +
> 
> This is identical to the x86 implementation. Any possibility they can be
> shared?

There is no place for me today to put gup_fast "equal" bits so I'd
need to create it and just doing it for a single inline function
sounds overkill. I could add a asm-generic/gup_fast.h add move that
function there and include the asm-generic/gup_fast.h from
asm*/include/gup_fast.h, is that what we want just for one function?
With all #ifdefs included I would end up writing more code than the
function itself. No problem with me in doing it though.

> >  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> >  		unsigned long end, int write, struct page **pages, int *nr)
> >  {
> > @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
> >  	do {
> >  		VM_BUG_ON(compound_head(page) != head);
> >  		pages[*nr] = page;
> > +		if (PageTail(page))
> > +			pin_huge_page_tail(page);
> >  		(*nr)++;
> >  		page++;
> >  		refs++;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -351,9 +351,17 @@ static inline int page_count(struct page
> >  
> >  static inline void get_page(struct page *page)
> >  {
> > -	page = compound_head(page);
> > -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> > +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> 
> Oof, this might need a comment. It's saying that getting a normal page or the
> head of a compound page must already have an elevated reference count. If
> we are getting a tail page, the reference count is stored in both the head
> and the tail so the BUG check does not apply.

Ok.

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -353,8 +353,20 @@ static inline int page_count(struct page
 
 static inline void get_page(struct page *page)
 {
+	/*
+	 * Getting a normal page or the head of a compound page
+	 * requires to already have an elevated page->_count. Only if
+	 * we're getting a tail page, the elevated page->_count is
+	 * required only in the head page, so for tail pages the
+	 * bugcheck only verifies that the page->_count isn't
+	 * negative.
+	 */
 	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
 	atomic_inc(&page->_count);
+	/*
+	 * Getting a tail page will elevate both the head and tail
+	 * page->_count(s).
+	 */
 	if (unlikely(PageTail(page))) {
 		/*
 		 * This is safe only because

> > diff --git a/mm/swap.c b/mm/swap.c
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
> >  		del_page_from_lru(zone, page);
> >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> >  	}
> > +}
> > +
> > +static void __put_single_page(struct page *page)
> > +{
> > +	__page_cache_release(page);
> >  	free_hot_cold_page(page, 0);
> >  }
> >  
> > +static void __put_compound_page(struct page *page)
> > +{
> > +	compound_page_dtor *dtor;
> > +
> > +	__page_cache_release(page);
> > +	dtor = get_compound_page_dtor(page);
> > +	(*dtor)(page);
> > +}
> > +
> >  static void put_compound_page(struct page *page)
> >  {
> > -	page = compound_head(page);
> > -	if (put_page_testzero(page)) {
> > -		compound_page_dtor *dtor;
> > -
> > -		dtor = get_compound_page_dtor(page);
> > -		(*dtor)(page);
> > +	if (unlikely(PageTail(page))) {
> > +		/* __split_huge_page_refcount can run under us */
> 
> So what? The fact you check PageTail twice is a hint as to what is
> happening and that we are depending on the order of when the head and
> tails bits get cleared but it's hard to be certain of that.

This is correct, we depend on the split_huge_page_refcount to clear
PageTail before overwriting first_page.

+	 	/*
+		 * 1) clear PageTail before overwriting first_page
+		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
+		 */
+		smp_wmb();

The first PageTail check is just to define the fast path and skip
reading page->first_page and doing smb_rmb() for the much more common
head pages (we're in unlikely branch).

So then we read first_page, we smb_rmb() and if pagetail is still set
after that, we can be sure first_page isn't a dangling pointer.

> 
> > +		struct page *page_head = page->first_page;
> > +		smp_rmb();
> > +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> > +			unsigned long flags;
> > +			if (unlikely(!PageHead(page_head))) {
> > +				/* PageHead is cleared after PageTail */
> > +				smp_rmb();
> > +				VM_BUG_ON(PageTail(page));
> > +				goto out_put_head;
> > +			}
> > +			/*
> > +			 * Only run compound_lock on a valid PageHead,
> > +			 * after having it pinned with
> > +			 * get_page_unless_zero() above.
> > +			 */
> > +			smp_mb();
> > +			/* page_head wasn't a dangling pointer */
> > +			compound_lock_irqsave(page_head, &flags);
> > +			if (unlikely(!PageTail(page))) {
> > +				/* __split_huge_page_refcount run before us */
> > +				compound_unlock_irqrestore(page_head, flags);
> > +				VM_BUG_ON(PageHead(page_head));
> > +			out_put_head:
> > +				if (put_page_testzero(page_head))
> > +					__put_single_page(page_head);
> > +			out_put_single:
> > +				if (put_page_testzero(page))
> > +					__put_single_page(page);
> > +				return;
> > +			}
> > +			VM_BUG_ON(page_head != page->first_page);
> > +			/*
> > +			 * We can release the refcount taken by
> > +			 * get_page_unless_zero now that
> > +			 * split_huge_page_refcount is blocked on the
> > +			 * compound_lock.
> > +			 */
> > +			if (put_page_testzero(page_head))
> > +				VM_BUG_ON(1);
> > +			/* __split_huge_page_refcount will wait now */
> > +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> > +			atomic_dec(&page->_count);
> > +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> > +			compound_unlock_irqrestore(page_head, flags);
> > +			if (put_page_testzero(page_head))
> > +				__put_compound_page(page_head);
> > +		} else {
> > +			/* page_head is a dangling pointer */
> > +			VM_BUG_ON(PageTail(page));
> > +			goto out_put_single;
> > +		}
> > +	} else if (put_page_testzero(page)) {
> > +		if (PageHead(page))
> > +			__put_compound_page(page);
> > +		else
> > +			__put_single_page(page);
> >  	}
> >  }
> >  
> > @@ -75,7 +141,7 @@ void put_page(struct page *page)
> >  	if (unlikely(PageCompound(page)))
> >  		put_compound_page(page);
> >  	else if (put_page_testzero(page))
> > -		__page_cache_release(page);
> > +		__put_single_page(page);
> >  }
> >  EXPORT_SYMBOL(put_page);
> >  
> 
> Functionally, I don't see a problem so
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> but some expansion on the leader and the comment, even if done as a
> follow-on patch, would be nice.

diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -79,8 +79,18 @@ static void put_compound_page(struct pag
 		/* __split_huge_page_refcount can run under us */
 		struct page *page_head = page->first_page;
 		smp_rmb();
+		/*
+		 * If PageTail is still set after smp_rmb() we can be sure
+		 * that the page->first_page we read wasn't a dangling pointer.
+		 * See __split_huge_page_refcount() smp_wmb().
+		 */
 		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
 			unsigned long flags;
+			/*
+			 * Verify that our page_head wasn't converted
+			 * to a a regular page before we got a
+			 * reference on it.
+			 */
 			if (unlikely(!PageHead(page_head))) {
 				/* PageHead is cleared after PageTail */
 				smp_rmb();

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 16 of 66] special pmd_trans_* functions
  2010-11-18 12:51     ` Mel Gorman
@ 2010-11-25 17:10       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 17:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 12:51:12PM +0000, Mel Gorman wrote:
> Usually it is insisted upon that this looks like
> 
> static inline int pmd_trans_huge(pmd) {
> 	return 0;
> }
> 
> I understand it's to avoid any possibility of side-effets though to have type
> checking and I am 99% certain the compiler still does the right thing. Still,
> with no obvious side-effects here;
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

It doesn't seem to fail build on x86-64 and x86, so it should build
for all other archs too. I'm keeping this incremental at the end just
in case.

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -471,10 +471,20 @@ extern void untrack_pfn_vma(struct vm_ar
 #endif
 
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd) 0
-#define pmd_trans_splitting(pmd) 0
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return 0;
+}
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return 0;
+}
 #ifndef __HAVE_ARCH_PMD_WRITE
-#define pmd_write(pmd)	({ BUG(); 0; })
+static inline int pmd_write(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
 #endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 16 of 66] special pmd_trans_* functions
@ 2010-11-25 17:10       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 17:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 12:51:12PM +0000, Mel Gorman wrote:
> Usually it is insisted upon that this looks like
> 
> static inline int pmd_trans_huge(pmd) {
> 	return 0;
> }
> 
> I understand it's to avoid any possibility of side-effets though to have type
> checking and I am 99% certain the compiler still does the right thing. Still,
> with no obvious side-effects here;
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

It doesn't seem to fail build on x86-64 and x86, so it should build
for all other archs too. I'm keeping this incremental at the end just
in case.

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -471,10 +471,20 @@ extern void untrack_pfn_vma(struct vm_ar
 #endif
 
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_trans_huge(pmd) 0
-#define pmd_trans_splitting(pmd) 0
+static inline int pmd_trans_huge(pmd_t pmd)
+{
+	return 0;
+}
+static inline int pmd_trans_splitting(pmd_t pmd)
+{
+	return 0;
+}
 #ifndef __HAVE_ARCH_PMD_WRITE
-#define pmd_write(pmd)	({ BUG(); 0; })
+static inline int pmd_write(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
 #endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-18 17:32       ` Linus Torvalds
@ 2010-11-25 17:35         ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 09:32:36AM -0800, Linus Torvalds wrote:
> I dunno. Those macros are _way_ too big and heavy to be macros or
> inline functions. Why aren't pmdp_splitting_flush() etc just
> functions?

That's because ptep_clear_flush and everything else in that file named
with ptep_* and doing expensive tlb flushes was a macro.

> 
> There is no performance advantage to inlining them - the TLB flush is
> going to be expensive enough that there's no point in avoiding a
> function call. And that header file really does end up being _really_
> ugly.

I agree but to me it looks like your compliant applies to the current
include/asm-generic/pgtable.h. My changes that simply mirrors closely
the style of that file to avoid altering coding style just for the new
stuff. That is compact code that I'm not even sure if anybody is
using, most certainly x86 isn't using that code so the .text bloat
isn't a practical concern.

Do you like me to do a cleanup of the asm-generic/pgtable.h and move
the tlb flushes to lib/pgtable.c? (ideally the pgtable.o file should
be empty after the preprocessor runs on x86*)

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-11-25 17:35         ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-25 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 09:32:36AM -0800, Linus Torvalds wrote:
> I dunno. Those macros are _way_ too big and heavy to be macros or
> inline functions. Why aren't pmdp_splitting_flush() etc just
> functions?

That's because ptep_clear_flush and everything else in that file named
with ptep_* and doing expensive tlb flushes was a macro.

> 
> There is no performance advantage to inlining them - the TLB flush is
> going to be expensive enough that there's no point in avoiding a
> function call. And that header file really does end up being _really_
> ugly.

I agree but to me it looks like your compliant applies to the current
include/asm-generic/pgtable.h. My changes that simply mirrors closely
the style of that file to avoid altering coding style just for the new
stuff. That is compact code that I'm not even sure if anybody is
using, most certainly x86 isn't using that code so the .text bloat
isn't a practical concern.

Do you like me to do a cleanup of the asm-generic/pgtable.h and move
the tlb flushes to lib/pgtable.c? (ideally the pgtable.o file should
be empty after the preprocessor runs on x86*)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
  2010-11-25 14:35       ` Andrea Arcangeli
@ 2010-11-26 11:40         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-26 11:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 25, 2010 at 03:35:20PM +0100, Andrea Arcangeli wrote:
> > <SNIP>

I agree with all the changes you made up to this point.

> > > +- if some task quits and more hugepages become available (either
> > > +  immediately in the buddy or through the VM), guest physical memory
> > > +  backed by regular pages should be relocated on hugepages
> > > +  automatically (with khugepaged)
> > > +
> > > +- it doesn't require boot-time memory reservation and in turn it uses
> > 
> > neither does hugetlbfs.
> > 
> > > +  hugepages whenever possible (the only possible reservation here is
> > > +  kernelcore= to avoid unmovable pages to fragment all the memory but
> > > +  such a tweak is not specific to transparent hugepage support and
> > > +  it's a generic feature that applies to all dynamic high order
> > > +  allocations in the kernel)
> > > +
> > > +- this initial support only offers the feature in the anonymous memory
> > > +  regions but it'd be ideal to move it to tmpfs and the pagecache
> > > +  later
> > > +
> > > +Transparent Hugepage Support maximizes the usefulness of free memory
> > > +if compared to the reservation approach of hugetlbfs by allowing all
> > > +unused memory to be used as cache or other movable (or even unmovable
> > > +entities).
> > 
> > hugetlbfs with memory overcommit offers something similar, particularly
> > in combination with libhugetlbfs with can automatically fall back to base
> > pages. I've run benchmarks comparing hugetlbfs using a static hugepage
> > pool with hugetlbfs dynamically allocating hugepages as required with no
> > discernable performance difference. So this statement is not strictly accurate.
> 
> Ok, but without libhugetlbfs fallback to regular pages splitting the
> vma and creating a no-hugetlbfs one in the middle, it really requires
> memory reservation to be _sure_ (libhugtlbfs runs outside of hugetlbfs
> and I doubt it allows later collapsing like khugepaged provides,
> furthermore I'm comparing kernel vs kernel not kernel vs
> kernel+userlandwrapping)...So I think mentioning the fact THP doesn't
> require memory reservation is more or less fair enough.

Ok, what you say is true. There is no collapsing of huge pages in
hugetlbfs and reservations are required to be sure.

> I removed
> "boot-time" though.
> 
> > > +It doesn't require reservation to prevent hugepage
> > > +allocation failures to be noticeable from userland. It allows paging
> > > +and all other advanced VM features to be available on the
> > > +hugepages. It requires no modifications for applications to take
> > > +advantage of it.
> > > +
> > > +Applications however can be further optimized to take advantage of
> > > +this feature, like for example they've been optimized before to avoid
> > > +a flood of mmap system calls for every malloc(4k). Optimizing userland
> > > +is by far not mandatory and khugepaged already can take care of long
> > > +lived page allocations even for hugepage unaware applications that
> > > +deals with large amounts of memory.
> > > +
> > > +In certain cases when hugepages are enabled system wide, application
> > > +may end up allocating more memory resources. An application may mmap a
> > > +large region but only touch 1 byte of it, in that case a 2M page might
> > > +be allocated instead of a 4k page for no good. This is why it's
> > > +possible to disable hugepages system-wide and to only have them inside
> > > +MADV_HUGEPAGE madvise regions.
> > > +
> > > +Embedded systems should enable hugepages only inside madvise regions
> > > +to eliminate any risk of wasting any precious byte of memory and to
> > > +only run faster.
> > > +
> > > +Applications that gets a lot of benefit from hugepages and that don't
> > > +risk to lose memory by using hugepages, should use
> > > +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> > > +
> > > +== sysfs ==
> > > +
> > > +Transparent Hugepage Support can be entirely disabled (mostly for
> > > +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> > > +avoid the risk of consuming more memory resources) or enabled system
> > > +wide. This can be achieved with one of:
> > > +
> > > +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> > > +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > > +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> > > +
> > > +It's also possible to limit defrag efforts in the VM to generate
> > > +hugepages in case they're not immediately free to madvise regions or
> > > +to never try to defrag memory and simply fallback to regular pages
> > > +unless hugepages are immediately available.
> > 
> > This is the first mention of defrag but hey, it's not a paper :)
> 
> Not sure I get it this, is this too early or too late to mention
> defrag? But yes this is not a paper so I guess I don't need to care ;)
> 

You don't need to care. The reader just has to know what defrag means
here. It's not worth sweating over.

> > > <SNIP>

I agree with all the changes.

> 
> > > +The transparent_hugepage/enabled values only affect future
> > > +behavior. So to make them effective you need to restart any
> > 
> > s/behavior/behaviour/
> 
> Funny, after this change my spell checker asks me to rename behaviour
> back to behavior. I'll stick to the spell checker, they are synonymous
> anyway.
> 

They are. Might be a difference in UK and US spelling again.

> > > +In case you can't handle compound pages if they're returned by
> > > +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> > > +follow_page, so that it will split the hugepages before returning
> > > +them. Migration for example passes FOLL_SPLIT as parameter to
> > > +follow_page because it's not hugepage aware and in fact it can't work
> > > +at all on hugetlbfs (but it instead works fine on transparent
> > 
> > hugetlbfs pages can now migrate although it's only used by hwpoison.
> 
> Yep. I'll need to teach migrate to avoid splitting the hugepage to
> migrate it too... (especially for numa, not much for hwpoison). And
> the migration support for hugetlbfs now makes it more complicated to
> migrate transhuge pages too with the same function because that code
> isn't inside a VM_HUGETLB check... Worst case we can check the hugepage
> destructor to differentiate the two, I've yet to check that. Surely
> it's feasible and mostly an implementation issue.
> 

It should be feasible.

> > > +Code walking pagetables but unware about huge pmds can simply call
> > > +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> > > +pmd_offset. It's trivial to make the code transparent hugepage aware
> > > +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> > > +missing after pmd_offset returns the pmd. Thanks to the graceful
> > > +fallback design, with a one liner change, you can avoid to write
> > > +hundred if not thousand of lines of complex code to make your code
> > > +hugepage aware.
> > > +
> > 
> > It'd be nice if you could point to a specific example but by no means
> > mandatory.
> 
> Ok:
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -226,6 +226,20 @@ but you can't handle it natively in your
>  calling split_huge_page(page). This is what the Linux VM does before
>  it tries to swapout the hugepage for example.
>  
> +Example to make mremap.c transparent hugepage aware with a one liner
> +change:
> +
> +diff --git a/mm/mremap.c b/mm/mremap.c
> +--- a/mm/mremap.c
> ++++ b/mm/mremap.c
> +@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
> + 		return NULL;
> + 
> + 	pmd = pmd_offset(pud, addr);
> ++	split_huge_page_pmd(mm, pmd);
> + 	if (pmd_none_or_clear_bad(pmd))
> + 		return NULL;
> +
>  == Locking in hugepage aware code ==
>  

Perfect.

>  We want as much code as possible hugepage aware, as calling
> 
> 
> > Ok, I'll need to read the rest of the series to verify if this is
> > correct but by and large it looks good. I think some of the language is
> > stronger than it should be and some of the comparisons with libhugetlbfs
> > are a bit off but I'd be naturally defensive on that topic. Make the
> > suggested changes if you like but if you don't, it shouldn't affect the
> > series.
> 
> I hope it looks better now, thanks!
> 

It does, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 03 of 66] transparent hugepage support documentation
@ 2010-11-26 11:40         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-26 11:40 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 25, 2010 at 03:35:20PM +0100, Andrea Arcangeli wrote:
> > <SNIP>

I agree with all the changes you made up to this point.

> > > +- if some task quits and more hugepages become available (either
> > > +  immediately in the buddy or through the VM), guest physical memory
> > > +  backed by regular pages should be relocated on hugepages
> > > +  automatically (with khugepaged)
> > > +
> > > +- it doesn't require boot-time memory reservation and in turn it uses
> > 
> > neither does hugetlbfs.
> > 
> > > +  hugepages whenever possible (the only possible reservation here is
> > > +  kernelcore= to avoid unmovable pages to fragment all the memory but
> > > +  such a tweak is not specific to transparent hugepage support and
> > > +  it's a generic feature that applies to all dynamic high order
> > > +  allocations in the kernel)
> > > +
> > > +- this initial support only offers the feature in the anonymous memory
> > > +  regions but it'd be ideal to move it to tmpfs and the pagecache
> > > +  later
> > > +
> > > +Transparent Hugepage Support maximizes the usefulness of free memory
> > > +if compared to the reservation approach of hugetlbfs by allowing all
> > > +unused memory to be used as cache or other movable (or even unmovable
> > > +entities).
> > 
> > hugetlbfs with memory overcommit offers something similar, particularly
> > in combination with libhugetlbfs with can automatically fall back to base
> > pages. I've run benchmarks comparing hugetlbfs using a static hugepage
> > pool with hugetlbfs dynamically allocating hugepages as required with no
> > discernable performance difference. So this statement is not strictly accurate.
> 
> Ok, but without libhugetlbfs fallback to regular pages splitting the
> vma and creating a no-hugetlbfs one in the middle, it really requires
> memory reservation to be _sure_ (libhugtlbfs runs outside of hugetlbfs
> and I doubt it allows later collapsing like khugepaged provides,
> furthermore I'm comparing kernel vs kernel not kernel vs
> kernel+userlandwrapping)...So I think mentioning the fact THP doesn't
> require memory reservation is more or less fair enough.

Ok, what you say is true. There is no collapsing of huge pages in
hugetlbfs and reservations are required to be sure.

> I removed
> "boot-time" though.
> 
> > > +It doesn't require reservation to prevent hugepage
> > > +allocation failures to be noticeable from userland. It allows paging
> > > +and all other advanced VM features to be available on the
> > > +hugepages. It requires no modifications for applications to take
> > > +advantage of it.
> > > +
> > > +Applications however can be further optimized to take advantage of
> > > +this feature, like for example they've been optimized before to avoid
> > > +a flood of mmap system calls for every malloc(4k). Optimizing userland
> > > +is by far not mandatory and khugepaged already can take care of long
> > > +lived page allocations even for hugepage unaware applications that
> > > +deals with large amounts of memory.
> > > +
> > > +In certain cases when hugepages are enabled system wide, application
> > > +may end up allocating more memory resources. An application may mmap a
> > > +large region but only touch 1 byte of it, in that case a 2M page might
> > > +be allocated instead of a 4k page for no good. This is why it's
> > > +possible to disable hugepages system-wide and to only have them inside
> > > +MADV_HUGEPAGE madvise regions.
> > > +
> > > +Embedded systems should enable hugepages only inside madvise regions
> > > +to eliminate any risk of wasting any precious byte of memory and to
> > > +only run faster.
> > > +
> > > +Applications that gets a lot of benefit from hugepages and that don't
> > > +risk to lose memory by using hugepages, should use
> > > +madvise(MADV_HUGEPAGE) on their critical mmapped regions.
> > > +
> > > +== sysfs ==
> > > +
> > > +Transparent Hugepage Support can be entirely disabled (mostly for
> > > +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
> > > +avoid the risk of consuming more memory resources) or enabled system
> > > +wide. This can be achieved with one of:
> > > +
> > > +echo always >/sys/kernel/mm/transparent_hugepage/enabled
> > > +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > > +echo never >/sys/kernel/mm/transparent_hugepage/enabled
> > > +
> > > +It's also possible to limit defrag efforts in the VM to generate
> > > +hugepages in case they're not immediately free to madvise regions or
> > > +to never try to defrag memory and simply fallback to regular pages
> > > +unless hugepages are immediately available.
> > 
> > This is the first mention of defrag but hey, it's not a paper :)
> 
> Not sure I get it this, is this too early or too late to mention
> defrag? But yes this is not a paper so I guess I don't need to care ;)
> 

You don't need to care. The reader just has to know what defrag means
here. It's not worth sweating over.

> > > <SNIP>

I agree with all the changes.

> 
> > > +The transparent_hugepage/enabled values only affect future
> > > +behavior. So to make them effective you need to restart any
> > 
> > s/behavior/behaviour/
> 
> Funny, after this change my spell checker asks me to rename behaviour
> back to behavior. I'll stick to the spell checker, they are synonymous
> anyway.
> 

They are. Might be a difference in UK and US spelling again.

> > > +In case you can't handle compound pages if they're returned by
> > > +follow_page, the FOLL_SPLIT bit can be specified as parameter to
> > > +follow_page, so that it will split the hugepages before returning
> > > +them. Migration for example passes FOLL_SPLIT as parameter to
> > > +follow_page because it's not hugepage aware and in fact it can't work
> > > +at all on hugetlbfs (but it instead works fine on transparent
> > 
> > hugetlbfs pages can now migrate although it's only used by hwpoison.
> 
> Yep. I'll need to teach migrate to avoid splitting the hugepage to
> migrate it too... (especially for numa, not much for hwpoison). And
> the migration support for hugetlbfs now makes it more complicated to
> migrate transhuge pages too with the same function because that code
> isn't inside a VM_HUGETLB check... Worst case we can check the hugepage
> destructor to differentiate the two, I've yet to check that. Surely
> it's feasible and mostly an implementation issue.
> 

It should be feasible.

> > > +Code walking pagetables but unware about huge pmds can simply call
> > > +split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
> > > +pmd_offset. It's trivial to make the code transparent hugepage aware
> > > +by just grepping for "pmd_offset" and adding split_huge_page_pmd where
> > > +missing after pmd_offset returns the pmd. Thanks to the graceful
> > > +fallback design, with a one liner change, you can avoid to write
> > > +hundred if not thousand of lines of complex code to make your code
> > > +hugepage aware.
> > > +
> > 
> > It'd be nice if you could point to a specific example but by no means
> > mandatory.
> 
> Ok:
> 
> diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
> --- a/Documentation/vm/transhuge.txt
> +++ b/Documentation/vm/transhuge.txt
> @@ -226,6 +226,20 @@ but you can't handle it natively in your
>  calling split_huge_page(page). This is what the Linux VM does before
>  it tries to swapout the hugepage for example.
>  
> +Example to make mremap.c transparent hugepage aware with a one liner
> +change:
> +
> +diff --git a/mm/mremap.c b/mm/mremap.c
> +--- a/mm/mremap.c
> ++++ b/mm/mremap.c
> +@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
> + 		return NULL;
> + 
> + 	pmd = pmd_offset(pud, addr);
> ++	split_huge_page_pmd(mm, pmd);
> + 	if (pmd_none_or_clear_bad(pmd))
> + 		return NULL;
> +
>  == Locking in hugepage aware code ==
>  

Perfect.

>  We want as much code as possible hugepage aware, as calling
> 
> 
> > Ok, I'll need to read the rest of the series to verify if this is
> > correct but by and large it looks good. I think some of the language is
> > stronger than it should be and some of the comparisons with libhugetlbfs
> > are a bit off but I'd be naturally defensive on that topic. Make the
> > suggested changes if you like but if you don't, it shouldn't affect the
> > series.
> 
> I hope it looks better now, thanks!
> 

It does, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
  2010-11-25 16:49       ` Andrea Arcangeli
@ 2010-11-26 11:46         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-26 11:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 25, 2010 at 05:49:16PM +0100, Andrea Arcangeli wrote:
> > > <SNIP>

The updates are fine.

> > > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > > --- a/arch/x86/mm/gup.c
> > > +++ b/arch/x86/mm/gup.c
> > > @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
> > >  	atomic_add(nr, &page->_count);
> > >  }
> > >  
> > > +static inline void pin_huge_page_tail(struct page *page)
> > > +{
> > > +	/*
> > > +	 * __split_huge_page_refcount() cannot run
> > > +	 * from under us.
> > > +	 */
> > > +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> > > +	atomic_inc(&page->_count);
> > > +}
> > > +
> > 
> > This is identical to the x86 implementation. Any possibility they can be
> > shared?
> 
> There is no place for me today to put gup_fast "equal" bits so I'd
> need to create it and just doing it for a single inline function
> sounds overkill. I could add a asm-generic/gup_fast.h add move that
> function there and include the asm-generic/gup_fast.h from
> asm*/include/gup_fast.h, is that what we want just for one function?
> With all #ifdefs included I would end up writing more code than the
> function itself. No problem with me in doing it though.
> 

Now that you've explained what is necessary, it's not worth it as part of the
series. It's sizable enough as it is for review. Consider it as a follow-up
patch maybe.

> > >  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> > >  		unsigned long end, int write, struct page **pages, int *nr)
> > >  {
> > > @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
> > >  	do {
> > >  		VM_BUG_ON(compound_head(page) != head);
> > >  		pages[*nr] = page;
> > > +		if (PageTail(page))
> > > +			pin_huge_page_tail(page);
> > >  		(*nr)++;
> > >  		page++;
> > >  		refs++;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -351,9 +351,17 @@ static inline int page_count(struct page
> > >  
> > >  static inline void get_page(struct page *page)
> > >  {
> > > -	page = compound_head(page);
> > > -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> > > +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> > 
> > Oof, this might need a comment. It's saying that getting a normal page or the
> > head of a compound page must already have an elevated reference count. If
> > we are getting a tail page, the reference count is stored in both the head
> > and the tail so the BUG check does not apply.
> 
> Ok.
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -353,8 +353,20 @@ static inline int page_count(struct page
>  
>  static inline void get_page(struct page *page)
>  {
> +	/*
> +	 * Getting a normal page or the head of a compound page
> +	 * requires to already have an elevated page->_count. Only if
> +	 * we're getting a tail page, the elevated page->_count is
> +	 * required only in the head page, so for tail pages the
> +	 * bugcheck only verifies that the page->_count isn't
> +	 * negative.
> +	 */
>  	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
>  	atomic_inc(&page->_count);
> +	/*
> +	 * Getting a tail page will elevate both the head and tail
> +	 * page->_count(s).
> +	 */
>  	if (unlikely(PageTail(page))) {
>  		/*
>  		 * This is safe only because

Much better.

> 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
> > >  		del_page_from_lru(zone, page);
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  	}
> > > +}
> > > +
> > > +static void __put_single_page(struct page *page)
> > > +{
> > > +	__page_cache_release(page);
> > >  	free_hot_cold_page(page, 0);
> > >  }
> > >  
> > > +static void __put_compound_page(struct page *page)
> > > +{
> > > +	compound_page_dtor *dtor;
> > > +
> > > +	__page_cache_release(page);
> > > +	dtor = get_compound_page_dtor(page);
> > > +	(*dtor)(page);
> > > +}
> > > +
> > >  static void put_compound_page(struct page *page)
> > >  {
> > > -	page = compound_head(page);
> > > -	if (put_page_testzero(page)) {
> > > -		compound_page_dtor *dtor;
> > > -
> > > -		dtor = get_compound_page_dtor(page);
> > > -		(*dtor)(page);
> > > +	if (unlikely(PageTail(page))) {
> > > +		/* __split_huge_page_refcount can run under us */
> > 
> > So what? The fact you check PageTail twice is a hint as to what is
> > happening and that we are depending on the order of when the head and
> > tails bits get cleared but it's hard to be certain of that.
> 
> This is correct, we depend on the split_huge_page_refcount to clear
> PageTail before overwriting first_page.
> 
> +	 	/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> 

This helps. Less guesswork involved :)

> The first PageTail check is just to define the fast path and skip
> reading page->first_page and doing smb_rmb() for the much more common
> head pages (we're in unlikely branch).
> 
> So then we read first_page, we smb_rmb() and if pagetail is still set
> after that, we can be sure first_page isn't a dangling pointer.
> 
> > 
> > > +		struct page *page_head = page->first_page;
> > > +		smp_rmb();
> > > +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> > > +			unsigned long flags;
> > > +			if (unlikely(!PageHead(page_head))) {
> > > +				/* PageHead is cleared after PageTail */
> > > +				smp_rmb();
> > > +				VM_BUG_ON(PageTail(page));
> > > +				goto out_put_head;
> > > +			}
> > > +			/*
> > > +			 * Only run compound_lock on a valid PageHead,
> > > +			 * after having it pinned with
> > > +			 * get_page_unless_zero() above.
> > > +			 */
> > > +			smp_mb();
> > > +			/* page_head wasn't a dangling pointer */
> > > +			compound_lock_irqsave(page_head, &flags);
> > > +			if (unlikely(!PageTail(page))) {
> > > +				/* __split_huge_page_refcount run before us */
> > > +				compound_unlock_irqrestore(page_head, flags);
> > > +				VM_BUG_ON(PageHead(page_head));
> > > +			out_put_head:
> > > +				if (put_page_testzero(page_head))
> > > +					__put_single_page(page_head);
> > > +			out_put_single:
> > > +				if (put_page_testzero(page))
> > > +					__put_single_page(page);
> > > +				return;
> > > +			}
> > > +			VM_BUG_ON(page_head != page->first_page);
> > > +			/*
> > > +			 * We can release the refcount taken by
> > > +			 * get_page_unless_zero now that
> > > +			 * split_huge_page_refcount is blocked on the
> > > +			 * compound_lock.
> > > +			 */
> > > +			if (put_page_testzero(page_head))
> > > +				VM_BUG_ON(1);
> > > +			/* __split_huge_page_refcount will wait now */
> > > +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> > > +			atomic_dec(&page->_count);
> > > +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> > > +			compound_unlock_irqrestore(page_head, flags);
> > > +			if (put_page_testzero(page_head))
> > > +				__put_compound_page(page_head);
> > > +		} else {
> > > +			/* page_head is a dangling pointer */
> > > +			VM_BUG_ON(PageTail(page));
> > > +			goto out_put_single;
> > > +		}
> > > +	} else if (put_page_testzero(page)) {
> > > +		if (PageHead(page))
> > > +			__put_compound_page(page);
> > > +		else
> > > +			__put_single_page(page);
> > >  	}
> > >  }
> > >  
> > > @@ -75,7 +141,7 @@ void put_page(struct page *page)
> > >  	if (unlikely(PageCompound(page)))
> > >  		put_compound_page(page);
> > >  	else if (put_page_testzero(page))
> > > -		__page_cache_release(page);
> > > +		__put_single_page(page);
> > >  }
> > >  EXPORT_SYMBOL(put_page);
> > >  
> > 
> > Functionally, I don't see a problem so
> > 
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > but some expansion on the leader and the comment, even if done as a
> > follow-on patch, would be nice.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -79,8 +79,18 @@ static void put_compound_page(struct pag
>  		/* __split_huge_page_refcount can run under us */
>  		struct page *page_head = page->first_page;
>  		smp_rmb();
> +		/*
> +		 * If PageTail is still set after smp_rmb() we can be sure
> +		 * that the page->first_page we read wasn't a dangling pointer.
> +		 * See __split_huge_page_refcount() smp_wmb().
> +		 */
>  		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
>  			unsigned long flags;
> +			/*
> +			 * Verify that our page_head wasn't converted
> +			 * to a a regular page before we got a
> +			 * reference on it.
> +			 */
>  			if (unlikely(!PageHead(page_head))) {
>  				/* PageHead is cleared after PageTail */
>  				smp_rmb();
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 06 of 66] alter compound get_page/put_page
@ 2010-11-26 11:46         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-26 11:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 25, 2010 at 05:49:16PM +0100, Andrea Arcangeli wrote:
> > > <SNIP>

The updates are fine.

> > > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > > --- a/arch/x86/mm/gup.c
> > > +++ b/arch/x86/mm/gup.c
> > > @@ -105,6 +105,16 @@ static inline void get_head_page_multipl
> > >  	atomic_add(nr, &page->_count);
> > >  }
> > >  
> > > +static inline void pin_huge_page_tail(struct page *page)
> > > +{
> > > +	/*
> > > +	 * __split_huge_page_refcount() cannot run
> > > +	 * from under us.
> > > +	 */
> > > +	VM_BUG_ON(atomic_read(&page->_count) < 0);
> > > +	atomic_inc(&page->_count);
> > > +}
> > > +
> > 
> > This is identical to the x86 implementation. Any possibility they can be
> > shared?
> 
> There is no place for me today to put gup_fast "equal" bits so I'd
> need to create it and just doing it for a single inline function
> sounds overkill. I could add a asm-generic/gup_fast.h add move that
> function there and include the asm-generic/gup_fast.h from
> asm*/include/gup_fast.h, is that what we want just for one function?
> With all #ifdefs included I would end up writing more code than the
> function itself. No problem with me in doing it though.
> 

Now that you've explained what is necessary, it's not worth it as part of the
series. It's sizable enough as it is for review. Consider it as a follow-up
patch maybe.

> > >  static noinline int gup_huge_pmd(pmd_t pmd, unsigned long addr,
> > >  		unsigned long end, int write, struct page **pages, int *nr)
> > >  {
> > > @@ -128,6 +138,8 @@ static noinline int gup_huge_pmd(pmd_t p
> > >  	do {
> > >  		VM_BUG_ON(compound_head(page) != head);
> > >  		pages[*nr] = page;
> > > +		if (PageTail(page))
> > > +			pin_huge_page_tail(page);
> > >  		(*nr)++;
> > >  		page++;
> > >  		refs++;
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -351,9 +351,17 @@ static inline int page_count(struct page
> > >  
> > >  static inline void get_page(struct page *page)
> > >  {
> > > -	page = compound_head(page);
> > > -	VM_BUG_ON(atomic_read(&page->_count) == 0);
> > > +	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
> > 
> > Oof, this might need a comment. It's saying that getting a normal page or the
> > head of a compound page must already have an elevated reference count. If
> > we are getting a tail page, the reference count is stored in both the head
> > and the tail so the BUG check does not apply.
> 
> Ok.
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -353,8 +353,20 @@ static inline int page_count(struct page
>  
>  static inline void get_page(struct page *page)
>  {
> +	/*
> +	 * Getting a normal page or the head of a compound page
> +	 * requires to already have an elevated page->_count. Only if
> +	 * we're getting a tail page, the elevated page->_count is
> +	 * required only in the head page, so for tail pages the
> +	 * bugcheck only verifies that the page->_count isn't
> +	 * negative.
> +	 */
>  	VM_BUG_ON(atomic_read(&page->_count) < !PageTail(page));
>  	atomic_inc(&page->_count);
> +	/*
> +	 * Getting a tail page will elevate both the head and tail
> +	 * page->_count(s).
> +	 */
>  	if (unlikely(PageTail(page))) {
>  		/*
>  		 * This is safe only because

Much better.

> 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -56,17 +56,83 @@ static void __page_cache_release(struct 
> > >  		del_page_from_lru(zone, page);
> > >  		spin_unlock_irqrestore(&zone->lru_lock, flags);
> > >  	}
> > > +}
> > > +
> > > +static void __put_single_page(struct page *page)
> > > +{
> > > +	__page_cache_release(page);
> > >  	free_hot_cold_page(page, 0);
> > >  }
> > >  
> > > +static void __put_compound_page(struct page *page)
> > > +{
> > > +	compound_page_dtor *dtor;
> > > +
> > > +	__page_cache_release(page);
> > > +	dtor = get_compound_page_dtor(page);
> > > +	(*dtor)(page);
> > > +}
> > > +
> > >  static void put_compound_page(struct page *page)
> > >  {
> > > -	page = compound_head(page);
> > > -	if (put_page_testzero(page)) {
> > > -		compound_page_dtor *dtor;
> > > -
> > > -		dtor = get_compound_page_dtor(page);
> > > -		(*dtor)(page);
> > > +	if (unlikely(PageTail(page))) {
> > > +		/* __split_huge_page_refcount can run under us */
> > 
> > So what? The fact you check PageTail twice is a hint as to what is
> > happening and that we are depending on the order of when the head and
> > tails bits get cleared but it's hard to be certain of that.
> 
> This is correct, we depend on the split_huge_page_refcount to clear
> PageTail before overwriting first_page.
> 
> +	 	/*
> +		 * 1) clear PageTail before overwriting first_page
> +		 * 2) clear PageTail before clearing PageHead for VM_BUG_ON
> +		 */
> +		smp_wmb();
> 

This helps. Less guesswork involved :)

> The first PageTail check is just to define the fast path and skip
> reading page->first_page and doing smb_rmb() for the much more common
> head pages (we're in unlikely branch).
> 
> So then we read first_page, we smb_rmb() and if pagetail is still set
> after that, we can be sure first_page isn't a dangling pointer.
> 
> > 
> > > +		struct page *page_head = page->first_page;
> > > +		smp_rmb();
> > > +		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
> > > +			unsigned long flags;
> > > +			if (unlikely(!PageHead(page_head))) {
> > > +				/* PageHead is cleared after PageTail */
> > > +				smp_rmb();
> > > +				VM_BUG_ON(PageTail(page));
> > > +				goto out_put_head;
> > > +			}
> > > +			/*
> > > +			 * Only run compound_lock on a valid PageHead,
> > > +			 * after having it pinned with
> > > +			 * get_page_unless_zero() above.
> > > +			 */
> > > +			smp_mb();
> > > +			/* page_head wasn't a dangling pointer */
> > > +			compound_lock_irqsave(page_head, &flags);
> > > +			if (unlikely(!PageTail(page))) {
> > > +				/* __split_huge_page_refcount run before us */
> > > +				compound_unlock_irqrestore(page_head, flags);
> > > +				VM_BUG_ON(PageHead(page_head));
> > > +			out_put_head:
> > > +				if (put_page_testzero(page_head))
> > > +					__put_single_page(page_head);
> > > +			out_put_single:
> > > +				if (put_page_testzero(page))
> > > +					__put_single_page(page);
> > > +				return;
> > > +			}
> > > +			VM_BUG_ON(page_head != page->first_page);
> > > +			/*
> > > +			 * We can release the refcount taken by
> > > +			 * get_page_unless_zero now that
> > > +			 * split_huge_page_refcount is blocked on the
> > > +			 * compound_lock.
> > > +			 */
> > > +			if (put_page_testzero(page_head))
> > > +				VM_BUG_ON(1);
> > > +			/* __split_huge_page_refcount will wait now */
> > > +			VM_BUG_ON(atomic_read(&page->_count) <= 0);
> > > +			atomic_dec(&page->_count);
> > > +			VM_BUG_ON(atomic_read(&page_head->_count) <= 0);
> > > +			compound_unlock_irqrestore(page_head, flags);
> > > +			if (put_page_testzero(page_head))
> > > +				__put_compound_page(page_head);
> > > +		} else {
> > > +			/* page_head is a dangling pointer */
> > > +			VM_BUG_ON(PageTail(page));
> > > +			goto out_put_single;
> > > +		}
> > > +	} else if (put_page_testzero(page)) {
> > > +		if (PageHead(page))
> > > +			__put_compound_page(page);
> > > +		else
> > > +			__put_single_page(page);
> > >  	}
> > >  }
> > >  
> > > @@ -75,7 +141,7 @@ void put_page(struct page *page)
> > >  	if (unlikely(PageCompound(page)))
> > >  		put_compound_page(page);
> > >  	else if (put_page_testzero(page))
> > > -		__page_cache_release(page);
> > > +		__put_single_page(page);
> > >  }
> > >  EXPORT_SYMBOL(put_page);
> > >  
> > 
> > Functionally, I don't see a problem so
> > 
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > but some expansion on the leader and the comment, even if done as a
> > follow-on patch, would be nice.
> 
> diff --git a/mm/swap.c b/mm/swap.c
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -79,8 +79,18 @@ static void put_compound_page(struct pag
>  		/* __split_huge_page_refcount can run under us */
>  		struct page *page_head = page->first_page;
>  		smp_rmb();
> +		/*
> +		 * If PageTail is still set after smp_rmb() we can be sure
> +		 * that the page->first_page we read wasn't a dangling pointer.
> +		 * See __split_huge_page_refcount() smp_wmb().
> +		 */
>  		if (likely(PageTail(page) && get_page_unless_zero(page_head))) {
>  			unsigned long flags;
> +			/*
> +			 * Verify that our page_head wasn't converted
> +			 * to a a regular page before we got a
> +			 * reference on it.
> +			 */
>  			if (unlikely(!PageHead(page_head))) {
>  				/* PageHead is cleared after PageTail */
>  				smp_rmb();
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
  2010-11-18 13:04     ` Mel Gorman
@ 2010-11-26 17:57       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-26 17:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 01:04:46PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Add needed pmd mangling functions with simmetry with their pte counterparts.
> 
> symmetry

Fixed.

> 
> > pmdp_freeze_flush is the only exception only present on the pmd side and it's
> > needed to serialize the VM against split_huge_page, it simply atomically clears
> > the present bit in the same way pmdp_clear_flush_young atomically clears the
> > accessed bit (and both need to flush the tlb to make it effective, which is
> > mandatory to happen synchronously for pmdp_freeze_flush).
> 
> I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
> pmdp_splitting_flush? Even if it is, it's the splitting bit you are
> dealing with which isn't the same as the present bit. I'm missing
> something.

Well the comment went out of sync with the code sorry. I updated it:

=======
Add needed pmd mangling functions with symmetry with their pte counterparts.
pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's
needed to serialize the VM against split_huge_page. It simply atomically sets
the splitting bit in a similar way pmdp_clear_flush_young atomically clears the
accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it
effective against gup_fast, but it wouldn't really require to flush the tlb
too. Just the tlb flush is the simplest operation we can invoke to serialize
pmdp_splitting_flush() against gup_fast.
=======

> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
> >  pte_t *populate_extra_pte(unsigned long vaddr);
> >  #endif	/* __ASSEMBLY__ */
> >  
> > +#ifndef __ASSEMBLY__
> > +#include <linux/mm_types.h>
> > +
> >  #ifdef CONFIG_X86_32
> >  # include "pgtable_32.h"
> >  #else
> >  # include "pgtable_64.h"
> >  #endif
> >  
> > -#ifndef __ASSEMBLY__
> > -#include <linux/mm_types.h>
> > -
> 
> Stupid quetion: Why is this move necessary?

That's not a stupid question, it seems to build in all configurations
even with this part backed out. I'll try to reverse this one in the
hope that it won't break build. I suppose some earlier version of the
patchset required this to build (I would never make a gratuitous
change like this if it wasn't needed at some point) but it seems not
be required anymore according to my build tests. If I'm wrong and some
build breaks I'll reintroduce it later.

> >  static inline int pte_none(pte_t pte)
> >  {
> >  	return !pte.pte;
> > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> >   * Currently stuck as a macro due to indirect forward reference to
> >   * linux/mmzone.h's __section_mem_map_addr() definition:
> >   */
> > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> >  
> 
> Why is it now necessary to use PTE_PFN_MASK?

Just for the NX bit, that couldn't be set before the pmd could be
marked PSE.

> The implementations look fine but I'm having trouble reconsiling what
> the leader says with the patch :(

Yes because it was out of sync, the new version is above.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
@ 2010-11-26 17:57       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-26 17:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 01:04:46PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Add needed pmd mangling functions with simmetry with their pte counterparts.
> 
> symmetry

Fixed.

> 
> > pmdp_freeze_flush is the only exception only present on the pmd side and it's
> > needed to serialize the VM against split_huge_page, it simply atomically clears
> > the present bit in the same way pmdp_clear_flush_young atomically clears the
> > accessed bit (and both need to flush the tlb to make it effective, which is
> > mandatory to happen synchronously for pmdp_freeze_flush).
> 
> I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
> pmdp_splitting_flush? Even if it is, it's the splitting bit you are
> dealing with which isn't the same as the present bit. I'm missing
> something.

Well the comment went out of sync with the code sorry. I updated it:

=======
Add needed pmd mangling functions with symmetry with their pte counterparts.
pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's
needed to serialize the VM against split_huge_page. It simply atomically sets
the splitting bit in a similar way pmdp_clear_flush_young atomically clears the
accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it
effective against gup_fast, but it wouldn't really require to flush the tlb
too. Just the tlb flush is the simplest operation we can invoke to serialize
pmdp_splitting_flush() against gup_fast.
=======

> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> > 
> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
> >  pte_t *populate_extra_pte(unsigned long vaddr);
> >  #endif	/* __ASSEMBLY__ */
> >  
> > +#ifndef __ASSEMBLY__
> > +#include <linux/mm_types.h>
> > +
> >  #ifdef CONFIG_X86_32
> >  # include "pgtable_32.h"
> >  #else
> >  # include "pgtable_64.h"
> >  #endif
> >  
> > -#ifndef __ASSEMBLY__
> > -#include <linux/mm_types.h>
> > -
> 
> Stupid quetion: Why is this move necessary?

That's not a stupid question, it seems to build in all configurations
even with this part backed out. I'll try to reverse this one in the
hope that it won't break build. I suppose some earlier version of the
patchset required this to build (I would never make a gratuitous
change like this if it wasn't needed at some point) but it seems not
be required anymore according to my build tests. If I'm wrong and some
build breaks I'll reintroduce it later.

> >  static inline int pte_none(pte_t pte)
> >  {
> >  	return !pte.pte;
> > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> >   * Currently stuck as a macro due to indirect forward reference to
> >   * linux/mmzone.h's __section_mem_map_addr() definition:
> >   */
> > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> >  
> 
> Why is it now necessary to use PTE_PFN_MASK?

Just for the NX bit, that couldn't be set before the pmd could be
marked PSE.

> The implementations look fine but I'm having trouble reconsiling what
> the leader says with the patch :(

Yes because it was out of sync, the new version is above.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 11 of 66] add pmd paravirt ops
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2010-11-26 18:01     ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-26 18:01 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:46PM +0100, Andrea Arcangeli wrote:
> @@ -543,6 +554,18 @@ static inline void set_pte_at(struct mm_
>  		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +			      pmd_t *pmdp, pmd_t pmd)
> +{
> +	if (sizeof(pmdval_t) > sizeof(long))
> +		/* 5 arg words */
> +		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
> +	else
> +		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
> +}
> +#endif
> +
>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
>  	pmdval_t val = native_pmd_val(pmd);

btw, I noticed the 32bit build with HIGHMEM off, paravirt on, and THP
on fails without this below change (that configuration clearly has
never been tested for build), so I'm merging this in the above quoted
patch.

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -558,11 +558,13 @@ static inline void set_pte_at(struct mm_
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
+#if PAGETABLE_LEVELS >= 3
 	if (sizeof(pmdval_t) > sizeof(long))
 		/* 5 arg words */
 		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
 	else
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+#endif
 }
 #endif
 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 11 of 66] add pmd paravirt ops
@ 2010-11-26 18:01     ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-26 18:01 UTC (permalink / raw)
  To: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel
  Cc: Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

On Wed, Nov 03, 2010 at 04:27:46PM +0100, Andrea Arcangeli wrote:
> @@ -543,6 +554,18 @@ static inline void set_pte_at(struct mm_
>  		PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
>  }
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> +			      pmd_t *pmdp, pmd_t pmd)
> +{
> +	if (sizeof(pmdval_t) > sizeof(long))
> +		/* 5 arg words */
> +		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
> +	else
> +		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
> +}
> +#endif
> +
>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
>  	pmdval_t val = native_pmd_val(pmd);

btw, I noticed the 32bit build with HIGHMEM off, paravirt on, and THP
on fails without this below change (that configuration clearly has
never been tested for build), so I'm merging this in the above quoted
patch.

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -558,11 +558,13 @@ static inline void set_pte_at(struct mm_
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 			      pmd_t *pmdp, pmd_t pmd)
 {
+#if PAGETABLE_LEVELS >= 3
 	if (sizeof(pmdval_t) > sizeof(long))
 		/* 5 arg words */
 		pv_mmu_ops.set_pmd_at(mm, addr, pmdp, pmd);
 	else
 		PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp, pmd.pmd);
+#endif
 }
 #endif
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-25 17:35         ` Andrea Arcangeli
@ 2010-11-26 22:24           ` Linus Torvalds
  -1 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-26 22:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 26, 2010 at 2:35 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Nov 18, 2010 at 09:32:36AM -0800, Linus Torvalds wrote:
>> I dunno. Those macros are _way_ too big and heavy to be macros or
>> inline functions. Why aren't pmdp_splitting_flush() etc just
>> functions?
>
> That's because ptep_clear_flush and everything else in that file named
> with ptep_* and doing expensive tlb flushes was a macro.

That may be, and you needn't necessarily clean up old use (although
that might be nice as a separate thing), but I wish we didn't make
what is already messy bigger and messier.

         Linus

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-11-26 22:24           ` Linus Torvalds
  0 siblings, 0 replies; 331+ messages in thread
From: Linus Torvalds @ 2010-11-26 22:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 26, 2010 at 2:35 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Thu, Nov 18, 2010 at 09:32:36AM -0800, Linus Torvalds wrote:
>> I dunno. Those macros are _way_ too big and heavy to be macros or
>> inline functions. Why aren't pmdp_splitting_flush() etc just
>> functions?
>
> That's because ptep_clear_flush and everything else in that file named
> with ptep_* and doing expensive tlb flushes was a macro.

That may be, and you needn't necessarily clean up old use (although
that might be nice as a separate thing), but I wish we didn't make
what is already messy bigger and messier.

         Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
  2010-11-03 15:28   ` Andrea Arcangeli
@ 2010-11-29  5:38     ` Daisuke Nishimura
  -1 siblings, 0 replies; 331+ messages in thread
From: Daisuke Nishimura @ 2010-11-29  5:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov, Daisuke Nishimura

> @@ -1655,7 +1672,11 @@ static void collapse_huge_page(struct mm
>  	unsigned long hstart, hend;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +#ifndef CONFIG_NUMA
>  	VM_BUG_ON(!*hpage);
> +#else
> +	VM_BUG_ON(*hpage);
> +#endif
>  
>  	/*
>  	 * Prevent all access to pagetables with the exception of
> @@ -1693,7 +1714,15 @@ static void collapse_huge_page(struct mm
>  	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
>  		goto out;
>  
> +#ifndef CONFIG_NUMA
>  	new_page = *hpage;
> +#else
> +	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
> +	if (unlikely(!new_page)) {
> +		*hpage = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +#endif
>  	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
>  		goto out;
>  
I think this should be:

	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}

Thanks,
Daisuke Nishimura.

> @@ -1724,6 +1753,9 @@ static void collapse_huge_page(struct mm
>  		spin_unlock(&mm->page_table_lock);
>  		anon_vma_unlock(vma->anon_vma);
>  		mem_cgroup_uncharge_page(new_page);
> +#ifdef CONFIG_NUMA
> +		put_page(new_page);
> +#endif
>  		goto out;
>  	}
>  
> @@ -1759,7 +1791,9 @@ static void collapse_huge_page(struct mm
>  	mm->nr_ptes--;
>  	spin_unlock(&mm->page_table_lock);
>  
> +#ifndef CONFIG_NUMA
>  	*hpage = NULL;
> +#endif
>  	khugepaged_pages_collapsed++;
>  out:
>  	up_write(&mm->mmap_sem);

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
@ 2010-11-29  5:38     ` Daisuke Nishimura
  0 siblings, 0 replies; 331+ messages in thread
From: Daisuke Nishimura @ 2010-11-29  5:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov, Daisuke Nishimura

> @@ -1655,7 +1672,11 @@ static void collapse_huge_page(struct mm
>  	unsigned long hstart, hend;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +#ifndef CONFIG_NUMA
>  	VM_BUG_ON(!*hpage);
> +#else
> +	VM_BUG_ON(*hpage);
> +#endif
>  
>  	/*
>  	 * Prevent all access to pagetables with the exception of
> @@ -1693,7 +1714,15 @@ static void collapse_huge_page(struct mm
>  	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
>  		goto out;
>  
> +#ifndef CONFIG_NUMA
>  	new_page = *hpage;
> +#else
> +	new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
> +	if (unlikely(!new_page)) {
> +		*hpage = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +#endif
>  	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
>  		goto out;
>  
I think this should be:

	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}

Thanks,
Daisuke Nishimura.

> @@ -1724,6 +1753,9 @@ static void collapse_huge_page(struct mm
>  		spin_unlock(&mm->page_table_lock);
>  		anon_vma_unlock(vma->anon_vma);
>  		mem_cgroup_uncharge_page(new_page);
> +#ifdef CONFIG_NUMA
> +		put_page(new_page);
> +#endif
>  		goto out;
>  	}
>  
> @@ -1759,7 +1791,9 @@ static void collapse_huge_page(struct mm
>  	mm->nr_ptes--;
>  	spin_unlock(&mm->page_table_lock);
>  
> +#ifndef CONFIG_NUMA
>  	*hpage = NULL;
> +#endif
>  	khugepaged_pages_collapsed++;
>  out:
>  	up_write(&mm->mmap_sem);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
  2010-11-26 17:57       ` Andrea Arcangeli
@ 2010-11-29 10:23         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-29 10:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 26, 2010 at 06:57:51PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 01:04:46PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Add needed pmd mangling functions with simmetry with their pte counterparts.
> > 
> > symmetry
> 
> Fixed.
> 
> > 
> > > pmdp_freeze_flush is the only exception only present on the pmd side and it's
> > > needed to serialize the VM against split_huge_page, it simply atomically clears
> > > the present bit in the same way pmdp_clear_flush_young atomically clears the
> > > accessed bit (and both need to flush the tlb to make it effective, which is
> > > mandatory to happen synchronously for pmdp_freeze_flush).
> > 
> > I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
> > pmdp_splitting_flush? Even if it is, it's the splitting bit you are
> > dealing with which isn't the same as the present bit. I'm missing
> > something.
> 
> Well the comment went out of sync with the code sorry. I updated it:
> 
> =======
> Add needed pmd mangling functions with symmetry with their pte counterparts.
> pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's
> needed to serialize the VM against split_huge_page. It simply atomically sets
> the splitting bit in a similar way pmdp_clear_flush_young atomically clears the
> accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it
> effective against gup_fast, but it wouldn't really require to flush the tlb
> too. Just the tlb flush is the simplest operation we can invoke to serialize
> pmdp_splitting_flush() against gup_fast.
> =======
> 

Much clearer, thanks.

> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > > 
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
> > >  pte_t *populate_extra_pte(unsigned long vaddr);
> > >  #endif	/* __ASSEMBLY__ */
> > >  
> > > +#ifndef __ASSEMBLY__
> > > +#include <linux/mm_types.h>
> > > +
> > >  #ifdef CONFIG_X86_32
> > >  # include "pgtable_32.h"
> > >  #else
> > >  # include "pgtable_64.h"
> > >  #endif
> > >  
> > > -#ifndef __ASSEMBLY__
> > > -#include <linux/mm_types.h>
> > > -
> > 
> > Stupid quetion: Why is this move necessary?
> 
> That's not a stupid question, it seems to build in all configurations
> even with this part backed out. I'll try to reverse this one in the
> hope that it won't break build. I suppose some earlier version of the
> patchset required this to build (I would never make a gratuitous
> change like this if it wasn't needed at some point) but it seems not
> be required anymore according to my build tests. If I'm wrong and some
> build breaks I'll reintroduce it later.
> 

Ok.

> > >  static inline int pte_none(pte_t pte)
> > >  {
> > >  	return !pte.pte;
> > > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> > >   * Currently stuck as a macro due to indirect forward reference to
> > >   * linux/mmzone.h's __section_mem_map_addr() definition:
> > >   */
> > > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> > >  
> > 
> > Why is it now necessary to use PTE_PFN_MASK?
> 
> Just for the NX bit, that couldn't be set before the pmd could be
> marked PSE.
> 

Sorry, I still am missing something. PTE_PFN_MASK is this

#define PTE_PFN_MASK            ((pteval_t)PHYSICAL_PAGE_MASK)
#define PHYSICAL_PAGE_MASK      (((signed long)PAGE_MASK) & __PHYSICAL_MASK)

I'm not seeing how PTE_PFN_MASK affects the NX bit (bit 63).

> > The implementations look fine but I'm having trouble reconsiling what
> > the leader says with the patch :(
> 
> Yes because it was out of sync, the new version is above.
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
@ 2010-11-29 10:23         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-11-29 10:23 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 26, 2010 at 06:57:51PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 01:04:46PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2010 at 04:27:53PM +0100, Andrea Arcangeli wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Add needed pmd mangling functions with simmetry with their pte counterparts.
> > 
> > symmetry
> 
> Fixed.
> 
> > 
> > > pmdp_freeze_flush is the only exception only present on the pmd side and it's
> > > needed to serialize the VM against split_huge_page, it simply atomically clears
> > > the present bit in the same way pmdp_clear_flush_young atomically clears the
> > > accessed bit (and both need to flush the tlb to make it effective, which is
> > > mandatory to happen synchronously for pmdp_freeze_flush).
> > 
> > I don't see a pmdp_freeze_flush defined in the patch. Did yu mean 
> > pmdp_splitting_flush? Even if it is, it's the splitting bit you are
> > dealing with which isn't the same as the present bit. I'm missing
> > something.
> 
> Well the comment went out of sync with the code sorry. I updated it:
> 
> =======
> Add needed pmd mangling functions with symmetry with their pte counterparts.
> pmdp_splitting_flush() is the only new addition on the pmd_ methods and it's
> needed to serialize the VM against split_huge_page. It simply atomically sets
> the splitting bit in a similar way pmdp_clear_flush_young atomically clears the
> accessed bit. pmdp_splitting_flush() also has to flush the tlb to make it
> effective against gup_fast, but it wouldn't really require to flush the tlb
> too. Just the tlb flush is the simplest operation we can invoke to serialize
> pmdp_splitting_flush() against gup_fast.
> =======
> 

Much clearer, thanks.

> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > > ---
> > > 
> > > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> > > --- a/arch/x86/include/asm/pgtable.h
> > > +++ b/arch/x86/include/asm/pgtable.h
> > > @@ -302,15 +302,15 @@ pmd_t *populate_extra_pmd(unsigned long 
> > >  pte_t *populate_extra_pte(unsigned long vaddr);
> > >  #endif	/* __ASSEMBLY__ */
> > >  
> > > +#ifndef __ASSEMBLY__
> > > +#include <linux/mm_types.h>
> > > +
> > >  #ifdef CONFIG_X86_32
> > >  # include "pgtable_32.h"
> > >  #else
> > >  # include "pgtable_64.h"
> > >  #endif
> > >  
> > > -#ifndef __ASSEMBLY__
> > > -#include <linux/mm_types.h>
> > > -
> > 
> > Stupid quetion: Why is this move necessary?
> 
> That's not a stupid question, it seems to build in all configurations
> even with this part backed out. I'll try to reverse this one in the
> hope that it won't break build. I suppose some earlier version of the
> patchset required this to build (I would never make a gratuitous
> change like this if it wasn't needed at some point) but it seems not
> be required anymore according to my build tests. If I'm wrong and some
> build breaks I'll reintroduce it later.
> 

Ok.

> > >  static inline int pte_none(pte_t pte)
> > >  {
> > >  	return !pte.pte;
> > > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> > >   * Currently stuck as a macro due to indirect forward reference to
> > >   * linux/mmzone.h's __section_mem_map_addr() definition:
> > >   */
> > > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> > >  
> > 
> > Why is it now necessary to use PTE_PFN_MASK?
> 
> Just for the NX bit, that couldn't be set before the pmd could be
> marked PSE.
> 

Sorry, I still am missing something. PTE_PFN_MASK is this

#define PTE_PFN_MASK            ((pteval_t)PHYSICAL_PAGE_MASK)
#define PHYSICAL_PAGE_MASK      (((signed long)PAGE_MASK) & __PHYSICAL_MASK)

I'm not seeing how PTE_PFN_MASK affects the NX bit (bit 63).

> > The implementations look fine but I'm having trouble reconsiling what
> > the leader says with the patch :(
> 
> Yes because it was out of sync, the new version is above.
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
  2010-11-29  5:38     ` Daisuke Nishimura
@ 2010-11-29 16:11       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 16:11 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov

On Mon, Nov 29, 2010 at 02:38:01PM +0900, Daisuke Nishimura wrote:
> I think this should be:
> 
> 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> #ifdef CONFIG_NUMA
> 		put_page(new_page);
> #endif
> 		goto out;
> 	}

Hmm no, the change you suggest would generate memory corruption with
use after free.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
@ 2010-11-29 16:11       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 16:11 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov

On Mon, Nov 29, 2010 at 02:38:01PM +0900, Daisuke Nishimura wrote:
> I think this should be:
> 
> 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> #ifdef CONFIG_NUMA
> 		put_page(new_page);
> #endif
> 		goto out;
> 	}

Hmm no, the change you suggest would generate memory corruption with
use after free.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
  2010-11-29 10:23         ` Mel Gorman
@ 2010-11-29 16:59           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 16:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Mon, Nov 29, 2010 at 10:23:11AM +0000, Mel Gorman wrote:
> > > > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> > > >   * Currently stuck as a macro due to indirect forward reference to
> > > >   * linux/mmzone.h's __section_mem_map_addr() definition:
> > > >   */
> > > > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > > > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> > > >  
> > > 
> > > Why is it now necessary to use PTE_PFN_MASK?
> > 
> > Just for the NX bit, that couldn't be set before the pmd could be
> > marked PSE.
> > 
> 
> Sorry, I still am missing something. PTE_PFN_MASK is this
> 
> #define PTE_PFN_MASK            ((pteval_t)PHYSICAL_PAGE_MASK)
> #define PHYSICAL_PAGE_MASK      (((signed long)PAGE_MASK) & __PHYSICAL_MASK)
> 
> I'm not seeing how PTE_PFN_MASK affects the NX bit (bit 63).

It simply clears it by doing & 0000... otherwise bit 51 would remain
erroneously set on the pfn passed to pfn_to_page.

Clearing bit 63 wasn't needed before because bit 63 couldn't be set on
a not huge pmd.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 18 of 66] add pmd mangling functions to x86
@ 2010-11-29 16:59           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 16:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Mon, Nov 29, 2010 at 10:23:11AM +0000, Mel Gorman wrote:
> > > > @@ -353,7 +353,7 @@ static inline unsigned long pmd_page_vad
> > > >   * Currently stuck as a macro due to indirect forward reference to
> > > >   * linux/mmzone.h's __section_mem_map_addr() definition:
> > > >   */
> > > > -#define pmd_page(pmd)	pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT)
> > > > +#define pmd_page(pmd)	pfn_to_page((pmd_val(pmd) & PTE_PFN_MASK) >> PAGE_SHIFT)
> > > >  
> > > 
> > > Why is it now necessary to use PTE_PFN_MASK?
> > 
> > Just for the NX bit, that couldn't be set before the pmd could be
> > marked PSE.
> > 
> 
> Sorry, I still am missing something. PTE_PFN_MASK is this
> 
> #define PTE_PFN_MASK            ((pteval_t)PHYSICAL_PAGE_MASK)
> #define PHYSICAL_PAGE_MASK      (((signed long)PAGE_MASK) & __PHYSICAL_MASK)
> 
> I'm not seeing how PTE_PFN_MASK affects the NX bit (bit 63).

It simply clears it by doing & 0000... otherwise bit 51 would remain
erroneously set on the pfn passed to pfn_to_page.

Clearing bit 63 wasn't needed before because bit 63 couldn't be set on
a not huge pmd.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 28 of 66] _GFP_NO_KSWAPD
  2010-11-18 13:18     ` Mel Gorman
@ 2010-11-29 19:03       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 19:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 01:18:39PM +0000, Mel Gorman wrote:
> This is not an exact merge with what's currently in mm. Look at the top
> of gfp.h and see "Plain integer GFP bitmasks. Do not use this
> directly.". The 0x400000u definition needs to go there and this becomes
> 
> #define __GFP_NO_KSWAPD		((__force_gfp_t)____0x400000u)
> 
> What you have just generates sparse warnings (I believe) so it's
> harmless.

Agreed.

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -34,6 +34,7 @@ struct vm_area_struct;
 #else
 #define ___GFP_NOTRACK		0
 #endif
+#define ___GFP_NO_KSWAPD	0x400000u
 
 /*
  * GFP bitmasks..
@@ -81,7 +82,7 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
-#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.


> Other than needing to define ____GFP_NO_KSWAPD

3 underscores.

> Acked-by: Mel Gorman <mel@csn.ul.ie>

Added, thanks!
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 28 of 66] _GFP_NO_KSWAPD
@ 2010-11-29 19:03       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-29 19:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 01:18:39PM +0000, Mel Gorman wrote:
> This is not an exact merge with what's currently in mm. Look at the top
> of gfp.h and see "Plain integer GFP bitmasks. Do not use this
> directly.". The 0x400000u definition needs to go there and this becomes
> 
> #define __GFP_NO_KSWAPD		((__force_gfp_t)____0x400000u)
> 
> What you have just generates sparse warnings (I believe) so it's
> harmless.

Agreed.

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -34,6 +34,7 @@ struct vm_area_struct;
 #else
 #define ___GFP_NOTRACK		0
 #endif
+#define ___GFP_NO_KSWAPD	0x400000u
 
 /*
  * GFP bitmasks..
@@ -81,7 +82,7 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
-#define __GFP_NO_KSWAPD	((__force gfp_t)0x400000u)
+#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 
 /*
  * This may seem redundant, but it's a way of annotating false positives vs.


> Other than needing to define ____GFP_NO_KSWAPD

3 underscores.

> Acked-by: Mel Gorman <mel@csn.ul.ie>

Added, thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
  2010-11-29 16:11       ` Andrea Arcangeli
@ 2010-11-30  0:38         ` Daisuke Nishimura
  -1 siblings, 0 replies; 331+ messages in thread
From: Daisuke Nishimura @ 2010-11-30  0:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov, Daisuke Nishimura

On Mon, 29 Nov 2010 17:11:03 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Mon, Nov 29, 2010 at 02:38:01PM +0900, Daisuke Nishimura wrote:
> > I think this should be:
> > 
> > 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> > #ifdef CONFIG_NUMA
> > 		put_page(new_page);
> > #endif
> > 		goto out;
> > 	}
> 
> Hmm no, the change you suggest would generate memory corruption with
> use after free.

I'm sorry if I miss something, "new_page" will be reused in !CONFIG_NUMA case
as you say, but, in CONFIG_NUMA case, it is allocated in this function
(collapse_huge_page()) by alloc_hugepage_vma(), and is not freed when memcg's
charge failed.
Actually, we do in collapse_huge_page():
	if (unlikely(!isolated)) {
		...
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}
later. I think we need a similar logic in memcg's failure path too.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
@ 2010-11-30  0:38         ` Daisuke Nishimura
  0 siblings, 0 replies; 331+ messages in thread
From: Daisuke Nishimura @ 2010-11-30  0:38 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov, Daisuke Nishimura

On Mon, 29 Nov 2010 17:11:03 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Mon, Nov 29, 2010 at 02:38:01PM +0900, Daisuke Nishimura wrote:
> > I think this should be:
> > 
> > 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
> > #ifdef CONFIG_NUMA
> > 		put_page(new_page);
> > #endif
> > 		goto out;
> > 	}
> 
> Hmm no, the change you suggest would generate memory corruption with
> use after free.

I'm sorry if I miss something, "new_page" will be reused in !CONFIG_NUMA case
as you say, but, in CONFIG_NUMA case, it is allocated in this function
(collapse_huge_page()) by alloc_hugepage_vma(), and is not freed when memcg's
charge failed.
Actually, we do in collapse_huge_page():
	if (unlikely(!isolated)) {
		...
#ifdef CONFIG_NUMA
		put_page(new_page);
#endif
		goto out;
	}
later. I think we need a similar logic in memcg's failure path too.

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
  2010-11-30  0:38         ` Daisuke Nishimura
@ 2010-11-30 19:01           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-30 19:01 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov

On Tue, Nov 30, 2010 at 09:38:04AM +0900, Daisuke Nishimura wrote:
> I'm sorry if I miss something, "new_page" will be reused in !CONFIG_NUMA case
> as you say, but, in CONFIG_NUMA case, it is allocated in this function
> (collapse_huge_page()) by alloc_hugepage_vma(), and is not freed when memcg's
> charge failed.
> Actually, we do in collapse_huge_page():
> 	if (unlikely(!isolated)) {
> 		...
> #ifdef CONFIG_NUMA
> 		put_page(new_page);
> #endif
> 		goto out;
> 	}
> later. I think we need a similar logic in memcg's failure path too.

Apologies, you really found a minor memleak in case of memcg
accounting failure.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1726,7 +1726,7 @@ static void collapse_huge_page(struct mm
 	}
 #endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
-		goto out;
+		goto out_put_page;
 
 	anon_vma_lock(vma->anon_vma);
 
@@ -1755,10 +1755,7 @@ static void collapse_huge_page(struct mm
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock(vma->anon_vma);
 		mem_cgroup_uncharge_page(new_page);
-#ifdef CONFIG_NUMA
-		put_page(new_page);
-#endif
-		goto out;
+		goto out_put_page;
 	}
 
 	/*
@@ -1799,6 +1796,13 @@ static void collapse_huge_page(struct mm
 	khugepaged_pages_collapsed++;
 out:
 	up_write(&mm->mmap_sem);
+	return;
+
+out_put_page:
+#ifdef CONFIG_NUMA
+	put_page(new_page);
+#endif
+	goto out;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm,



I was too optimistic that there wasn't really a bug, I thought it was
some confusion about the hpage usage that differs with numa and not
numa.

On a side note, the CONFIG_NUMA case will later change further to move
the allocation out of the mmap_sem write mode to make the fs
submitting I/O from userland and doing memory allocations in the I/O
paths happier.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 53 of 66] add numa awareness to hugepage allocations
@ 2010-11-30 19:01           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-11-30 19:01 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner, Chris Mason,
	Borislav Petkov

On Tue, Nov 30, 2010 at 09:38:04AM +0900, Daisuke Nishimura wrote:
> I'm sorry if I miss something, "new_page" will be reused in !CONFIG_NUMA case
> as you say, but, in CONFIG_NUMA case, it is allocated in this function
> (collapse_huge_page()) by alloc_hugepage_vma(), and is not freed when memcg's
> charge failed.
> Actually, we do in collapse_huge_page():
> 	if (unlikely(!isolated)) {
> 		...
> #ifdef CONFIG_NUMA
> 		put_page(new_page);
> #endif
> 		goto out;
> 	}
> later. I think we need a similar logic in memcg's failure path too.

Apologies, you really found a minor memleak in case of memcg
accounting failure.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1726,7 +1726,7 @@ static void collapse_huge_page(struct mm
 	}
 #endif
 	if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)))
-		goto out;
+		goto out_put_page;
 
 	anon_vma_lock(vma->anon_vma);
 
@@ -1755,10 +1755,7 @@ static void collapse_huge_page(struct mm
 		spin_unlock(&mm->page_table_lock);
 		anon_vma_unlock(vma->anon_vma);
 		mem_cgroup_uncharge_page(new_page);
-#ifdef CONFIG_NUMA
-		put_page(new_page);
-#endif
-		goto out;
+		goto out_put_page;
 	}
 
 	/*
@@ -1799,6 +1796,13 @@ static void collapse_huge_page(struct mm
 	khugepaged_pages_collapsed++;
 out:
 	up_write(&mm->mmap_sem);
+	return;
+
+out_put_page:
+#ifdef CONFIG_NUMA
+	put_page(new_page);
+#endif
+	goto out;
 }
 
 static int khugepaged_scan_pmd(struct mm_struct *mm,



I was too optimistic that there wasn't really a bug, I thought it was
some confusion about the hpage usage that differs with numa and not
numa.

On a side note, the CONFIG_NUMA case will later change further to move
the allocation out of the mmap_sem write mode to make the fs
submitting I/O from userland and doing memory allocations in the I/O
paths happier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
  2010-11-26 22:24           ` Linus Torvalds
@ 2010-12-02 17:50             ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-02 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello,

On Sat, Nov 27, 2010 at 07:24:37AM +0900, Linus Torvalds wrote:
> That may be, and you needn't necessarily clean up old use (although
> that might be nice as a separate thing), but I wish we didn't make
> what is already messy bigger and messier.

Ok I cleaned up most of the old code in asm-generic/pgtable.h too. Let
me know if you like further changes in this area. I embedded the
cleanups of the pmd_trans_huge/pmd_trans_splitting inside the previous
patch (16) too (not at the end anymore).

Thanks,
Andrea

=====
Subject: add pmd mangling generic functions

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -5,67 +5,108 @@
 #ifdef CONFIG_MMU
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-/*
- * Largely same as above, but only sets the access flags (dirty,
- * accessed, and writable). Furthermore, we know it always gets set
- * to a "more permissive" setting, which allows most architectures
- * to optimize this. We return whether the PTE actually changed, which
- * in turn instructs the caller to do things like update__mmu_cache.
- * This used to be done in the caller, but sparc needs minor faults to
- * force that call on sun4c so we changed this macro slightly
- */
-#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
-({									  \
-	int __changed = !pte_same(*(__ptep), __entry);			  \
-	if (__changed) {						  \
-		set_pte_at((__vma)->vm_mm, (__address), __ptep, __entry); \
-		flush_tlb_page(__vma, __address);			  \
-	}								  \
-	__changed;							  \
-})
+extern int ptep_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pte_t *ptep,
+				 pte_t entry, int dirty);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define ptep_test_and_clear_young(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte = *(__ptep);					\
-	int r = 1;							\
-	if (!pte_young(__pte))						\
-		r = 0;							\
-	else								\
-		set_pte_at((__vma)->vm_mm, (__address),			\
-			   (__ptep), pte_mkold(__pte));			\
-	r;								\
-})
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pte_t *ptep)
+{
+	pte_t pte = *ptep;
+	int r = 1;
+	if (!pte_young(pte))
+		r = 0;
+	else
+		set_pte_at(vma->vm_mm, address, ptep, pte_mkold(pte));
+	return r;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pmd_t *pmdp)
+{
+	pmd_t pmd = *pmdp;
+	int r = 1;
+	if (!pmd_young(pmd))
+		r = 0;
+	else
+		set_pmd_at(vma->vm_mm, address, pmdp, pmd_mkold(pmd));
+	return r;
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pmd_t *pmdp)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-#define ptep_clear_flush_young(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
-	if (__young)							\
-		flush_tlb_page(__vma, __address);			\
-	__young;							\
-})
+int ptep_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pte_t *ptep);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
-#define ptep_get_and_clear(__mm, __address, __ptep)			\
-({									\
-	pte_t __pte = *(__ptep);					\
-	pte_clear((__mm), (__address), (__ptep));			\
-	__pte;								\
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pte_t *ptep)
+{
+	pte_t pte = *ptep;
+	pte_clear(mm, address, ptep);
+	return pte;
+)
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = *pmdp;
+	pmd_clear(mm, address, pmdp);
+	return pmd;
 })
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
-#define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
-({									\
-	pte_t __pte;							\
-	__pte = ptep_get_and_clear((__mm), (__address), (__ptep));	\
-	__pte;								\
-})
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+					    unsigned long address, pte_t *ptep,
+					    int full)
+{
+	pte_t pte;
+	pte = ptep_get_and_clear(mm, address, ptep);
+	return pte;
+}
 #endif
 
 /*
@@ -74,20 +115,25 @@
  * not present, or in the process of an address space destruction.
  */
 #ifndef __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL
-#define pte_clear_not_present_full(__mm, __address, __ptep, __full)	\
-do {									\
-	pte_clear((__mm), (__address), (__ptep));			\
-} while (0)
+static inline void pte_clear_not_present_full(struct mm_struct *mm,
+					      unsigned long address,
+					      pte_t *ptep,
+					      int full)
+{
+	pte_clear(mm, address, ptep);
+}
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
-#define ptep_clear_flush(__vma, __address, __ptep)			\
-({									\
-	pte_t __pte;							\
-	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
-	flush_tlb_page(__vma, __address);				\
-	__pte;								\
-})
+extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pte_t *ptep);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+extern pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -99,8 +145,49 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long address, pmd_t *pmdp)
+{
+	BUG();
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
-#define pte_same(A,B)	(pte_val(A) == pte_val(B))
+static inline int pte_same(pte_t pte_a, pte_t pte_b)
+{
+	return pte_val(pte_a) == pte_val(pte_b);
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+	return pmd_val(pmd_a) == pmd_val(pmd_b);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
@@ -357,6 +444,13 @@ static inline int pmd_trans_splitting(pm
 {
 	return 0;
 }
+#ifndef __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o
+			   vmalloc.o pagewalk.o pgtable-generic.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page_alloc.o page-writeback.o \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
new file mode 100644
--- /dev/null
+++ b/mm/pgtable-generic.c
@@ -0,0 +1,124 @@
+/*
+ *  mm/pgtable-generic.c
+ *
+ *  Generic pgtable methods declared in asm-generic/pgtable.h
+ *  Most arch won't need these.
+ *
+ *  Copyright (C) 2010  Linus Torvalds
+ */
+
+#include <asm/tlb.h>
+#include <asm-generic/pgtable.h>
+
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+/*
+ * Only sets the access flags (dirty, accessed, and
+ * writable). Furthermore, we know it always gets set to a "more
+ * permissive" setting, which allows most architectures to optimize
+ * this. We return whether the PTE actually changed, which in turn
+ * instructs the caller to do things like update__mmu_cache.  This
+ * used to be done in the caller, but sparc needs minor faults to
+ * force that call on sun4c so we changed this macro slightly
+ */
+int ptep_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pte_t *ptep,
+			  pte_t entry, int dirty)
+{
+	int changed = !pte_same(*ptep, entry);
+	if (changed) {
+		set_pte_at(vma->vm_mm, address, ptep, entry);
+		flush_tlb_page(vma, address);
+	}
+	return changed;
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	int changed = !pmd_same(*pmdp, entry);
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	if (changed) {
+		set_pmd_at(vma->vm_mm, address, pmdp, entry);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+	return changed;
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+	BUG();
+	return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+int ptep_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pte_t *ptep)
+{
+	int young;
+	young = ptep_test_and_clear_young(vma, address, ptep);
+	if (young)
+		flush_tlb_page(vma, address);
+	return young;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return young;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
+pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep)
+{
+	pte_t pte;
+	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
+	flush_tlb_page(vma, address);
+	return pte;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pmd_t *pmdp)
+{
+	pmd_t pmd;
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return pmd;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+pmd_t pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			   pmd_t *pmdp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+	/* tlb flush only to serialize against gup-fast */
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+#endif

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 17 of 66] add pmd mangling generic functions
@ 2010-12-02 17:50             ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-02 17:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, linux-mm, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello,

On Sat, Nov 27, 2010 at 07:24:37AM +0900, Linus Torvalds wrote:
> That may be, and you needn't necessarily clean up old use (although
> that might be nice as a separate thing), but I wish we didn't make
> what is already messy bigger and messier.

Ok I cleaned up most of the old code in asm-generic/pgtable.h too. Let
me know if you like further changes in this area. I embedded the
cleanups of the pmd_trans_huge/pmd_trans_splitting inside the previous
patch (16) too (not at the end anymore).

Thanks,
Andrea

=====
Subject: add pmd mangling generic functions

From: Andrea Arcangeli <aarcange@redhat.com>

Some are needed to build but not actually used on archs not supporting
transparent hugepages. Others like pmdp_clear_flush are used by x86 too.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
---

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -5,67 +5,108 @@
 #ifdef CONFIG_MMU
 
 #ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
-/*
- * Largely same as above, but only sets the access flags (dirty,
- * accessed, and writable). Furthermore, we know it always gets set
- * to a "more permissive" setting, which allows most architectures
- * to optimize this. We return whether the PTE actually changed, which
- * in turn instructs the caller to do things like update__mmu_cache.
- * This used to be done in the caller, but sparc needs minor faults to
- * force that call on sun4c so we changed this macro slightly
- */
-#define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
-({									  \
-	int __changed = !pte_same(*(__ptep), __entry);			  \
-	if (__changed) {						  \
-		set_pte_at((__vma)->vm_mm, (__address), __ptep, __entry); \
-		flush_tlb_page(__vma, __address);			  \
-	}								  \
-	__changed;							  \
-})
+extern int ptep_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pte_t *ptep,
+				 pte_t entry, int dirty);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+extern int pmdp_set_access_flags(struct vm_area_struct *vma,
+				 unsigned long address, pmd_t *pmdp,
+				 pmd_t entry, int dirty);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
-#define ptep_test_and_clear_young(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte = *(__ptep);					\
-	int r = 1;							\
-	if (!pte_young(__pte))						\
-		r = 0;							\
-	else								\
-		set_pte_at((__vma)->vm_mm, (__address),			\
-			   (__ptep), pte_mkold(__pte));			\
-	r;								\
-})
+static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pte_t *ptep)
+{
+	pte_t pte = *ptep;
+	int r = 1;
+	if (!pte_young(pte))
+		r = 0;
+	else
+		set_pte_at(vma->vm_mm, address, ptep, pte_mkold(pte));
+	return r;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pmd_t *pmdp)
+{
+	pmd_t pmd = *pmdp;
+	int r = 1;
+	if (!pmd_young(pmd))
+		r = 0;
+	else
+		set_pmd_at(vma->vm_mm, address, pmdp, pmd_mkold(pmd));
+	return r;
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
+					    unsigned long address,
+					    pmd_t *pmdp)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
-#define ptep_clear_flush_young(__vma, __address, __ptep)		\
-({									\
-	int __young;							\
-	__young = ptep_test_and_clear_young(__vma, __address, __ptep);	\
-	if (__young)							\
-		flush_tlb_page(__vma, __address);			\
-	__young;							\
-})
+int ptep_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pte_t *ptep);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
-#define ptep_get_and_clear(__mm, __address, __ptep)			\
-({									\
-	pte_t __pte = *(__ptep);					\
-	pte_clear((__mm), (__address), (__ptep));			\
-	__pte;								\
+static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pte_t *ptep)
+{
+	pte_t pte = *ptep;
+	pte_clear(mm, address, ptep);
+	return pte;
+)
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_GET_AND_CLEAR
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	pmd_t pmd = *pmdp;
+	pmd_clear(mm, address, pmdp);
+	return pmd;
 })
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline pmd_t pmdp_get_and_clear(struct mm_struct *mm,
+				       unsigned long address,
+				       pmd_t *pmdp)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
-#define ptep_get_and_clear_full(__mm, __address, __ptep, __full)	\
-({									\
-	pte_t __pte;							\
-	__pte = ptep_get_and_clear((__mm), (__address), (__ptep));	\
-	__pte;								\
-})
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+					    unsigned long address, pte_t *ptep,
+					    int full)
+{
+	pte_t pte;
+	pte = ptep_get_and_clear(mm, address, ptep);
+	return pte;
+}
 #endif
 
 /*
@@ -74,20 +115,25 @@
  * not present, or in the process of an address space destruction.
  */
 #ifndef __HAVE_ARCH_PTE_CLEAR_NOT_PRESENT_FULL
-#define pte_clear_not_present_full(__mm, __address, __ptep, __full)	\
-do {									\
-	pte_clear((__mm), (__address), (__ptep));			\
-} while (0)
+static inline void pte_clear_not_present_full(struct mm_struct *mm,
+					      unsigned long address,
+					      pte_t *ptep,
+					      int full)
+{
+	pte_clear(mm, address, ptep);
+}
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
-#define ptep_clear_flush(__vma, __address, __ptep)			\
-({									\
-	pte_t __pte;							\
-	__pte = ptep_get_and_clear((__vma)->vm_mm, __address, __ptep);	\
-	flush_tlb_page(__vma, __address);				\
-	__pte;								\
-})
+extern pte_t ptep_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pte_t *ptep);
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+extern pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pmd_t *pmdp);
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -99,8 +145,49 @@ static inline void ptep_set_wrprotect(st
 }
 #endif
 
+#ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long address, pmd_t *pmdp)
+{
+	pmd_t old_pmd = *pmdp;
+	set_pmd_at(mm, address, pmdp, pmd_wrprotect(old_pmd));
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline void pmdp_set_wrprotect(struct mm_struct *mm,
+				      unsigned long address, pmd_t *pmdp)
+{
+	BUG();
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+extern pmd_t pmdp_clear_flush(struct vm_area_struct *vma,
+			      unsigned long address,
+			      pmd_t *pmdp);
+#endif
+
 #ifndef __HAVE_ARCH_PTE_SAME
-#define pte_same(A,B)	(pte_val(A) == pte_val(B))
+static inline int pte_same(pte_t pte_a, pte_t pte_b)
+{
+	return pte_val(pte_a) == pte_val(pte_b);
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMD_SAME
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+	return pmd_val(pmd_a) == pmd_val(pmd_b);
+}
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
 
 #ifndef __HAVE_ARCH_PAGE_TEST_DIRTY
@@ -357,6 +444,13 @@ static inline int pmd_trans_splitting(pm
 {
 	return 0;
 }
+#ifndef __HAVE_ARCH_PMD_WRITE
+static inline int pmd_write(pmd_t pmd)
+{
+	BUG();
+	return 0;
+}
+#endif /* __HAVE_ARCH_PMD_WRITE */
 #endif
 
 #endif /* !__ASSEMBLY__ */
diff --git a/mm/Makefile b/mm/Makefile
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -5,7 +5,7 @@
 mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   mlock.o mmap.o mprotect.o mremap.o msync.o rmap.o \
-			   vmalloc.o pagewalk.o
+			   vmalloc.o pagewalk.o pgtable-generic.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
 			   maccess.o page_alloc.o page-writeback.o \
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
new file mode 100644
--- /dev/null
+++ b/mm/pgtable-generic.c
@@ -0,0 +1,124 @@
+/*
+ *  mm/pgtable-generic.c
+ *
+ *  Generic pgtable methods declared in asm-generic/pgtable.h
+ *  Most arch won't need these.
+ *
+ *  Copyright (C) 2010  Linus Torvalds
+ */
+
+#include <asm/tlb.h>
+#include <asm-generic/pgtable.h>
+
+#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
+/*
+ * Only sets the access flags (dirty, accessed, and
+ * writable). Furthermore, we know it always gets set to a "more
+ * permissive" setting, which allows most architectures to optimize
+ * this. We return whether the PTE actually changed, which in turn
+ * instructs the caller to do things like update__mmu_cache.  This
+ * used to be done in the caller, but sparc needs minor faults to
+ * force that call on sun4c so we changed this macro slightly
+ */
+int ptep_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pte_t *ptep,
+			  pte_t entry, int dirty)
+{
+	int changed = !pte_same(*ptep, entry);
+	if (changed) {
+		set_pte_at(vma->vm_mm, address, ptep, entry);
+		flush_tlb_page(vma, address);
+	}
+	return changed;
+})
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
+int pmdp_set_access_flags(struct vm_area_struct *vma,
+			  unsigned long address, pmd_t *pmdp,
+			  pmd_t entry, int dirty)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	int changed = !pmd_same(*pmdp, entry);
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	if (changed) {
+		set_pmd_at(vma->vm_mm, address, pmdp, entry);
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	}
+	return changed;
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+	BUG();
+	return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
+int ptep_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pte_t *ptep)
+{
+	int young;
+	young = ptep_test_and_clear_young(vma, address, ptep);
+	if (young)
+		flush_tlb_page(vma, address);
+	return young;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
+int pmdp_clear_flush_young(struct vm_area_struct *vma,
+			   unsigned long address, pmd_t *pmdp)
+{
+	int young;
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	young = pmdp_test_and_clear_young(vma, address, pmdp);
+	if (young)
+		flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return young;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH
+pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pte_t *ptep)
+{
+	pte_t pte;
+	pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
+	flush_tlb_page(vma, address);
+	return pte;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_CLEAR_FLUSH
+pmd_t pmdp_clear_flush(struct vm_area_struct *vma, unsigned long address,
+		       pmd_t *pmdp)
+{
+	pmd_t pmd;
+#ifndef CONFIG_TRANSPARENT_HUGEPAGE
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	pmd = pmdp_get_and_clear(vma->vm_mm, address, pmdp);
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+	return pmd;
+}
+#endif
+
+#ifndef __HAVE_ARCH_PMDP_SPLITTING_FLUSH
+pmd_t pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
+			   pmd_t *pmdp)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	pmd_t pmd = pmd_mksplitting(*pmdp);
+	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+	/* tlb flush only to serialize against gup-fast */
+	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+#else /* CONFIG_TRANSPARENT_HUGEPAGE */
+	BUG();
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+#endif

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 30 of 66] transparent hugepage core
  2010-11-18 15:12     ` Mel Gorman
@ 2010-12-07 21:24       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-07 21:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 03:12:21PM +0000, Mel Gorman wrote:
> All that seems fine to me. Nits in part that are simply not worth
> calling out. In principal, I Agree With This :)

I didn't understand what is not worth calling out, but I like the end
of your sentence ;).

> > +#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +	do {								\
> > +		pmd_t *____pmd = (__pmd);				\
> > +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> > +		/*							\
> > +		 * spin_unlock_wait() is just a loop in C and so the	\
> > +		 * CPU can reorder anything around it.			\
> > +		 */							\
> > +		smp_mb();						\
> 
> Just a note as I see nothing wrong with this but that's a good spot. The
> unlock isn't a memory barrier. Out of curiousity, does it really need to be
> a full barrier or would a write barrier have been enough?
> 
> > +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> > +		       pmd_trans_huge(*____pmd));			\

spin_unlock reads, the BUG_ON reads, so even if we ignore what happens
before and after wait_split_huge_page(), at most it could be a read
memory barrier. It can't be a write memory barrier as the
spin_unlock_wait would pass it too.

I think it better be a full memory barrier to be sure the writes after
wait_split_huge_page return don't happen before spin_unlock_wait. It's
hard to see how that could happen though.

> > +	} while (0)
> > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > +#if HPAGE_PMD_ORDER > MAX_ORDER
> > +#error "hugepages can't be allocated by the buddy allocator"
> > +#endif
> > +
> > +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> > +static inline int PageTransHuge(struct page *page)
> > +{
> > +	VM_BUG_ON(PageTail(page));
> > +	return PageHead(page);
> > +}
> 
> gfp.h seems an odd place for these. Should the flags go in page-flags.h
> and maybe put vma_address() in internal.h?
> 
> Not a biggie.

Cleaned up thanks.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,13 +97,6 @@ extern void __split_huge_page_pmd(struct
 #if HPAGE_PMD_ORDER > MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
-
-extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
-static inline int PageTransHuge(struct page *page)
-{
-	VM_BUG_ON(PageTail(page));
-	return PageHead(page);
-}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
@@ -120,7 +113,6 @@ static inline int split_huge_page(struct
 	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define PageTransHuge(page) 0
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -409,6 +409,19 @@ static inline void ClearPageCompound(str
 
 #endif /* !PAGEFLAGS_EXTENDED */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else
+static inline int PageTransHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -134,6 +134,10 @@ static inline void mlock_migrate_page(st
 	}
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned long vma_address(struct page *page,
+				 struct vm_area_struct *vma);
+#endif
 #else /* !CONFIG_MMU */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {

> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
> >  }
> >  
> >  static inline void
> > +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> > +		       struct list_head *head)
> > +{
> > +	list_add(&page->lru, head);
> > +	__inc_zone_state(zone, NR_LRU_BASE + l);
> > +	mem_cgroup_add_lru_list(page, l);
> > +}
> > +
> > +static inline void
> >  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> >  {
> > -	list_add(&page->lru, &zone->lru[l].list);
> > -	__inc_zone_state(zone, NR_LRU_BASE + l);
> > -	mem_cgroup_add_lru_list(page, l);
> > +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
> >  }
> >  
> 
> Do these really need to be in a public header or can they move to
> mm/swap.c?

The above quoted change is a noop as far as the old code is concerned,
and moving it to swap.c would alter the old code. I think list_add and
__mod_zone_page_state is pretty small and fast so probably it's worth
keeping it as inline.

> > +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> > +				 struct mm_struct *mm)
> > +{
> > +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> > +
> 
> assert_spin_locked() ?

Changed.

> > +int handle_pte_fault(struct mm_struct *mm,
> > +		     struct vm_area_struct *vma, unsigned long address,
> > +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
> >  {
> >  	pte_t entry;
> >  	spinlock_t *ptl;
> > @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
> >  	pmd = pmd_alloc(mm, pud, address);
> >  	if (!pmd)
> >  		return VM_FAULT_OOM;
> > -	pte = pte_alloc_map(mm, vma, pmd, address);
> > -	if (!pte)
> > +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> > +		if (!vma->vm_ops)
> > +			return do_huge_pmd_anonymous_page(mm, vma, address,
> > +							  pmd, flags);
> > +	} else {
> > +		pmd_t orig_pmd = *pmd;
> > +		barrier();
> 
> What is this barrier for?

This is to be guaranteed gcc doesn't re-read the *pmd after the
barrier and it instead always read it from orig_pmd variable on the
local kernel stack. gcc doesn't know *pmd can still change from under
us until after we take some lock and the code relies on orig_pmd not
to change after barrier().

> Other than a few minor questions, these seems very similar to what you
> had before. There is a lot going on in this patch but I did not find
> anything wrong.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Great thanks!
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 30 of 66] transparent hugepage core
@ 2010-12-07 21:24       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-07 21:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 03:12:21PM +0000, Mel Gorman wrote:
> All that seems fine to me. Nits in part that are simply not worth
> calling out. In principal, I Agree With This :)

I didn't understand what is not worth calling out, but I like the end
of your sentence ;).

> > +#define wait_split_huge_page(__anon_vma, __pmd)				\
> > +	do {								\
> > +		pmd_t *____pmd = (__pmd);				\
> > +		spin_unlock_wait(&(__anon_vma)->root->lock);		\
> > +		/*							\
> > +		 * spin_unlock_wait() is just a loop in C and so the	\
> > +		 * CPU can reorder anything around it.			\
> > +		 */							\
> > +		smp_mb();						\
> 
> Just a note as I see nothing wrong with this but that's a good spot. The
> unlock isn't a memory barrier. Out of curiousity, does it really need to be
> a full barrier or would a write barrier have been enough?
> 
> > +		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
> > +		       pmd_trans_huge(*____pmd));			\

spin_unlock reads, the BUG_ON reads, so even if we ignore what happens
before and after wait_split_huge_page(), at most it could be a read
memory barrier. It can't be a write memory barrier as the
spin_unlock_wait would pass it too.

I think it better be a full memory barrier to be sure the writes after
wait_split_huge_page return don't happen before spin_unlock_wait. It's
hard to see how that could happen though.

> > +	} while (0)
> > +#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> > +#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> > +#if HPAGE_PMD_ORDER > MAX_ORDER
> > +#error "hugepages can't be allocated by the buddy allocator"
> > +#endif
> > +
> > +extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
> > +static inline int PageTransHuge(struct page *page)
> > +{
> > +	VM_BUG_ON(PageTail(page));
> > +	return PageHead(page);
> > +}
> 
> gfp.h seems an odd place for these. Should the flags go in page-flags.h
> and maybe put vma_address() in internal.h?
> 
> Not a biggie.

Cleaned up thanks.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,13 +97,6 @@ extern void __split_huge_page_pmd(struct
 #if HPAGE_PMD_ORDER > MAX_ORDER
 #error "hugepages can't be allocated by the buddy allocator"
 #endif
-
-extern unsigned long vma_address(struct page *page, struct vm_area_struct *vma);
-static inline int PageTransHuge(struct page *page)
-{
-	VM_BUG_ON(PageTail(page));
-	return PageHead(page);
-}
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUG(); 0; })
@@ -120,7 +113,6 @@ static inline int split_huge_page(struct
 	do { } while (0)
 #define wait_split_huge_page(__anon_vma, __pmd)	\
 	do { } while (0)
-#define PageTransHuge(page) 0
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -409,6 +409,19 @@ static inline void ClearPageCompound(str
 
 #endif /* !PAGEFLAGS_EXTENDED */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int PageTransHuge(struct page *page)
+{
+	VM_BUG_ON(PageTail(page));
+	return PageHead(page);
+}
+#else
+static inline int PageTransHuge(struct page *page)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MMU
 #define __PG_MLOCKED		(1 << PG_mlocked)
 #else
diff --git a/mm/internal.h b/mm/internal.h
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -134,6 +134,10 @@ static inline void mlock_migrate_page(st
 	}
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned long vma_address(struct page *page,
+				 struct vm_area_struct *vma);
+#endif
 #else /* !CONFIG_MMU */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
 {

> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -20,11 +20,18 @@ static inline int page_is_file_cache(str
> >  }
> >  
> >  static inline void
> > +__add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l,
> > +		       struct list_head *head)
> > +{
> > +	list_add(&page->lru, head);
> > +	__inc_zone_state(zone, NR_LRU_BASE + l);
> > +	mem_cgroup_add_lru_list(page, l);
> > +}
> > +
> > +static inline void
> >  add_page_to_lru_list(struct zone *zone, struct page *page, enum lru_list l)
> >  {
> > -	list_add(&page->lru, &zone->lru[l].list);
> > -	__inc_zone_state(zone, NR_LRU_BASE + l);
> > -	mem_cgroup_add_lru_list(page, l);
> > +	__add_page_to_lru_list(zone, page, l, &zone->lru[l].list);
> >  }
> >  
> 
> Do these really need to be in a public header or can they move to
> mm/swap.c?

The above quoted change is a noop as far as the old code is concerned,
and moving it to swap.c would alter the old code. I think list_add and
__mod_zone_page_state is pretty small and fast so probably it's worth
keeping it as inline.

> > +static void prepare_pmd_huge_pte(pgtable_t pgtable,
> > +				 struct mm_struct *mm)
> > +{
> > +	VM_BUG_ON(spin_can_lock(&mm->page_table_lock));
> > +
> 
> assert_spin_locked() ?

Changed.

> > +int handle_pte_fault(struct mm_struct *mm,
> > +		     struct vm_area_struct *vma, unsigned long address,
> > +		     pte_t *pte, pmd_t *pmd, unsigned int flags)
> >  {
> >  	pte_t entry;
> >  	spinlock_t *ptl;
> > @@ -3222,9 +3257,40 @@ int handle_mm_fault(struct mm_struct *mm
> >  	pmd = pmd_alloc(mm, pud, address);
> >  	if (!pmd)
> >  		return VM_FAULT_OOM;
> > -	pte = pte_alloc_map(mm, vma, pmd, address);
> > -	if (!pte)
> > +	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> > +		if (!vma->vm_ops)
> > +			return do_huge_pmd_anonymous_page(mm, vma, address,
> > +							  pmd, flags);
> > +	} else {
> > +		pmd_t orig_pmd = *pmd;
> > +		barrier();
> 
> What is this barrier for?

This is to be guaranteed gcc doesn't re-read the *pmd after the
barrier and it instead always read it from orig_pmd variable on the
local kernel stack. gcc doesn't know *pmd can still change from under
us until after we take some lock and the code relies on orig_pmd not
to change after barrier().

> Other than a few minor questions, these seems very similar to what you
> had before. There is a lot going on in this patch but I did not find
> anything wrong.
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>

Great thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
  2010-11-18 15:19     ` Mel Gorman
@ 2010-12-09 17:14       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 17:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 03:19:35PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:08PM +0100, Andrea Arcangeli wrote:
> > @@ -121,6 +122,11 @@ static inline int split_huge_page(struct
> >  #define wait_split_huge_page(__anon_vma, __pmd)	\
> >  	do { } while (0)
> >  #define PageTransHuge(page) 0
> > +static inline int hugepage_madvise(unsigned long *vm_flags)
> > +{
> > +	BUG_ON(0);
> 
> What's BUG_ON(0) in aid of?

When CONFIG_TRANSPARENT_HUGEPAGE is disabled, nothing must call that
function (madvise must return -EINVAL like older kernels instead). But
I guess you meant I should convert the BUG_ON(0) to a BUG() instead? (done)

> I should have said it at patch 4 but don't forget that Michael Kerrisk
> should be made aware of MADV_HUGEPAGE so it makes it to a manual page
> at some point.

Ok, I'll forward patch 4.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 33 of 66] madvise(MADV_HUGEPAGE)
@ 2010-12-09 17:14       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 17:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 03:19:35PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:08PM +0100, Andrea Arcangeli wrote:
> > @@ -121,6 +122,11 @@ static inline int split_huge_page(struct
> >  #define wait_split_huge_page(__anon_vma, __pmd)	\
> >  	do { } while (0)
> >  #define PageTransHuge(page) 0
> > +static inline int hugepage_madvise(unsigned long *vm_flags)
> > +{
> > +	BUG_ON(0);
> 
> What's BUG_ON(0) in aid of?

When CONFIG_TRANSPARENT_HUGEPAGE is disabled, nothing must call that
function (madvise must return -EINVAL like older kernels instead). But
I guess you meant I should convert the BUG_ON(0) to a BUG() instead? (done)

> I should have said it at patch 4 but don't forget that Michael Kerrisk
> should be made aware of MADV_HUGEPAGE so it makes it to a manual page
> at some point.

Ok, I'll forward patch 4.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
  2010-11-18 16:06     ` Mel Gorman
@ 2010-12-09 18:13       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:06:13PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Skip transhuge pages in ksm for now.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> This is an idle concern that I haven't looked into but is there any conflict
> between khugepaged scanning the KSM scanning?
> 
> Specifically, I *think* the impact of this patch is that KSM will not
> accidentally split a huge page. Is that right? If so, it could do with
> being included in the changelog.

KSM wasn't aware about hugepages and in turn it'd never split them
anyway. We want KSM to split hugepages only when if finds two equal
subpages. That will happen later.

Right now there is no collision of ksmd and khugepaged, regular pages,
hugepages and ksm pages will co-exist fine in the same vma. The only
problem is that the system has now to start swapping before KSM has a
chance to find equal pages and we'll fix it in the future so KSM can
scan inside hugepages too and split them and merge the subpages as
needed before the memory pressure starts.

> On the other hand, can khugepaged be prevented from promoting a hugepage
> because of KSM?

Sure, khugepaged won't promote if there's any ksm page in the
range. That's not going to change. When KSM is started, the priority
remains in saving memory. If people uses enabled=madvise and
MADV_HUGEPAGE+MADV_MERGEABLE there is actually zero memory loss
because of THP and there is a speed improvement for all pages that
aren't equal. So it's an ideal setup even for embedded. Regular cloud
setup would be enabled=always + MADV_MERGEABLE (with enabled=always
MADV_HUGEPAGE becomes a noop).

On a related note I'm also going to introduce a MADV_NO_HUGEPAGE, is
that a good name for it? cloud management wants to be able to disable
THP per-VM basis (when the VM are totally idle, and low priority, this
currently also helps to maximize the power of KSM that would otherwise
be activated only after initial sawpping, but the KSM part will be
fixed). It could be achieved also with enabled=madvise and
MADV_HUGEPAGE but we don't want to change the system wide default in
order to disable THP on a per-VM basis: it's much nicer if the default
behavior of the host remains the same in case it's not a pure
hypervisor usage but there are other loads running in parallel to the
virt load. In theory a prctl(PR_NO_HUGEPAGE) could also do it and it'd
be possible to use from a wrapper (madvise can't be wrapped), but I
think MADV_NO_HUGEPAGE is cleaner and it won't require brand new
per-process info.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
@ 2010-12-09 18:13       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:06:13PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > Skip transhuge pages in ksm for now.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> This is an idle concern that I haven't looked into but is there any conflict
> between khugepaged scanning the KSM scanning?
> 
> Specifically, I *think* the impact of this patch is that KSM will not
> accidentally split a huge page. Is that right? If so, it could do with
> being included in the changelog.

KSM wasn't aware about hugepages and in turn it'd never split them
anyway. We want KSM to split hugepages only when if finds two equal
subpages. That will happen later.

Right now there is no collision of ksmd and khugepaged, regular pages,
hugepages and ksm pages will co-exist fine in the same vma. The only
problem is that the system has now to start swapping before KSM has a
chance to find equal pages and we'll fix it in the future so KSM can
scan inside hugepages too and split them and merge the subpages as
needed before the memory pressure starts.

> On the other hand, can khugepaged be prevented from promoting a hugepage
> because of KSM?

Sure, khugepaged won't promote if there's any ksm page in the
range. That's not going to change. When KSM is started, the priority
remains in saving memory. If people uses enabled=madvise and
MADV_HUGEPAGE+MADV_MERGEABLE there is actually zero memory loss
because of THP and there is a speed improvement for all pages that
aren't equal. So it's an ideal setup even for embedded. Regular cloud
setup would be enabled=always + MADV_MERGEABLE (with enabled=always
MADV_HUGEPAGE becomes a noop).

On a related note I'm also going to introduce a MADV_NO_HUGEPAGE, is
that a good name for it? cloud management wants to be able to disable
THP per-VM basis (when the VM are totally idle, and low priority, this
currently also helps to maximize the power of KSM that would otherwise
be activated only after initial sawpping, but the KSM part will be
fixed). It could be achieved also with enabled=madvise and
MADV_HUGEPAGE but we don't want to change the system wide default in
order to disable THP on a per-VM basis: it's much nicer if the default
behavior of the host remains the same in case it's not a pure
hypervisor usage but there are other loads running in parallel to the
virt load. In theory a prctl(PR_NO_HUGEPAGE) could also do it and it'd
be possible to use from a wrapper (madvise can't be wrapped), but I
think MADV_NO_HUGEPAGE is cleaner and it won't require brand new
per-process info.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 45 of 66] remove PG_buddy
  2010-11-18 16:08     ` Mel Gorman
@ 2010-12-09 18:15       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:08:01PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:20PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
> > added to page->flags without overflowing (because of the sparse section bits
> > increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
> > the memory hotplug code from _mapcount to lru.next to avoid any risk of
> > clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
> > lru.next even more easily than the mapcount instead.
> > 
> 
> Does this make much of a difference? I confess I didn't read the patch closely
> because I didn't get the motivation.

The motivation is described in the first line. If I wouldn't remove
PG_buddy, introducing PG_compound_lock would overflow the 32bit build
with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. The bitflag easier to nuke
was PG_buddy.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 45 of 66] remove PG_buddy
@ 2010-12-09 18:15       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:08:01PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:20PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > PG_buddy can be converted to _mapcount == -2. So the PG_compound_lock can be
> > added to page->flags without overflowing (because of the sparse section bits
> > increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. This also has to move
> > the memory hotplug code from _mapcount to lru.next to avoid any risk of
> > clashes. We can't use lru.next for PG_buddy removal, but memory hotplug can use
> > lru.next even more easily than the mapcount instead.
> > 
> 
> Does this make much of a difference? I confess I didn't read the patch closely
> because I didn't get the motivation.

The motivation is described in the first line. If I wouldn't remove
PG_buddy, introducing PG_compound_lock would overflow the 32bit build
with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y. The bitflag easier to nuke
was PG_buddy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 51 of 66] set recommended min free kbytes
  2010-11-18 16:16     ` Mel Gorman
@ 2010-12-09 18:47       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:16:24PM +0000, Mel Gorman wrote:
> > +	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
> > +	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
> > +
> 
> The really important value is pageblock_nr_pages here. It'll just happen
> to work on x86 and x86-64 but anti-fragmentation is really about
> pageblocks, not PMDs.
> 
> > +	/*
> > +	 * Make sure that on average at least two pageblocks are almost free
> > +	 * of another type, one for a migratetype to fall back to and a
> > +	 * second to avoid subsequent fallbacks of other types There are 3
> > +	 * MIGRATE_TYPES we care about.
> > +	 */
> > +	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
> > +
> 
> Same on the use of pageblock_nr_pages. Also, you can replace 3 with
> MIGRATE_PCPTYPES.
> 
> > +	/* don't ever allow to reserve more than 5% of the lowmem */
> > +	recommended_min = min(recommended_min,
> > +			      (unsigned long) nr_free_buffer_pages() / 20);
> > +	recommended_min <<= (PAGE_SHIFT-10);
> > +
> > +	if (recommended_min > min_free_kbytes) {
> > +		min_free_kbytes = recommended_min;
> > +		setup_per_zone_wmarks();
> > +	}
> 
> 
> The timing this is called is important. Would you mind doing a quick
> debugging check by adding a printk to setup_zone_migrate_reserve() to ensure
> MIGRATE_RESERVE is getting set on sensible pageblocks? (see where the comment
> Suitable for reserving if this block is movable is) If MIGRATE_RESERVE blocks
> are not being created in a sensible fashion, atomic high-order allocations
> will suffer in mysterious ways.
> 
> SEtting the higher min free kbytes from userspace happens to work because
> the system is initialised and MIGRATE_MOVABLE exists but that might not be
> the case when automatically set like this patch.

When min_free_kbytes doesn't need to be increased (like huge system
with lots of ram) setup_per_zone_wmarks wasn't called in the
late_initcall where this code runs, and the original
setup_per_zone_wmarks apparently wasn't calling
setup_zone_migrate_reserve (how can it be?).

Anyway now I patched it like this and it seems to work properly with
the unconditional setup_per_zone_wmarks in
late_initcall(set_recommended_min_free_kbytes).

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -103,7 +103,7 @@ static int set_recommended_min_free_kbyt
 		nr_zones++;
 
 	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
-	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
+	recommended_min = pageblock_nr_pages * nr_zones * 2;
 
 	/*
 	 * Make sure that on average at least two pageblocks are almost free
@@ -111,17 +111,17 @@ static int set_recommended_min_free_kbyt
 	 * second to avoid subsequent fallbacks of other types There are 3
 	 * MIGRATE_TYPES we care about.
 	 */
-	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
+	recommended_min += pageblock_nr_pages * nr_zones *
+			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
 
 	/* don't ever allow to reserve more than 5% of the lowmem */
 	recommended_min = min(recommended_min,
 			      (unsigned long) nr_free_buffer_pages() / 20);
 	recommended_min <<= (PAGE_SHIFT-10);
 
-	if (recommended_min > min_free_kbytes) {
+	if (recommended_min > min_free_kbytes)
 		min_free_kbytes = recommended_min;
-		setup_per_zone_wmarks();
-	}
+	setup_per_zone_wmarks();
 	return 0;
 }
 late_initcall(set_recommended_min_free_kbytes);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,6 +3188,7 @@ static void setup_zone_migrate_reserve(s
 	 */
 	reserve = min(2, reserve);
 
+	printk("reserve start %d\n", reserve);
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		if (!pfn_valid(pfn))
 			continue;
@@ -3226,6 +3227,7 @@ static void setup_zone_migrate_reserve(s
 			move_freepages_block(zone, page, MIGRATE_MOVABLE);
 		}
 	}
+	printk("reserve end %d\n", reserve);
 }
 
 /*



hugeadm with CONFIG_TRANSPARENT_HUGEPAGE=n leads to:

reserve start 1
reserve end 0
reserve start 2
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0

With CONFIG_TRANSPARENT_HUGEPAGE=Y I boot I see:

reserve start 1
reserve end 0
reserve start 2
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 2
reserve end 0
reserve start 0

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 51 of 66] set recommended min free kbytes
@ 2010-12-09 18:47       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:16:24PM +0000, Mel Gorman wrote:
> > +	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
> > +	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
> > +
> 
> The really important value is pageblock_nr_pages here. It'll just happen
> to work on x86 and x86-64 but anti-fragmentation is really about
> pageblocks, not PMDs.
> 
> > +	/*
> > +	 * Make sure that on average at least two pageblocks are almost free
> > +	 * of another type, one for a migratetype to fall back to and a
> > +	 * second to avoid subsequent fallbacks of other types There are 3
> > +	 * MIGRATE_TYPES we care about.
> > +	 */
> > +	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
> > +
> 
> Same on the use of pageblock_nr_pages. Also, you can replace 3 with
> MIGRATE_PCPTYPES.
> 
> > +	/* don't ever allow to reserve more than 5% of the lowmem */
> > +	recommended_min = min(recommended_min,
> > +			      (unsigned long) nr_free_buffer_pages() / 20);
> > +	recommended_min <<= (PAGE_SHIFT-10);
> > +
> > +	if (recommended_min > min_free_kbytes) {
> > +		min_free_kbytes = recommended_min;
> > +		setup_per_zone_wmarks();
> > +	}
> 
> 
> The timing this is called is important. Would you mind doing a quick
> debugging check by adding a printk to setup_zone_migrate_reserve() to ensure
> MIGRATE_RESERVE is getting set on sensible pageblocks? (see where the comment
> Suitable for reserving if this block is movable is) If MIGRATE_RESERVE blocks
> are not being created in a sensible fashion, atomic high-order allocations
> will suffer in mysterious ways.
> 
> SEtting the higher min free kbytes from userspace happens to work because
> the system is initialised and MIGRATE_MOVABLE exists but that might not be
> the case when automatically set like this patch.

When min_free_kbytes doesn't need to be increased (like huge system
with lots of ram) setup_per_zone_wmarks wasn't called in the
late_initcall where this code runs, and the original
setup_per_zone_wmarks apparently wasn't calling
setup_zone_migrate_reserve (how can it be?).

Anyway now I patched it like this and it seems to work properly with
the unconditional setup_per_zone_wmarks in
late_initcall(set_recommended_min_free_kbytes).

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -103,7 +103,7 @@ static int set_recommended_min_free_kbyt
 		nr_zones++;
 
 	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
-	recommended_min = HPAGE_PMD_NR * nr_zones * 2;
+	recommended_min = pageblock_nr_pages * nr_zones * 2;
 
 	/*
 	 * Make sure that on average at least two pageblocks are almost free
@@ -111,17 +111,17 @@ static int set_recommended_min_free_kbyt
 	 * second to avoid subsequent fallbacks of other types There are 3
 	 * MIGRATE_TYPES we care about.
 	 */
-	recommended_min += HPAGE_PMD_NR * nr_zones * 3 * 3;
+	recommended_min += pageblock_nr_pages * nr_zones *
+			   MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
 
 	/* don't ever allow to reserve more than 5% of the lowmem */
 	recommended_min = min(recommended_min,
 			      (unsigned long) nr_free_buffer_pages() / 20);
 	recommended_min <<= (PAGE_SHIFT-10);
 
-	if (recommended_min > min_free_kbytes) {
+	if (recommended_min > min_free_kbytes)
 		min_free_kbytes = recommended_min;
-		setup_per_zone_wmarks();
-	}
+	setup_per_zone_wmarks();
 	return 0;
 }
 late_initcall(set_recommended_min_free_kbytes);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3188,6 +3188,7 @@ static void setup_zone_migrate_reserve(s
 	 */
 	reserve = min(2, reserve);
 
+	printk("reserve start %d\n", reserve);
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		if (!pfn_valid(pfn))
 			continue;
@@ -3226,6 +3227,7 @@ static void setup_zone_migrate_reserve(s
 			move_freepages_block(zone, page, MIGRATE_MOVABLE);
 		}
 	}
+	printk("reserve end %d\n", reserve);
 }
 
 /*



hugeadm with CONFIG_TRANSPARENT_HUGEPAGE=n leads to:

reserve start 1
reserve end 0
reserve start 2
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0

With CONFIG_TRANSPARENT_HUGEPAGE=Y I boot I see:

reserve start 1
reserve end 0
reserve start 2
reserve end 0
reserve start 2
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 0
reserve end 0
reserve start 2
reserve end 0
reserve start 0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 52 of 66] enable direct defrag
  2010-11-18 16:17     ` Mel Gorman
@ 2010-12-09 18:57       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:17:41PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:27PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
> 
> While I'm hoping that my series on using compaction will be used instead
> of outright deletion of lumpy reclaim, this patch would still make
> sense.

Ok :), s/removed/disabled/

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 52 of 66] enable direct defrag
@ 2010-12-09 18:57       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 18:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:17:41PM +0000, Mel Gorman wrote:
> On Wed, Nov 03, 2010 at 04:28:27PM +0100, Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > With memory compaction in, and lumpy-reclaim removed, it seems safe enough to
> 
> While I'm hoping that my series on using compaction will be used instead
> of outright deletion of lumpy reclaim, this patch would still make
> sense.

Ok :), s/removed/disabled/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-11-18 16:22         ` Mel Gorman
@ 2010-12-09 19:04           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 19:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> coming to a complete stop and having to restart?

Hmm it's like if you're gigabytes in swap and apps hangs for a while
and system is not really usable and it swaps for most new memory
allocations despite there's plenty of memory free, but it's not a
deadlock of course.

BTW, alternatively I could:

 unsigned long transparent_hugepage_flags __read_mostly =
        (1<<TRANSPARENT_HUGEPAGE_FLAG)|
+#ifdef CONFIG_COMPACTION
+       (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
+#endif
        (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);

That would adds GFP_ATOMIC to THP allocation if compaction wasn't
selected, but I think having compaction enabled diminish the risk of
misconfigured kernels leading to unexpected measurements and behavior,
so I feel much safer to keep the select COMPACTION in this patch.

> Acked-by: Mel Gorman <mel@csn.ul.ie>

Added.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-12-09 19:04           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 19:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> coming to a complete stop and having to restart?

Hmm it's like if you're gigabytes in swap and apps hangs for a while
and system is not really usable and it swaps for most new memory
allocations despite there's plenty of memory free, but it's not a
deadlock of course.

BTW, alternatively I could:

 unsigned long transparent_hugepage_flags __read_mostly =
        (1<<TRANSPARENT_HUGEPAGE_FLAG)|
+#ifdef CONFIG_COMPACTION
+       (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
+#endif
        (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);

That would adds GFP_ATOMIC to THP allocation if compaction wasn't
selected, but I think having compaction enabled diminish the risk of
misconfigured kernels leading to unexpected measurements and behavior,
so I feel much safer to keep the select COMPACTION in this patch.

> Acked-by: Mel Gorman <mel@csn.ul.ie>

Added.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
  2010-11-18 16:31     ` Mel Gorman
@ 2010-12-09 19:10       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:31:24PM +0000, Mel Gorman wrote:
> I don't think this is related to THP although I see what you're doing.
> It should be handled on its own. I'd also wonder if some of the tg3
> failures are due to MIGRATE_RESERVE not being set properly when
> min_free_kbytes is automatically resized.

The failures also happened on older kernels except they wouldn't print
it in the kernel logs, but you're right we may get better with migrate
reserve enabled also on huge systems (it seems hugeadm
--set_recommended-min_free_kbytes was doing a little more than its
kernel counterpart with large systems with tons of ram, now fixed).

My status as far as this patch is concerned:

=========
Subject: use compaction for all allocation orders

From: Andrea Arcangeli <aarcange@redhat.com>

It makes no sense not to enable compaction for small order pages as we don't
want to end up with bad order 2 allocations and good and graceful order 9
allocations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
 	 * made because an assumption is made that the page allocator can satisfy
 	 * the "cheaper" orders without taking special steps
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
 	count_vm_event(COMPACTSTALL);




===========
Subject: use compaction in kswapd for GFP_ATOMIC order > 0

From: Andrea Arcangeli <aarcange@redhat.com>

This takes advantage of memory compaction to properly generate pages of order >
0 if regular page reclaim fails and priority level becomes more severe and we
don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
+#define COMPACT_MODE_DIRECT_RECLAIM	0
+#define COMPACT_MODE_KSWAPD		1
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
 			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					int compact_mode);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
 
@@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       int compact_mode)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void defer_compaction(struct zone *zone)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -38,6 +38,8 @@ struct compact_control {
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+
+	int compact_mode;
 };
 
 static unsigned long release_freepages(struct list_head *freelist)
@@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
 }
 
 static int compact_finished(struct zone *zone,
-						struct compact_control *cc)
+			    struct compact_control *cc)
 {
 	unsigned int order;
-	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
@@ -370,12 +372,27 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
+	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
+		watermark = low_wmark_pages(zone);
+	else
+		watermark = high_wmark_pages(zone);
+	watermark += (1 << cc->order);
+
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
 		return COMPACT_CONTINUE;
 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
+	/*
+	 * Generating only one page of the right order is not enough
+	 * for kswapd, we must continue until we're above the high
+	 * watermark as a pool for high order GFP_ATOMIC allocations
+	 * too.
+	 */
+	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
+		return COMPACT_CONTINUE;
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
-						int order, gfp_t gfp_mask)
+unsigned long compact_zone_order(struct zone *zone,
+				 int order, gfp_t gfp_mask,
+				 int compact_mode)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
 		.order = order,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
+		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
 			break;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask);
+		status = compact_zone_order(zone, order, gfp_mask,
+					    COMPACT_MODE_DIRECT_RECLAIM);
 		rc = max(status, rc);
 
 		if (zone_watermark_ok(zone, order, watermark, 0, 0))
@@ -547,6 +567,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/compaction.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2148,6 +2149,7 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
+			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2177,9 +2179,26 @@ loop_again:
 						lru_pages);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
+
+			compaction = 0;
+			if (order &&
+			    zone_watermark_ok(zone, 0,
+					       high_wmark_pages(zone),
+					      end_zone, 0) &&
+			    !zone_watermark_ok(zone, order,
+					       high_wmark_pages(zone),
+					       end_zone, 0)) {
+				compact_zone_order(zone,
+						   order,
+						   sc.gfp_mask,
+						   COMPACT_MODE_KSWAPD);
+				compaction = 1;
+			}
+
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!compaction && nr_slab == 0 &&
+			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0
@ 2010-12-09 19:10       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-09 19:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Nov 18, 2010 at 04:31:24PM +0000, Mel Gorman wrote:
> I don't think this is related to THP although I see what you're doing.
> It should be handled on its own. I'd also wonder if some of the tg3
> failures are due to MIGRATE_RESERVE not being set properly when
> min_free_kbytes is automatically resized.

The failures also happened on older kernels except they wouldn't print
it in the kernel logs, but you're right we may get better with migrate
reserve enabled also on huge systems (it seems hugeadm
--set_recommended-min_free_kbytes was doing a little more than its
kernel counterpart with large systems with tons of ram, now fixed).

My status as far as this patch is concerned:

=========
Subject: use compaction for all allocation orders

From: Andrea Arcangeli <aarcange@redhat.com>

It makes no sense not to enable compaction for small order pages as we don't
want to end up with bad order 2 allocations and good and graceful order 9
allocations.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -476,7 +495,7 @@ unsigned long try_to_compact_pages(struc
 	 * made because an assumption is made that the page allocator can satisfy
 	 * the "cheaper" orders without taking special steps
 	 */
-	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
 	count_vm_event(COMPACTSTALL);




===========
Subject: use compaction in kswapd for GFP_ATOMIC order > 0

From: Andrea Arcangeli <aarcange@redhat.com>

This takes advantage of memory compaction to properly generate pages of order >
0 if regular page reclaim fails and priority level becomes more severe and we
don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	3
 
+#define COMPACT_MODE_DIRECT_RECLAIM	0
+#define COMPACT_MODE_KSWAPD		1
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -20,6 +23,9 @@ extern int sysctl_extfrag_handler(struct
 			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					int compact_mode);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
 
@@ -59,6 +65,13 @@ static inline unsigned long try_to_compa
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       int compact_mode)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void defer_compaction(struct zone *zone)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -38,6 +38,8 @@ struct compact_control {
 	unsigned int order;		/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+
+	int compact_mode;
 };
 
 static unsigned long release_freepages(struct list_head *freelist)
@@ -357,10 +359,10 @@ static void update_nr_listpages(struct c
 }
 
 static int compact_finished(struct zone *zone,
-						struct compact_control *cc)
+			    struct compact_control *cc)
 {
 	unsigned int order;
-	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
@@ -370,12 +372,27 @@ static int compact_finished(struct zone 
 		return COMPACT_COMPLETE;
 
 	/* Compaction run is not finished if the watermark is not met */
+	if (cc->compact_mode != COMPACT_MODE_KSWAPD)
+		watermark = low_wmark_pages(zone);
+	else
+		watermark = high_wmark_pages(zone);
+	watermark += (1 << cc->order);
+
 	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
 		return COMPACT_CONTINUE;
 
 	if (cc->order == -1)
 		return COMPACT_CONTINUE;
 
+	/*
+	 * Generating only one page of the right order is not enough
+	 * for kswapd, we must continue until we're above the high
+	 * watermark as a pool for high order GFP_ATOMIC allocations
+	 * too.
+	 */
+	if (cc->compact_mode == COMPACT_MODE_KSWAPD)
+		return COMPACT_CONTINUE;
+
 	/* Direct compactor: Is a suitable page free? */
 	for (order = cc->order; order < MAX_ORDER; order++) {
 		/* Job done if page is free of the right migratetype */
@@ -433,8 +450,9 @@ static int compact_zone(struct zone *zon
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
-						int order, gfp_t gfp_mask)
+unsigned long compact_zone_order(struct zone *zone,
+				 int order, gfp_t gfp_mask,
+				 int compact_mode)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -442,6 +460,7 @@ static unsigned long compact_zone_order(
 		.order = order,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
+		.compact_mode = compact_mode,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -517,7 +536,8 @@ unsigned long try_to_compact_pages(struc
 			break;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask);
+		status = compact_zone_order(zone, order, gfp_mask,
+					    COMPACT_MODE_DIRECT_RECLAIM);
 		rc = max(status, rc);
 
 		if (zone_watermark_ok(zone, order, watermark, 0, 0))
@@ -547,6 +567,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.compact_mode = COMPACT_MODE_DIRECT_RECLAIM,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -40,6 +40,7 @@
 #include <linux/memcontrol.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
+#include <linux/compaction.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -2148,6 +2149,7 @@ loop_again:
 		 * cause too much scanning of the lower zones.
 		 */
 		for (i = 0; i <= end_zone; i++) {
+			int compaction;
 			struct zone *zone = pgdat->node_zones + i;
 			int nr_slab;
 
@@ -2177,9 +2179,26 @@ loop_again:
 						lru_pages);
 			sc.nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
+
+			compaction = 0;
+			if (order &&
+			    zone_watermark_ok(zone, 0,
+					       high_wmark_pages(zone),
+					      end_zone, 0) &&
+			    !zone_watermark_ok(zone, order,
+					       high_wmark_pages(zone),
+					       end_zone, 0)) {
+				compact_zone_order(zone,
+						   order,
+						   sc.gfp_mask,
+						   COMPACT_MODE_KSWAPD);
+				compaction = 1;
+			}
+
 			if (zone->all_unreclaimable)
 				continue;
-			if (nr_slab == 0 && !zone_reclaimable(zone))
+			if (!compaction && nr_slab == 0 &&
+			    !zone_reclaimable(zone))
 				zone->all_unreclaimable = 1;
 			/*
 			 * If we've done a decent amount of scanning and

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
  2010-12-09 18:13       ` Andrea Arcangeli
@ 2010-12-10 12:17         ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-12-10 12:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Dec 09, 2010 at 07:13:54PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 04:06:13PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Skip transhuge pages in ksm for now.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Reviewed-by: Rik van Riel <riel@redhat.com>
> > 
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > This is an idle concern that I haven't looked into but is there any conflict
> > between khugepaged scanning the KSM scanning?
> > 
> > Specifically, I *think* the impact of this patch is that KSM will not
> > accidentally split a huge page. Is that right? If so, it could do with
> > being included in the changelog.
> 
> KSM wasn't aware about hugepages and in turn it'd never split them
> anyway. We want KSM to split hugepages only when if finds two equal
> subpages. That will happen later.
> 

Ok.

> Right now there is no collision of ksmd and khugepaged, regular pages,
> hugepages and ksm pages will co-exist fine in the same vma. The only
> problem is that the system has now to start swapping before KSM has a
> chance to find equal pages and we'll fix it in the future so KSM can
> scan inside hugepages too and split them and merge the subpages as
> needed before the memory pressure starts.
> 

Ok. So it's not a perfect mesh but it's not broken either.

> > On the other hand, can khugepaged be prevented from promoting a hugepage
> > because of KSM?
> 
> Sure, khugepaged won't promote if there's any ksm page in the
> range. That's not going to change. When KSM is started, the priority
> remains in saving memory. If people uses enabled=madvise and
> MADV_HUGEPAGE+MADV_MERGEABLE there is actually zero memory loss
> because of THP and there is a speed improvement for all pages that
> aren't equal. So it's an ideal setup even for embedded. Regular cloud
> setup would be enabled=always + MADV_MERGEABLE (with enabled=always
> MADV_HUGEPAGE becomes a noop).
> 

That's a reasonable compromise. Thanks for clarifying.

> On a related note I'm also going to introduce a MADV_NO_HUGEPAGE, is
> that a good name for it? cloud management wants to be able to disable
> THP per-VM basis (when the VM are totally idle, and low priority, this
> currently also helps to maximize the power of KSM that would otherwise
> be activated only after initial sawpping, but the KSM part will be
> fixed). It could be achieved also with enabled=madvise and
> MADV_HUGEPAGE but we don't want to change the system wide default in
> order to disable THP on a per-VM basis: it's much nicer if the default
> behavior of the host remains the same in case it's not a pure
> hypervisor usage but there are other loads running in parallel to the
> virt load. In theory a prctl(PR_NO_HUGEPAGE) could also do it and it'd
> be possible to use from a wrapper (madvise can't be wrapped), but I
> think MADV_NO_HUGEPAGE is cleaner and it won't require brand new
> per-process info.
> 

I see no problem with the proposal. The name seems as good as any other
name. I guess the only other sensible alternative might be
MADV_BASEPAGE.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 44 of 66] skip transhuge pages in ksm for now
@ 2010-12-10 12:17         ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-12-10 12:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Dave Hansen, Benjamin Herrenschmidt, Ingo Molnar,
	Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Dec 09, 2010 at 07:13:54PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 04:06:13PM +0000, Mel Gorman wrote:
> > On Wed, Nov 03, 2010 at 04:28:19PM +0100, Andrea Arcangeli wrote:
> > > From: Andrea Arcangeli <aarcange@redhat.com>
> > > 
> > > Skip transhuge pages in ksm for now.
> > > 
> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > > Reviewed-by: Rik van Riel <riel@redhat.com>
> > 
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > This is an idle concern that I haven't looked into but is there any conflict
> > between khugepaged scanning the KSM scanning?
> > 
> > Specifically, I *think* the impact of this patch is that KSM will not
> > accidentally split a huge page. Is that right? If so, it could do with
> > being included in the changelog.
> 
> KSM wasn't aware about hugepages and in turn it'd never split them
> anyway. We want KSM to split hugepages only when if finds two equal
> subpages. That will happen later.
> 

Ok.

> Right now there is no collision of ksmd and khugepaged, regular pages,
> hugepages and ksm pages will co-exist fine in the same vma. The only
> problem is that the system has now to start swapping before KSM has a
> chance to find equal pages and we'll fix it in the future so KSM can
> scan inside hugepages too and split them and merge the subpages as
> needed before the memory pressure starts.
> 

Ok. So it's not a perfect mesh but it's not broken either.

> > On the other hand, can khugepaged be prevented from promoting a hugepage
> > because of KSM?
> 
> Sure, khugepaged won't promote if there's any ksm page in the
> range. That's not going to change. When KSM is started, the priority
> remains in saving memory. If people uses enabled=madvise and
> MADV_HUGEPAGE+MADV_MERGEABLE there is actually zero memory loss
> because of THP and there is a speed improvement for all pages that
> aren't equal. So it's an ideal setup even for embedded. Regular cloud
> setup would be enabled=always + MADV_MERGEABLE (with enabled=always
> MADV_HUGEPAGE becomes a noop).
> 

That's a reasonable compromise. Thanks for clarifying.

> On a related note I'm also going to introduce a MADV_NO_HUGEPAGE, is
> that a good name for it? cloud management wants to be able to disable
> THP per-VM basis (when the VM are totally idle, and low priority, this
> currently also helps to maximize the power of KSM that would otherwise
> be activated only after initial sawpping, but the KSM part will be
> fixed). It could be achieved also with enabled=madvise and
> MADV_HUGEPAGE but we don't want to change the system wide default in
> order to disable THP on a per-VM basis: it's much nicer if the default
> behavior of the host remains the same in case it's not a pure
> hypervisor usage but there are other loads running in parallel to the
> virt load. In theory a prctl(PR_NO_HUGEPAGE) could also do it and it'd
> be possible to use from a wrapper (madvise can't be wrapped), but I
> think MADV_NO_HUGEPAGE is cleaner and it won't require brand new
> per-process info.
> 

I see no problem with the proposal. The name seems as good as any other
name. I guess the only other sensible alternative might be
MADV_BASEPAGE.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-12-09 19:04           ` Andrea Arcangeli
@ 2010-12-14  9:45             ` Mel Gorman
  -1 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-12-14  9:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Dec 09, 2010 at 08:04:07PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> > Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> > coming to a complete stop and having to restart?
> 
> Hmm it's like if you're gigabytes in swap and apps hangs for a while
> and system is not really usable and it swaps for most new memory
> allocations despite there's plenty of memory free, but it's not a
> deadlock of course.
> 

Ok, but it's likely to be kswapd being very aggressive because it's
woken up frequently and tries to balance all zones. Once it's not
deadlocking entirely, there isn't a more fundamental bug hiding in there
somewhere.

> BTW, alternatively I could:
> 
>  unsigned long transparent_hugepage_flags __read_mostly =
>         (1<<TRANSPARENT_HUGEPAGE_FLAG)|
> +#ifdef CONFIG_COMPACTION
> +       (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
> +#endif
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
> 
> That would adds GFP_ATOMIC to THP allocation if compaction wasn't
> selected,

With GFP_NO_KSWAPD, it would stop trashing I suspect the success rate
would be extremely low as nothing will be defragmenting memory.

> but I think having compaction enabled diminish the risk of
> misconfigured kernels leading to unexpected measurements and behavior,
> so I feel much safer to keep the select COMPACTION in this patch.
> 

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-12-14  9:45             ` Mel Gorman
  0 siblings, 0 replies; 331+ messages in thread
From: Mel Gorman @ 2010-12-14  9:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Thu, Dec 09, 2010 at 08:04:07PM +0100, Andrea Arcangeli wrote:
> On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> > Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> > coming to a complete stop and having to restart?
> 
> Hmm it's like if you're gigabytes in swap and apps hangs for a while
> and system is not really usable and it swaps for most new memory
> allocations despite there's plenty of memory free, but it's not a
> deadlock of course.
> 

Ok, but it's likely to be kswapd being very aggressive because it's
woken up frequently and tries to balance all zones. Once it's not
deadlocking entirely, there isn't a more fundamental bug hiding in there
somewhere.

> BTW, alternatively I could:
> 
>  unsigned long transparent_hugepage_flags __read_mostly =
>         (1<<TRANSPARENT_HUGEPAGE_FLAG)|
> +#ifdef CONFIG_COMPACTION
> +       (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
> +#endif
>         (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
> 
> That would adds GFP_ATOMIC to THP allocation if compaction wasn't
> selected,

With GFP_NO_KSWAPD, it would stop trashing I suspect the success rate
would be extremely low as nothing will be defragmenting memory.

> but I think having compaction enabled diminish the risk of
> misconfigured kernels leading to unexpected measurements and behavior,
> so I feel much safer to keep the select COMPACTION in this patch.
> 

Agreed.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
  2010-12-14  9:45             ` Mel Gorman
@ 2010-12-14 16:06               ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 16:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hi Mel,

On Tue, Dec 14, 2010 at 09:45:56AM +0000, Mel Gorman wrote:
> On Thu, Dec 09, 2010 at 08:04:07PM +0100, Andrea Arcangeli wrote:
> > On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> > > Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> > > coming to a complete stop and having to restart?
> > 
> > Hmm it's like if you're gigabytes in swap and apps hangs for a while
> > and system is not really usable and it swaps for most new memory
> > allocations despite there's plenty of memory free, but it's not a
> > deadlock of course.
> > 
> 
> Ok, but it's likely to be kswapd being very aggressive because it's
> woken up frequently and tries to balance all zones. Once it's not
> deadlocking entirely, there isn't a more fundamental bug hiding in there
> somewhere.

kswapd isn't activated by transhuge allocations because there's
khugepaged for that and it's throttle to try a 2m alloc only once per
minute if there's fragmentation.

So the reason of the trashing is the direct lumpy.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled
@ 2010-12-14 16:06               ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 16:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KOSAKI Motohiro, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hi Mel,

On Tue, Dec 14, 2010 at 09:45:56AM +0000, Mel Gorman wrote:
> On Thu, Dec 09, 2010 at 08:04:07PM +0100, Andrea Arcangeli wrote:
> > On Thu, Nov 18, 2010 at 04:22:45PM +0000, Mel Gorman wrote:
> > > Just to confirm - by hang, you mean grinds to a slow pace as opposed to
> > > coming to a complete stop and having to restart?
> > 
> > Hmm it's like if you're gigabytes in swap and apps hangs for a while
> > and system is not really usable and it swaps for most new memory
> > allocations despite there's plenty of memory free, but it's not a
> > deadlock of course.
> > 
> 
> Ok, but it's likely to be kswapd being very aggressive because it's
> woken up frequently and tries to balance all zones. Once it's not
> deadlocking entirely, there isn't a more fundamental bug hiding in there
> somewhere.

kswapd isn't activated by transhuge allocations because there's
khugepaged for that and it's throttle to try a 2m alloc only once per
minute if there's fragmentation.

So the reason of the trashing is the direct lumpy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
  2010-11-19  1:10       ` KAMEZAWA Hiroyuki
@ 2010-12-14 17:38         ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello Kame,

On Fri, Nov 19, 2010 at 10:10:41AM +0900, KAMEZAWA Hiroyuki wrote:
> If there are requirements of big page > 4GB, unsigned long should be used.

There aren't, it's unthinkable at the moment even 1G (besides it seems
on some implementations the CPU lacks a real 1G tlb and just prefetch
the same 2M tlb so making the tlb miss read one less cacheline but in
practice 1G don't seem to provide any measurable runtime compared to
2M THP). Things will always be exponential, the benefit provided going
from 4k to 2M will always be an order of magnitude bigger than the
benefit from 2M to 1G, no matter how much the hardware is handling the
1G pages natively.

> > >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > >  	struct mem_cgroup *mem = NULL;
> > >  	int ret;
> > > -	int csize = CHARGE_SIZE;
> > > +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> > >  
> 
> unsigned long here.

This is to shut off a warning because CHARGE_SIZE is calculated as
multiple of PAGE_SIZE which is unsigned long. csize was already int
but unsigned long is not required. Like you point out it's not worth
batching even pages as small as 2G, so no need of unsigned long.

> > > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> > >  	if (batch->memcg != mem)
> > >  		goto direct_uncharge;
> > >  	/* remember freed charge and uncharge it later */
> > > -	batch->bytes += PAGE_SIZE;
> > > +	batch->bytes += page_size;
> 
> Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?

As you wish, so I'm changing it like this.

archs where the pmd is implemented purely in software might actually
be able to use page sizes smaller than 2M that may make sense to
batch, but for now if you think this is simpler I'll go for it. We
need simple.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
 		goto direct_uncharge;
 
+	if (page_size != PAGE_SIZE)
+		goto direct_uncharge;
+
 	/*
 	 * In typical case, batch->memcg == mem. This means we can
 	 * merge a series of uncharges to an uncharge of res_counter.
@@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += page_size;
+	batch->bytes += PAGE_SIZE;
 	if (uncharge_memsw)
-		batch->memsw_bytes += page_size;
+		batch->memsw_bytes += PAGE_SIZE;
 	return;
 direct_uncharge:
 	res_counter_uncharge(&mem->res, page_size);


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
@ 2010-12-14 17:38         ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello Kame,

On Fri, Nov 19, 2010 at 10:10:41AM +0900, KAMEZAWA Hiroyuki wrote:
> If there are requirements of big page > 4GB, unsigned long should be used.

There aren't, it's unthinkable at the moment even 1G (besides it seems
on some implementations the CPU lacks a real 1G tlb and just prefetch
the same 2M tlb so making the tlb miss read one less cacheline but in
practice 1G don't seem to provide any measurable runtime compared to
2M THP). Things will always be exponential, the benefit provided going
from 4k to 2M will always be an order of magnitude bigger than the
benefit from 2M to 1G, no matter how much the hardware is handling the
1G pages natively.

> > >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > >  	struct mem_cgroup *mem = NULL;
> > >  	int ret;
> > > -	int csize = CHARGE_SIZE;
> > > +	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> > >  
> 
> unsigned long here.

This is to shut off a warning because CHARGE_SIZE is calculated as
multiple of PAGE_SIZE which is unsigned long. csize was already int
but unsigned long is not required. Like you point out it's not worth
batching even pages as small as 2G, so no need of unsigned long.

> > > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> > >  	if (batch->memcg != mem)
> > >  		goto direct_uncharge;
> > >  	/* remember freed charge and uncharge it later */
> > > -	batch->bytes += PAGE_SIZE;
> > > +	batch->bytes += page_size;
> 
> Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?

As you wish, so I'm changing it like this.

archs where the pmd is implemented purely in software might actually
be able to use page sizes smaller than 2M that may make sense to
batch, but for now if you think this is simpler I'll go for it. We
need simple.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
 		goto direct_uncharge;
 
+	if (page_size != PAGE_SIZE)
+		goto direct_uncharge;
+
 	/*
 	 * In typical case, batch->memcg == mem. This means we can
 	 * merge a series of uncharges to an uncharge of res_counter.
@@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
 	if (batch->memcg != mem)
 		goto direct_uncharge;
 	/* remember freed charge and uncharge it later */
-	batch->bytes += page_size;
+	batch->bytes += PAGE_SIZE;
 	if (uncharge_memsw)
-		batch->memsw_bytes += page_size;
+		batch->memsw_bytes += PAGE_SIZE;
 	return;
 direct_uncharge:
 	res_counter_uncharge(&mem->res, page_size);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 38 of 66] memcontrol: try charging huge pages from stock
  2010-11-19  1:14     ` KAMEZAWA Hiroyuki
@ 2010-12-14 17:38       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 10:14:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 03 Nov 2010 16:28:13 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > The stock unit is just bytes, there is no reason to only take normal
> > pages from it.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> nonsense. The stock size is CHARGE_SIZE=32*PAGE_SIZE at maximum.
> When we make this to be larger than HUGEPAGE size, 2M per cpu at least.
> This means memcg's resource "usage" accounting will have 128MB inaccuracy.
> 
> Nack.

Removed, thanks!

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 38 of 66] memcontrol: try charging huge pages from stock
@ 2010-12-14 17:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 10:14:27AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 03 Nov 2010 16:28:13 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > The stock unit is just bytes, there is no reason to only take normal
> > pages from it.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> 
> nonsense. The stock size is CHARGE_SIZE=32*PAGE_SIZE at maximum.
> When we make this to be larger than HUGEPAGE size, 2M per cpu at least.
> This means memcg's resource "usage" accounting will have 128MB inaccuracy.
> 
> Nack.

Removed, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 39 of 66] memcg huge memory
  2010-11-19  1:19     ` KAMEZAWA Hiroyuki
@ 2010-12-14 17:38       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 10:19:38AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 03 Nov 2010 16:28:14 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> > @@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> >  					  vma, address);
> > -		if (unlikely(!pages[i])) {
> > -			while (--i >= 0)
> > +		if (unlikely(!pages[i] ||
> > +			     mem_cgroup_newpage_charge(pages[i], mm,
> > +						       GFP_KERNEL))) {
> > +			if (pages[i])
> >  				put_page(pages[i]);
> > +			while (--i >= 0) {
> > +				mem_cgroup_uncharge_page(pages[i]);
> > +				put_page(pages[i]);
> > +			}
> 
> Maybe you can use batched-uncharge here.
> ==
> mem_cgroup_uncharge_start()
> {
> 	do loop;
> }
> mem_cgroup_uncharge_end();
> ==
> Then, many atomic ops can be reduced.

Cute!

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -413,10 +413,12 @@ static int do_huge_pmd_wp_page_fallback(
 						       GFP_KERNEL))) {
 			if (pages[i])
 				put_page(pages[i]);
+			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
 				mem_cgroup_uncharge_page(pages[i]);
 				put_page(pages[i]);
 			}
+			mem_cgroup_uncharge_end();
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;


> 
> 
> >  			kfree(pages);
> >  			ret |= VM_FAULT_OOM;
> >  			goto out;
> > @@ -455,8 +467,10 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(&mm->page_table_lock);
> > -	for (i = 0; i < HPAGE_PMD_NR; i++)
> > +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +		mem_cgroup_uncharge_page(pages[i]);
> >  		put_page(pages[i]);
> > +	}
> 
> here, too.

This is actually a very unlikely path handling a thread race
condition, but I'll add it any way.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -469,10 +469,12 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
+	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
 	}
+	mem_cgroup_uncharge_end();
 	kfree(pages);
 	goto out;
 }


> Hmm...it seems there are no codes for move_account() hugepage in series.
> I think it needs some complicated work to walk page table.

Agreed.

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 39 of 66] memcg huge memory
@ 2010-12-14 17:38       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-14 17:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Fri, Nov 19, 2010 at 10:19:38AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 03 Nov 2010 16:28:14 +0100
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> > @@ -402,9 +408,15 @@ static int do_huge_pmd_wp_page_fallback(
> >  	for (i = 0; i < HPAGE_PMD_NR; i++) {
> >  		pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> >  					  vma, address);
> > -		if (unlikely(!pages[i])) {
> > -			while (--i >= 0)
> > +		if (unlikely(!pages[i] ||
> > +			     mem_cgroup_newpage_charge(pages[i], mm,
> > +						       GFP_KERNEL))) {
> > +			if (pages[i])
> >  				put_page(pages[i]);
> > +			while (--i >= 0) {
> > +				mem_cgroup_uncharge_page(pages[i]);
> > +				put_page(pages[i]);
> > +			}
> 
> Maybe you can use batched-uncharge here.
> ==
> mem_cgroup_uncharge_start()
> {
> 	do loop;
> }
> mem_cgroup_uncharge_end();
> ==
> Then, many atomic ops can be reduced.

Cute!

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -413,10 +413,12 @@ static int do_huge_pmd_wp_page_fallback(
 						       GFP_KERNEL))) {
 			if (pages[i])
 				put_page(pages[i]);
+			mem_cgroup_uncharge_start();
 			while (--i >= 0) {
 				mem_cgroup_uncharge_page(pages[i]);
 				put_page(pages[i]);
 			}
+			mem_cgroup_uncharge_end();
 			kfree(pages);
 			ret |= VM_FAULT_OOM;
 			goto out;


> 
> 
> >  			kfree(pages);
> >  			ret |= VM_FAULT_OOM;
> >  			goto out;
> > @@ -455,8 +467,10 @@ out:
> >  
> >  out_free_pages:
> >  	spin_unlock(&mm->page_table_lock);
> > -	for (i = 0; i < HPAGE_PMD_NR; i++)
> > +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +		mem_cgroup_uncharge_page(pages[i]);
> >  		put_page(pages[i]);
> > +	}
> 
> here, too.

This is actually a very unlikely path handling a thread race
condition, but I'll add it any way.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -469,10 +469,12 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
+	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
 		put_page(pages[i]);
 	}
+	mem_cgroup_uncharge_end();
 	kfree(pages);
 	goto out;
 }


> Hmm...it seems there are no codes for move_account() hugepage in series.
> I think it needs some complicated work to walk page table.

Agreed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
  2010-12-14 17:38         ` Andrea Arcangeli
@ 2010-12-15  0:12           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-15  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, 14 Dec 2010 18:38:17 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:
 
> > > > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> > > >  	if (batch->memcg != mem)
> > > >  		goto direct_uncharge;
> > > >  	/* remember freed charge and uncharge it later */
> > > > -	batch->bytes += PAGE_SIZE;
> > > > +	batch->bytes += page_size;
> > 
> > Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?
> 
> As you wish, so I'm changing it like this.
> 
> archs where the pmd is implemented purely in software might actually
> be able to use page sizes smaller than 2M that may make sense to
> batch, but for now if you think this is simpler I'll go for it. We
> need simple.
> 

Thank you. Hmm,..seems not very simple :( I'm sorry.
Please do as you want.

-Kame

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
>  		goto direct_uncharge;
>  
> +	if (page_size != PAGE_SIZE)
> +		goto direct_uncharge;
> +
>  	/*
>  	 * In typical case, batch->memcg == mem. This means we can
>  	 * merge a series of uncharges to an uncharge of res_counter.
> @@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += page_size;
> +	batch->bytes += PAGE_SIZE;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += page_size;
> +		batch->memsw_bytes += PAGE_SIZE;
>  	return;
>  direct_uncharge:
>  	res_counter_uncharge(&mem->res, page_size);
> 
> 


^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
@ 2010-12-15  0:12           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 331+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-12-15  0:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

On Tue, 14 Dec 2010 18:38:17 +0100
Andrea Arcangeli <aarcange@redhat.com> wrote:
 
> > > > @@ -2491,14 +2503,14 @@ __do_uncharge(struct mem_cgroup *mem, co
> > > >  	if (batch->memcg != mem)
> > > >  		goto direct_uncharge;
> > > >  	/* remember freed charge and uncharge it later */
> > > > -	batch->bytes += PAGE_SIZE;
> > > > +	batch->bytes += page_size;
> > 
> > Hmm, isn't it simpler to avoid batched-uncharge when page_size > PAGE_SIZE ?
> 
> As you wish, so I'm changing it like this.
> 
> archs where the pmd is implemented purely in software might actually
> be able to use page sizes smaller than 2M that may make sense to
> batch, but for now if you think this is simpler I'll go for it. We
> need simple.
> 

Thank you. Hmm,..seems not very simple :( I'm sorry.
Please do as you want.

-Kame

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
>  		goto direct_uncharge;
>  
> +	if (page_size != PAGE_SIZE)
> +		goto direct_uncharge;
> +
>  	/*
>  	 * In typical case, batch->memcg == mem. This means we can
>  	 * merge a series of uncharges to an uncharge of res_counter.
> @@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
>  	if (batch->memcg != mem)
>  		goto direct_uncharge;
>  	/* remember freed charge and uncharge it later */
> -	batch->bytes += page_size;
> +	batch->bytes += PAGE_SIZE;
>  	if (uncharge_memsw)
> -		batch->memsw_bytes += page_size;
> +		batch->memsw_bytes += PAGE_SIZE;
>  	return;
>  direct_uncharge:
>  	res_counter_uncharge(&mem->res, page_size);
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
  2010-12-15  0:12           ` KAMEZAWA Hiroyuki
@ 2010-12-15  5:29             ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-15  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello,

On Wed, Dec 15, 2010 at 09:12:09AM +0900, KAMEZAWA Hiroyuki wrote:
> Thank you. Hmm,..seems not very simple :( I'm sorry.
> Please do as you want.

I did the below change, let me know if there's any problem with it.

What's left is mem_cgroup_move_parent...

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
> >  		goto direct_uncharge;
> >  
> > +	if (page_size != PAGE_SIZE)
> > +		goto direct_uncharge;
> > +
> >  	/*
> >  	 * In typical case, batch->memcg == mem. This means we can
> >  	 * merge a series of uncharges to an uncharge of res_counter.
> > @@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (batch->memcg != mem)
> >  		goto direct_uncharge;
> >  	/* remember freed charge and uncharge it later */
> > -	batch->bytes += page_size;
> > +	batch->bytes += PAGE_SIZE;
> >  	if (uncharge_memsw)
> > -		batch->memsw_bytes += page_size;
> > +		batch->memsw_bytes += PAGE_SIZE;
> >  	return;
> >  direct_uncharge:
> >  	res_counter_uncharge(&mem->res, page_size);
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 36 of 66] memcg compound
@ 2010-12-15  5:29             ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2010-12-15  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, Christoph Lameter, Chris Wright,
	bpicco, KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Hello,

On Wed, Dec 15, 2010 at 09:12:09AM +0900, KAMEZAWA Hiroyuki wrote:
> Thank you. Hmm,..seems not very simple :( I'm sorry.
> Please do as you want.

I did the below change, let me know if there's any problem with it.

What's left is mem_cgroup_move_parent...

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2503,6 +2503,9 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (!batch->do_batch || test_thread_flag(TIF_MEMDIE))
> >  		goto direct_uncharge;
> >  
> > +	if (page_size != PAGE_SIZE)
> > +		goto direct_uncharge;
> > +
> >  	/*
> >  	 * In typical case, batch->memcg == mem. This means we can
> >  	 * merge a series of uncharges to an uncharge of res_counter.
> > @@ -2511,9 +2514,9 @@ __do_uncharge(struct mem_cgroup *mem, co
> >  	if (batch->memcg != mem)
> >  		goto direct_uncharge;
> >  	/* remember freed charge and uncharge it later */
> > -	batch->bytes += page_size;
> > +	batch->bytes += PAGE_SIZE;
> >  	if (uncharge_memsw)
> > -		batch->memsw_bytes += page_size;
> > +		batch->memsw_bytes += PAGE_SIZE;
> >  	return;
> >  direct_uncharge:
> >  	res_counter_uncharge(&mem->res, page_size);
> > 
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
  2010-11-03 15:27   ` Andrea Arcangeli
@ 2011-01-17 14:14     ` Michal Simek
  -1 siblings, 0 replies; 331+ messages in thread
From: Michal Simek @ 2011-01-17 14:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 3360 bytes --]

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> huge_memory.c needs it too when it fallbacks in copying hugepages into regular
> fragmented pages if hugepage allocation fails during COW.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

It wasn't good idea to do it. mm/memory.c is used only for system with 
MMU. System without MMU are broken.

Not sure what the right fix is but anyway I think use one ifdef make 
sense (git patch in attachment).

Regards,
Michal


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 956a355..f6385fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -470,6 +470,7 @@ static inline void set_compound_order(struct page 
*page, unsigned long order)
  	page[1].lru.prev = (void *)order;
  }

+#ifdef CONFIG_MMU
  /*
   * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
   * servicing faults for write access.  In the normal case, do always want
@@ -482,6 +483,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
  		pte = pte_mkwrite(pte);
  	return pte;
  }
+#endif

  /*
   * Multiple processes may "see" the same page. E.g. for untouched


> ---
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -416,6 +416,19 @@ static inline void set_compound_order(st
>  }
>  
>  /*
> + * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
> + * servicing faults for write access.  In the normal case, do always want
> + * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> + * that do not have writing enabled, when used by access_process_vm.
> + */
> +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pte = pte_mkwrite(pte);
> +	return pte;
> +}
> +
> +/*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
>   * zeroes, and text pages of executables and shared libraries have
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2048,19 +2048,6 @@ static inline int pte_unmap_same(struct 
>  	return same;
>  }
>  
> -/*
> - * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
> - * servicing faults for write access.  In the normal case, do always want
> - * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> - * that do not have writing enabled, when used by access_process_vm.
> - */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> -{
> -	if (likely(vma->vm_flags & VM_WRITE))
> -		pte = pte_mkwrite(pte);
> -	return pte;
> -}
> -
>  static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
>  {
>  	/*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Michal Simek, Ing. (M.Eng)
PetaLogix - Linux Solutions for a Reconfigurable World
w: www.petalogix.com p: +61-7-30090663,+42-0-721842854 f: +61-7-30090663

[-- Attachment #2: 0001-mm-System-without-MMU-do-not-need-pte_mkwrite.patch --]
[-- Type: text/x-patch, Size: 1132 bytes --]

>From 9d09534daec636eafed279352c6b34d3b47c3a09 Mon Sep 17 00:00:00 2001
From: Michal Simek <monstr@monstr.eu>
Date: Mon, 17 Jan 2011 15:09:36 +0100
Subject: [PATCH] mm: System without MMU do not need pte_mkwrite

The patch "thp: export maybe_mkwrite"
(sha1 14fd403f2146f740942d78af4e0ee59396ad8eab)
break systems without MMU.

Signed-off-by: Michal Simek <monstr@monstr.eu>
---
 include/linux/mm.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 956a355..f6385fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -470,6 +470,7 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].lru.prev = (void *)order;
 }
 
+#ifdef CONFIG_MMU
 /*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
  * servicing faults for write access.  In the normal case, do always want
@@ -482,6 +483,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 		pte = pte_mkwrite(pte);
 	return pte;
 }
+#endif
 
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
-- 
1.5.5.6


^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
@ 2011-01-17 14:14     ` Michal Simek
  0 siblings, 0 replies; 331+ messages in thread
From: Michal Simek @ 2011-01-17 14:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 3360 bytes --]

Andrea Arcangeli wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> huge_memory.c needs it too when it fallbacks in copying hugepages into regular
> fragmented pages if hugepage allocation fails during COW.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>

It wasn't good idea to do it. mm/memory.c is used only for system with 
MMU. System without MMU are broken.

Not sure what the right fix is but anyway I think use one ifdef make 
sense (git patch in attachment).

Regards,
Michal


diff --git a/include/linux/mm.h b/include/linux/mm.h
index 956a355..f6385fc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -470,6 +470,7 @@ static inline void set_compound_order(struct page 
*page, unsigned long order)
  	page[1].lru.prev = (void *)order;
  }

+#ifdef CONFIG_MMU
  /*
   * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
   * servicing faults for write access.  In the normal case, do always want
@@ -482,6 +483,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct 
vm_area_struct *vma)
  		pte = pte_mkwrite(pte);
  	return pte;
  }
+#endif

  /*
   * Multiple processes may "see" the same page. E.g. for untouched


> ---
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -416,6 +416,19 @@ static inline void set_compound_order(st
>  }
>  
>  /*
> + * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
> + * servicing faults for write access.  In the normal case, do always want
> + * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> + * that do not have writing enabled, when used by access_process_vm.
> + */
> +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> +	if (likely(vma->vm_flags & VM_WRITE))
> +		pte = pte_mkwrite(pte);
> +	return pte;
> +}
> +
> +/*
>   * Multiple processes may "see" the same page. E.g. for untouched
>   * mappings of /dev/null, all processes see the same page full of
>   * zeroes, and text pages of executables and shared libraries have
> diff --git a/mm/memory.c b/mm/memory.c
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2048,19 +2048,6 @@ static inline int pte_unmap_same(struct 
>  	return same;
>  }
>  
> -/*
> - * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
> - * servicing faults for write access.  In the normal case, do always want
> - * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> - * that do not have writing enabled, when used by access_process_vm.
> - */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> -{
> -	if (likely(vma->vm_flags & VM_WRITE))
> -		pte = pte_mkwrite(pte);
> -	return pte;
> -}
> -
>  static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
>  {
>  	/*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Michal Simek, Ing. (M.Eng)
PetaLogix - Linux Solutions for a Reconfigurable World
w: www.petalogix.com p: +61-7-30090663,+42-0-721842854 f: +61-7-30090663

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-mm-System-without-MMU-do-not-need-pte_mkwrite.patch --]
[-- Type: text/x-patch; name="0001-mm-System-without-MMU-do-not-need-pte_mkwrite.patch", Size: 0 bytes --]



^ permalink raw reply related	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
  2011-01-17 14:14     ` Michal Simek
@ 2011-01-17 14:33       ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2011-01-17 14:33 UTC (permalink / raw)
  To: Michal Simek
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Hi Michal,

On Mon, Jan 17, 2011 at 03:14:07PM +0100, Michal Simek wrote:
> Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > huge_memory.c needs it too when it fallbacks in copying hugepages into regular
> > fragmented pages if hugepage allocation fails during COW.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> It wasn't good idea to do it. mm/memory.c is used only for system with 
> MMU. System without MMU are broken.
> 
> Not sure what the right fix is but anyway I think use one ifdef make 
> sense (git patch in attachment).

Can you show the build failure with CONFIG_MMU=n so I can understand
better? Other places in mm.h depends on pte_t/vm_area_struct/VM_WRITE
to be defined, if a system is without MMU nobody should call it
simply. Not saying your patch is wrong, but I'm trying to understand
how exactly it got broken and the gcc error would show it immediately.

This is only called by memory.o and huge_memory.o and they both are
built only if MMU=y.

Thanks!
Andrea

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
@ 2011-01-17 14:33       ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2011-01-17 14:33 UTC (permalink / raw)
  To: Michal Simek
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Hi Michal,

On Mon, Jan 17, 2011 at 03:14:07PM +0100, Michal Simek wrote:
> Andrea Arcangeli wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > huge_memory.c needs it too when it fallbacks in copying hugepages into regular
> > fragmented pages if hugepage allocation fails during COW.
> > 
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
> It wasn't good idea to do it. mm/memory.c is used only for system with 
> MMU. System without MMU are broken.
> 
> Not sure what the right fix is but anyway I think use one ifdef make 
> sense (git patch in attachment).

Can you show the build failure with CONFIG_MMU=n so I can understand
better? Other places in mm.h depends on pte_t/vm_area_struct/VM_WRITE
to be defined, if a system is without MMU nobody should call it
simply. Not saying your patch is wrong, but I'm trying to understand
how exactly it got broken and the gcc error would show it immediately.

This is only called by memory.o and huge_memory.o and they both are
built only if MMU=y.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
  2011-01-17 14:33       ` Andrea Arcangeli
@ 2011-01-18 14:29         ` Michal Simek
  -1 siblings, 0 replies; 331+ messages in thread
From: Michal Simek @ 2011-01-18 14:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Andrea Arcangeli wrote:
> Hi Michal,
> 
> On Mon, Jan 17, 2011 at 03:14:07PM +0100, Michal Simek wrote:
>> Andrea Arcangeli wrote:
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>>
>>> huge_memory.c needs it too when it fallbacks in copying hugepages into regular
>>> fragmented pages if hugepage allocation fails during COW.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>> Acked-by: Rik van Riel <riel@redhat.com>
>>> Acked-by: Mel Gorman <mel@csn.ul.ie>
>> It wasn't good idea to do it. mm/memory.c is used only for system with 
>> MMU. System without MMU are broken.
>>
>> Not sure what the right fix is but anyway I think use one ifdef make 
>> sense (git patch in attachment).
> 
> Can you show the build failure with CONFIG_MMU=n so I can understand
> better? Other places in mm.h depends on pte_t/vm_area_struct/VM_WRITE
> to be defined, if a system is without MMU nobody should call it
> simply. Not saying your patch is wrong, but I'm trying to understand
> how exactly it got broken and the gcc error would show it immediately.
> 
> This is only called by memory.o and huge_memory.o and they both are
> built only if MMU=y.

Of course: Look for example at this page:
http://www.monstr.eu/wiki/doku.php?id=log:2011-01-18_11_51_49#linux_next

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
@ 2011-01-18 14:29         ` Michal Simek
  0 siblings, 0 replies; 331+ messages in thread
From: Michal Simek @ 2011-01-18 14:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

Andrea Arcangeli wrote:
> Hi Michal,
> 
> On Mon, Jan 17, 2011 at 03:14:07PM +0100, Michal Simek wrote:
>> Andrea Arcangeli wrote:
>>> From: Andrea Arcangeli <aarcange@redhat.com>
>>>
>>> huge_memory.c needs it too when it fallbacks in copying hugepages into regular
>>> fragmented pages if hugepage allocation fails during COW.
>>>
>>> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>>> Acked-by: Rik van Riel <riel@redhat.com>
>>> Acked-by: Mel Gorman <mel@csn.ul.ie>
>> It wasn't good idea to do it. mm/memory.c is used only for system with 
>> MMU. System without MMU are broken.
>>
>> Not sure what the right fix is but anyway I think use one ifdef make 
>> sense (git patch in attachment).
> 
> Can you show the build failure with CONFIG_MMU=n so I can understand
> better? Other places in mm.h depends on pte_t/vm_area_struct/VM_WRITE
> to be defined, if a system is without MMU nobody should call it
> simply. Not saying your patch is wrong, but I'm trying to understand
> how exactly it got broken and the gcc error would show it immediately.
> 
> This is only called by memory.o and huge_memory.o and they both are
> built only if MMU=y.

Of course: Look for example at this page:
http://www.monstr.eu/wiki/doku.php?id=log:2011-01-18_11_51_49#linux_next

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
  2011-01-18 14:29         ` Michal Simek
@ 2011-01-18 20:32           ` Andrea Arcangeli
  -1 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2011-01-18 20:32 UTC (permalink / raw)
  To: Michal Simek
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

On Tue, Jan 18, 2011 at 03:29:42PM +0100, Michal Simek wrote:
> Of course: Look for example at this page:
> http://www.monstr.eu/wiki/doku.php?id=log:2011-01-18_11_51_49#linux_next

Ok now I see, the problem is the lack of pte_mkwrite with MMU=n.

So either we apply your patch or we move the maybe_mkwrite at the top
of huge_mm.h (before #ifdef CONFIG_TRANSPARENT_HUGEPAGE), it's up to
you...

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
@ 2011-01-18 20:32           ` Andrea Arcangeli
  0 siblings, 0 replies; 331+ messages in thread
From: Andrea Arcangeli @ 2011-01-18 20:32 UTC (permalink / raw)
  To: Michal Simek
  Cc: linux-mm, Linus Torvalds, Andrew Morton, linux-kernel,
	Marcelo Tosatti, Adam Litke, Avi Kivity, Hugh Dickins,
	Rik van Riel, Mel Gorman, Dave Hansen, Benjamin Herrenschmidt,
	Ingo Molnar, Mike Travis, KAMEZAWA Hiroyuki, Christoph Lameter,
	Chris Wright, bpicco, KOSAKI Motohiro, Balbir Singh,
	Michael S. Tsirkin, Peter Zijlstra, Johannes Weiner,
	Daisuke Nishimura, Chris Mason, Borislav Petkov

On Tue, Jan 18, 2011 at 03:29:42PM +0100, Michal Simek wrote:
> Of course: Look for example at this page:
> http://www.monstr.eu/wiki/doku.php?id=log:2011-01-18_11_51_49#linux_next

Ok now I see, the problem is the lack of pte_mkwrite with MMU=n.

So either we apply your patch or we move the maybe_mkwrite at the top
of huge_mm.h (before #ifdef CONFIG_TRANSPARENT_HUGEPAGE), it's up to
you...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

* Re: [PATCH 13 of 66] export maybe_mkwrite
  2011-01-18 20:32           ` Andrea Arcangeli
  (?)
@ 2011-01-20  7:03           ` Michal Simek
  -1 siblings, 0 replies; 331+ messages in thread
From: Michal Simek @ 2011-01-20  7:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Michal Simek, linux-mm, Linus Torvalds, Andrew Morton,
	linux-kernel, Marcelo Tosatti, Adam Litke, Avi Kivity,
	Hugh Dickins, Rik van Riel, Mel Gorman, Dave Hansen,
	Benjamin Herrenschmidt, Ingo Molnar, Mike Travis,
	KAMEZAWA Hiroyuki, Christoph Lameter, Chris Wright, bpicco,
	KOSAKI Motohiro, Balbir Singh, Michael S. Tsirkin,
	Peter Zijlstra, Johannes Weiner, Daisuke Nishimura, Chris Mason,
	Borislav Petkov

Andrea Arcangeli wrote:
> On Tue, Jan 18, 2011 at 03:29:42PM +0100, Michal Simek wrote:
>> Of course: Look for example at this page:
>> http://www.monstr.eu/wiki/doku.php?id=log:2011-01-18_11_51_49#linux_next
> 
> Ok now I see, the problem is the lack of pte_mkwrite with MMU=n.
> 
> So either we apply your patch or we move the maybe_mkwrite at the top
> of huge_mm.h (before #ifdef CONFIG_TRANSPARENT_HUGEPAGE), it's up to
> you...

If you mean me, IRC there are some !MMU that's why I prefer to use my 
origin version.

Who kill take care about this fix? I prefer to add it to mainline ASAP 
to fix all noMMU platforms.

Thanks,
Michal


-- 
Michal Simek, Ing. (M.Eng)
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 331+ messages in thread

end of thread, other threads:[~2011-01-20  7:03 UTC | newest]

Thread overview: 331+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-03 15:27 [PATCH 00 of 66] Transparent Hugepage Support #32 Andrea Arcangeli
2010-11-03 15:27 ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 01 of 66] disable lumpy when compaction is enabled Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:18   ` KOSAKI Motohiro
2010-11-09  3:18     ` KOSAKI Motohiro
2010-11-09 21:30     ` Andrea Arcangeli
2010-11-09 21:30       ` Andrea Arcangeli
2010-11-09 21:38       ` Mel Gorman
2010-11-09 21:38         ` Mel Gorman
2010-11-09 22:22         ` Andrea Arcangeli
2010-11-09 22:22           ` Andrea Arcangeli
2010-11-10 14:27           ` Mel Gorman
2010-11-10 14:27             ` Mel Gorman
2010-11-10 16:03             ` Andrea Arcangeli
2010-11-10 16:03               ` Andrea Arcangeli
2010-11-18  8:30   ` Mel Gorman
2010-11-18  8:30     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 02 of 66] mm, migration: Fix race between shift_arg_pages and rmap_walk by guaranteeing rmap_walk finds PTEs created within the temporary stack Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:01   ` KOSAKI Motohiro
2010-11-09  3:01     ` KOSAKI Motohiro
2010-11-09 21:36     ` Andrea Arcangeli
2010-11-09 21:36       ` Andrea Arcangeli
2010-11-18 11:13   ` Mel Gorman
2010-11-18 11:13     ` Mel Gorman
2010-11-18 17:13     ` Mel Gorman
2010-11-18 17:13       ` Mel Gorman
2010-11-19 17:38     ` Andrea Arcangeli
2010-11-19 17:38       ` Andrea Arcangeli
2010-11-19 17:54       ` Linus Torvalds
2010-11-19 17:54         ` Linus Torvalds
2010-11-19 19:26         ` Andrea Arcangeli
2010-11-19 19:26           ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 03 of 66] transparent hugepage support documentation Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:41   ` Mel Gorman
2010-11-18 11:41     ` Mel Gorman
2010-11-25 14:35     ` Andrea Arcangeli
2010-11-25 14:35       ` Andrea Arcangeli
2010-11-26 11:40       ` Mel Gorman
2010-11-26 11:40         ` Mel Gorman
2010-11-03 15:27 ` [PATCH 04 of 66] define MADV_HUGEPAGE Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:44   ` Mel Gorman
2010-11-18 11:44     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 05 of 66] compound_lock Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 11:49   ` Mel Gorman
2010-11-18 11:49     ` Mel Gorman
2010-11-18 17:28     ` Linus Torvalds
2010-11-18 17:28       ` Linus Torvalds
2010-11-25 16:21       ` Andrea Arcangeli
2010-11-25 16:21         ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 06 of 66] alter compound get_page/put_page Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:37   ` Mel Gorman
2010-11-18 12:37     ` Mel Gorman
2010-11-25 16:49     ` Andrea Arcangeli
2010-11-25 16:49       ` Andrea Arcangeli
2010-11-26 11:46       ` Mel Gorman
2010-11-26 11:46         ` Mel Gorman
2010-11-03 15:27 ` [PATCH 07 of 66] update futex compound knowledge Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 08 of 66] fix bad_page to show the real reason the page is bad Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-09  3:03   ` KOSAKI Motohiro
2010-11-09  3:03     ` KOSAKI Motohiro
2010-11-18 12:40   ` Mel Gorman
2010-11-18 12:40     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 09 of 66] clear compound mapping Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 10 of 66] add native_set_pmd_at Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 11 of 66] add pmd paravirt ops Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-26 18:01   ` Andrea Arcangeli
2010-11-26 18:01     ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 12 of 66] no paravirt version of pmd ops Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 13 of 66] export maybe_mkwrite Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2011-01-17 14:14   ` Michal Simek
2011-01-17 14:14     ` Michal Simek
2011-01-17 14:33     ` Andrea Arcangeli
2011-01-17 14:33       ` Andrea Arcangeli
2011-01-18 14:29       ` Michal Simek
2011-01-18 14:29         ` Michal Simek
2011-01-18 20:32         ` Andrea Arcangeli
2011-01-18 20:32           ` Andrea Arcangeli
2011-01-20  7:03           ` Michal Simek
2010-11-03 15:27 ` [PATCH 14 of 66] comment reminder in destroy_compound_page Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 15 of 66] config_transparent_hugepage Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 16 of 66] special pmd_trans_* functions Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:51   ` Mel Gorman
2010-11-18 12:51     ` Mel Gorman
2010-11-25 17:10     ` Andrea Arcangeli
2010-11-25 17:10       ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 17 of 66] add pmd mangling generic functions Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 12:52   ` Mel Gorman
2010-11-18 12:52     ` Mel Gorman
2010-11-18 17:32     ` Linus Torvalds
2010-11-18 17:32       ` Linus Torvalds
2010-11-25 17:35       ` Andrea Arcangeli
2010-11-25 17:35         ` Andrea Arcangeli
2010-11-26 22:24         ` Linus Torvalds
2010-11-26 22:24           ` Linus Torvalds
2010-12-02 17:50           ` Andrea Arcangeli
2010-12-02 17:50             ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 18 of 66] add pmd mangling functions to x86 Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:04   ` Mel Gorman
2010-11-18 13:04     ` Mel Gorman
2010-11-26 17:57     ` Andrea Arcangeli
2010-11-26 17:57       ` Andrea Arcangeli
2010-11-29 10:23       ` Mel Gorman
2010-11-29 10:23         ` Mel Gorman
2010-11-29 16:59         ` Andrea Arcangeli
2010-11-29 16:59           ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 19 of 66] bail out gup_fast on splitting pmd Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 20 of 66] pte alloc trans splitting Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 21 of 66] add pmd mmu_notifier helpers Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:27 ` [PATCH 22 of 66] clear page compound Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:11   ` Mel Gorman
2010-11-18 13:11     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 23 of 66] add pmd_huge_pte to mm_struct Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-18 13:13   ` Mel Gorman
2010-11-18 13:13     ` Mel Gorman
2010-11-03 15:27 ` [PATCH 24 of 66] split_huge_page_mm/vma Andrea Arcangeli
2010-11-03 15:27   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 25 of 66] split_huge_page paging Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 26 of 66] clear_copy_huge_page Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 27 of 66] kvm mmu transparent hugepage support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 28 of 66] _GFP_NO_KSWAPD Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 13:18   ` Mel Gorman
2010-11-18 13:18     ` Mel Gorman
2010-11-29 19:03     ` Andrea Arcangeli
2010-11-29 19:03       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 29 of 66] don't alloc harder for gfp nomemalloc even if nowait Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  3:05   ` KOSAKI Motohiro
2010-11-09  3:05     ` KOSAKI Motohiro
2010-11-18 13:19   ` Mel Gorman
2010-11-18 13:19     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 30 of 66] transparent hugepage core Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:12   ` Mel Gorman
2010-11-18 15:12     ` Mel Gorman
2010-12-07 21:24     ` Andrea Arcangeli
2010-12-07 21:24       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 31 of 66] split_huge_page anon_vma ordering dependency Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:13   ` Mel Gorman
2010-11-18 15:13     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 32 of 66] verify pmd_trans_huge isn't leaking Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:15   ` Mel Gorman
2010-11-18 15:15     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 33 of 66] madvise(MADV_HUGEPAGE) Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:19   ` Mel Gorman
2010-11-18 15:19     ` Mel Gorman
2010-12-09 17:14     ` Andrea Arcangeli
2010-12-09 17:14       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 34 of 66] add PageTransCompound Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:20   ` Mel Gorman
2010-11-18 15:20     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 35 of 66] pmd_trans_huge migrate bugcheck Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 36 of 66] memcg compound Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 15:26   ` Mel Gorman
2010-11-18 15:26     ` Mel Gorman
2010-11-19  1:10     ` KAMEZAWA Hiroyuki
2010-11-19  1:10       ` KAMEZAWA Hiroyuki
2010-12-14 17:38       ` Andrea Arcangeli
2010-12-14 17:38         ` Andrea Arcangeli
2010-12-15  0:12         ` KAMEZAWA Hiroyuki
2010-12-15  0:12           ` KAMEZAWA Hiroyuki
2010-12-15  5:29           ` Andrea Arcangeli
2010-12-15  5:29             ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 37 of 66] transhuge-memcg: commit tail pages at charge Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 38 of 66] memcontrol: try charging huge pages from stock Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-19  1:14   ` KAMEZAWA Hiroyuki
2010-11-19  1:14     ` KAMEZAWA Hiroyuki
2010-12-14 17:38     ` Andrea Arcangeli
2010-12-14 17:38       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 39 of 66] memcg huge memory Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-19  1:19   ` KAMEZAWA Hiroyuki
2010-11-19  1:19     ` KAMEZAWA Hiroyuki
2010-12-14 17:38     ` Andrea Arcangeli
2010-12-14 17:38       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 40 of 66] transparent hugepage vmstat Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 41 of 66] khugepaged Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 42 of 66] khugepaged vma merge Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 43 of 66] don't leave orhpaned swap cache after ksm merging Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  3:08   ` KOSAKI Motohiro
2010-11-09  3:08     ` KOSAKI Motohiro
2010-11-09 21:40     ` Andrea Arcangeli
2010-11-09 21:40       ` Andrea Arcangeli
2010-11-10  7:49       ` Hugh Dickins
2010-11-10  7:49         ` Hugh Dickins
2010-11-10 16:08         ` Andrea Arcangeli
2010-11-10 16:08           ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 44 of 66] skip transhuge pages in ksm for now Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:06   ` Mel Gorman
2010-11-18 16:06     ` Mel Gorman
2010-12-09 18:13     ` Andrea Arcangeli
2010-12-09 18:13       ` Andrea Arcangeli
2010-12-10 12:17       ` Mel Gorman
2010-12-10 12:17         ` Mel Gorman
2010-11-03 15:28 ` [PATCH 45 of 66] remove PG_buddy Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:08   ` Mel Gorman
2010-11-18 16:08     ` Mel Gorman
2010-12-09 18:15     ` Andrea Arcangeli
2010-12-09 18:15       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 46 of 66] add x86 32bit support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 47 of 66] mincore transparent hugepage support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 48 of 66] add pmd_modify Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 49 of 66] mprotect: pass vma down to page table walkers Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 50 of 66] mprotect: transparent huge page support Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 51 of 66] set recommended min free kbytes Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:16   ` Mel Gorman
2010-11-18 16:16     ` Mel Gorman
2010-12-09 18:47     ` Andrea Arcangeli
2010-12-09 18:47       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 52 of 66] enable direct defrag Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:17   ` Mel Gorman
2010-11-18 16:17     ` Mel Gorman
2010-12-09 18:57     ` Andrea Arcangeli
2010-12-09 18:57       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 53 of 66] add numa awareness to hugepage allocations Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-29  5:38   ` Daisuke Nishimura
2010-11-29  5:38     ` Daisuke Nishimura
2010-11-29 16:11     ` Andrea Arcangeli
2010-11-29 16:11       ` Andrea Arcangeli
2010-11-30  0:38       ` Daisuke Nishimura
2010-11-30  0:38         ` Daisuke Nishimura
2010-11-30 19:01         ` Andrea Arcangeli
2010-11-30 19:01           ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 54 of 66] transparent hugepage config choice Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 55 of 66] select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE enabled Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  6:20   ` KOSAKI Motohiro
2010-11-09  6:20     ` KOSAKI Motohiro
2010-11-09 21:11     ` Andrea Arcangeli
2010-11-09 21:11       ` Andrea Arcangeli
2010-11-14  5:07       ` KOSAKI Motohiro
2010-11-14  5:07         ` KOSAKI Motohiro
2010-11-15 15:13         ` Andrea Arcangeli
2010-11-15 15:13           ` Andrea Arcangeli
2010-11-18 16:22       ` Mel Gorman
2010-11-18 16:22         ` Mel Gorman
2010-12-09 19:04         ` Andrea Arcangeli
2010-12-09 19:04           ` Andrea Arcangeli
2010-12-14  9:45           ` Mel Gorman
2010-12-14  9:45             ` Mel Gorman
2010-12-14 16:06             ` Andrea Arcangeli
2010-12-14 16:06               ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 56 of 66] transhuge isolate_migratepages() Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:25   ` Mel Gorman
2010-11-18 16:25     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 57 of 66] avoid breaking huge pmd invariants in case of vma_adjust failures Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 58 of 66] don't allow transparent hugepage support without PSE Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 59 of 66] mmu_notifier_test_young Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 60 of 66] freeze khugepaged and ksmd Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 61 of 66] use compaction for GFP_ATOMIC order > 0 Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09 10:27   ` KOSAKI Motohiro
2010-11-09 10:27     ` KOSAKI Motohiro
2010-11-09 21:49     ` Andrea Arcangeli
2010-11-09 21:49       ` Andrea Arcangeli
2010-11-18 16:31   ` Mel Gorman
2010-11-18 16:31     ` Mel Gorman
2010-12-09 19:10     ` Andrea Arcangeli
2010-12-09 19:10       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 62 of 66] disable transparent hugepages by default on small systems Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:34   ` Mel Gorman
2010-11-18 16:34     ` Mel Gorman
2010-11-03 15:28 ` [PATCH 63 of 66] fix anon memory statistics with transparent hugepages Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 64 of 66] scale nr_rotated to balance memory pressure Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-09  6:16   ` KOSAKI Motohiro
2010-11-09  6:16     ` KOSAKI Motohiro
2010-11-18 19:15     ` Andrea Arcangeli
2010-11-18 19:15       ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 65 of 66] transparent hugepage sysfs meminfo Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-03 15:28 ` [PATCH 66 of 66] add debug checks for mapcount related invariants Andrea Arcangeli
2010-11-03 15:28   ` Andrea Arcangeli
2010-11-18 16:39 ` [PATCH 00 of 66] Transparent Hugepage Support #32 Mel Gorman
2010-11-18 16:39   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.