All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 0/8] MADV_FREE support
@ 2014-07-09  6:22 ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim

This patch enable MADV_FREE hint for madvise syscall, which have
been supported by other OSes. [PATCH 1] includes the details.

[1] support MADVISE_FREE for !THP page so if VM encounter
THP page in syscall context, it splits THP page.
[2-7] is to preparing to call madvise syscall without THP plitting
[8] enable THP page support for MADV_FREE.


* From v11
 * Fix arm build - Steve
 * Separate patch for arm and arm64 - Steve
 * Remove unnecessary check - Kirill
 * Skip non-vm_normal page - Kirill
 * Add Acked-by - Zhang
 * Sparc64 build fix
 * Pagetable walker THP handling fix

* From v10
 * Add Acked-by from arch stuff(x86, s390)
 * Pagewalker based pagetable working - Kirill
 * Fix try_to_unmap_one broken with hwpoison - Kirill
 * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
 * Fix pgtable-3level.h for arm - Steve

* From v9
 * Add Acked-by - Rik
 * Add THP page support - Kirill

* From v8
 * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44

* From v7
 * Rebased-on next-20140613

* From v6
 * Remove page from swapcache in syscal time
 * Move utility functions from memory.c to madvise.c - Johannes
 * Rename untilify functtions - Johannes
 * Remove unnecessary checks from vmscan.c - Johannes
 * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
 * Drop Reviewe-by because there was some changes since then.

* From v5
 * Fix PPC problem which don't flush TLB - Rik
 * Remove unnecessary lazyfree_range stub function - Rik
 * Rebased on v3.15-rc5

* From v4
 * Add Reviewed-by: Zhang Yanfei
 * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14

* From v3
 * Add "how to work part" in description - Zhang
 * Add page_discardable utility function - Zhang
 * Clean up

* From v2
 * Remove forceful dirty marking of swap-readed page - Johannes
 * Remove deactivation logic of lazyfreed page
 * Rebased on 3.14
 * Remove RFC tag

* From v1
 * Use custom page table walker for madvise_free - Johannes
 * Remove PG_lazypage flag - Johannes
 * Do madvise_dontneed instead of madvise_freein swapless system

Minchan Kim (8):
  [1] mm: support madvise(MADV_FREE)
  [2] x86: add pmd_[dirty|mkclean] for THP
  [3] sparc: add pmd_[dirty|mkclean] for THP
  [4] powerpc: add pmd_[dirty|mkclean] for THP
  [5] s390: add pmd_[dirty|mkclean] for THP
  [6] arm: add pmd_[dirty|mkclean] for THP
  [7] arm64: add pmd_[dirty|mkclean] for THP
  [8] mm: Don't split THP page when syscall is called

 arch/arm/include/asm/pgtable-3level.h    |   3 +
 arch/arm64/include/asm/pgtable.h         |   2 +
 arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
 arch/s390/include/asm/pgtable.h          |  12 +++
 arch/sparc/include/asm/pgtable_64.h      |  16 ++++
 arch/x86/include/asm/pgtable.h           |  10 ++
 include/linux/huge_mm.h                  |   4 +
 include/linux/rmap.h                     |   9 +-
 include/linux/vm_event_item.h            |   1 +
 include/uapi/asm-generic/mman-common.h   |   1 +
 mm/huge_memory.c                         |  35 +++++++
 mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
 mm/rmap.c                                |  46 ++++++++-
 mm/vmscan.c                              |  64 +++++++++----
 mm/vmstat.c                              |   1 +
 15 files changed, 341 insertions(+), 20 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v12 0/8] MADV_FREE support
@ 2014-07-09  6:22 ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk, Linux API,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, Zhang Yanfei, Kirill A. Shutemov,
	Minchan Kim

This patch enable MADV_FREE hint for madvise syscall, which have
been supported by other OSes. [PATCH 1] includes the details.

[1] support MADVISE_FREE for !THP page so if VM encounter
THP page in syscall context, it splits THP page.
[2-7] is to preparing to call madvise syscall without THP plitting
[8] enable THP page support for MADV_FREE.


* From v11
 * Fix arm build - Steve
 * Separate patch for arm and arm64 - Steve
 * Remove unnecessary check - Kirill
 * Skip non-vm_normal page - Kirill
 * Add Acked-by - Zhang
 * Sparc64 build fix
 * Pagetable walker THP handling fix

* From v10
 * Add Acked-by from arch stuff(x86, s390)
 * Pagewalker based pagetable working - Kirill
 * Fix try_to_unmap_one broken with hwpoison - Kirill
 * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
 * Fix pgtable-3level.h for arm - Steve

* From v9
 * Add Acked-by - Rik
 * Add THP page support - Kirill

* From v8
 * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44

* From v7
 * Rebased-on next-20140613

* From v6
 * Remove page from swapcache in syscal time
 * Move utility functions from memory.c to madvise.c - Johannes
 * Rename untilify functtions - Johannes
 * Remove unnecessary checks from vmscan.c - Johannes
 * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
 * Drop Reviewe-by because there was some changes since then.

* From v5
 * Fix PPC problem which don't flush TLB - Rik
 * Remove unnecessary lazyfree_range stub function - Rik
 * Rebased on v3.15-rc5

* From v4
 * Add Reviewed-by: Zhang Yanfei
 * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14

* From v3
 * Add "how to work part" in description - Zhang
 * Add page_discardable utility function - Zhang
 * Clean up

* From v2
 * Remove forceful dirty marking of swap-readed page - Johannes
 * Remove deactivation logic of lazyfreed page
 * Rebased on 3.14
 * Remove RFC tag

* From v1
 * Use custom page table walker for madvise_free - Johannes
 * Remove PG_lazypage flag - Johannes
 * Do madvise_dontneed instead of madvise_freein swapless system

Minchan Kim (8):
  [1] mm: support madvise(MADV_FREE)
  [2] x86: add pmd_[dirty|mkclean] for THP
  [3] sparc: add pmd_[dirty|mkclean] for THP
  [4] powerpc: add pmd_[dirty|mkclean] for THP
  [5] s390: add pmd_[dirty|mkclean] for THP
  [6] arm: add pmd_[dirty|mkclean] for THP
  [7] arm64: add pmd_[dirty|mkclean] for THP
  [8] mm: Don't split THP page when syscall is called

 arch/arm/include/asm/pgtable-3level.h    |   3 +
 arch/arm64/include/asm/pgtable.h         |   2 +
 arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
 arch/s390/include/asm/pgtable.h          |  12 +++
 arch/sparc/include/asm/pgtable_64.h      |  16 ++++
 arch/x86/include/asm/pgtable.h           |  10 ++
 include/linux/huge_mm.h                  |   4 +
 include/linux/rmap.h                     |   9 +-
 include/linux/vm_event_item.h            |   1 +
 include/uapi/asm-generic/mman-common.h   |   1 +
 mm/huge_memory.c                         |  35 +++++++
 mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
 mm/rmap.c                                |  46 ++++++++-
 mm/vmscan.c                              |  64 +++++++++----
 mm/vmstat.c                              |   1 +
 15 files changed, 341 insertions(+), 20 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v12 0/8] MADV_FREE support
@ 2014-07-09  6:22 ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim

This patch enable MADV_FREE hint for madvise syscall, which have
been supported by other OSes. [PATCH 1] includes the details.

[1] support MADVISE_FREE for !THP page so if VM encounter
THP page in syscall context, it splits THP page.
[2-7] is to preparing to call madvise syscall without THP plitting
[8] enable THP page support for MADV_FREE.


* From v11
 * Fix arm build - Steve
 * Separate patch for arm and arm64 - Steve
 * Remove unnecessary check - Kirill
 * Skip non-vm_normal page - Kirill
 * Add Acked-by - Zhang
 * Sparc64 build fix
 * Pagetable walker THP handling fix

* From v10
 * Add Acked-by from arch stuff(x86, s390)
 * Pagewalker based pagetable working - Kirill
 * Fix try_to_unmap_one broken with hwpoison - Kirill
 * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
 * Fix pgtable-3level.h for arm - Steve

* From v9
 * Add Acked-by - Rik
 * Add THP page support - Kirill

* From v8
 * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44

* From v7
 * Rebased-on next-20140613

* From v6
 * Remove page from swapcache in syscal time
 * Move utility functions from memory.c to madvise.c - Johannes
 * Rename untilify functtions - Johannes
 * Remove unnecessary checks from vmscan.c - Johannes
 * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
 * Drop Reviewe-by because there was some changes since then.

* From v5
 * Fix PPC problem which don't flush TLB - Rik
 * Remove unnecessary lazyfree_range stub function - Rik
 * Rebased on v3.15-rc5

* From v4
 * Add Reviewed-by: Zhang Yanfei
 * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14

* From v3
 * Add "how to work part" in description - Zhang
 * Add page_discardable utility function - Zhang
 * Clean up

* From v2
 * Remove forceful dirty marking of swap-readed page - Johannes
 * Remove deactivation logic of lazyfreed page
 * Rebased on 3.14
 * Remove RFC tag

* From v1
 * Use custom page table walker for madvise_free - Johannes
 * Remove PG_lazypage flag - Johannes
 * Do madvise_dontneed instead of madvise_freein swapless system

Minchan Kim (8):
  [1] mm: support madvise(MADV_FREE)
  [2] x86: add pmd_[dirty|mkclean] for THP
  [3] sparc: add pmd_[dirty|mkclean] for THP
  [4] powerpc: add pmd_[dirty|mkclean] for THP
  [5] s390: add pmd_[dirty|mkclean] for THP
  [6] arm: add pmd_[dirty|mkclean] for THP
  [7] arm64: add pmd_[dirty|mkclean] for THP
  [8] mm: Don't split THP page when syscall is called

 arch/arm/include/asm/pgtable-3level.h    |   3 +
 arch/arm64/include/asm/pgtable.h         |   2 +
 arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
 arch/s390/include/asm/pgtable.h          |  12 +++
 arch/sparc/include/asm/pgtable_64.h      |  16 ++++
 arch/x86/include/asm/pgtable.h           |  10 ++
 include/linux/huge_mm.h                  |   4 +
 include/linux/rmap.h                     |   9 +-
 include/linux/vm_event_item.h            |   1 +
 include/uapi/asm-generic/mman-common.h   |   1 +
 mm/huge_memory.c                         |  35 +++++++
 mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
 mm/rmap.c                                |  46 ++++++++-
 mm/vmscan.c                              |  64 +++++++++----
 mm/vmstat.c                              |   1 +
 15 files changed, 341 insertions(+), 20 deletions(-)

-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v12 1/8] mm: support madvise(MADV_FREE)
  2014-07-09  6:22 ` Minchan Kim
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim

Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).

How to work is following as.

When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.

Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Stepping:              7
CPU MHz:               2801.000
BogoMIPS:              5581.64
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3

ebizzy benchmark(./ebizzy -S 10 -n 512)

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records:  10              records:  10
avg:      7682.10         avg:      15306.10
std:      62.35(0.81%)    std:      347.99(2.27%)
max:      7770.00         max:      15622.00
min:      7598.00         min:      14772.00

2 thread
records:  10              records:  10
avg:      12747.50        avg:      24171.00
std:      792.06(6.21%)   std:      895.18(3.70%)
max:      13337.00        max:      26023.00
min:      10535.00        min:      23152.00

4 thread
records:  10              records:  10
avg:      16474.60        avg:      33717.90
std:      1496.45(9.08%)  std:      2008.97(5.96%)
max:      17877.00        max:      35958.00
min:      12224.00        min:      29565.00

8 thread
records:  10              records:  10
avg:      16778.50        avg:      33308.10
std:      825.53(4.92%)   std:      1668.30(5.01%)
max:      17543.00        max:      36010.00
min:      14576.00        min:      29577.00

16 thread
records:  10              records:  10
avg:      20614.40        avg:      35516.30
std:      602.95(2.92%)   std:      1283.65(3.61%)
max:      21753.00        max:      37178.00
min:      19605.00        min:      33217.00

32 thread
records:  10              records:  10
avg:      22771.70        avg:      36018.50
std:      598.94(2.63%)   std:      1046.76(2.91%)
max:      24035.00        max:      37266.00
min:      22108.00        min:      34149.00

In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jason Evans <je@fb.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   9 ++-
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |  42 +++++++++-
 mm/vmscan.c                            |  40 ++++++++--
 mm/vmstat.c                            |   1 +
 7 files changed, 218 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be574506e6a9..0ba377b97a38 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -75,6 +75,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_FREE = 8,			/* free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
@@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+			struct mem_cgroup *memcg, unsigned long *vm_flags,
+			int *is_dirty);
 
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *memcg,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int *is_pte_dirty)
 {
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
 	return 0;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ced92345c963..e2d3fb1e9814 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30da4ab..55b42e5e32a3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,14 @@
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
+
+struct madvise_free_private {
+	struct vm_area_struct *vma;
+	struct mmu_gather *tlb;
+};
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct madvise_free_private *fp = walk->private;
+	struct mmu_gather *tlb = fp->tlb;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = fp->vma;
+	spinlock_t *ptl;
+	pte_t *pte, ptent;
+	struct page *page;
+
+	split_huge_page_pmd(vma, addr, pmd);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (PageSwapCache(page)) {
+			if (trylock_page(page)) {
+				if (try_to_free_swap(page))
+					ClearPageDirty(page);
+				unlock_page(page);
+			} else
+				continue;
+		}
+
+		/*
+		 * Some of architecture(ex, PPC) don't update TLB
+		 * with set_pte_at and tlb_remove_tlb_entry so for
+		 * the portability, remap the pte with old|clean
+		 * after pte clearing.
+		 */
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		ptent = pte_mkold(ptent);
+		ptent = pte_mkclean(ptent);
+		set_pte_at(mm, addr, pte, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct madvise_free_private fp = {
+		.vma = vma,
+		.tlb = tlb,
+	};
+
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = &fp,
+	};
+
+	BUG_ON(addr >= end);
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (vma->vm_file)
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -381,6 +508,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -400,6 +535,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 7928ddd91b6e..7ac6e059aba7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -663,6 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 }
 
 struct page_referenced_arg {
+	int dirtied;
 	int mapcount;
 	int referenced;
 	unsigned long vm_flags;
@@ -677,6 +678,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	int referenced = 0;
+	int dirty = 0;
 	struct page_referenced_arg *pra = arg;
 
 	if (unlikely(PageTransHuge(page))) {
@@ -700,6 +702,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
+
+		/*
+		 * In this implmentation, MADV_FREE doesn't support THP free
+		 */
+		dirty++;
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
@@ -729,6 +736,10 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
 				referenced++;
 		}
+
+		if (pte_dirty(*pte))
+			dirty++;
+
 		pte_unmap_unlock(pte, ptl);
 	}
 
@@ -737,6 +748,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pra->vm_flags |= vma->vm_flags;
 	}
 
+	if (dirty)
+		pra->dirtied++;
+
 	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
@@ -761,6 +775,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
  * @is_locked: caller holds lock on the page
  * @memcg: target memory cgroup
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -768,7 +783,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *memcg,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int *is_pte_dirty)
 {
 	int ret;
 	int we_locked = 0;
@@ -783,6 +799,9 @@ int page_referenced(struct page *page,
 	};
 
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
+
 	if (!page_mapped(page))
 		return 0;
 
@@ -810,6 +829,9 @@ int page_referenced(struct page *page,
 	if (we_locked)
 		unlock_page(page);
 
+	if (is_pte_dirty)
+		*is_pte_dirty = pra.dirtied;
+
 	return pra.referenced;
 }
 
@@ -1128,6 +1150,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	int dirty = 0;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1157,7 +1180,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	dirty = pte_dirty(pteval);
+	if (dirty)
 		set_page_dirty(page);
 
 	/* Update high watermark before we lower rss */
@@ -1186,6 +1210,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
+		if (flags & TTU_FREE) {
+			VM_BUG_ON_PAGE(PageSwapCache(page), page);
+			if (!dirty && !PageDirty(page)) {
+				/* It's a freeable page by MADV_FREE */
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
+			} else {
+				set_pte_at(mm, address, pte, pteval);
+				ret = SWAP_FAIL;
+				goto out_unmap;
+			}
+		}
+
 		if (PageSwapCache(page)) {
 			/*
 			 * Store the swap location in the pte.
@@ -1227,6 +1264,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
+discard:
 	page_remove_rmap(page);
 	page_cache_release(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d24fd63b209..914815a43362 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -707,13 +707,17 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct scan_control *sc)
+						  struct scan_control *sc,
+						  bool *freeable)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
+	int pte_dirty;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
-					  &vm_flags);
+					  &vm_flags, &pte_dirty);
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
@@ -754,6 +758,10 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
+			!PageDirty(page))
+		*freeable = true;
+
 	/* Reclaim if clean, defer dirty pages to writeback */
 	if (referenced_page && !PageSwapBacked(page))
 		return PAGEREF_RECLAIM_CLEAN;
@@ -823,6 +831,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool freeable = false;
 
 		cond_resched();
 
@@ -945,7 +954,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (!force_reclaim)
-			references = page_check_references(page, sc);
+			references = page_check_references(page, sc,
+							&freeable);
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
@@ -961,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
+		if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
@@ -976,8 +986,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+		if (page_mapped(page) && (mapping || freeable)) {
+			switch (try_to_unmap(page,
+				freeable ? TTU_FREE : ttu_flags)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -985,7 +996,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_SUCCESS:
-				; /* try to free the page below */
+				/* try to free the page below */
+				if (!freeable)
+					break;
+				/*
+				 * Freeable anon page doesn't have mapping
+				 * due to skipping of swapcache so we free
+				 * page in here rather than __remove_mapping.
+				 */
+				VM_BUG_ON_PAGE(PageSwapCache(page), page);
+				if (!page_freeze_refs(page, 1))
+					goto keep_locked;
+				__clear_page_locked(page);
+				count_vm_event(PGLAZYFREED);
+				goto free_it;
 			}
 		}
 
@@ -1727,7 +1751,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
-				    &vm_flags)) {
+				    &vm_flags, NULL)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
diff --git a/mm/vmstat.c b/mm/vmstat.c
index eef6321c8470..da18337c6c66 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -794,6 +794,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 1/8] mm: support madvise(MADV_FREE)
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim

Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).

How to work is following as.

When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.

Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Stepping:              7
CPU MHz:               2801.000
BogoMIPS:              5581.64
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3

ebizzy benchmark(./ebizzy -S 10 -n 512)

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records:  10              records:  10
avg:      7682.10         avg:      15306.10
std:      62.35(0.81%)    std:      347.99(2.27%)
max:      7770.00         max:      15622.00
min:      7598.00         min:      14772.00

2 thread
records:  10              records:  10
avg:      12747.50        avg:      24171.00
std:      792.06(6.21%)   std:      895.18(3.70%)
max:      13337.00        max:      26023.00
min:      10535.00        min:      23152.00

4 thread
records:  10              records:  10
avg:      16474.60        avg:      33717.90
std:      1496.45(9.08%)  std:      2008.97(5.96%)
max:      17877.00        max:      35958.00
min:      12224.00        min:      29565.00

8 thread
records:  10              records:  10
avg:      16778.50        avg:      33308.10
std:      825.53(4.92%)   std:      1668.30(5.01%)
max:      17543.00        max:      36010.00
min:      14576.00        min:      29577.00

16 thread
records:  10              records:  10
avg:      20614.40        avg:      35516.30
std:      602.95(2.92%)   std:      1283.65(3.61%)
max:      21753.00        max:      37178.00
min:      19605.00        min:      33217.00

32 thread
records:  10              records:  10
avg:      22771.70        avg:      36018.50
std:      598.94(2.63%)   std:      1046.76(2.91%)
max:      24035.00        max:      37266.00
min:      22108.00        min:      34149.00

In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jason Evans <je@fb.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   9 ++-
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |  42 +++++++++-
 mm/vmscan.c                            |  40 ++++++++--
 mm/vmstat.c                            |   1 +
 7 files changed, 218 insertions(+), 12 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index be574506e6a9..0ba377b97a38 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -75,6 +75,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_FREE = 8,			/* free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
@@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
  * Called from mm/vmscan.c to handle paging out
  */
 int page_referenced(struct page *, int is_locked,
-			struct mem_cgroup *memcg, unsigned long *vm_flags);
+			struct mem_cgroup *memcg, unsigned long *vm_flags,
+			int *is_dirty);
 
 #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
 
@@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
 
 static inline int page_referenced(struct page *page, int is_locked,
 				  struct mem_cgroup *memcg,
-				  unsigned long *vm_flags)
+				  unsigned long *vm_flags,
+				  int *is_pte_dirty)
 {
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
 	return 0;
 }
 
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index ced92345c963..e2d3fb1e9814 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30da4ab..55b42e5e32a3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,14 @@
 #include <linux/blkdev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
+
+struct madvise_free_private {
+	struct vm_area_struct *vma;
+	struct mmu_gather *tlb;
+};
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct madvise_free_private *fp = walk->private;
+	struct mmu_gather *tlb = fp->tlb;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = fp->vma;
+	spinlock_t *ptl;
+	pte_t *pte, ptent;
+	struct page *page;
+
+	split_huge_page_pmd(vma, addr, pmd);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (PageSwapCache(page)) {
+			if (trylock_page(page)) {
+				if (try_to_free_swap(page))
+					ClearPageDirty(page);
+				unlock_page(page);
+			} else
+				continue;
+		}
+
+		/*
+		 * Some of architecture(ex, PPC) don't update TLB
+		 * with set_pte_at and tlb_remove_tlb_entry so for
+		 * the portability, remap the pte with old|clean
+		 * after pte clearing.
+		 */
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		ptent = pte_mkold(ptent);
+		ptent = pte_mkclean(ptent);
+		set_pte_at(mm, addr, pte, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct madvise_free_private fp = {
+		.vma = vma,
+		.tlb = tlb,
+	};
+
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = &fp,
+	};
+
+	BUG_ON(addr >= end);
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (vma->vm_file)
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -381,6 +508,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -400,6 +535,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 7928ddd91b6e..7ac6e059aba7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -663,6 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 }
 
 struct page_referenced_arg {
+	int dirtied;
 	int mapcount;
 	int referenced;
 	unsigned long vm_flags;
@@ -677,6 +678,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
 	int referenced = 0;
+	int dirty = 0;
 	struct page_referenced_arg *pra = arg;
 
 	if (unlikely(PageTransHuge(page))) {
@@ -700,6 +702,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		/* go ahead even if the pmd is pmd_trans_splitting() */
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
+
+		/*
+		 * In this implmentation, MADV_FREE doesn't support THP free
+		 */
+		dirty++;
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
@@ -729,6 +736,10 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
 				referenced++;
 		}
+
+		if (pte_dirty(*pte))
+			dirty++;
+
 		pte_unmap_unlock(pte, ptl);
 	}
 
@@ -737,6 +748,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		pra->vm_flags |= vma->vm_flags;
 	}
 
+	if (dirty)
+		pra->dirtied++;
+
 	pra->mapcount--;
 	if (!pra->mapcount)
 		return SWAP_SUCCESS; /* To break the loop */
@@ -761,6 +775,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
  * @is_locked: caller holds lock on the page
  * @memcg: target memory cgroup
  * @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
  *
  * Quick test_and_clear_referenced for all mappings to a page,
  * returns the number of ptes which referenced the page.
@@ -768,7 +783,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
 int page_referenced(struct page *page,
 		    int is_locked,
 		    struct mem_cgroup *memcg,
-		    unsigned long *vm_flags)
+		    unsigned long *vm_flags,
+		    int *is_pte_dirty)
 {
 	int ret;
 	int we_locked = 0;
@@ -783,6 +799,9 @@ int page_referenced(struct page *page,
 	};
 
 	*vm_flags = 0;
+	if (is_pte_dirty)
+		*is_pte_dirty = 0;
+
 	if (!page_mapped(page))
 		return 0;
 
@@ -810,6 +829,9 @@ int page_referenced(struct page *page,
 	if (we_locked)
 		unlock_page(page);
 
+	if (is_pte_dirty)
+		*is_pte_dirty = pra.dirtied;
+
 	return pra.referenced;
 }
 
@@ -1128,6 +1150,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int ret = SWAP_AGAIN;
 	enum ttu_flags flags = (enum ttu_flags)arg;
+	int dirty = 0;
 
 	pte = page_check_address(page, mm, address, &ptl, 0);
 	if (!pte)
@@ -1157,7 +1180,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
-	if (pte_dirty(pteval))
+	dirty = pte_dirty(pteval);
+	if (dirty)
 		set_page_dirty(page);
 
 	/* Update high watermark before we lower rss */
@@ -1186,6 +1210,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
+		if (flags & TTU_FREE) {
+			VM_BUG_ON_PAGE(PageSwapCache(page), page);
+			if (!dirty && !PageDirty(page)) {
+				/* It's a freeable page by MADV_FREE */
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
+			} else {
+				set_pte_at(mm, address, pte, pteval);
+				ret = SWAP_FAIL;
+				goto out_unmap;
+			}
+		}
+
 		if (PageSwapCache(page)) {
 			/*
 			 * Store the swap location in the pte.
@@ -1227,6 +1264,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
+discard:
 	page_remove_rmap(page);
 	page_cache_release(page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d24fd63b209..914815a43362 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -707,13 +707,17 @@ enum page_references {
 };
 
 static enum page_references page_check_references(struct page *page,
-						  struct scan_control *sc)
+						  struct scan_control *sc,
+						  bool *freeable)
 {
 	int referenced_ptes, referenced_page;
 	unsigned long vm_flags;
+	int pte_dirty;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
 	referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
-					  &vm_flags);
+					  &vm_flags, &pte_dirty);
 	referenced_page = TestClearPageReferenced(page);
 
 	/*
@@ -754,6 +758,10 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
+			!PageDirty(page))
+		*freeable = true;
+
 	/* Reclaim if clean, defer dirty pages to writeback */
 	if (referenced_page && !PageSwapBacked(page))
 		return PAGEREF_RECLAIM_CLEAN;
@@ -823,6 +831,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool freeable = false;
 
 		cond_resched();
 
@@ -945,7 +954,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (!force_reclaim)
-			references = page_check_references(page, sc);
+			references = page_check_references(page, sc,
+							&freeable);
 
 		switch (references) {
 		case PAGEREF_ACTIVATE:
@@ -961,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
+		if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
@@ -976,8 +986,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page, ttu_flags)) {
+		if (page_mapped(page) && (mapping || freeable)) {
+			switch (try_to_unmap(page,
+				freeable ? TTU_FREE : ttu_flags)) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -985,7 +996,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_SUCCESS:
-				; /* try to free the page below */
+				/* try to free the page below */
+				if (!freeable)
+					break;
+				/*
+				 * Freeable anon page doesn't have mapping
+				 * due to skipping of swapcache so we free
+				 * page in here rather than __remove_mapping.
+				 */
+				VM_BUG_ON_PAGE(PageSwapCache(page), page);
+				if (!page_freeze_refs(page, 1))
+					goto keep_locked;
+				__clear_page_locked(page);
+				count_vm_event(PGLAZYFREED);
+				goto free_it;
 			}
 		}
 
@@ -1727,7 +1751,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 		}
 
 		if (page_referenced(page, 0, sc->target_mem_cgroup,
-				    &vm_flags)) {
+				    &vm_flags, NULL)) {
 			nr_rotated += hpage_nr_pages(page);
 			/*
 			 * Identify referenced, file-backed active pages and
diff --git a/mm/vmstat.c b/mm/vmstat.c
index eef6321c8470..da18337c6c66 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -794,6 +794,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 2/8] x86: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22 ` Minchan Kim
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/include/asm/pgtable.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec056012618..329865799653 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_DIRTY;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
@@ -267,6 +272,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 2/8] x86: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/include/asm/pgtable.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0ec056012618..329865799653 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
 	return pmd_flags(pmd) & _PAGE_ACCESSED;
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	return pmd_flags(pmd) & _PAGE_DIRTY;
+}
+
 static inline int pte_write(pte_t pte)
 {
 	return pte_flags(pte) & _PAGE_RW;
@@ -267,6 +272,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 3/8] sparc: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22 ` Minchan Kim
  (?)
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	David S. Miller, sparclinux

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/sparc/include/asm/pgtable_64.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3770bf5c6e1b..b80a309d7e00 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -666,6 +666,13 @@ static inline unsigned long pmd_young(pmd_t pmd)
 	return pte_young(pte);
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	return pte_dirty(pte);
+}
+
 static inline unsigned long pmd_write(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
@@ -723,6 +730,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	pte = pte_mkclean(pte);
+
+	return __pmd(pte_val(pte));
+}
+
 static inline pmd_t pmd_mkyoung(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 3/8] sparc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	David S. Miller, sparclinux

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/sparc/include/asm/pgtable_64.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3770bf5c6e1b..b80a309d7e00 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -666,6 +666,13 @@ static inline unsigned long pmd_young(pmd_t pmd)
 	return pte_young(pte);
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	return pte_dirty(pte);
+}
+
 static inline unsigned long pmd_write(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
@@ -723,6 +730,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	pte = pte_mkclean(pte);
+
+	return __pmd(pte_val(pte));
+}
+
 static inline pmd_t pmd_mkyoung(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 3/8] sparc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	David S. Miller, sparclinux

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/sparc/include/asm/pgtable_64.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3770bf5c6e1b..b80a309d7e00 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -666,6 +666,13 @@ static inline unsigned long pmd_young(pmd_t pmd)
 	return pte_young(pte);
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	return pte_dirty(pte);
+}
+
 static inline unsigned long pmd_write(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
@@ -723,6 +730,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	pte = pte_mkclean(pte);
+
+	return __pmd(pte_val(pte));
+}
+
 static inline pmd_t pmd_mkyoung(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 4/8] powerpc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Benjamin Herrenschmidt, Paul Mackerras, Aneesh Kumar K.V,
	linuxppc-dev

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 4/8] powerpc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk, Linux API,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, Zhang Yanfei, Kirill A. Shutemov,
	Minchan Kim, Benjamin Herrenschmidt, Paul Mackerras,
	Aneesh Kumar K.V, linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt <benh-XVmvHMARGAS8U2dJNN8I7kB+6BGkLq7r@public.gmane.org>
Cc: Paul Mackerras <paulus-eUNUBHrolfbYtjvyW6yDsg@public.gmane.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Cc: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 4/8] powerpc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Benjamin Herrenschmidt, Paul Mackerras, Aneesh Kumar K.V,
	linuxppc-dev

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 4/8] powerpc: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jason Evans, Rik van Riel, Minchan Kim, Aneesh Kumar K.V,
	Linux API, Hugh Dickins, linux-kernel, linux-mm, Zhang Yanfei,
	Michael Kerrisk, KOSAKI Motohiro, Johannes Weiner,
	Kirill A. Shutemov, linuxppc-dev, Paul Mackerras, Mel Gorman

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index eb9261024f51..c9a4bbe8e179 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 5/8] s390: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22 ` Minchan Kim
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Martin Schwidefsky, Heiko Carstens, Dominik Dingel,
	Christian Borntraeger, linux-s390

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page but for s390 pmds only referenced bit is available
because there is no free bit left in the pmd entry for the
software dirty bit so this patch adds dumb pmd_dirty which
returns always true by suggesting by Martin.

They finally find a solution in future.
http://marc.info/?l=linux-api&m=140440328820808&w=2

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/s390/include/asm/pgtable.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index fcba5e03839f..9862fcb0592b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1586,6 +1586,18 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd;
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	/* No dirty bit in the segment table entry */
+	return 1;
+}
+
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	/* No dirty bit in the segment table entry */
+	return pmd;
+}
+
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address, pmd_t *pmdp)
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 5/8] s390: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Martin Schwidefsky, Heiko Carstens, Dominik Dingel,
	Christian Borntraeger, linux-s390

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page but for s390 pmds only referenced bit is available
because there is no free bit left in the pmd entry for the
software dirty bit so this patch adds dumb pmd_dirty which
returns always true by suggesting by Martin.

They finally find a solution in future.
http://marc.info/?l=linux-api&m=140440328820808&w=2

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: linux-s390@vger.kernel.org
Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/s390/include/asm/pgtable.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index fcba5e03839f..9862fcb0592b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1586,6 +1586,18 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd;
 }
 
+static inline int pmd_dirty(pmd_t pmd)
+{
+	/* No dirty bit in the segment table entry */
+	return 1;
+}
+
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	/* No dirty bit in the segment table entry */
+	return pmd;
+}
+
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address, pmd_t *pmdp)
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 6/8] arm: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22 ` Minchan Kim
  (?)
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Catalin Marinas, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/include/asm/pgtable-3level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 85c60adc8b60..830f84f2d277 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -220,6 +220,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
 #endif
 
+#define pmd_dirty(pmd)		(pmd_val(pmd) & PMD_SECT_DIRTY)
+
 #define PMD_BIT_FUNC(fn,op) \
 static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
@@ -228,6 +230,7 @@ PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
 PMD_BIT_FUNC(mksplitting, |= PMD_SECT_SPLITTING);
 PMD_BIT_FUNC(mkwrite,   &= ~PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean,   &= ~PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 6/8] arm: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Catalin Marinas, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/include/asm/pgtable-3level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 85c60adc8b60..830f84f2d277 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -220,6 +220,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
 #endif
 
+#define pmd_dirty(pmd)		(pmd_val(pmd) & PMD_SECT_DIRTY)
+
 #define PMD_BIT_FUNC(fn,op) \
 static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
@@ -228,6 +230,7 @@ PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
 PMD_BIT_FUNC(mksplitting, |= PMD_SECT_SPLITTING);
 PMD_BIT_FUNC(mkwrite,   &= ~PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean,   &= ~PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 6/8] arm: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel at lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/include/asm/pgtable-3level.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 85c60adc8b60..830f84f2d277 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -220,6 +220,8 @@ static inline pmd_t *pmd_offset(pud_t *pud, unsigned long addr)
 #define pmd_trans_splitting(pmd) (pmd_val(pmd) & PMD_SECT_SPLITTING)
 #endif
 
+#define pmd_dirty(pmd)		(pmd_val(pmd) & PMD_SECT_DIRTY)
+
 #define PMD_BIT_FUNC(fn,op) \
 static inline pmd_t pmd_##fn(pmd_t pmd) { pmd_val(pmd) op; return pmd; }
 
@@ -228,6 +230,7 @@ PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
 PMD_BIT_FUNC(mksplitting, |= PMD_SECT_SPLITTING);
 PMD_BIT_FUNC(mkwrite,   &= ~PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean,   &= ~PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22 ` Minchan Kim
  (?)
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Catalin Marinas, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 579702086488..f3ec01cef04f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -240,10 +240,12 @@ static inline pmd_t pte_pmd(pte_t pte)
 #endif
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mksplitting(pmd)	pte_pmd(pte_mkspecial(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Catalin Marinas, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 579702086488..f3ec01cef04f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -240,10 +240,12 @@ static inline pmd_t pte_pmd(pte_t pte)
 #endif
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mksplitting(pmd)	pte_pmd(pte_mkspecial(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: linux-arm-kernel

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: linux-arm-kernel at lists.infradead.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 579702086488..f3ec01cef04f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -240,10 +240,12 @@ static inline pmd_t pte_pmd(pte_t pte)
 #endif
 
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mksplitting(pmd)	pte_pmd(pte_mkspecial(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 8/8] mm: Don't split THP page when syscall is called
  2014-07-09  6:22 ` Minchan Kim
@ 2014-07-09  6:22   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Andrea Arcangeli

We don't need to split THP page when MADV_FREE syscall is
called. It could be done when VM decide really frees it so
we could avoid unnecessary THP split.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/huge_mm.h |  4 ++++
 mm/huge_memory.c        | 35 +++++++++++++++++++++++++++++++++++
 mm/madvise.c            | 21 ++++++++++++++++++++-
 mm/rmap.c               |  8 ++++++--
 mm/vmscan.c             | 28 ++++++++++++++++++----------
 5 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb8d3dc..25a961256d9f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
@@ -56,6 +59,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 				     unsigned long address,
 				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
+extern int pmd_freeable(pmd_t pmd);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d562a9fe931..919c780015b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1384,6 +1384,36 @@ out:
 	return 0;
 }
 
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = tlb->mm;
+	int ret = 1;
+
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+		struct page *page;
+		pmd_t orig_pmd;
+
+		orig_pmd = pmdp_get_and_clear(mm, addr, pmd);
+
+		/* No hugepage in swapcache */
+		page = pmd_page(orig_pmd);
+		VM_BUG_ON_PAGE(PageSwapCache(page), page);
+
+		orig_pmd = pmd_mkold(orig_pmd);
+		orig_pmd = pmd_mkclean(orig_pmd);
+
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		spin_unlock(ptl);
+		ret = 0;
+	}
+
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1620,6 +1650,11 @@ unlock:
 	return NULL;
 }
 
+int pmd_freeable(pmd_t pmd)
+{
+	return !pmd_dirty(pmd);
+}
+
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
 				       unsigned long address)
diff --git a/mm/madvise.c b/mm/madvise.c
index 55b42e5e32a3..0927eba6d858 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		if (next - addr != HPAGE_PMD_SIZE) {
+#ifdef CONFIG_DEBUG_VM
+			if (!rwsem_is_locked(&mm->mmap_sem)) {
+				pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
+					__func__, addr, end,
+					vma->vm_start,
+					vma->vm_end);
+				BUG();
+			}
+#endif
+			split_huge_page_pmd(vma, addr, pmd);
+		} else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+			goto next;
+		/* fall through */
+	}
 
-	split_huge_page_pmd(vma, addr, pmd);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
@@ -312,6 +330,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
+next:
 	cond_resched();
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ac6e059aba7..5a0cca9bc263 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -704,9 +704,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			referenced++;
 
 		/*
-		 * In this implmentation, MADV_FREE doesn't support THP free
+		 * Use pmd_freeable instead of raw pmd_dirty because in some
+		 * of architecture, pmd_dirty is not defined unless
+		 * CONFIG_TRANSPARNTE_HUGE is enabled
 		 */
-		dirty++;
+		if (!pmd_freeable(*pmd))
+			dirty++;
+
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 914815a43362..995b28e3aee7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -971,17 +971,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
-			if (!(sc->gfp_mask & __GFP_IO))
-				goto keep_locked;
-			if (!add_to_swap(page, page_list))
-				goto activate_locked;
-			may_enter_fs = 1;
-
-			/* Adding to swap updated mapping */
-			mapping = page_mapping(page);
+		if (PageAnon(page) && !PageSwapCache(page)) {
+			if (!freeable) {
+				if (!(sc->gfp_mask & __GFP_IO))
+					goto keep_locked;
+				if (!add_to_swap(page, page_list))
+					goto activate_locked;
+				may_enter_fs = 1;
+				/* Adding to swap updated mapping */
+				mapping = page_mapping(page);
+			} else {
+				if (likely(!PageTransHuge(page)))
+					goto unmap;
+				/* try_to_unmap isn't aware of THP page */
+				if (unlikely(split_huge_page_to_list(page,
+								page_list)))
+					goto keep_locked;
+			}
 		}
-
+unmap:
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v12 8/8] mm: Don't split THP page when syscall is called
@ 2014-07-09  6:22   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-09  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov, Minchan Kim,
	Andrea Arcangeli

We don't need to split THP page when MADV_FREE syscall is
called. It could be done when VM decide really frees it so
we could avoid unnecessary THP split.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/huge_mm.h |  4 ++++
 mm/huge_memory.c        | 35 +++++++++++++++++++++++++++++++++++
 mm/madvise.c            | 21 ++++++++++++++++++++-
 mm/rmap.c               |  8 ++++++--
 mm/vmscan.c             | 28 ++++++++++++++++++----------
 5 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 63579cb8d3dc..25a961256d9f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
@@ -56,6 +59,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,
 				     unsigned long address,
 				     enum page_check_address_pmd_flag flag,
 				     spinlock_t **ptl);
+extern int pmd_freeable(pmd_t pmd);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5d562a9fe931..919c780015b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1384,6 +1384,36 @@ out:
 	return 0;
 }
 
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+
+{
+	spinlock_t *ptl;
+	struct mm_struct *mm = tlb->mm;
+	int ret = 1;
+
+	if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+		struct page *page;
+		pmd_t orig_pmd;
+
+		orig_pmd = pmdp_get_and_clear(mm, addr, pmd);
+
+		/* No hugepage in swapcache */
+		page = pmd_page(orig_pmd);
+		VM_BUG_ON_PAGE(PageSwapCache(page), page);
+
+		orig_pmd = pmd_mkold(orig_pmd);
+		orig_pmd = pmd_mkclean(orig_pmd);
+
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+		spin_unlock(ptl);
+		ret = 0;
+	}
+
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -1620,6 +1650,11 @@ unlock:
 	return NULL;
 }
 
+int pmd_freeable(pmd_t pmd)
+{
+	return !pmd_dirty(pmd);
+}
+
 static int __split_huge_page_splitting(struct page *page,
 				       struct vm_area_struct *vma,
 				       unsigned long address)
diff --git a/mm/madvise.c b/mm/madvise.c
index 55b42e5e32a3..0927eba6d858 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd)) {
+		if (next - addr != HPAGE_PMD_SIZE) {
+#ifdef CONFIG_DEBUG_VM
+			if (!rwsem_is_locked(&mm->mmap_sem)) {
+				pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
+					__func__, addr, end,
+					vma->vm_start,
+					vma->vm_end);
+				BUG();
+			}
+#endif
+			split_huge_page_pmd(vma, addr, pmd);
+		} else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+			goto next;
+		/* fall through */
+	}
 
-	split_huge_page_pmd(vma, addr, pmd);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
@@ -312,6 +330,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
+next:
 	cond_resched();
 	return 0;
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ac6e059aba7..5a0cca9bc263 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -704,9 +704,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			referenced++;
 
 		/*
-		 * In this implmentation, MADV_FREE doesn't support THP free
+		 * Use pmd_freeable instead of raw pmd_dirty because in some
+		 * of architecture, pmd_dirty is not defined unless
+		 * CONFIG_TRANSPARNTE_HUGE is enabled
 		 */
-		dirty++;
+		if (!pmd_freeable(*pmd))
+			dirty++;
+
 		spin_unlock(ptl);
 	} else {
 		pte_t *pte;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 914815a43362..995b28e3aee7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -971,17 +971,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 */
-		if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
-			if (!(sc->gfp_mask & __GFP_IO))
-				goto keep_locked;
-			if (!add_to_swap(page, page_list))
-				goto activate_locked;
-			may_enter_fs = 1;
-
-			/* Adding to swap updated mapping */
-			mapping = page_mapping(page);
+		if (PageAnon(page) && !PageSwapCache(page)) {
+			if (!freeable) {
+				if (!(sc->gfp_mask & __GFP_IO))
+					goto keep_locked;
+				if (!add_to_swap(page, page_list))
+					goto activate_locked;
+				may_enter_fs = 1;
+				/* Adding to swap updated mapping */
+				mapping = page_mapping(page);
+			} else {
+				if (likely(!PageTransHuge(page)))
+					goto unmap;
+				/* try_to_unmap isn't aware of THP page */
+				if (unlikely(split_huge_page_to_list(page,
+								page_list)))
+					goto keep_locked;
+			}
 		}
-
+unmap:
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
-- 
2.0.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
  2014-07-09  6:22   ` Minchan Kim
  (?)
@ 2014-07-09 11:14     ` Catalin Marinas
  -1 siblings, 0 replies; 36+ messages in thread
From: Catalin Marinas @ 2014-07-09 11:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei,
	Kirill A. Shutemov, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

On Wed, Jul 09, 2014 at 07:22:28AM +0100, Minchan Kim wrote:
> MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
> overwrite of the contents since MADV_FREE syscall is called for
> THP page.
> 
> This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
> support.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Steve Capper <steve.capper@linaro.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
@ 2014-07-09 11:14     ` Catalin Marinas
  0 siblings, 0 replies; 36+ messages in thread
From: Catalin Marinas @ 2014-07-09 11:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei,
	Kirill A. Shutemov, Will Deacon, Steve Capper, Russell King,
	linux-arm-kernel

On Wed, Jul 09, 2014 at 07:22:28AM +0100, Minchan Kim wrote:
> MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
> overwrite of the contents since MADV_FREE syscall is called for
> THP page.
> 
> This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
> support.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Steve Capper <steve.capper@linaro.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v12 7/8] arm64: add pmd_[dirty|mkclean] for THP
@ 2014-07-09 11:14     ` Catalin Marinas
  0 siblings, 0 replies; 36+ messages in thread
From: Catalin Marinas @ 2014-07-09 11:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 09, 2014 at 07:22:28AM +0100, Minchan Kim wrote:
> MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
> overwrite of the contents since MADV_FREE syscall is called for
> THP page.
> 
> This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
> support.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Steve Capper <steve.capper@linaro.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: linux-arm-kernel at lists.infradead.org
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 0/8] MADV_FREE support
@ 2014-07-17  5:02   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-17  5:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov

Kirill, Do you have any comment?

On Wed, Jul 09, 2014 at 03:22:21PM +0900, Minchan Kim wrote:
> This patch enable MADV_FREE hint for madvise syscall, which have
> been supported by other OSes. [PATCH 1] includes the details.
> 
> [1] support MADVISE_FREE for !THP page so if VM encounter
> THP page in syscall context, it splits THP page.
> [2-7] is to preparing to call madvise syscall without THP plitting
> [8] enable THP page support for MADV_FREE.
> 
> 
> * From v11
>  * Fix arm build - Steve
>  * Separate patch for arm and arm64 - Steve
>  * Remove unnecessary check - Kirill
>  * Skip non-vm_normal page - Kirill
>  * Add Acked-by - Zhang
>  * Sparc64 build fix
>  * Pagetable walker THP handling fix
> 
> * From v10
>  * Add Acked-by from arch stuff(x86, s390)
>  * Pagewalker based pagetable working - Kirill
>  * Fix try_to_unmap_one broken with hwpoison - Kirill
>  * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
>  * Fix pgtable-3level.h for arm - Steve
> 
> * From v9
>  * Add Acked-by - Rik
>  * Add THP page support - Kirill
> 
> * From v8
>  * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
> 
> * From v7
>  * Rebased-on next-20140613
> 
> * From v6
>  * Remove page from swapcache in syscal time
>  * Move utility functions from memory.c to madvise.c - Johannes
>  * Rename untilify functtions - Johannes
>  * Remove unnecessary checks from vmscan.c - Johannes
>  * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
>  * Drop Reviewe-by because there was some changes since then.
> 
> * From v5
>  * Fix PPC problem which don't flush TLB - Rik
>  * Remove unnecessary lazyfree_range stub function - Rik
>  * Rebased on v3.15-rc5
> 
> * From v4
>  * Add Reviewed-by: Zhang Yanfei
>  * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
> 
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
> 
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
> 
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
> 
> Minchan Kim (8):
>   [1] mm: support madvise(MADV_FREE)
>   [2] x86: add pmd_[dirty|mkclean] for THP
>   [3] sparc: add pmd_[dirty|mkclean] for THP
>   [4] powerpc: add pmd_[dirty|mkclean] for THP
>   [5] s390: add pmd_[dirty|mkclean] for THP
>   [6] arm: add pmd_[dirty|mkclean] for THP
>   [7] arm64: add pmd_[dirty|mkclean] for THP
>   [8] mm: Don't split THP page when syscall is called
> 
>  arch/arm/include/asm/pgtable-3level.h    |   3 +
>  arch/arm64/include/asm/pgtable.h         |   2 +
>  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
>  arch/s390/include/asm/pgtable.h          |  12 +++
>  arch/sparc/include/asm/pgtable_64.h      |  16 ++++
>  arch/x86/include/asm/pgtable.h           |  10 ++
>  include/linux/huge_mm.h                  |   4 +
>  include/linux/rmap.h                     |   9 +-
>  include/linux/vm_event_item.h            |   1 +
>  include/uapi/asm-generic/mman-common.h   |   1 +
>  mm/huge_memory.c                         |  35 +++++++
>  mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
>  mm/rmap.c                                |  46 ++++++++-
>  mm/vmscan.c                              |  64 +++++++++----
>  mm/vmstat.c                              |   1 +
>  15 files changed, 341 insertions(+), 20 deletions(-)
> 
> -- 
> 2.0.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 0/8] MADV_FREE support
@ 2014-07-17  5:02   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-17  5:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk, Linux API,
	Hugh Dickins, Johannes Weiner, Rik van Riel, KOSAKI Motohiro,
	Mel Gorman, Jason Evans, Zhang Yanfei, Kirill A. Shutemov

Kirill, Do you have any comment?

On Wed, Jul 09, 2014 at 03:22:21PM +0900, Minchan Kim wrote:
> This patch enable MADV_FREE hint for madvise syscall, which have
> been supported by other OSes. [PATCH 1] includes the details.
> 
> [1] support MADVISE_FREE for !THP page so if VM encounter
> THP page in syscall context, it splits THP page.
> [2-7] is to preparing to call madvise syscall without THP plitting
> [8] enable THP page support for MADV_FREE.
> 
> 
> * From v11
>  * Fix arm build - Steve
>  * Separate patch for arm and arm64 - Steve
>  * Remove unnecessary check - Kirill
>  * Skip non-vm_normal page - Kirill
>  * Add Acked-by - Zhang
>  * Sparc64 build fix
>  * Pagetable walker THP handling fix
> 
> * From v10
>  * Add Acked-by from arch stuff(x86, s390)
>  * Pagewalker based pagetable working - Kirill
>  * Fix try_to_unmap_one broken with hwpoison - Kirill
>  * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
>  * Fix pgtable-3level.h for arm - Steve
> 
> * From v9
>  * Add Acked-by - Rik
>  * Add THP page support - Kirill
> 
> * From v8
>  * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
> 
> * From v7
>  * Rebased-on next-20140613
> 
> * From v6
>  * Remove page from swapcache in syscal time
>  * Move utility functions from memory.c to madvise.c - Johannes
>  * Rename untilify functtions - Johannes
>  * Remove unnecessary checks from vmscan.c - Johannes
>  * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
>  * Drop Reviewe-by because there was some changes since then.
> 
> * From v5
>  * Fix PPC problem which don't flush TLB - Rik
>  * Remove unnecessary lazyfree_range stub function - Rik
>  * Rebased on v3.15-rc5
> 
> * From v4
>  * Add Reviewed-by: Zhang Yanfei
>  * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
> 
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
> 
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
> 
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
> 
> Minchan Kim (8):
>   [1] mm: support madvise(MADV_FREE)
>   [2] x86: add pmd_[dirty|mkclean] for THP
>   [3] sparc: add pmd_[dirty|mkclean] for THP
>   [4] powerpc: add pmd_[dirty|mkclean] for THP
>   [5] s390: add pmd_[dirty|mkclean] for THP
>   [6] arm: add pmd_[dirty|mkclean] for THP
>   [7] arm64: add pmd_[dirty|mkclean] for THP
>   [8] mm: Don't split THP page when syscall is called
> 
>  arch/arm/include/asm/pgtable-3level.h    |   3 +
>  arch/arm64/include/asm/pgtable.h         |   2 +
>  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
>  arch/s390/include/asm/pgtable.h          |  12 +++
>  arch/sparc/include/asm/pgtable_64.h      |  16 ++++
>  arch/x86/include/asm/pgtable.h           |  10 ++
>  include/linux/huge_mm.h                  |   4 +
>  include/linux/rmap.h                     |   9 +-
>  include/linux/vm_event_item.h            |   1 +
>  include/uapi/asm-generic/mman-common.h   |   1 +
>  mm/huge_memory.c                         |  35 +++++++
>  mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
>  mm/rmap.c                                |  46 ++++++++-
>  mm/vmscan.c                              |  64 +++++++++----
>  mm/vmstat.c                              |   1 +
>  15 files changed, 341 insertions(+), 20 deletions(-)
> 
> -- 
> 2.0.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 0/8] MADV_FREE support
@ 2014-07-17  5:02   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-17  5:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, Linux API, Hugh Dickins,
	Johannes Weiner, Rik van Riel, KOSAKI Motohiro, Mel Gorman,
	Jason Evans, Zhang Yanfei, Kirill A. Shutemov

Kirill, Do you have any comment?

On Wed, Jul 09, 2014 at 03:22:21PM +0900, Minchan Kim wrote:
> This patch enable MADV_FREE hint for madvise syscall, which have
> been supported by other OSes. [PATCH 1] includes the details.
> 
> [1] support MADVISE_FREE for !THP page so if VM encounter
> THP page in syscall context, it splits THP page.
> [2-7] is to preparing to call madvise syscall without THP plitting
> [8] enable THP page support for MADV_FREE.
> 
> 
> * From v11
>  * Fix arm build - Steve
>  * Separate patch for arm and arm64 - Steve
>  * Remove unnecessary check - Kirill
>  * Skip non-vm_normal page - Kirill
>  * Add Acked-by - Zhang
>  * Sparc64 build fix
>  * Pagetable walker THP handling fix
> 
> * From v10
>  * Add Acked-by from arch stuff(x86, s390)
>  * Pagewalker based pagetable working - Kirill
>  * Fix try_to_unmap_one broken with hwpoison - Kirill
>  * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
>  * Fix pgtable-3level.h for arm - Steve
> 
> * From v9
>  * Add Acked-by - Rik
>  * Add THP page support - Kirill
> 
> * From v8
>  * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
> 
> * From v7
>  * Rebased-on next-20140613
> 
> * From v6
>  * Remove page from swapcache in syscal time
>  * Move utility functions from memory.c to madvise.c - Johannes
>  * Rename untilify functtions - Johannes
>  * Remove unnecessary checks from vmscan.c - Johannes
>  * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
>  * Drop Reviewe-by because there was some changes since then.
> 
> * From v5
>  * Fix PPC problem which don't flush TLB - Rik
>  * Remove unnecessary lazyfree_range stub function - Rik
>  * Rebased on v3.15-rc5
> 
> * From v4
>  * Add Reviewed-by: Zhang Yanfei
>  * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
> 
> * From v3
>  * Add "how to work part" in description - Zhang
>  * Add page_discardable utility function - Zhang
>  * Clean up
> 
> * From v2
>  * Remove forceful dirty marking of swap-readed page - Johannes
>  * Remove deactivation logic of lazyfreed page
>  * Rebased on 3.14
>  * Remove RFC tag
> 
> * From v1
>  * Use custom page table walker for madvise_free - Johannes
>  * Remove PG_lazypage flag - Johannes
>  * Do madvise_dontneed instead of madvise_freein swapless system
> 
> Minchan Kim (8):
>   [1] mm: support madvise(MADV_FREE)
>   [2] x86: add pmd_[dirty|mkclean] for THP
>   [3] sparc: add pmd_[dirty|mkclean] for THP
>   [4] powerpc: add pmd_[dirty|mkclean] for THP
>   [5] s390: add pmd_[dirty|mkclean] for THP
>   [6] arm: add pmd_[dirty|mkclean] for THP
>   [7] arm64: add pmd_[dirty|mkclean] for THP
>   [8] mm: Don't split THP page when syscall is called
> 
>  arch/arm/include/asm/pgtable-3level.h    |   3 +
>  arch/arm64/include/asm/pgtable.h         |   2 +
>  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
>  arch/s390/include/asm/pgtable.h          |  12 +++
>  arch/sparc/include/asm/pgtable_64.h      |  16 ++++
>  arch/x86/include/asm/pgtable.h           |  10 ++
>  include/linux/huge_mm.h                  |   4 +
>  include/linux/rmap.h                     |   9 +-
>  include/linux/vm_event_item.h            |   1 +
>  include/uapi/asm-generic/mman-common.h   |   1 +
>  mm/huge_memory.c                         |  35 +++++++
>  mm/madvise.c                             | 155 +++++++++++++++++++++++++++++++
>  mm/rmap.c                                |  46 ++++++++-
>  mm/vmscan.c                              |  64 +++++++++----
>  mm/vmstat.c                              |   1 +
>  15 files changed, 341 insertions(+), 20 deletions(-)
> 
> -- 
> 2.0.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 1/8] mm: support madvise(MADV_FREE)
  2014-07-09  6:22   ` Minchan Kim
@ 2014-07-17 11:28     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-07-17 11:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei

On Wed, Jul 09, 2014 at 03:22:22PM +0900, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 42
> Stepping:              7
> CPU MHz:               2801.000
> BogoMIPS:              5581.64
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc		MADV_free-jemalloc
> 
> 1 thread
> records:  10              records:  10
> avg:      7682.10         avg:      15306.10
> std:      62.35(0.81%)    std:      347.99(2.27%)
> max:      7770.00         max:      15622.00
> min:      7598.00         min:      14772.00
> 
> 2 thread
> records:  10              records:  10
> avg:      12747.50        avg:      24171.00
> std:      792.06(6.21%)   std:      895.18(3.70%)
> max:      13337.00        max:      26023.00
> min:      10535.00        min:      23152.00
> 
> 4 thread
> records:  10              records:  10
> avg:      16474.60        avg:      33717.90
> std:      1496.45(9.08%)  std:      2008.97(5.96%)
> max:      17877.00        max:      35958.00
> min:      12224.00        min:      29565.00
> 
> 8 thread
> records:  10              records:  10
> avg:      16778.50        avg:      33308.10
> std:      825.53(4.92%)   std:      1668.30(5.01%)
> max:      17543.00        max:      36010.00
> min:      14576.00        min:      29577.00
> 
> 16 thread
> records:  10              records:  10
> avg:      20614.40        avg:      35516.30
> std:      602.95(2.92%)   std:      1283.65(3.61%)
> max:      21753.00        max:      37178.00
> min:      19605.00        min:      33217.00
> 
> 32 thread
> records:  10              records:  10
> avg:      22771.70        avg:      36018.50
> std:      598.94(2.63%)   std:      1046.76(2.91%)
> max:      24035.00        max:      37266.00
> min:      22108.00        min:      34149.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Linux API <linux-api@vger.kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jason Evans <je@fb.com>
> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/rmap.h                   |   9 ++-
>  include/linux/vm_event_item.h          |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
>  mm/rmap.c                              |  42 +++++++++-
>  mm/vmscan.c                            |  40 ++++++++--
>  mm/vmstat.c                            |   1 +
>  7 files changed, 218 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be574506e6a9..0ba377b97a38 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -75,6 +75,7 @@ enum ttu_flags {
>  	TTU_UNMAP = 1,			/* unmap mode */
>  	TTU_MIGRATION = 2,		/* migration mode */
>  	TTU_MUNLOCK = 4,		/* munlock mode */
> +	TTU_FREE = 8,			/* free mode */
>  
>  	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
>  	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked,
> -			struct mem_cgroup *memcg, unsigned long *vm_flags);
> +			struct mem_cgroup *memcg, unsigned long *vm_flags,
> +			int *is_dirty);
>  
>  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>  
> @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>  
>  static inline int page_referenced(struct page *page, int is_locked,
>  				  struct mem_cgroup *memcg,
> -				  unsigned long *vm_flags)
> +				  unsigned long *vm_flags,
> +				  int *is_pte_dirty)
>  {
>  	*vm_flags = 0;
> +	if (is_pte_dirty)
> +		*is_pte_dirty = 0;
>  	return 0;
>  }
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index ced92345c963..e2d3fb1e9814 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		PGFREE, PGACTIVATE, PGDEACTIVATE,
>  		PGFAULT, PGMAJFAULT,
> +		PGLAZYFREED,
>  		FOR_ALL_ZONES(PGREFILL),
>  		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>  		FOR_ALL_ZONES(PGSTEAL_DIRECT),
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index ddc3b36f1046..7a94102b7a02 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,6 +34,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938b30da4ab..55b42e5e32a3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,14 @@
>  #include <linux/blkdev.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <asm/tlb.h>
> +
> +struct madvise_free_private {
> +	struct vm_area_struct *vma;
> +	struct mmu_gather *tlb;
> +};
>  
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
> +	case MADV_FREE:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +
> +{
> +	struct madvise_free_private *fp = walk->private;
> +	struct mmu_gather *tlb = fp->tlb;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = fp->vma;
> +	spinlock_t *ptl;
> +	pte_t *pte, ptent;
> +	struct page *page;
> +
> +	split_huge_page_pmd(vma, addr, pmd);
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (PageSwapCache(page)) {
> +			if (trylock_page(page)) {
> +				if (try_to_free_swap(page))
> +					ClearPageDirty(page);
> +				unlock_page(page);

'continue' for !try_to_free_swap(page) case?

> +			} else
> +				continue;
> +		}
> +
> +		/*
> +		 * Some of architecture(ex, PPC) don't update TLB
> +		 * with set_pte_at and tlb_remove_tlb_entry so for
> +		 * the portability, remap the pte with old|clean
> +		 * after pte clearing.
> +		 */
> +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> +						tlb->fullmm);
> +		ptent = pte_mkold(ptent);
> +		ptent = pte_mkclean(ptent);
> +		set_pte_at(mm, addr, pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
> +	}
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +	return 0;
> +}
> +
> +static void madvise_free_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct madvise_free_private fp = {
> +		.vma = vma,
> +		.tlb = tlb,
> +	};
> +
> +	struct mm_walk free_walk = {
> +		.pmd_entry = madvise_free_pte_range,
> +		.mm = vma->vm_mm,
> +		.private = &fp,
> +	};
> +
> +	BUG_ON(addr >= end);
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &free_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	unsigned long start, end;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))

THP uses VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE to filter out
'special' VMAs. Looks reasonable here too.

Otherwise:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 1/8] mm: support madvise(MADV_FREE)
@ 2014-07-17 11:28     ` Kirill A. Shutemov
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-07-17 11:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei

On Wed, Jul 09, 2014 at 03:22:22PM +0900, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
> 
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
> 
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
> 
> How to work is following as.
> 
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
> 
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
> 
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                4
> On-line CPU(s) list:   0-3
> Thread(s) per core:    2
> Core(s) per socket:    2
> Socket(s):             1
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 42
> Stepping:              7
> CPU MHz:               2801.000
> BogoMIPS:              5581.64
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              4096K
> NUMA node0 CPU(s):     0-3
> 
> ebizzy benchmark(./ebizzy -S 10 -n 512)
> 
>  vanilla-jemalloc		MADV_free-jemalloc
> 
> 1 thread
> records:  10              records:  10
> avg:      7682.10         avg:      15306.10
> std:      62.35(0.81%)    std:      347.99(2.27%)
> max:      7770.00         max:      15622.00
> min:      7598.00         min:      14772.00
> 
> 2 thread
> records:  10              records:  10
> avg:      12747.50        avg:      24171.00
> std:      792.06(6.21%)   std:      895.18(3.70%)
> max:      13337.00        max:      26023.00
> min:      10535.00        min:      23152.00
> 
> 4 thread
> records:  10              records:  10
> avg:      16474.60        avg:      33717.90
> std:      1496.45(9.08%)  std:      2008.97(5.96%)
> max:      17877.00        max:      35958.00
> min:      12224.00        min:      29565.00
> 
> 8 thread
> records:  10              records:  10
> avg:      16778.50        avg:      33308.10
> std:      825.53(4.92%)   std:      1668.30(5.01%)
> max:      17543.00        max:      36010.00
> min:      14576.00        min:      29577.00
> 
> 16 thread
> records:  10              records:  10
> avg:      20614.40        avg:      35516.30
> std:      602.95(2.92%)   std:      1283.65(3.61%)
> max:      21753.00        max:      37178.00
> min:      19605.00        min:      33217.00
> 
> 32 thread
> records:  10              records:  10
> avg:      22771.70        avg:      36018.50
> std:      598.94(2.63%)   std:      1046.76(2.91%)
> max:      24035.00        max:      37266.00
> min:      22108.00        min:      34149.00
> 
> In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> 
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Linux API <linux-api@vger.kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jason Evans <je@fb.com>
> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/linux/rmap.h                   |   9 ++-
>  include/linux/vm_event_item.h          |   1 +
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
>  mm/rmap.c                              |  42 +++++++++-
>  mm/vmscan.c                            |  40 ++++++++--
>  mm/vmstat.c                            |   1 +
>  7 files changed, 218 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index be574506e6a9..0ba377b97a38 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -75,6 +75,7 @@ enum ttu_flags {
>  	TTU_UNMAP = 1,			/* unmap mode */
>  	TTU_MIGRATION = 2,		/* migration mode */
>  	TTU_MUNLOCK = 4,		/* munlock mode */
> +	TTU_FREE = 8,			/* free mode */
>  
>  	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
>  	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
>   * Called from mm/vmscan.c to handle paging out
>   */
>  int page_referenced(struct page *, int is_locked,
> -			struct mem_cgroup *memcg, unsigned long *vm_flags);
> +			struct mem_cgroup *memcg, unsigned long *vm_flags,
> +			int *is_dirty);
>  
>  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
>  
> @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
>  
>  static inline int page_referenced(struct page *page, int is_locked,
>  				  struct mem_cgroup *memcg,
> -				  unsigned long *vm_flags)
> +				  unsigned long *vm_flags,
> +				  int *is_pte_dirty)
>  {
>  	*vm_flags = 0;
> +	if (is_pte_dirty)
> +		*is_pte_dirty = 0;
>  	return 0;
>  }
>  
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index ced92345c963..e2d3fb1e9814 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		PGFREE, PGACTIVATE, PGDEACTIVATE,
>  		PGFAULT, PGMAJFAULT,
> +		PGLAZYFREED,
>  		FOR_ALL_ZONES(PGREFILL),
>  		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
>  		FOR_ALL_ZONES(PGSTEAL_DIRECT),
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index ddc3b36f1046..7a94102b7a02 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,6 +34,7 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> +#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0938b30da4ab..55b42e5e32a3 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -19,6 +19,14 @@
>  #include <linux/blkdev.h>
>  #include <linux/swap.h>
>  #include <linux/swapops.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <asm/tlb.h>
> +
> +struct madvise_free_private {
> +	struct vm_area_struct *vma;
> +	struct mmu_gather *tlb;
> +};
>  
>  /*
>   * Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
>  	case MADV_REMOVE:
>  	case MADV_WILLNEED:
>  	case MADV_DONTNEED:
> +	case MADV_FREE:
>  		return 0;
>  	default:
>  		/* be safe, default to 1. list exceptions explicitly */
> @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  	return 0;
>  }
>  
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +
> +{
> +	struct madvise_free_private *fp = walk->private;
> +	struct mmu_gather *tlb = fp->tlb;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = fp->vma;
> +	spinlock_t *ptl;
> +	pte_t *pte, ptent;
> +	struct page *page;
> +
> +	split_huge_page_pmd(vma, addr, pmd);
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (PageSwapCache(page)) {
> +			if (trylock_page(page)) {
> +				if (try_to_free_swap(page))
> +					ClearPageDirty(page);
> +				unlock_page(page);

'continue' for !try_to_free_swap(page) case?

> +			} else
> +				continue;
> +		}
> +
> +		/*
> +		 * Some of architecture(ex, PPC) don't update TLB
> +		 * with set_pte_at and tlb_remove_tlb_entry so for
> +		 * the portability, remap the pte with old|clean
> +		 * after pte clearing.
> +		 */
> +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> +						tlb->fullmm);
> +		ptent = pte_mkold(ptent);
> +		ptent = pte_mkclean(ptent);
> +		set_pte_at(mm, addr, pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
> +	}
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +	return 0;
> +}
> +
> +static void madvise_free_page_range(struct mmu_gather *tlb,
> +			     struct vm_area_struct *vma,
> +			     unsigned long addr, unsigned long end)
> +{
> +	struct madvise_free_private fp = {
> +		.vma = vma,
> +		.tlb = tlb,
> +	};
> +
> +	struct mm_walk free_walk = {
> +		.pmd_entry = madvise_free_pte_range,
> +		.mm = vma->vm_mm,
> +		.private = &fp,
> +	};
> +
> +	BUG_ON(addr >= end);
> +	tlb_start_vma(tlb, vma);
> +	walk_page_range(addr, end, &free_walk);
> +	tlb_end_vma(tlb, vma);
> +}
> +
> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> +			unsigned long start_addr, unsigned long end_addr)
> +{
> +	unsigned long start, end;
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_gather tlb;
> +
> +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))

THP uses VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE to filter out
'special' VMAs. Looks reasonable here too.

Otherwise:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 8/8] mm: Don't split THP page when syscall is called
  2014-07-09  6:22   ` Minchan Kim
@ 2014-07-17 11:30     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-07-17 11:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei,
	Andrea Arcangeli

On Wed, Jul 09, 2014 at 03:22:29PM +0900, Minchan Kim wrote:
> We don't need to split THP page when MADV_FREE syscall is
> called. It could be done when VM decide really frees it so
> we could avoid unnecessary THP split.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

I would like to free THP without splitting. But good enough for now.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 8/8] mm: Don't split THP page when syscall is called
@ 2014-07-17 11:30     ` Kirill A. Shutemov
  0 siblings, 0 replies; 36+ messages in thread
From: Kirill A. Shutemov @ 2014-07-17 11:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei,
	Andrea Arcangeli

On Wed, Jul 09, 2014 at 03:22:29PM +0900, Minchan Kim wrote:
> We don't need to split THP page when MADV_FREE syscall is
> called. It could be done when VM decide really frees it so
> we could avoid unnecessary THP split.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

I would like to free THP without splitting. But good enough for now.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 1/8] mm: support madvise(MADV_FREE)
  2014-07-17 11:28     ` Kirill A. Shutemov
@ 2014-07-18  2:22       ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-18  2:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei

On Thu, Jul 17, 2014 at 02:28:32PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 09, 2014 at 03:22:22PM +0900, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> > 
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> > 
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> > 
> > How to work is following as.
> > 
> > When madvise syscall is called, VM clears dirty bit of ptes of
> > the range. If memory pressure happens, VM checks dirty bit of
> > page table and if it found still "clean", it means it's a
> > "lazyfree pages" so VM could discard the page instead of swapping out.
> > Once there was store operation for the page before VM peek a page
> > to reclaim, dirty bit is set so VM can swap out the page instead of
> > discarding.
> > 
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> > 
> > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                4
> > On-line CPU(s) list:   0-3
> > Thread(s) per core:    2
> > Core(s) per socket:    2
> > Socket(s):             1
> > NUMA node(s):          1
> > Vendor ID:             GenuineIntel
> > CPU family:            6
> > Model:                 42
> > Stepping:              7
> > CPU MHz:               2801.000
> > BogoMIPS:              5581.64
> > Virtualization:        VT-x
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              4096K
> > NUMA node0 CPU(s):     0-3
> > 
> > ebizzy benchmark(./ebizzy -S 10 -n 512)
> > 
> >  vanilla-jemalloc		MADV_free-jemalloc
> > 
> > 1 thread
> > records:  10              records:  10
> > avg:      7682.10         avg:      15306.10
> > std:      62.35(0.81%)    std:      347.99(2.27%)
> > max:      7770.00         max:      15622.00
> > min:      7598.00         min:      14772.00
> > 
> > 2 thread
> > records:  10              records:  10
> > avg:      12747.50        avg:      24171.00
> > std:      792.06(6.21%)   std:      895.18(3.70%)
> > max:      13337.00        max:      26023.00
> > min:      10535.00        min:      23152.00
> > 
> > 4 thread
> > records:  10              records:  10
> > avg:      16474.60        avg:      33717.90
> > std:      1496.45(9.08%)  std:      2008.97(5.96%)
> > max:      17877.00        max:      35958.00
> > min:      12224.00        min:      29565.00
> > 
> > 8 thread
> > records:  10              records:  10
> > avg:      16778.50        avg:      33308.10
> > std:      825.53(4.92%)   std:      1668.30(5.01%)
> > max:      17543.00        max:      36010.00
> > min:      14576.00        min:      29577.00
> > 
> > 16 thread
> > records:  10              records:  10
> > avg:      20614.40        avg:      35516.30
> > std:      602.95(2.92%)   std:      1283.65(3.61%)
> > max:      21753.00        max:      37178.00
> > min:      19605.00        min:      33217.00
> > 
> > 32 thread
> > records:  10              records:  10
> > avg:      22771.70        avg:      36018.50
> > std:      598.94(2.63%)   std:      1046.76(2.91%)
> > max:      24035.00        max:      37266.00
> > min:      22108.00        min:      34149.00
> > 
> > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> > 
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Linux API <linux-api@vger.kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Jason Evans <je@fb.com>
> > Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/rmap.h                   |   9 ++-
> >  include/linux/vm_event_item.h          |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
> >  mm/rmap.c                              |  42 +++++++++-
> >  mm/vmscan.c                            |  40 ++++++++--
> >  mm/vmstat.c                            |   1 +
> >  7 files changed, 218 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index be574506e6a9..0ba377b97a38 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -75,6 +75,7 @@ enum ttu_flags {
> >  	TTU_UNMAP = 1,			/* unmap mode */
> >  	TTU_MIGRATION = 2,		/* migration mode */
> >  	TTU_MUNLOCK = 4,		/* munlock mode */
> > +	TTU_FREE = 8,			/* free mode */
> >  
> >  	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> >  	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> > @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
> >   * Called from mm/vmscan.c to handle paging out
> >   */
> >  int page_referenced(struct page *, int is_locked,
> > -			struct mem_cgroup *memcg, unsigned long *vm_flags);
> > +			struct mem_cgroup *memcg, unsigned long *vm_flags,
> > +			int *is_dirty);
> >  
> >  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >  
> > @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
> >  
> >  static inline int page_referenced(struct page *page, int is_locked,
> >  				  struct mem_cgroup *memcg,
> > -				  unsigned long *vm_flags)
> > +				  unsigned long *vm_flags,
> > +				  int *is_pte_dirty)
> >  {
> >  	*vm_flags = 0;
> > +	if (is_pte_dirty)
> > +		*is_pte_dirty = 0;
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index ced92345c963..e2d3fb1e9814 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGALLOC),
> >  		PGFREE, PGACTIVATE, PGDEACTIVATE,
> >  		PGFAULT, PGMAJFAULT,
> > +		PGLAZYFREED,
> >  		FOR_ALL_ZONES(PGREFILL),
> >  		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> >  		FOR_ALL_ZONES(PGSTEAL_DIRECT),
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index ddc3b36f1046..7a94102b7a02 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -34,6 +34,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_FREE	5		/* free pages only if memory pressure */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_REMOVE	9		/* remove these pages & resources */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0938b30da4ab..55b42e5e32a3 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,14 @@
> >  #include <linux/blkdev.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> > +#include <linux/mmu_notifier.h>
> > +
> > +#include <asm/tlb.h>
> > +
> > +struct madvise_free_private {
> > +	struct vm_area_struct *vma;
> > +	struct mmu_gather *tlb;
> > +};
> >  
> >  /*
> >   * Any behaviour which results in changes to the vma->vm_flags needs to
> > @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_FREE:
> >  		return 0;
> >  	default:
> >  		/* be safe, default to 1. list exceptions explicitly */
> > @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > +	struct madvise_free_private *fp = walk->private;
> > +	struct mmu_gather *tlb = fp->tlb;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = fp->vma;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, ptent;
> > +	struct page *page;
> > +
> > +	split_huge_page_pmd(vma, addr, pmd);
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (PageSwapCache(page)) {
> > +			if (trylock_page(page)) {
> > +				if (try_to_free_swap(page))
> > +					ClearPageDirty(page);
> > +				unlock_page(page);
> 
> 'continue' for !try_to_free_swap(page) case?

Good catch.

> 
> > +			} else
> > +				continue;
> > +		}
> > +
> > +		/*
> > +		 * Some of architecture(ex, PPC) don't update TLB
> > +		 * with set_pte_at and tlb_remove_tlb_entry so for
> > +		 * the portability, remap the pte with old|clean
> > +		 * after pte clearing.
> > +		 */
> > +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +						tlb->fullmm);
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > +	}
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +	cond_resched();
> > +	return 0;
> > +}
> > +
> > +static void madvise_free_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct madvise_free_private fp = {
> > +		.vma = vma,
> > +		.tlb = tlb,
> > +	};
> > +
> > +	struct mm_walk free_walk = {
> > +		.pmd_entry = madvise_free_pte_range,
> > +		.mm = vma->vm_mm,
> > +		.private = &fp,
> > +	};
> > +
> > +	BUG_ON(addr >= end);
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &free_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static int madvise_free_single_vma(struct vm_area_struct *vma,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	unsigned long start, end;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> 
> THP uses VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE to filter out
> 'special' VMAs. Looks reasonable here too.
> 
> Otherwise:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks, Kirill!

> 
> -- 
>  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v12 1/8] mm: support madvise(MADV_FREE)
@ 2014-07-18  2:22       ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2014-07-18  2:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	Linux API, Hugh Dickins, Johannes Weiner, Rik van Riel,
	KOSAKI Motohiro, Mel Gorman, Jason Evans, Zhang Yanfei

On Thu, Jul 17, 2014 at 02:28:32PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 09, 2014 at 03:22:22PM +0900, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> > 
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> > 
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> > 
> > How to work is following as.
> > 
> > When madvise syscall is called, VM clears dirty bit of ptes of
> > the range. If memory pressure happens, VM checks dirty bit of
> > page table and if it found still "clean", it means it's a
> > "lazyfree pages" so VM could discard the page instead of swapping out.
> > Once there was store operation for the page before VM peek a page
> > to reclaim, dirty bit is set so VM can swap out the page instead of
> > discarding.
> > 
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> > 
> > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                4
> > On-line CPU(s) list:   0-3
> > Thread(s) per core:    2
> > Core(s) per socket:    2
> > Socket(s):             1
> > NUMA node(s):          1
> > Vendor ID:             GenuineIntel
> > CPU family:            6
> > Model:                 42
> > Stepping:              7
> > CPU MHz:               2801.000
> > BogoMIPS:              5581.64
> > Virtualization:        VT-x
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              4096K
> > NUMA node0 CPU(s):     0-3
> > 
> > ebizzy benchmark(./ebizzy -S 10 -n 512)
> > 
> >  vanilla-jemalloc		MADV_free-jemalloc
> > 
> > 1 thread
> > records:  10              records:  10
> > avg:      7682.10         avg:      15306.10
> > std:      62.35(0.81%)    std:      347.99(2.27%)
> > max:      7770.00         max:      15622.00
> > min:      7598.00         min:      14772.00
> > 
> > 2 thread
> > records:  10              records:  10
> > avg:      12747.50        avg:      24171.00
> > std:      792.06(6.21%)   std:      895.18(3.70%)
> > max:      13337.00        max:      26023.00
> > min:      10535.00        min:      23152.00
> > 
> > 4 thread
> > records:  10              records:  10
> > avg:      16474.60        avg:      33717.90
> > std:      1496.45(9.08%)  std:      2008.97(5.96%)
> > max:      17877.00        max:      35958.00
> > min:      12224.00        min:      29565.00
> > 
> > 8 thread
> > records:  10              records:  10
> > avg:      16778.50        avg:      33308.10
> > std:      825.53(4.92%)   std:      1668.30(5.01%)
> > max:      17543.00        max:      36010.00
> > min:      14576.00        min:      29577.00
> > 
> > 16 thread
> > records:  10              records:  10
> > avg:      20614.40        avg:      35516.30
> > std:      602.95(2.92%)   std:      1283.65(3.61%)
> > max:      21753.00        max:      37178.00
> > min:      19605.00        min:      33217.00
> > 
> > 32 thread
> > records:  10              records:  10
> > avg:      22771.70        avg:      36018.50
> > std:      598.94(2.63%)   std:      1046.76(2.91%)
> > max:      24035.00        max:      37266.00
> > min:      22108.00        min:      34149.00
> > 
> > In summary, MADV_FREE is about 2 time faster than MADV_DONTNEED.
> > 
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Linux API <linux-api@vger.kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Jason Evans <je@fb.com>
> > Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  include/linux/rmap.h                   |   9 ++-
> >  include/linux/vm_event_item.h          |   1 +
> >  include/uapi/asm-generic/mman-common.h |   1 +
> >  mm/madvise.c                           | 136 +++++++++++++++++++++++++++++++++
> >  mm/rmap.c                              |  42 +++++++++-
> >  mm/vmscan.c                            |  40 ++++++++--
> >  mm/vmstat.c                            |   1 +
> >  7 files changed, 218 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index be574506e6a9..0ba377b97a38 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -75,6 +75,7 @@ enum ttu_flags {
> >  	TTU_UNMAP = 1,			/* unmap mode */
> >  	TTU_MIGRATION = 2,		/* migration mode */
> >  	TTU_MUNLOCK = 4,		/* munlock mode */
> > +	TTU_FREE = 8,			/* free mode */
> >  
> >  	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> >  	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> > @@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
> >   * Called from mm/vmscan.c to handle paging out
> >   */
> >  int page_referenced(struct page *, int is_locked,
> > -			struct mem_cgroup *memcg, unsigned long *vm_flags);
> > +			struct mem_cgroup *memcg, unsigned long *vm_flags,
> > +			int *is_dirty);
> >  
> >  #define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> >  
> > @@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
> >  
> >  static inline int page_referenced(struct page *page, int is_locked,
> >  				  struct mem_cgroup *memcg,
> > -				  unsigned long *vm_flags)
> > +				  unsigned long *vm_flags,
> > +				  int *is_pte_dirty)
> >  {
> >  	*vm_flags = 0;
> > +	if (is_pte_dirty)
> > +		*is_pte_dirty = 0;
> >  	return 0;
> >  }
> >  
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index ced92345c963..e2d3fb1e9814 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGALLOC),
> >  		PGFREE, PGACTIVATE, PGDEACTIVATE,
> >  		PGFAULT, PGMAJFAULT,
> > +		PGLAZYFREED,
> >  		FOR_ALL_ZONES(PGREFILL),
> >  		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
> >  		FOR_ALL_ZONES(PGSTEAL_DIRECT),
> > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> > index ddc3b36f1046..7a94102b7a02 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -34,6 +34,7 @@
> >  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
> >  #define MADV_WILLNEED	3		/* will need these pages */
> >  #define MADV_DONTNEED	4		/* don't need these pages */
> > +#define MADV_FREE	5		/* free pages only if memory pressure */
> >  
> >  /* common parameters: try to keep these consistent across architectures */
> >  #define MADV_REMOVE	9		/* remove these pages & resources */
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0938b30da4ab..55b42e5e32a3 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -19,6 +19,14 @@
> >  #include <linux/blkdev.h>
> >  #include <linux/swap.h>
> >  #include <linux/swapops.h>
> > +#include <linux/mmu_notifier.h>
> > +
> > +#include <asm/tlb.h>
> > +
> > +struct madvise_free_private {
> > +	struct vm_area_struct *vma;
> > +	struct mmu_gather *tlb;
> > +};
> >  
> >  /*
> >   * Any behaviour which results in changes to the vma->vm_flags needs to
> > @@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
> >  	case MADV_REMOVE:
> >  	case MADV_WILLNEED:
> >  	case MADV_DONTNEED:
> > +	case MADV_FREE:
> >  		return 0;
> >  	default:
> >  		/* be safe, default to 1. list exceptions explicitly */
> > @@ -251,6 +260,124 @@ static long madvise_willneed(struct vm_area_struct *vma,
> >  	return 0;
> >  }
> >  
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > +	struct madvise_free_private *fp = walk->private;
> > +	struct mmu_gather *tlb = fp->tlb;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = fp->vma;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, ptent;
> > +	struct page *page;
> > +
> > +	split_huge_page_pmd(vma, addr, pmd);
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (PageSwapCache(page)) {
> > +			if (trylock_page(page)) {
> > +				if (try_to_free_swap(page))
> > +					ClearPageDirty(page);
> > +				unlock_page(page);
> 
> 'continue' for !try_to_free_swap(page) case?

Good catch.

> 
> > +			} else
> > +				continue;
> > +		}
> > +
> > +		/*
> > +		 * Some of architecture(ex, PPC) don't update TLB
> > +		 * with set_pte_at and tlb_remove_tlb_entry so for
> > +		 * the portability, remap the pte with old|clean
> > +		 * after pte clearing.
> > +		 */
> > +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +						tlb->fullmm);
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > +	}
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +	cond_resched();
> > +	return 0;
> > +}
> > +
> > +static void madvise_free_page_range(struct mmu_gather *tlb,
> > +			     struct vm_area_struct *vma,
> > +			     unsigned long addr, unsigned long end)
> > +{
> > +	struct madvise_free_private fp = {
> > +		.vma = vma,
> > +		.tlb = tlb,
> > +	};
> > +
> > +	struct mm_walk free_walk = {
> > +		.pmd_entry = madvise_free_pte_range,
> > +		.mm = vma->vm_mm,
> > +		.private = &fp,
> > +	};
> > +
> > +	BUG_ON(addr >= end);
> > +	tlb_start_vma(tlb, vma);
> > +	walk_page_range(addr, end, &free_walk);
> > +	tlb_end_vma(tlb, vma);
> > +}
> > +
> > +static int madvise_free_single_vma(struct vm_area_struct *vma,
> > +			unsigned long start_addr, unsigned long end_addr)
> > +{
> > +	unsigned long start, end;
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_gather tlb;
> > +
> > +	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> 
> THP uses VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE to filter out
> 'special' VMAs. Looks reasonable here too.
> 
> Otherwise:
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Thanks, Kirill!

> 
> -- 
>  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-07-18  2:22 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-09  6:22 [PATCH v12 0/8] MADV_FREE support Minchan Kim
2014-07-09  6:22 ` Minchan Kim
2014-07-09  6:22 ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 1/8] mm: support madvise(MADV_FREE) Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-17 11:28   ` Kirill A. Shutemov
2014-07-17 11:28     ` Kirill A. Shutemov
2014-07-18  2:22     ` Minchan Kim
2014-07-18  2:22       ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 2/8] x86: add pmd_[dirty|mkclean] for THP Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 3/8] sparc: " Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 4/8] powerpc: " Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 5/8] s390: " Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 6/8] arm: " Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22 ` [PATCH v12 7/8] arm64: " Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-09 11:14   ` Catalin Marinas
2014-07-09 11:14     ` Catalin Marinas
2014-07-09 11:14     ` Catalin Marinas
2014-07-09  6:22 ` [PATCH v12 8/8] mm: Don't split THP page when syscall is called Minchan Kim
2014-07-09  6:22   ` Minchan Kim
2014-07-17 11:30   ` Kirill A. Shutemov
2014-07-17 11:30     ` Kirill A. Shutemov
2014-07-17  5:02 ` [PATCH v12 0/8] MADV_FREE support Minchan Kim
2014-07-17  5:02   ` Minchan Kim
2014-07-17  5:02   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.