All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] MADV_FREE support
@ 2015-11-30  6:39 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
so this version doesn't include them.

I have been tested it on mmotm-2015-11-25-17-08 with additional
patch[1] from Kirill to prevent BUG_ON which he didn't send to
linux-mm yet as formal patch. With it, I couldn't find any
problem so far.

Note that this version is based on THP refcount redesign so
I needed some modification on MADV_FREE because split_huge_pmd
doesn't split a THP page any more and pmd_trans_huge(pmd) is not
enough to guarantee the page is not THP page.
As well, for MAVD_FREE lazy-split, THP split should respect
pmd's dirtiness rather than marking ptes of all subpages dirty
unconditionally. Please, review last patch in this patchset.

	mm: don't split THP page when syscall is called

[1] https://lkml.org/lkml/2015/11/17/134

git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2

In this stage, I don't think we need to write man page.
It could be done after solid policy and implementation.

 * Change from v4
   * drop lazyfree LRU
   * drop swapless support
   * drop lazyfreeness
   * rebase on recent mmotom with THP refcount redesign

 * Change from v3
   * some bug fix
   * code refactoring
   * lazyfree reclaim logic change
   * reordering patch

 * Change from v2
   * vm_lazyfreeness tuning knob
   * add new LRU list - Johannes, Shaohua
   * support swapless - Johannes

 * Change from v1
   * Don't do unnecessary TLB flush - Shaohua
   * Added Acked-by - Hugh, Michal
   * Merge deactivate_page and deactivate_file_page
   * Add pmd_dirty/pmd_mkclean patches for several arches
   * Add lazy THP split patch
   * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure

Chen Gang (1):
  arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
    architectures

Minchan Kim (11):
  mm: support madvise(MADV_FREE)
  mm: define MADV_FREE for some arches
  mm: free swp_entry in madvise_free
  mm: move lazily freed pages to inactive list
  mm: mark stable page dirty in KSM
  x86: add pmd_[dirty|mkclean] for THP
  sparc: add pmd_[dirty|mkclean] for THP
  powerpc: add pmd_[dirty|mkclean] for THP
  arm: add pmd_mkclean for THP
  arm64: add pmd_mkclean for THP
  mm: don't split THP page when syscall is called

 arch/alpha/include/uapi/asm/mman.h       |   2 +
 arch/arm/include/asm/pgtable-3level.h    |   1 +
 arch/arm64/include/asm/pgtable.h         |   1 +
 arch/mips/include/uapi/asm/mman.h        |   2 +
 arch/parisc/include/uapi/asm/mman.h      |   2 +
 arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
 arch/sparc/include/asm/pgtable_64.h      |   9 ++
 arch/x86/include/asm/pgtable.h           |   5 +
 arch/xtensa/include/uapi/asm/mman.h      |   2 +
 include/linux/huge_mm.h                  |   3 +
 include/linux/rmap.h                     |   1 +
 include/linux/swap.h                     |   1 +
 include/linux/vm_event_item.h            |   1 +
 include/uapi/asm-generic/mman-common.h   |   1 +
 mm/huge_memory.c                         |  87 +++++++++++++-
 mm/ksm.c                                 |   6 +
 mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
 mm/rmap.c                                |   8 ++
 mm/swap.c                                |  44 +++++++
 mm/swap_state.c                          |   5 +-
 mm/vmscan.c                              |  10 +-
 mm/vmstat.c                              |   1 +
 22 files changed, 383 insertions(+), 10 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v5 00/12] MADV_FREE support
@ 2015-11-30  6:39 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
so this version doesn't include them.

I have been tested it on mmotm-2015-11-25-17-08 with additional
patch[1] from Kirill to prevent BUG_ON which he didn't send to
linux-mm yet as formal patch. With it, I couldn't find any
problem so far.

Note that this version is based on THP refcount redesign so
I needed some modification on MADV_FREE because split_huge_pmd
doesn't split a THP page any more and pmd_trans_huge(pmd) is not
enough to guarantee the page is not THP page.
As well, for MAVD_FREE lazy-split, THP split should respect
pmd's dirtiness rather than marking ptes of all subpages dirty
unconditionally. Please, review last patch in this patchset.

	mm: don't split THP page when syscall is called

[1] https://lkml.org/lkml/2015/11/17/134

git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2

In this stage, I don't think we need to write man page.
It could be done after solid policy and implementation.

 * Change from v4
   * drop lazyfree LRU
   * drop swapless support
   * drop lazyfreeness
   * rebase on recent mmotom with THP refcount redesign

 * Change from v3
   * some bug fix
   * code refactoring
   * lazyfree reclaim logic change
   * reordering patch

 * Change from v2
   * vm_lazyfreeness tuning knob
   * add new LRU list - Johannes, Shaohua
   * support swapless - Johannes

 * Change from v1
   * Don't do unnecessary TLB flush - Shaohua
   * Added Acked-by - Hugh, Michal
   * Merge deactivate_page and deactivate_file_page
   * Add pmd_dirty/pmd_mkclean patches for several arches
   * Add lazy THP split patch
   * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure

Chen Gang (1):
  arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
    architectures

Minchan Kim (11):
  mm: support madvise(MADV_FREE)
  mm: define MADV_FREE for some arches
  mm: free swp_entry in madvise_free
  mm: move lazily freed pages to inactive list
  mm: mark stable page dirty in KSM
  x86: add pmd_[dirty|mkclean] for THP
  sparc: add pmd_[dirty|mkclean] for THP
  powerpc: add pmd_[dirty|mkclean] for THP
  arm: add pmd_mkclean for THP
  arm64: add pmd_mkclean for THP
  mm: don't split THP page when syscall is called

 arch/alpha/include/uapi/asm/mman.h       |   2 +
 arch/arm/include/asm/pgtable-3level.h    |   1 +
 arch/arm64/include/asm/pgtable.h         |   1 +
 arch/mips/include/uapi/asm/mman.h        |   2 +
 arch/parisc/include/uapi/asm/mman.h      |   2 +
 arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
 arch/sparc/include/asm/pgtable_64.h      |   9 ++
 arch/x86/include/asm/pgtable.h           |   5 +
 arch/xtensa/include/uapi/asm/mman.h      |   2 +
 include/linux/huge_mm.h                  |   3 +
 include/linux/rmap.h                     |   1 +
 include/linux/swap.h                     |   1 +
 include/linux/vm_event_item.h            |   1 +
 include/uapi/asm-generic/mman-common.h   |   1 +
 mm/huge_memory.c                         |  87 +++++++++++++-
 mm/ksm.c                                 |   6 +
 mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
 mm/rmap.c                                |   8 ++
 mm/swap.c                                |  44 +++++++
 mm/swap_state.c                          |   5 +-
 mm/vmscan.c                              |  10 +-
 mm/vmstat.c                              |   1 +
 22 files changed, 383 insertions(+), 10 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v5 01/12] mm: support madvise(MADV_FREE)
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Michal Hocko

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty bit
in page table entry to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't discard
the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 168 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   8 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 8 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 77d1ba57d495..04d2aec64e57 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_LZFREE = 8,			/* lazy free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e1f8c993e73b..67c1dbd19c6d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index a74dd84bbb6d..0e821e3c3d45 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..e2fe2e26f449 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,161 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *orig_pte, *pte, ptent;
+	struct page *page;
+
+	split_huge_pmd(vma, pmd, addr);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/*
+		 * If pmd isn't transhuge but the page is THP and
+		 * is owned by only this process, split it and
+		 * deactivate all pages.
+		 */
+		if (PageTransCompound(page)) {
+			if (page_mapcount(page) != 1)
+				goto out;
+			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				goto out;
+			}
+			pte_unmap_unlock(orig_pte, ptl);
+			if (split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				goto out;
+			}
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte--;
+			addr -= PAGE_SIZE;
+			continue;
+		}
+
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+		if (PageSwapCache(page) || PageDirty(page)) {
+			if (!trylock_page(page))
+				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_mapcount(page) != 1) {
+				unlock_page(page);
+				continue;
+			}
+
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		if (pte_young(ptent) || pte_dirty(ptent)) {
+			/*
+			 * Some of architecture(ex, PPC) don't update TLB
+			 * with set_pte_at and tlb_remove_tlb_entry so for
+			 * the portability, remap the pte with old|clean
+			 * after pte clearing.
+			 */
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+
+			ptent = pte_mkold(ptent);
+			ptent = pte_mkclean(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+	}
+out:
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +538,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +565,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 6f371261dd12..321b633ee559 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1508,6 +1508,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See handle_pte_fault() ...
 		 */
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+
+		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (swap_duplicate(entry) < 0) {
 			set_pte_at(mm, address, pte, pteval);
 			ret = SWAP_FAIL;
@@ -1528,6 +1535,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, mm_counter_file(page));
 
+discard:
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..676ff2991380 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..c2f69445190c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -908,6 +908,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool lazyfree = false;
 
 		cond_resched();
 
@@ -1051,6 +1052,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1062,8 +1064,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, lazyfree ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1188,6 +1191,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (lazyfree && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d13cd8eebf70..38929dc79c3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,6 +781,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 01/12] mm: support madvise(MADV_FREE)
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Michal Hocko

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty bit
in page table entry to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't discard
the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 168 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   8 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 8 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 77d1ba57d495..04d2aec64e57 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_LZFREE = 8,			/* lazy free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e1f8c993e73b..67c1dbd19c6d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index a74dd84bbb6d..0e821e3c3d45 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..e2fe2e26f449 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,161 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *orig_pte, *pte, ptent;
+	struct page *page;
+
+	split_huge_pmd(vma, pmd, addr);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/*
+		 * If pmd isn't transhuge but the page is THP and
+		 * is owned by only this process, split it and
+		 * deactivate all pages.
+		 */
+		if (PageTransCompound(page)) {
+			if (page_mapcount(page) != 1)
+				goto out;
+			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				goto out;
+			}
+			pte_unmap_unlock(orig_pte, ptl);
+			if (split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				goto out;
+			}
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte--;
+			addr -= PAGE_SIZE;
+			continue;
+		}
+
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+		if (PageSwapCache(page) || PageDirty(page)) {
+			if (!trylock_page(page))
+				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_mapcount(page) != 1) {
+				unlock_page(page);
+				continue;
+			}
+
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		if (pte_young(ptent) || pte_dirty(ptent)) {
+			/*
+			 * Some of architecture(ex, PPC) don't update TLB
+			 * with set_pte_at and tlb_remove_tlb_entry so for
+			 * the portability, remap the pte with old|clean
+			 * after pte clearing.
+			 */
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+
+			ptent = pte_mkold(ptent);
+			ptent = pte_mkclean(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+	}
+out:
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +538,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +565,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 6f371261dd12..321b633ee559 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1508,6 +1508,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See handle_pte_fault() ...
 		 */
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+
+		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (swap_duplicate(entry) < 0) {
 			set_pte_at(mm, address, pte, pteval);
 			ret = SWAP_FAIL;
@@ -1528,6 +1535,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, mm_counter_file(page));
 
+discard:
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..676ff2991380 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..c2f69445190c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -908,6 +908,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool lazyfree = false;
 
 		cond_resched();
 
@@ -1051,6 +1052,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1062,8 +1064,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, lazyfree ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1188,6 +1191,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (lazyfree && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d13cd8eebf70..38929dc79c3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,6 +781,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 02/12] mm: define MADV_FREE for some arches
  2015-11-30  6:39 ` Minchan Kim
  (?)
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Richard Henderson, Ivan Kokshaysky, James E.J. Bottomley,
	Helge Deller, Ralf Baechle, Chris Zankel, Max Filippov,
	kbuild test robot

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/include/uapi/asm/mman.h  | 1 +
 arch/mips/include/uapi/asm/mman.h   | 1 +
 arch/parisc/include/uapi/asm/mman.h | 1 +
 arch/xtensa/include/uapi/asm/mman.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index f2f949671798..d828beb5e69b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -47,6 +47,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 97c03f468924..a6f8daff8e3b 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -73,6 +73,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index dd4d1876a020..bda94f0d0b94 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -43,6 +43,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 360944e1da52..83c5150b06f9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -86,6 +86,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 02/12] mm: define MADV_FREE for some arches
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Richard Henderson, Ivan Kokshaysky, James E.J. Bottomley,
	Helge Deller, Ralf Baechle, Chris Zankel, Max Filippov,
	kbuild test robot

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/include/uapi/asm/mman.h  | 1 +
 arch/mips/include/uapi/asm/mman.h   | 1 +
 arch/parisc/include/uapi/asm/mman.h | 1 +
 arch/xtensa/include/uapi/asm/mman.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index f2f949671798..d828beb5e69b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -47,6 +47,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 97c03f468924..a6f8daff8e3b 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -73,6 +73,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index dd4d1876a020..bda94f0d0b94 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -43,6 +43,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 360944e1da52..83c5150b06f9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -86,6 +86,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 02/12] mm: define MADV_FREE for some arches
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Richard Henderson, Ivan Kokshaysky, James E.J. Bottomley,
	Helge Deller, Ralf Baechle, Chris Zankel, Max Filippov,
	kbuild test robot

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/include/uapi/asm/mman.h  | 1 +
 arch/mips/include/uapi/asm/mman.h   | 1 +
 arch/parisc/include/uapi/asm/mman.h | 1 +
 arch/xtensa/include/uapi/asm/mman.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index f2f949671798..d828beb5e69b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -47,6 +47,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 97c03f468924..a6f8daff8e3b 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -73,6 +73,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index dd4d1876a020..bda94f0d0b94 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -43,6 +43,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 360944e1da52..83c5150b06f9 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -86,6 +86,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 03/12] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-11-30  6:39 ` Minchan Kim
  (?)
  (?)
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Chen Gang, rth,
	ink, mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, linux-arch, sparclinux, roland, darrick.wong,
	davem, Minchan Kim

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: roland@kernel.org
Cc: darrick.wong@oracle.com
Cc: davem@davemloft.net
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 1 +
 arch/mips/include/uapi/asm/mman.h      | 1 +
 arch/parisc/include/uapi/asm/mman.h    | 1 +
 arch/xtensa/include/uapi/asm/mman.h    | 1 +
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index d828beb5e69b..ab336c06153e 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -50,6 +50,7 @@
 #define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index a6f8daff8e3b..b0ebe59f73fd 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -76,6 +76,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index bda94f0d0b94..cf830d465f75 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -46,6 +46,7 @@
 #define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 83c5150b06f9..d030594ed22b 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -89,6 +89,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 0e821e3c3d45..58274382a616 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,9 +39,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 03/12] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Chen Gang, rth,
	ink, mattst88, Ralf Baechle, jejb, deller, chris, jcmvb

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: roland@kernel.org
Cc: darrick.wong@oracle.com
Cc: davem@davemloft.net
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 1 +
 arch/mips/include/uapi/asm/mman.h      | 1 +
 arch/parisc/include/uapi/asm/mman.h    | 1 +
 arch/xtensa/include/uapi/asm/mman.h    | 1 +
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index d828beb5e69b..ab336c06153e 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -50,6 +50,7 @@
 #define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index a6f8daff8e3b..b0ebe59f73fd 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -76,6 +76,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index bda94f0d0b94..cf830d465f75 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -46,6 +46,7 @@
 #define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 83c5150b06f9..d030594ed22b 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -89,6 +89,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 0e821e3c3d45..58274382a616 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,9 +39,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 03/12] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Chen Gang, rth,
	ink, mattst88, Ralf Baechle, jejb, deller, chris, jcmvb

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: roland@kernel.org
Cc: darrick.wong@oracle.com
Cc: davem@davemloft.net
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 1 +
 arch/mips/include/uapi/asm/mman.h      | 1 +
 arch/parisc/include/uapi/asm/mman.h    | 1 +
 arch/xtensa/include/uapi/asm/mman.h    | 1 +
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index d828beb5e69b..ab336c06153e 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -50,6 +50,7 @@
 #define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index a6f8daff8e3b..b0ebe59f73fd 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -76,6 +76,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index bda94f0d0b94..cf830d465f75 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -46,6 +46,7 @@
 #define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 83c5150b06f9..d030594ed22b 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -89,6 +89,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 0e821e3c3d45..58274382a616 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,9 +39,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 03/12] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Chen Gang, rth,
	ink, mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, linux-arch, sparclinux, roland, darrick.wong,
	davem, Minchan Kim

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: roland@kernel.org
Cc: darrick.wong@oracle.com
Cc: davem@davemloft.net
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 1 +
 arch/mips/include/uapi/asm/mman.h      | 1 +
 arch/parisc/include/uapi/asm/mman.h    | 1 +
 arch/xtensa/include/uapi/asm/mman.h    | 1 +
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index d828beb5e69b..ab336c06153e 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -50,6 +50,7 @@
 #define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index a6f8daff8e3b..b0ebe59f73fd 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -76,6 +76,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index bda94f0d0b94..cf830d465f75 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -46,6 +46,7 @@
 #define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 83c5150b06f9..d030594ed22b 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -89,6 +89,7 @@
 #define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 0e821e3c3d45..58274382a616 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,9 +39,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 04/12] mm: free swp_entry in madvise_free
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Michal Hocko

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index e2fe2e26f449..8de3d9a636c9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
+	int nr_swap = 0;
 
 	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
@@ -280,8 +281,24 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
 
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
@@ -353,6 +370,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 	}
 out:
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 04/12] mm: free swp_entry in madvise_free
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Michal Hocko

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index e2fe2e26f449..8de3d9a636c9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
+	int nr_swap = 0;
 
 	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
@@ -280,8 +281,24 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			swp_entry_t entry;
+
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
 
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
@@ -353,6 +370,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		}
 	}
 out:
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 05/12] mm: move lazily freed pages to inactive list
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list.  I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..d08feef3d047 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 8de3d9a636c9..975e24e4c134 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -366,6 +366,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
 			set_pte_at(mm, addr, pte, ptent);
+			if (PageActive(page))
+				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 	}
diff --git a/mm/swap.c b/mm/swap.c
index abffc33bb975..674e2c93da4e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -554,6 +555,24 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -580,6 +599,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -609,6 +632,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -638,6 +681,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 05/12] mm: move lazily freed pages to inactive list
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list.  I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..d08feef3d047 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 8de3d9a636c9..975e24e4c134 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -366,6 +366,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
 			set_pte_at(mm, addr, pte, ptent);
+			if (PageActive(page))
+				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 	}
diff --git a/mm/swap.c b/mm/swap.c
index abffc33bb975..674e2c93da4e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -554,6 +555,24 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		ClearPageReferenced(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -580,6 +599,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -609,6 +632,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -638,6 +681,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 06/12] mm: mark stable page dirty in KSM
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 30cb0f753e19..5e967536c38e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1015,6 +1015,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Page reclaim just frees a clean page with no dirty
+			 * ptes: make sure that the ksm page would be swapped.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 06/12] mm: mark stable page dirty in KSM
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 30cb0f753e19..5e967536c38e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1015,6 +1015,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Page reclaim just frees a clean page with no dirty
+			 * ptes: make sure that the ksm page would be swapped.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 07/12] x86: add pmd_[dirty|mkclean] for THP
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a8d1aa3a43b0..9ff592003afd 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -269,6 +269,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 07/12] x86: add pmd_[dirty|mkclean] for THP
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a8d1aa3a43b0..9ff592003afd 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -269,6 +269,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
 	return pmd_clear_flags(pmd, _PAGE_ACCESSED);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
 static inline pmd_t pmd_wrprotect(pmd_t pmd)
 {
 	return pmd_clear_flags(pmd, _PAGE_RW);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 08/12] sparc: add pmd_[dirty|mkclean] for THP
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/sparc/include/asm/pgtable_64.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index f5bfcd66aeb5..7a38d6a576c5 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -710,6 +710,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	pte = pte_mkclean(pte);
+
+	return __pmd(pte_val(pte));
+}
+
 static inline pmd_t pmd_mkyoung(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 08/12] sparc: add pmd_[dirty|mkclean] for THP
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/sparc/include/asm/pgtable_64.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index f5bfcd66aeb5..7a38d6a576c5 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -710,6 +710,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return __pmd(pte_val(pte));
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	pte_t pte = __pte(pmd_val(pmd));
+
+	pte = pte_mkclean(pte);
+
+	return __pmd(pte_val(pte));
+}
+
 static inline pmd_t pmd_mkyoung(pmd_t pmd)
 {
 	pte_t pte = __pte(pmd_val(pmd));
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 09/12] powerpc: add pmd_[dirty|mkclean] for THP
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0db2a3f8e554..21d961bbac0e 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -502,9 +502,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 09/12] powerpc: add pmd_[dirty|mkclean] for THP
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 0db2a3f8e554..21d961bbac0e 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -502,9 +502,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_pfn(pmd)		pte_pfn(pmd_pte(pmd))
 #define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_young(pmd)		pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd)		pte_dirty(pmd_pte(pmd))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)	pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 10/12] arm: add pmd_mkclean for THP
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/include/asm/pgtable-3level.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 59d1457ca551..dc46398bc3a5 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -240,6 +240,7 @@ PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
 PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 10/12] arm: add pmd_mkclean for THP
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm/include/asm/pgtable-3level.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 59d1457ca551..dc46398bc3a5 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -240,6 +240,7 @@ PMD_BIT_FUNC(wrprotect,	|= L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkold,	&= ~PMD_SECT_AF);
 PMD_BIT_FUNC(mkwrite,   &= ~L_PMD_SECT_RDONLY);
 PMD_BIT_FUNC(mkdirty,   |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean,   &= ~L_PMD_SECT_DIRTY);
 PMD_BIT_FUNC(mkyoung,   |= PMD_SECT_AF);
 
 #define pmd_mkhuge(pmd)		(__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 11/12] arm64: add pmd_mkclean for THP
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d2a1879b466b..fab3ddb30df7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -340,6 +340,7 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)       pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 11/12] arm64: add pmd_mkclean for THP
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d2a1879b466b..fab3ddb30df7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -340,6 +340,7 @@ static inline pgprot_t mk_sect_prot(pgprot_t prot)
 #define pmd_wrprotect(pmd)	pte_pmd(pte_wrprotect(pmd_pte(pmd)))
 #define pmd_mkold(pmd)		pte_pmd(pte_mkold(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)	pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd)       pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkdirty(pmd)	pte_pmd(pte_mkdirty(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)	pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mknotpresent(pmd)	(__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 12/12] mm: don't split THP page when syscall is called
  2015-11-30  6:39 ` Minchan Kim
@ 2015-11-30  6:39   ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Andrea Arcangeli

We don't need to split THP page when MADV_FREE syscall is called
if [start, len] is aligned with THP size. The split could be done
when VM decide to free it in reclaim path if memory pressure is
heavy. With that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we
could discard them.

Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/huge_mm.h |  3 ++
 mm/huge_memory.c        | 87 ++++++++++++++++++++++++++++++++++++++++++++++---
 mm/madvise.c            |  8 ++++-
 3 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 72cd942edb22..0160201993d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr, unsigned long next);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b41793b12a2d..2aa28cbe7263 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1530,6 +1530,77 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long addr, unsigned long next)
+
+{
+	spinlock_t *ptl;
+	pmd_t orig_pmd;
+	struct page *page;
+	struct mm_struct *mm = tlb->mm;
+	int ret = 0;
+
+	if (!pmd_trans_huge_lock(pmd, vma, &ptl))
+		goto out;
+
+	orig_pmd = *pmd;
+	if (is_huge_zero_pmd(orig_pmd)) {
+		ret = 1;
+		goto out;
+	}
+
+	page = pmd_page(orig_pmd);
+	/*
+	 * If other processes are mapping this page, we couldn't discard
+	 * the page unless they all do MADV_FREE so let's skip the page.
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	if (!trylock_page(page))
+		goto out;
+
+	/*
+	 * If user want to discard part-pages of THP, split it so MADV_FREE
+	 * will deactivate only them.
+	 */
+	if (next - addr != HPAGE_PMD_SIZE) {
+		get_page(page);
+		spin_unlock(ptl);
+		if (split_huge_page(page)) {
+			put_page(page);
+			unlock_page(page);
+			goto out_unlocked;
+		}
+		put_page(page);
+		unlock_page(page);
+		ret = 1;
+		goto out_unlocked;
+	}
+
+	if (PageDirty(page))
+		ClearPageDirty(page);
+	unlock_page(page);
+
+	if (PageActive(page))
+		deactivate_page(page);
+
+	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
+		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+			tlb->fullmm);
+		orig_pmd = pmd_mkold(orig_pmd);
+		orig_pmd = pmd_mkclean(orig_pmd);
+
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	}
+	ret = 1;
+out:
+	spin_unlock(ptl);
+out_unlocked:
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -2784,7 +2855,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write;
+	bool young, write, dirty;
 	int i;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2808,6 +2879,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
+	dirty = pmd_dirty(*pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
@@ -2825,12 +2897,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = swp_entry_to_pte(swp_entry);
 		} else {
 			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			entry = maybe_mkwrite(entry, vma);
 			if (!write)
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
 		}
+		if (dirty)
+			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -3028,6 +3102,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
 			continue;
 		flush_cache_page(vma, address, page_to_pfn(page));
 		entry = ptep_clear_flush(vma, address, pte + i);
+		if (pte_dirty(entry))
+			SetPageDirty(page);
 		swp_entry = make_migration_entry(page, pte_write(entry));
 		swp_pte = swp_entry_to_pte(swp_entry);
 		if (pte_soft_dirty(entry))
@@ -3086,7 +3162,8 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		page_add_anon_rmap(page, vma, address, false);
 
 		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
-		entry = pte_mkdirty(entry);
+		if (PageDirty(page))
+			entry = pte_mkdirty(entry);
 		if (is_write_migration_entry(swp_entry))
 			entry = maybe_mkwrite(entry, vma);
 
@@ -3147,8 +3224,8 @@ static int __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
 			 (1L << PG_locked) |
-			 (1L << PG_unevictable)));
-	page_tail->flags |= (1L << PG_dirty);
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty)));
 
 	/*
 	 * After clearing PageTail the gup refcount can be released.
diff --git a/mm/madvise.c b/mm/madvise.c
index 975e24e4c134..563d6c145d75 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
 	int nr_swap = 0;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd))
+		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
+			goto next;
 
-	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
@@ -381,6 +386,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
+next:
 	return 0;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 12/12] mm: don't split THP page when syscall is called
@ 2015-11-30  6:39   ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Minchan Kim,
	Andrea Arcangeli

We don't need to split THP page when MADV_FREE syscall is called
if [start, len] is aligned with THP size. The split could be done
when VM decide to free it in reclaim path if memory pressure is
heavy. With that, we could avoid unnecessary THP split.

For the feature, this patch changes pte dirtness marking logic of THP.
Now, it marks every ptes of pages dirty unconditionally in splitting,
which makes MADV_FREE void. So, instead, this patch propagates pmd
dirtiness to all pages via PG_dirty and restores pte dirtiness from
PG_dirty. With this, if pmd is clean(ie, MADV_FREEed) when split
happens(e,g, shrink_page_list), all of pages are clean too so we
could discard them.

Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/huge_mm.h |  3 ++
 mm/huge_memory.c        | 87 ++++++++++++++++++++++++++++++++++++++++++++++---
 mm/madvise.c            |  8 ++++-
 3 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 72cd942edb22..0160201993d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 					  unsigned long addr,
 					  pmd_t *pmd,
 					  unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+			struct vm_area_struct *vma,
+			pmd_t *pmd, unsigned long addr, unsigned long next);
 extern int zap_huge_pmd(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b41793b12a2d..2aa28cbe7263 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1530,6 +1530,77 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long addr, unsigned long next)
+
+{
+	spinlock_t *ptl;
+	pmd_t orig_pmd;
+	struct page *page;
+	struct mm_struct *mm = tlb->mm;
+	int ret = 0;
+
+	if (!pmd_trans_huge_lock(pmd, vma, &ptl))
+		goto out;
+
+	orig_pmd = *pmd;
+	if (is_huge_zero_pmd(orig_pmd)) {
+		ret = 1;
+		goto out;
+	}
+
+	page = pmd_page(orig_pmd);
+	/*
+	 * If other processes are mapping this page, we couldn't discard
+	 * the page unless they all do MADV_FREE so let's skip the page.
+	 */
+	if (page_mapcount(page) != 1)
+		goto out;
+
+	if (!trylock_page(page))
+		goto out;
+
+	/*
+	 * If user want to discard part-pages of THP, split it so MADV_FREE
+	 * will deactivate only them.
+	 */
+	if (next - addr != HPAGE_PMD_SIZE) {
+		get_page(page);
+		spin_unlock(ptl);
+		if (split_huge_page(page)) {
+			put_page(page);
+			unlock_page(page);
+			goto out_unlocked;
+		}
+		put_page(page);
+		unlock_page(page);
+		ret = 1;
+		goto out_unlocked;
+	}
+
+	if (PageDirty(page))
+		ClearPageDirty(page);
+	unlock_page(page);
+
+	if (PageActive(page))
+		deactivate_page(page);
+
+	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
+		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+			tlb->fullmm);
+		orig_pmd = pmd_mkold(orig_pmd);
+		orig_pmd = pmd_mkclean(orig_pmd);
+
+		set_pmd_at(mm, addr, pmd, orig_pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	}
+	ret = 1;
+out:
+	spin_unlock(ptl);
+out_unlocked:
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
@@ -2784,7 +2855,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t _pmd;
-	bool young, write;
+	bool young, write, dirty;
 	int i;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2808,6 +2879,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	atomic_add(HPAGE_PMD_NR - 1, &page->_count);
 	write = pmd_write(*pmd);
 	young = pmd_young(*pmd);
+	dirty = pmd_dirty(*pmd);
 
 	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
 	pmd_populate(mm, &_pmd, pgtable);
@@ -2825,12 +2897,14 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = swp_entry_to_pte(swp_entry);
 		} else {
 			entry = mk_pte(page + i, vma->vm_page_prot);
-			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+			entry = maybe_mkwrite(entry, vma);
 			if (!write)
 				entry = pte_wrprotect(entry);
 			if (!young)
 				entry = pte_mkold(entry);
 		}
+		if (dirty)
+			SetPageDirty(page + i);
 		pte = pte_offset_map(&_pmd, haddr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
@@ -3028,6 +3102,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
 			continue;
 		flush_cache_page(vma, address, page_to_pfn(page));
 		entry = ptep_clear_flush(vma, address, pte + i);
+		if (pte_dirty(entry))
+			SetPageDirty(page);
 		swp_entry = make_migration_entry(page, pte_write(entry));
 		swp_pte = swp_entry_to_pte(swp_entry);
 		if (pte_soft_dirty(entry))
@@ -3086,7 +3162,8 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		page_add_anon_rmap(page, vma, address, false);
 
 		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
-		entry = pte_mkdirty(entry);
+		if (PageDirty(page))
+			entry = pte_mkdirty(entry);
 		if (is_write_migration_entry(swp_entry))
 			entry = maybe_mkwrite(entry, vma);
 
@@ -3147,8 +3224,8 @@ static int __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_uptodate) |
 			 (1L << PG_active) |
 			 (1L << PG_locked) |
-			 (1L << PG_unevictable)));
-	page_tail->flags |= (1L << PG_dirty);
+			 (1L << PG_unevictable) |
+			 (1L << PG_dirty)));
 
 	/*
 	 * After clearing PageTail the gup refcount can be released.
diff --git a/mm/madvise.c b/mm/madvise.c
index 975e24e4c134..563d6c145d75 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
 	int nr_swap = 0;
+	unsigned long next;
+
+	next = pmd_addr_end(addr, end);
+	if (pmd_trans_huge(*pmd))
+		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
+			goto next;
 
-	split_huge_pmd(vma, pmd, addr);
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
@@ -381,6 +386,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
+next:
 	return 0;
 }
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
  2015-11-30  6:39   ` Minchan Kim
  (?)
@ 2015-11-30  8:20     ` Mika Penttilä
  -1 siblings, 0 replies; 48+ messages in thread
From: Mika Penttilä @ 2015-11-30  8:20 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Michal Hocko

> +		 * If pmd isn't transhuge but the page is THP and
> +		 * is owned by only this process, split it and
> +		 * deactivate all pages.
> +		 */
> +		if (PageTransCompound(page)) {
> +			if (page_mapcount(page) != 1)
> +				goto out;
> +			get_page(page);
> +			if (!trylock_page(page)) {
> +				put_page(page);
> +				goto out;
> +			}
> +			pte_unmap_unlock(orig_pte, ptl);
> +			if (split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> +				goto out;
> +			}
> +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +			pte--;
> +			addr -= PAGE_SIZE;
> +			continue;
> +		}

looks like this leaks page count if split_huge_page() is succesfull
(returns zero).

--Mika


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
@ 2015-11-30  8:20     ` Mika Penttilä
  0 siblings, 0 replies; 48+ messages in thread
From: Mika Penttilä @ 2015-11-30  8:20 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Michal Hocko

> +		 * If pmd isn't transhuge but the page is THP and
> +		 * is owned by only this process, split it and
> +		 * deactivate all pages.
> +		 */
> +		if (PageTransCompound(page)) {
> +			if (page_mapcount(page) != 1)
> +				goto out;
> +			get_page(page);
> +			if (!trylock_page(page)) {
> +				put_page(page);
> +				goto out;
> +			}
> +			pte_unmap_unlock(orig_pte, ptl);
> +			if (split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> +				goto out;
> +			}
> +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +			pte--;
> +			addr -= PAGE_SIZE;
> +			continue;
> +		}

looks like this leaks page count if split_huge_page() is succesfull
(returns zero).

--Mika

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
@ 2015-11-30  8:20     ` Mika Penttilä
  0 siblings, 0 replies; 48+ messages in thread
From: Mika Penttilä @ 2015-11-30  8:20 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski, Michal Hocko

> +		 * If pmd isn't transhuge but the page is THP and
> +		 * is owned by only this process, split it and
> +		 * deactivate all pages.
> +		 */
> +		if (PageTransCompound(page)) {
> +			if (page_mapcount(page) != 1)
> +				goto out;
> +			get_page(page);
> +			if (!trylock_page(page)) {
> +				put_page(page);
> +				goto out;
> +			}
> +			pte_unmap_unlock(orig_pte, ptl);
> +			if (split_huge_page(page)) {
> +				unlock_page(page);
> +				put_page(page);
> +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> +				goto out;
> +			}
> +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +			pte--;
> +			addr -= PAGE_SIZE;
> +			continue;
> +		}

looks like this leaks page count if split_huge_page() is succesfull
(returns zero).

--Mika

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
  2015-11-30  8:20     ` Mika Penttilä
  (?)
@ 2015-11-30  9:22       ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  9:22 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Jason Evans, Daniel Micay,
	Kirill A. Shutemov, Shaohua Li, Michal Hocko, yalin.wang2010,
	Andy Lutomirski, Michal Hocko

On Mon, Nov 30, 2015 at 10:20:25AM +0200, Mika Penttilä wrote:
> > +		 * If pmd isn't transhuge but the page is THP and
> > +		 * is owned by only this process, split it and
> > +		 * deactivate all pages.
> > +		 */
> > +		if (PageTransCompound(page)) {
> > +			if (page_mapcount(page) != 1)
> > +				goto out;
> > +			get_page(page);
> > +			if (!trylock_page(page)) {
> > +				put_page(page);
> > +				goto out;
> > +			}
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			if (split_huge_page(page)) {
> > +				unlock_page(page);
> > +				put_page(page);
> > +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +				goto out;
> > +			}
> > +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +			pte--;
> > +			addr -= PAGE_SIZE;
> > +			continue;
> > +		}
> 
> looks like this leaks page count if split_huge_page() is succesfull
> (returns zero).

Even, I missed unlock_page.
Thanks for the review!

>From d22483fae454b100bcf73d514dd7d903fd84f744 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 30 Oct 2015 16:01:37 +0900
Subject: [PATCH v5 01/12] mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty bit
in page table entry to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't discard
the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 170 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   8 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 8 files changed, 192 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 77d1ba57d495..04d2aec64e57 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_LZFREE = 8,			/* lazy free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e1f8c993e73b..67c1dbd19c6d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index a74dd84bbb6d..0e821e3c3d45 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..ed137fde4459 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,163 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *orig_pte, *pte, ptent;
+	struct page *page;
+
+	split_huge_pmd(vma, pmd, addr);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/*
+		 * If pmd isn't transhuge but the page is THP and
+		 * is owned by only this process, split it and
+		 * deactivate all pages.
+		 */
+		if (PageTransCompound(page)) {
+			if (page_mapcount(page) != 1)
+				goto out;
+			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				goto out;
+			}
+			pte_unmap_unlock(orig_pte, ptl);
+			if (split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				goto out;
+			}
+			put_page(page);
+			unlock_page(page);
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte--;
+			addr -= PAGE_SIZE;
+			continue;
+		}
+
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+		if (PageSwapCache(page) || PageDirty(page)) {
+			if (!trylock_page(page))
+				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_mapcount(page) != 1) {
+				unlock_page(page);
+				continue;
+			}
+
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		if (pte_young(ptent) || pte_dirty(ptent)) {
+			/*
+			 * Some of architecture(ex, PPC) don't update TLB
+			 * with set_pte_at and tlb_remove_tlb_entry so for
+			 * the portability, remap the pte with old|clean
+			 * after pte clearing.
+			 */
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+
+			ptent = pte_mkold(ptent);
+			ptent = pte_mkclean(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+	}
+out:
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +540,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +567,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 6f371261dd12..321b633ee559 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1508,6 +1508,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See handle_pte_fault() ...
 		 */
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+
+		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (swap_duplicate(entry) < 0) {
 			set_pte_at(mm, address, pte, pteval);
 			ret = SWAP_FAIL;
@@ -1528,6 +1535,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, mm_counter_file(page));
 
+discard:
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..676ff2991380 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..c2f69445190c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -908,6 +908,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool lazyfree = false;
 
 		cond_resched();
 
@@ -1051,6 +1052,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1062,8 +1064,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, lazyfree ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1188,6 +1191,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (lazyfree && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d13cd8eebf70..38929dc79c3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,6 +781,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
@ 2015-11-30  9:22       ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  9:22 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Jason Evans, Daniel Micay,
	Kirill A. Shutemov, Shaohua Li, Michal Hocko, yalin.wang2010,
	Andy Lutomirski, Michal Hocko

On Mon, Nov 30, 2015 at 10:20:25AM +0200, Mika Penttilä wrote:
> > +		 * If pmd isn't transhuge but the page is THP and
> > +		 * is owned by only this process, split it and
> > +		 * deactivate all pages.
> > +		 */
> > +		if (PageTransCompound(page)) {
> > +			if (page_mapcount(page) != 1)
> > +				goto out;
> > +			get_page(page);
> > +			if (!trylock_page(page)) {
> > +				put_page(page);
> > +				goto out;
> > +			}
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			if (split_huge_page(page)) {
> > +				unlock_page(page);
> > +				put_page(page);
> > +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +				goto out;
> > +			}
> > +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +			pte--;
> > +			addr -= PAGE_SIZE;
> > +			continue;
> > +		}
> 
> looks like this leaks page count if split_huge_page() is succesfull
> (returns zero).

Even, I missed unlock_page.
Thanks for the review!

>From d22483fae454b100bcf73d514dd7d903fd84f744 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 30 Oct 2015 16:01:37 +0900
Subject: [PATCH v5 01/12] mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty bit
in page table entry to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't discard
the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 170 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   8 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 8 files changed, 192 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 77d1ba57d495..04d2aec64e57 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_LZFREE = 8,			/* lazy free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e1f8c993e73b..67c1dbd19c6d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index a74dd84bbb6d..0e821e3c3d45 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -39,6 +39,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..ed137fde4459 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,163 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *orig_pte, *pte, ptent;
+	struct page *page;
+
+	split_huge_pmd(vma, pmd, addr);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		/*
+		 * If pmd isn't transhuge but the page is THP and
+		 * is owned by only this process, split it and
+		 * deactivate all pages.
+		 */
+		if (PageTransCompound(page)) {
+			if (page_mapcount(page) != 1)
+				goto out;
+			get_page(page);
+			if (!trylock_page(page)) {
+				put_page(page);
+				goto out;
+			}
+			pte_unmap_unlock(orig_pte, ptl);
+			if (split_huge_page(page)) {
+				unlock_page(page);
+				put_page(page);
+				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				goto out;
+			}
+			put_page(page);
+			unlock_page(page);
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			pte--;
+			addr -= PAGE_SIZE;
+			continue;
+		}
+
+		VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+		if (PageSwapCache(page) || PageDirty(page)) {
+			if (!trylock_page(page))
+				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_mapcount(page) != 1) {
+				unlock_page(page);
+				continue;
+			}
+
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		if (pte_young(ptent) || pte_dirty(ptent)) {
+			/*
+			 * Some of architecture(ex, PPC) don't update TLB
+			 * with set_pte_at and tlb_remove_tlb_entry so for
+			 * the portability, remap the pte with old|clean
+			 * after pte clearing.
+			 */
+			ptent = ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+
+			ptent = pte_mkold(ptent);
+			ptent = pte_mkclean(ptent);
+			set_pte_at(mm, addr, pte, ptent);
+			tlb_remove_tlb_entry(tlb, pte, addr);
+		}
+	}
+out:
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_pte, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +540,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +567,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 6f371261dd12..321b633ee559 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1508,6 +1508,13 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * See handle_pte_fault() ...
 		 */
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+
+		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (swap_duplicate(entry) < 0) {
 			set_pte_at(mm, address, pte, pteval);
 			ret = SWAP_FAIL;
@@ -1528,6 +1535,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, mm_counter_file(page));
 
+discard:
 	page_remove_rmap(page, PageHuge(page));
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d783872d746c..676ff2991380 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..c2f69445190c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -908,6 +908,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool lazyfree = false;
 
 		cond_resched();
 
@@ -1051,6 +1052,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1062,8 +1064,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, lazyfree ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1188,6 +1191,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
+		if (lazyfree && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d13cd8eebf70..38929dc79c3d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -781,6 +781,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 01/12] mm: support madvise(MADV_FREE)
@ 2015-11-30  9:22       ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2015-11-30  9:22 UTC (permalink / raw)
  To: Mika Penttilä
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, Jason Evans, Daniel Micay,
	Kirill A. Shutemov, Shaohua Li, Michal Hocko, yalin.wang2010,
	Andy Lutomirski, Michal Hocko

On Mon, Nov 30, 2015 at 10:20:25AM +0200, Mika Penttilä wrote:
> > +		 * If pmd isn't transhuge but the page is THP and
> > +		 * is owned by only this process, split it and
> > +		 * deactivate all pages.
> > +		 */
> > +		if (PageTransCompound(page)) {
> > +			if (page_mapcount(page) != 1)
> > +				goto out;
> > +			get_page(page);
> > +			if (!trylock_page(page)) {
> > +				put_page(page);
> > +				goto out;
> > +			}
> > +			pte_unmap_unlock(orig_pte, ptl);
> > +			if (split_huge_page(page)) {
> > +				unlock_page(page);
> > +				put_page(page);
> > +				pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +				goto out;
> > +			}
> > +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +			pte--;
> > +			addr -= PAGE_SIZE;
> > +			continue;
> > +		}
> 
> looks like this leaks page count if split_huge_page() is succesfull
> (returns zero).

Even, I missed unlock_page.
Thanks for the review!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
  2015-11-30  6:39 ` Minchan Kim
@ 2016-01-28  7:16   ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-01-28  7:16 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: mtk.manpages, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Minchan,

On 11/30/2015 07:39 AM, Minchan Kim wrote:
> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> so this version doesn't include them.
> 
> I have been tested it on mmotm-2015-11-25-17-08 with additional
> patch[1] from Kirill to prevent BUG_ON which he didn't send to
> linux-mm yet as formal patch. With it, I couldn't find any
> problem so far.
> 
> Note that this version is based on THP refcount redesign so
> I needed some modification on MADV_FREE because split_huge_pmd
> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> enough to guarantee the page is not THP page.
> As well, for MAVD_FREE lazy-split, THP split should respect
> pmd's dirtiness rather than marking ptes of all subpages dirty
> unconditionally. Please, review last patch in this patchset.

Now that MADV_FREE has been merged, would you be willing to write
patch to the madvise(2) man page that describes the semantics, 
noes limitations and restrictions, and (ideally) has some sentences
describing use cases?

Thanks,

Michael


> 	mm: don't split THP page when syscall is called
> 
> [1] https://lkml.org/lkml/2015/11/17/134
> 
> git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2
> 
> In this stage, I don't think we need to write man page.
> It could be done after solid policy and implementation.
> 
>  * Change from v4
>    * drop lazyfree LRU
>    * drop swapless support
>    * drop lazyfreeness
>    * rebase on recent mmotom with THP refcount redesign
> 
>  * Change from v3
>    * some bug fix
>    * code refactoring
>    * lazyfree reclaim logic change
>    * reordering patch
> 
>  * Change from v2
>    * vm_lazyfreeness tuning knob
>    * add new LRU list - Johannes, Shaohua
>    * support swapless - Johannes
> 
>  * Change from v1
>    * Don't do unnecessary TLB flush - Shaohua
>    * Added Acked-by - Hugh, Michal
>    * Merge deactivate_page and deactivate_file_page
>    * Add pmd_dirty/pmd_mkclean patches for several arches
>    * Add lazy THP split patch
>    * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure
> 
> Chen Gang (1):
>   arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
>     architectures
> 
> Minchan Kim (11):
>   mm: support madvise(MADV_FREE)
>   mm: define MADV_FREE for some arches
>   mm: free swp_entry in madvise_free
>   mm: move lazily freed pages to inactive list
>   mm: mark stable page dirty in KSM
>   x86: add pmd_[dirty|mkclean] for THP
>   sparc: add pmd_[dirty|mkclean] for THP
>   powerpc: add pmd_[dirty|mkclean] for THP
>   arm: add pmd_mkclean for THP
>   arm64: add pmd_mkclean for THP
>   mm: don't split THP page when syscall is called
> 
>  arch/alpha/include/uapi/asm/mman.h       |   2 +
>  arch/arm/include/asm/pgtable-3level.h    |   1 +
>  arch/arm64/include/asm/pgtable.h         |   1 +
>  arch/mips/include/uapi/asm/mman.h        |   2 +
>  arch/parisc/include/uapi/asm/mman.h      |   2 +
>  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
>  arch/sparc/include/asm/pgtable_64.h      |   9 ++
>  arch/x86/include/asm/pgtable.h           |   5 +
>  arch/xtensa/include/uapi/asm/mman.h      |   2 +
>  include/linux/huge_mm.h                  |   3 +
>  include/linux/rmap.h                     |   1 +
>  include/linux/swap.h                     |   1 +
>  include/linux/vm_event_item.h            |   1 +
>  include/uapi/asm-generic/mman-common.h   |   1 +
>  mm/huge_memory.c                         |  87 +++++++++++++-
>  mm/ksm.c                                 |   6 +
>  mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
>  mm/rmap.c                                |   8 ++
>  mm/swap.c                                |  44 +++++++
>  mm/swap_state.c                          |   5 +-
>  mm/vmscan.c                              |  10 +-
>  mm/vmstat.c                              |   1 +
>  22 files changed, 383 insertions(+), 10 deletions(-)
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-01-28  7:16   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-01-28  7:16 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: mtk.manpages, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Minchan,

On 11/30/2015 07:39 AM, Minchan Kim wrote:
> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> so this version doesn't include them.
> 
> I have been tested it on mmotm-2015-11-25-17-08 with additional
> patch[1] from Kirill to prevent BUG_ON which he didn't send to
> linux-mm yet as formal patch. With it, I couldn't find any
> problem so far.
> 
> Note that this version is based on THP refcount redesign so
> I needed some modification on MADV_FREE because split_huge_pmd
> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> enough to guarantee the page is not THP page.
> As well, for MAVD_FREE lazy-split, THP split should respect
> pmd's dirtiness rather than marking ptes of all subpages dirty
> unconditionally. Please, review last patch in this patchset.

Now that MADV_FREE has been merged, would you be willing to write
patch to the madvise(2) man page that describes the semantics, 
noes limitations and restrictions, and (ideally) has some sentences
describing use cases?

Thanks,

Michael


> 	mm: don't split THP page when syscall is called
> 
> [1] https://lkml.org/lkml/2015/11/17/134
> 
> git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2
> 
> In this stage, I don't think we need to write man page.
> It could be done after solid policy and implementation.
> 
>  * Change from v4
>    * drop lazyfree LRU
>    * drop swapless support
>    * drop lazyfreeness
>    * rebase on recent mmotom with THP refcount redesign
> 
>  * Change from v3
>    * some bug fix
>    * code refactoring
>    * lazyfree reclaim logic change
>    * reordering patch
> 
>  * Change from v2
>    * vm_lazyfreeness tuning knob
>    * add new LRU list - Johannes, Shaohua
>    * support swapless - Johannes
> 
>  * Change from v1
>    * Don't do unnecessary TLB flush - Shaohua
>    * Added Acked-by - Hugh, Michal
>    * Merge deactivate_page and deactivate_file_page
>    * Add pmd_dirty/pmd_mkclean patches for several arches
>    * Add lazy THP split patch
>    * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure
> 
> Chen Gang (1):
>   arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
>     architectures
> 
> Minchan Kim (11):
>   mm: support madvise(MADV_FREE)
>   mm: define MADV_FREE for some arches
>   mm: free swp_entry in madvise_free
>   mm: move lazily freed pages to inactive list
>   mm: mark stable page dirty in KSM
>   x86: add pmd_[dirty|mkclean] for THP
>   sparc: add pmd_[dirty|mkclean] for THP
>   powerpc: add pmd_[dirty|mkclean] for THP
>   arm: add pmd_mkclean for THP
>   arm64: add pmd_mkclean for THP
>   mm: don't split THP page when syscall is called
> 
>  arch/alpha/include/uapi/asm/mman.h       |   2 +
>  arch/arm/include/asm/pgtable-3level.h    |   1 +
>  arch/arm64/include/asm/pgtable.h         |   1 +
>  arch/mips/include/uapi/asm/mman.h        |   2 +
>  arch/parisc/include/uapi/asm/mman.h      |   2 +
>  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
>  arch/sparc/include/asm/pgtable_64.h      |   9 ++
>  arch/x86/include/asm/pgtable.h           |   5 +
>  arch/xtensa/include/uapi/asm/mman.h      |   2 +
>  include/linux/huge_mm.h                  |   3 +
>  include/linux/rmap.h                     |   1 +
>  include/linux/swap.h                     |   1 +
>  include/linux/vm_event_item.h            |   1 +
>  include/uapi/asm-generic/mman-common.h   |   1 +
>  mm/huge_memory.c                         |  87 +++++++++++++-
>  mm/ksm.c                                 |   6 +
>  mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
>  mm/rmap.c                                |   8 ++
>  mm/swap.c                                |  44 +++++++
>  mm/swap_state.c                          |   5 +-
>  mm/vmscan.c                              |  10 +-
>  mm/vmstat.c                              |   1 +
>  22 files changed, 383 insertions(+), 10 deletions(-)
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-01-29  7:32     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-01-29  7:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Michael,

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?

I will try next week.
Thanks for the heads up.

> 
> Thanks,
> 
> Michael
> 
> 
> > 	mm: don't split THP page when syscall is called
> > 
> > [1] https://lkml.org/lkml/2015/11/17/134
> > 
> > git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> > branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2
> > 
> > In this stage, I don't think we need to write man page.
> > It could be done after solid policy and implementation.
> > 
> >  * Change from v4
> >    * drop lazyfree LRU
> >    * drop swapless support
> >    * drop lazyfreeness
> >    * rebase on recent mmotom with THP refcount redesign
> > 
> >  * Change from v3
> >    * some bug fix
> >    * code refactoring
> >    * lazyfree reclaim logic change
> >    * reordering patch
> > 
> >  * Change from v2
> >    * vm_lazyfreeness tuning knob
> >    * add new LRU list - Johannes, Shaohua
> >    * support swapless - Johannes
> > 
> >  * Change from v1
> >    * Don't do unnecessary TLB flush - Shaohua
> >    * Added Acked-by - Hugh, Michal
> >    * Merge deactivate_page and deactivate_file_page
> >    * Add pmd_dirty/pmd_mkclean patches for several arches
> >    * Add lazy THP split patch
> >    * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure
> > 
> > Chen Gang (1):
> >   arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
> >     architectures
> > 
> > Minchan Kim (11):
> >   mm: support madvise(MADV_FREE)
> >   mm: define MADV_FREE for some arches
> >   mm: free swp_entry in madvise_free
> >   mm: move lazily freed pages to inactive list
> >   mm: mark stable page dirty in KSM
> >   x86: add pmd_[dirty|mkclean] for THP
> >   sparc: add pmd_[dirty|mkclean] for THP
> >   powerpc: add pmd_[dirty|mkclean] for THP
> >   arm: add pmd_mkclean for THP
> >   arm64: add pmd_mkclean for THP
> >   mm: don't split THP page when syscall is called
> > 
> >  arch/alpha/include/uapi/asm/mman.h       |   2 +
> >  arch/arm/include/asm/pgtable-3level.h    |   1 +
> >  arch/arm64/include/asm/pgtable.h         |   1 +
> >  arch/mips/include/uapi/asm/mman.h        |   2 +
> >  arch/parisc/include/uapi/asm/mman.h      |   2 +
> >  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
> >  arch/sparc/include/asm/pgtable_64.h      |   9 ++
> >  arch/x86/include/asm/pgtable.h           |   5 +
> >  arch/xtensa/include/uapi/asm/mman.h      |   2 +
> >  include/linux/huge_mm.h                  |   3 +
> >  include/linux/rmap.h                     |   1 +
> >  include/linux/swap.h                     |   1 +
> >  include/linux/vm_event_item.h            |   1 +
> >  include/uapi/asm-generic/mman-common.h   |   1 +
> >  mm/huge_memory.c                         |  87 +++++++++++++-
> >  mm/ksm.c                                 |   6 +
> >  mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
> >  mm/rmap.c                                |   8 ++
> >  mm/swap.c                                |  44 +++++++
> >  mm/swap_state.c                          |   5 +-
> >  mm/vmscan.c                              |  10 +-
> >  mm/vmstat.c                              |   1 +
> >  22 files changed, 383 insertions(+), 10 deletions(-)
> > 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-01-29  7:32     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-01-29  7:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Shaohua Li, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Andy Lutomirski

Hello Michael,

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?

I will try next week.
Thanks for the heads up.

> 
> Thanks,
> 
> Michael
> 
> 
> > 	mm: don't split THP page when syscall is called
> > 
> > [1] https://lkml.org/lkml/2015/11/17/134
> > 
> > git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> > branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2
> > 
> > In this stage, I don't think we need to write man page.
> > It could be done after solid policy and implementation.
> > 
> >  * Change from v4
> >    * drop lazyfree LRU
> >    * drop swapless support
> >    * drop lazyfreeness
> >    * rebase on recent mmotom with THP refcount redesign
> > 
> >  * Change from v3
> >    * some bug fix
> >    * code refactoring
> >    * lazyfree reclaim logic change
> >    * reordering patch
> > 
> >  * Change from v2
> >    * vm_lazyfreeness tuning knob
> >    * add new LRU list - Johannes, Shaohua
> >    * support swapless - Johannes
> > 
> >  * Change from v1
> >    * Don't do unnecessary TLB flush - Shaohua
> >    * Added Acked-by - Hugh, Michal
> >    * Merge deactivate_page and deactivate_file_page
> >    * Add pmd_dirty/pmd_mkclean patches for several arches
> >    * Add lazy THP split patch
> >    * Drop zhangyanfei-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org - Delivery Failure
> > 
> > Chen Gang (1):
> >   arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
> >     architectures
> > 
> > Minchan Kim (11):
> >   mm: support madvise(MADV_FREE)
> >   mm: define MADV_FREE for some arches
> >   mm: free swp_entry in madvise_free
> >   mm: move lazily freed pages to inactive list
> >   mm: mark stable page dirty in KSM
> >   x86: add pmd_[dirty|mkclean] for THP
> >   sparc: add pmd_[dirty|mkclean] for THP
> >   powerpc: add pmd_[dirty|mkclean] for THP
> >   arm: add pmd_mkclean for THP
> >   arm64: add pmd_mkclean for THP
> >   mm: don't split THP page when syscall is called
> > 
> >  arch/alpha/include/uapi/asm/mman.h       |   2 +
> >  arch/arm/include/asm/pgtable-3level.h    |   1 +
> >  arch/arm64/include/asm/pgtable.h         |   1 +
> >  arch/mips/include/uapi/asm/mman.h        |   2 +
> >  arch/parisc/include/uapi/asm/mman.h      |   2 +
> >  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
> >  arch/sparc/include/asm/pgtable_64.h      |   9 ++
> >  arch/x86/include/asm/pgtable.h           |   5 +
> >  arch/xtensa/include/uapi/asm/mman.h      |   2 +
> >  include/linux/huge_mm.h                  |   3 +
> >  include/linux/rmap.h                     |   1 +
> >  include/linux/swap.h                     |   1 +
> >  include/linux/vm_event_item.h            |   1 +
> >  include/uapi/asm-generic/mman-common.h   |   1 +
> >  mm/huge_memory.c                         |  87 +++++++++++++-
> >  mm/ksm.c                                 |   6 +
> >  mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
> >  mm/rmap.c                                |   8 ++
> >  mm/swap.c                                |  44 +++++++
> >  mm/swap_state.c                          |   5 +-
> >  mm/vmscan.c                              |  10 +-
> >  mm/vmstat.c                              |   1 +
> >  22 files changed, 383 insertions(+), 10 deletions(-)
> > 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-01-29  7:32     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-01-29  7:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Michael,

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?

I will try next week.
Thanks for the heads up.

> 
> Thanks,
> 
> Michael
> 
> 
> > 	mm: don't split THP page when syscall is called
> > 
> > [1] https://lkml.org/lkml/2015/11/17/134
> > 
> > git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
> > branch: mm/madv_free-v4.4-rc2-mmotm-2015-11-25-17-08-v5r2
> > 
> > In this stage, I don't think we need to write man page.
> > It could be done after solid policy and implementation.
> > 
> >  * Change from v4
> >    * drop lazyfree LRU
> >    * drop swapless support
> >    * drop lazyfreeness
> >    * rebase on recent mmotom with THP refcount redesign
> > 
> >  * Change from v3
> >    * some bug fix
> >    * code refactoring
> >    * lazyfree reclaim logic change
> >    * reordering patch
> > 
> >  * Change from v2
> >    * vm_lazyfreeness tuning knob
> >    * add new LRU list - Johannes, Shaohua
> >    * support swapless - Johannes
> > 
> >  * Change from v1
> >    * Don't do unnecessary TLB flush - Shaohua
> >    * Added Acked-by - Hugh, Michal
> >    * Merge deactivate_page and deactivate_file_page
> >    * Add pmd_dirty/pmd_mkclean patches for several arches
> >    * Add lazy THP split patch
> >    * Drop zhangyanfei@cn.fujitsu.com - Delivery Failure
> > 
> > Chen Gang (1):
> >   arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
> >     architectures
> > 
> > Minchan Kim (11):
> >   mm: support madvise(MADV_FREE)
> >   mm: define MADV_FREE for some arches
> >   mm: free swp_entry in madvise_free
> >   mm: move lazily freed pages to inactive list
> >   mm: mark stable page dirty in KSM
> >   x86: add pmd_[dirty|mkclean] for THP
> >   sparc: add pmd_[dirty|mkclean] for THP
> >   powerpc: add pmd_[dirty|mkclean] for THP
> >   arm: add pmd_mkclean for THP
> >   arm64: add pmd_mkclean for THP
> >   mm: don't split THP page when syscall is called
> > 
> >  arch/alpha/include/uapi/asm/mman.h       |   2 +
> >  arch/arm/include/asm/pgtable-3level.h    |   1 +
> >  arch/arm64/include/asm/pgtable.h         |   1 +
> >  arch/mips/include/uapi/asm/mman.h        |   2 +
> >  arch/parisc/include/uapi/asm/mman.h      |   2 +
> >  arch/powerpc/include/asm/pgtable-ppc64.h |   2 +
> >  arch/sparc/include/asm/pgtable_64.h      |   9 ++
> >  arch/x86/include/asm/pgtable.h           |   5 +
> >  arch/xtensa/include/uapi/asm/mman.h      |   2 +
> >  include/linux/huge_mm.h                  |   3 +
> >  include/linux/rmap.h                     |   1 +
> >  include/linux/swap.h                     |   1 +
> >  include/linux/vm_event_item.h            |   1 +
> >  include/uapi/asm-generic/mman-common.h   |   1 +
> >  mm/huge_memory.c                         |  87 +++++++++++++-
> >  mm/ksm.c                                 |   6 +
> >  mm/madvise.c                             | 199 +++++++++++++++++++++++++++++++
> >  mm/rmap.c                                |   8 ++
> >  mm/swap.c                                |  44 +++++++
> >  mm/swap_state.c                          |   5 +-
> >  mm/vmscan.c                              |  10 +-
> >  mm/vmstat.c                              |   1 +
> >  22 files changed, 383 insertions(+), 10 deletions(-)
> > 
> 
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
  2016-02-06 13:32   ` Michael Kerrisk (man-pages)
@ 2016-02-07 12:31     ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-02-07 12:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

On Sat, Feb 06, 2016 at 02:32:02PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 02/05/2016 03:15 AM, Minchan Kim wrote:
> > On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Minchan,
> >>
> >> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> >>> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> >>> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> >>> so this version doesn't include them.
> >>>
> >>> I have been tested it on mmotm-2015-11-25-17-08 with additional
> >>> patch[1] from Kirill to prevent BUG_ON which he didn't send to
> >>> linux-mm yet as formal patch. With it, I couldn't find any
> >>> problem so far.
> >>>
> >>> Note that this version is based on THP refcount redesign so
> >>> I needed some modification on MADV_FREE because split_huge_pmd
> >>> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> >>> enough to guarantee the page is not THP page.
> >>> As well, for MAVD_FREE lazy-split, THP split should respect
> >>> pmd's dirtiness rather than marking ptes of all subpages dirty
> >>> unconditionally. Please, review last patch in this patchset.
> >>
> >> Now that MADV_FREE has been merged, would you be willing to write
> >> patch to the madvise(2) man page that describes the semantics, 
> >> noes limitations and restrictions, and (ideally) has some sentences
> >> describing use cases?
> >>
> > 
> > Hello Michael,
> > 
> > Could you review this patch?
> > 
> > Thanks.
> > 
> >>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Fri, 5 Feb 2016 11:09:54 +0900
> > Subject: [PATCH] madvise.2: Add MADV_FREE
> > 
> > Document the MADV_FREE flags added to madvise() in Linux 4.5
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  man2/madvise.2 | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/man2/madvise.2 b/man2/madvise.2
> > index c1df67c..4704304 100644
> > --- a/man2/madvise.2
> > +++ b/man2/madvise.2
> > @@ -143,6 +143,25 @@ flag are special memory areas that are not managed
> >  by the virtual memory subsystem.
> >  Such pages are typically created by device drivers that
> >  map the pages into user space.)
> > +.TP
> > +.B MADV_FREE " (since Linux 4.5)"
> > +Application is finished with the given range, so kernel can free
> > +resources associated with it but the freeing could be delayed until
> > +memory pressure happens or canceld by write operation by user.
> > +
> > +After a successful MADV_FREE operation, user shouldn't expect kernel
> > +keeps stale data on the page. However, subsequent write of pages
> > +in the range will succeed and then kernel cannot free those dirtied pages
> > +so user can always see just written data. If there was no subsequent
> > +write, kernel can free those clean pages any time. In such case,
> > +user can see zero-fill-on-demand pages.
> > +
> > +Note that, it works only with private anonymous pages (see
> > +.BR mmap (2)).
> > +On swapless system, freeing pages in given range happens instantly
> > +regardless of memory pressure.
> > +
> > +
> >  .\"
> >  .\" ======================================================================
> >  .\"
> > 
> 
> Thanks for the nice text! I reworked somewhat, trying to fill out a
> few details about how I understand things work, but I may have introduced
> errors, so I would be happy if you would check the following text:

Below looks good to me.
Thanks, Michael

> 
>        MADV_FREE (since Linux 4.5)
>               The  application  no  longer  requires  the pages in the
>               range specified by addr and len.  The  kernel  can  thus
>               free these pages, but the freeing could be delayed until
>               memory pressure occurs.  For each of the pages that  has
>               been  marked to be freed but has not yet been freed, the
>               free operation will be canceled  if  the  caller  writes
>               into  the page.  After a successful MADV_FREE operation,
>               any stale data (i.e., dirty, unwritten  pages)  will  be
>               lost  when  the kernel frees the pages.  However, subse‐
>               quent writes to pages in the range will succeed and then
>               kernel  cannot  free  those  dirtied  pages, so that the
>               caller can always see just written data.  If there is no
>               subsequent  write,  the kernel can free the pages at any
>               time.  Once pages in the  range  have  been  freed,  the
>               caller  will  see  zero-fill-on-demand pages upon subse‐
>               quent page references.
> 
>               The MADV_FREE operation can be applied only  to  private
>               anonymous  pages  (see  mmap(2)).  On a swapless system,
>               freeing  pages  in  a  given  range  happens  instantly,
>               regardless of memory pressure.
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-07 12:31     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-02-07 12:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

On Sat, Feb 06, 2016 at 02:32:02PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 02/05/2016 03:15 AM, Minchan Kim wrote:
> > On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> >> Hello Minchan,
> >>
> >> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> >>> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> >>> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> >>> so this version doesn't include them.
> >>>
> >>> I have been tested it on mmotm-2015-11-25-17-08 with additional
> >>> patch[1] from Kirill to prevent BUG_ON which he didn't send to
> >>> linux-mm yet as formal patch. With it, I couldn't find any
> >>> problem so far.
> >>>
> >>> Note that this version is based on THP refcount redesign so
> >>> I needed some modification on MADV_FREE because split_huge_pmd
> >>> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> >>> enough to guarantee the page is not THP page.
> >>> As well, for MAVD_FREE lazy-split, THP split should respect
> >>> pmd's dirtiness rather than marking ptes of all subpages dirty
> >>> unconditionally. Please, review last patch in this patchset.
> >>
> >> Now that MADV_FREE has been merged, would you be willing to write
> >> patch to the madvise(2) man page that describes the semantics, 
> >> noes limitations and restrictions, and (ideally) has some sentences
> >> describing use cases?
> >>
> > 
> > Hello Michael,
> > 
> > Could you review this patch?
> > 
> > Thanks.
> > 
> >>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Fri, 5 Feb 2016 11:09:54 +0900
> > Subject: [PATCH] madvise.2: Add MADV_FREE
> > 
> > Document the MADV_FREE flags added to madvise() in Linux 4.5
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  man2/madvise.2 | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/man2/madvise.2 b/man2/madvise.2
> > index c1df67c..4704304 100644
> > --- a/man2/madvise.2
> > +++ b/man2/madvise.2
> > @@ -143,6 +143,25 @@ flag are special memory areas that are not managed
> >  by the virtual memory subsystem.
> >  Such pages are typically created by device drivers that
> >  map the pages into user space.)
> > +.TP
> > +.B MADV_FREE " (since Linux 4.5)"
> > +Application is finished with the given range, so kernel can free
> > +resources associated with it but the freeing could be delayed until
> > +memory pressure happens or canceld by write operation by user.
> > +
> > +After a successful MADV_FREE operation, user shouldn't expect kernel
> > +keeps stale data on the page. However, subsequent write of pages
> > +in the range will succeed and then kernel cannot free those dirtied pages
> > +so user can always see just written data. If there was no subsequent
> > +write, kernel can free those clean pages any time. In such case,
> > +user can see zero-fill-on-demand pages.
> > +
> > +Note that, it works only with private anonymous pages (see
> > +.BR mmap (2)).
> > +On swapless system, freeing pages in given range happens instantly
> > +regardless of memory pressure.
> > +
> > +
> >  .\"
> >  .\" ======================================================================
> >  .\"
> > 
> 
> Thanks for the nice text! I reworked somewhat, trying to fill out a
> few details about how I understand things work, but I may have introduced
> errors, so I would be happy if you would check the following text:

Below looks good to me.
Thanks, Michael

> 
>        MADV_FREE (since Linux 4.5)
>               The  application  no  longer  requires  the pages in the
>               range specified by addr and len.  The  kernel  can  thus
>               free these pages, but the freeing could be delayed until
>               memory pressure occurs.  For each of the pages that  has
>               been  marked to be freed but has not yet been freed, the
>               free operation will be canceled  if  the  caller  writes
>               into  the page.  After a successful MADV_FREE operation,
>               any stale data (i.e., dirty, unwritten  pages)  will  be
>               lost  when  the kernel frees the pages.  However, subse‐
>               quent writes to pages in the range will succeed and then
>               kernel  cannot  free  those  dirtied  pages, so that the
>               caller can always see just written data.  If there is no
>               subsequent  write,  the kernel can free the pages at any
>               time.  Once pages in the  range  have  been  freed,  the
>               caller  will  see  zero-fill-on-demand pages upon subse‐
>               quent page references.
> 
>               The MADV_FREE operation can be applied only  to  private
>               anonymous  pages  (see  mmap(2)).  On a swapless system,
>               freeing  pages  in  a  given  range  happens  instantly,
>               regardless of memory pressure.
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
  2016-02-05  2:15 ` Minchan Kim
  (?)
@ 2016-02-06 13:32   ` Michael Kerrisk (man-pages)
  -1 siblings, 0 replies; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-02-06 13:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: mtk.manpages, Andrew Morton, linux-kernel, linux-mm, linux-api,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Shaohua Li, Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Minchan,

On 02/05/2016 03:15 AM, Minchan Kim wrote:
> On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Minchan,
>>
>> On 11/30/2015 07:39 AM, Minchan Kim wrote:
>>> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
>>> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
>>> so this version doesn't include them.
>>>
>>> I have been tested it on mmotm-2015-11-25-17-08 with additional
>>> patch[1] from Kirill to prevent BUG_ON which he didn't send to
>>> linux-mm yet as formal patch. With it, I couldn't find any
>>> problem so far.
>>>
>>> Note that this version is based on THP refcount redesign so
>>> I needed some modification on MADV_FREE because split_huge_pmd
>>> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
>>> enough to guarantee the page is not THP page.
>>> As well, for MAVD_FREE lazy-split, THP split should respect
>>> pmd's dirtiness rather than marking ptes of all subpages dirty
>>> unconditionally. Please, review last patch in this patchset.
>>
>> Now that MADV_FREE has been merged, would you be willing to write
>> patch to the madvise(2) man page that describes the semantics, 
>> noes limitations and restrictions, and (ideally) has some sentences
>> describing use cases?
>>
> 
> Hello Michael,
> 
> Could you review this patch?
> 
> Thanks.
> 
>>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 5 Feb 2016 11:09:54 +0900
> Subject: [PATCH] madvise.2: Add MADV_FREE
> 
> Document the MADV_FREE flags added to madvise() in Linux 4.5
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  man2/madvise.2 | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index c1df67c..4704304 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -143,6 +143,25 @@ flag are special memory areas that are not managed
>  by the virtual memory subsystem.
>  Such pages are typically created by device drivers that
>  map the pages into user space.)
> +.TP
> +.B MADV_FREE " (since Linux 4.5)"
> +Application is finished with the given range, so kernel can free
> +resources associated with it but the freeing could be delayed until
> +memory pressure happens or canceld by write operation by user.
> +
> +After a successful MADV_FREE operation, user shouldn't expect kernel
> +keeps stale data on the page. However, subsequent write of pages
> +in the range will succeed and then kernel cannot free those dirtied pages
> +so user can always see just written data. If there was no subsequent
> +write, kernel can free those clean pages any time. In such case,
> +user can see zero-fill-on-demand pages.
> +
> +Note that, it works only with private anonymous pages (see
> +.BR mmap (2)).
> +On swapless system, freeing pages in given range happens instantly
> +regardless of memory pressure.
> +
> +
>  .\"
>  .\" ======================================================================
>  .\"
> 

Thanks for the nice text! I reworked somewhat, trying to fill out a
few details about how I understand things work, but I may have introduced
errors, so I would be happy if you would check the following text:

       MADV_FREE (since Linux 4.5)
              The  application  no  longer  requires  the pages in the
              range specified by addr and len.  The  kernel  can  thus
              free these pages, but the freeing could be delayed until
              memory pressure occurs.  For each of the pages that  has
              been  marked to be freed but has not yet been freed, the
              free operation will be canceled  if  the  caller  writes
              into  the page.  After a successful MADV_FREE operation,
              any stale data (i.e., dirty, unwritten  pages)  will  be
              lost  when  the kernel frees the pages.  However, subse‐
              quent writes to pages in the range will succeed and then
              kernel  cannot  free  those  dirtied  pages, so that the
              caller can always see just written data.  If there is no
              subsequent  write,  the kernel can free the pages at any
              time.  Once pages in the  range  have  been  freed,  the
              caller  will  see  zero-fill-on-demand pages upon subse‐
              quent page references.

              The MADV_FREE operation can be applied only  to  private
              anonymous  pages  (see  mmap(2)).  On a swapless system,
              freeing  pages  in  a  given  range  happens  instantly,
              regardless of memory pressure.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-06 13:32   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-02-06 13:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: mtk.manpages, Andrew Morton, linux-kernel, linux-mm, linux-api,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Shaohua Li, Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Minchan,

On 02/05/2016 03:15 AM, Minchan Kim wrote:
> On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Minchan,
>>
>> On 11/30/2015 07:39 AM, Minchan Kim wrote:
>>> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
>>> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
>>> so this version doesn't include them.
>>>
>>> I have been tested it on mmotm-2015-11-25-17-08 with additional
>>> patch[1] from Kirill to prevent BUG_ON which he didn't send to
>>> linux-mm yet as formal patch. With it, I couldn't find any
>>> problem so far.
>>>
>>> Note that this version is based on THP refcount redesign so
>>> I needed some modification on MADV_FREE because split_huge_pmd
>>> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
>>> enough to guarantee the page is not THP page.
>>> As well, for MAVD_FREE lazy-split, THP split should respect
>>> pmd's dirtiness rather than marking ptes of all subpages dirty
>>> unconditionally. Please, review last patch in this patchset.
>>
>> Now that MADV_FREE has been merged, would you be willing to write
>> patch to the madvise(2) man page that describes the semantics, 
>> noes limitations and restrictions, and (ideally) has some sentences
>> describing use cases?
>>
> 
> Hello Michael,
> 
> Could you review this patch?
> 
> Thanks.
> 
>>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 5 Feb 2016 11:09:54 +0900
> Subject: [PATCH] madvise.2: Add MADV_FREE
> 
> Document the MADV_FREE flags added to madvise() in Linux 4.5
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  man2/madvise.2 | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index c1df67c..4704304 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -143,6 +143,25 @@ flag are special memory areas that are not managed
>  by the virtual memory subsystem.
>  Such pages are typically created by device drivers that
>  map the pages into user space.)
> +.TP
> +.B MADV_FREE " (since Linux 4.5)"
> +Application is finished with the given range, so kernel can free
> +resources associated with it but the freeing could be delayed until
> +memory pressure happens or canceld by write operation by user.
> +
> +After a successful MADV_FREE operation, user shouldn't expect kernel
> +keeps stale data on the page. However, subsequent write of pages
> +in the range will succeed and then kernel cannot free those dirtied pages
> +so user can always see just written data. If there was no subsequent
> +write, kernel can free those clean pages any time. In such case,
> +user can see zero-fill-on-demand pages.
> +
> +Note that, it works only with private anonymous pages (see
> +.BR mmap (2)).
> +On swapless system, freeing pages in given range happens instantly
> +regardless of memory pressure.
> +
> +
>  .\"
>  .\" ======================================================================
>  .\"
> 

Thanks for the nice text! I reworked somewhat, trying to fill out a
few details about how I understand things work, but I may have introduced
errors, so I would be happy if you would check the following text:

       MADV_FREE (since Linux 4.5)
              The  application  no  longer  requires  the pages in the
              range specified by addr and len.  The  kernel  can  thus
              free these pages, but the freeing could be delayed until
              memory pressure occurs.  For each of the pages that  has
              been  marked to be freed but has not yet been freed, the
              free operation will be canceled  if  the  caller  writes
              into  the page.  After a successful MADV_FREE operation,
              any stale data (i.e., dirty, unwritten  pages)  will  be
              lost  when  the kernel frees the pages.  However, subse‐
              quent writes to pages in the range will succeed and then
              kernel  cannot  free  those  dirtied  pages, so that the
              caller can always see just written data.  If there is no
              subsequent  write,  the kernel can free the pages at any
              time.  Once pages in the  range  have  been  freed,  the
              caller  will  see  zero-fill-on-demand pages upon subse‐
              quent page references.

              The MADV_FREE operation can be applied only  to  private
              anonymous  pages  (see  mmap(2)).  On a swapless system,
              freeing  pages  in  a  given  range  happens  instantly,
              regardless of memory pressure.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-06 13:32   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 48+ messages in thread
From: Michael Kerrisk (man-pages) @ 2016-02-06 13:32 UTC (permalink / raw)
  To: Minchan Kim
  Cc: mtk.manpages, Andrew Morton, linux-kernel, linux-mm, linux-api,
	Hugh Dickins, Johannes Weiner, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Shaohua Li, Michal Hocko, yalin.wang2010, Andy Lutomirski

Hello Minchan,

On 02/05/2016 03:15 AM, Minchan Kim wrote:
> On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
>> Hello Minchan,
>>
>> On 11/30/2015 07:39 AM, Minchan Kim wrote:
>>> In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
>>> new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
>>> so this version doesn't include them.
>>>
>>> I have been tested it on mmotm-2015-11-25-17-08 with additional
>>> patch[1] from Kirill to prevent BUG_ON which he didn't send to
>>> linux-mm yet as formal patch. With it, I couldn't find any
>>> problem so far.
>>>
>>> Note that this version is based on THP refcount redesign so
>>> I needed some modification on MADV_FREE because split_huge_pmd
>>> doesn't split a THP page any more and pmd_trans_huge(pmd) is not
>>> enough to guarantee the page is not THP page.
>>> As well, for MAVD_FREE lazy-split, THP split should respect
>>> pmd's dirtiness rather than marking ptes of all subpages dirty
>>> unconditionally. Please, review last patch in this patchset.
>>
>> Now that MADV_FREE has been merged, would you be willing to write
>> patch to the madvise(2) man page that describes the semantics, 
>> noes limitations and restrictions, and (ideally) has some sentences
>> describing use cases?
>>
> 
> Hello Michael,
> 
> Could you review this patch?
> 
> Thanks.
> 
>>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Fri, 5 Feb 2016 11:09:54 +0900
> Subject: [PATCH] madvise.2: Add MADV_FREE
> 
> Document the MADV_FREE flags added to madvise() in Linux 4.5
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  man2/madvise.2 | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index c1df67c..4704304 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -143,6 +143,25 @@ flag are special memory areas that are not managed
>  by the virtual memory subsystem.
>  Such pages are typically created by device drivers that
>  map the pages into user space.)
> +.TP
> +.B MADV_FREE " (since Linux 4.5)"
> +Application is finished with the given range, so kernel can free
> +resources associated with it but the freeing could be delayed until
> +memory pressure happens or canceld by write operation by user.
> +
> +After a successful MADV_FREE operation, user shouldn't expect kernel
> +keeps stale data on the page. However, subsequent write of pages
> +in the range will succeed and then kernel cannot free those dirtied pages
> +so user can always see just written data. If there was no subsequent
> +write, kernel can free those clean pages any time. In such case,
> +user can see zero-fill-on-demand pages.
> +
> +Note that, it works only with private anonymous pages (see
> +.BR mmap (2)).
> +On swapless system, freeing pages in given range happens instantly
> +regardless of memory pressure.
> +
> +
>  .\"
>  .\" ======================================================================
>  .\"
> 

Thanks for the nice text! I reworked somewhat, trying to fill out a
few details about how I understand things work, but I may have introduced
errors, so I would be happy if you would check the following text:

       MADV_FREE (since Linux 4.5)
              The  application  no  longer  requires  the pages in the
              range specified by addr and len.  The  kernel  can  thus
              free these pages, but the freeing could be delayed until
              memory pressure occurs.  For each of the pages that  has
              been  marked to be freed but has not yet been freed, the
              free operation will be canceled  if  the  caller  writes
              into  the page.  After a successful MADV_FREE operation,
              any stale data (i.e., dirty, unwritten  pages)  will  be
              lost  when  the kernel frees the pages.  However, subsea??
              quent writes to pages in the range will succeed and then
              kernel  cannot  free  those  dirtied  pages, so that the
              caller can always see just written data.  If there is no
              subsequent  write,  the kernel can free the pages at any
              time.  Once pages in the  range  have  been  freed,  the
              caller  will  see  zero-fill-on-demand pages upon subsea??
              quent page references.

              The MADV_FREE operation can be applied only  to  private
              anonymous  pages  (see  mmap(2)).  On a swapless system,
              freeing  pages  in  a  given  range  happens  instantly,
              regardless of memory pressure.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-05  2:15 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-02-05  2:15 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?
> 

Hello Michael,

Could you review this patch?

Thanks.

>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 5 Feb 2016 11:09:54 +0900
Subject: [PATCH] madvise.2: Add MADV_FREE

Document the MADV_FREE flags added to madvise() in Linux 4.5

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 man2/madvise.2 | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/man2/madvise.2 b/man2/madvise.2
index c1df67c..4704304 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -143,6 +143,25 @@ flag are special memory areas that are not managed
 by the virtual memory subsystem.
 Such pages are typically created by device drivers that
 map the pages into user space.)
+.TP
+.B MADV_FREE " (since Linux 4.5)"
+Application is finished with the given range, so kernel can free
+resources associated with it but the freeing could be delayed until
+memory pressure happens or canceld by write operation by user.
+
+After a successful MADV_FREE operation, user shouldn't expect kernel
+keeps stale data on the page. However, subsequent write of pages
+in the range will succeed and then kernel cannot free those dirtied pages
+so user can always see just written data. If there was no subsequent
+write, kernel can free those clean pages any time. In such case,
+user can see zero-fill-on-demand pages.
+
+Note that, it works only with private anonymous pages (see
+.BR mmap (2)).
+On swapless system, freeing pages in given range happens instantly
+regardless of memory pressure.
+
+
 .\"
 .\" ======================================================================
 .\"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-05  2:15 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-02-05  2:15 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Shaohua Li, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Andy Lutomirski

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?
> 

Hello Michael,

Could you review this patch?

Thanks.

>From 203372f901f574e991215fdff6907608ba53f932 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Date: Fri, 5 Feb 2016 11:09:54 +0900
Subject: [PATCH] madvise.2: Add MADV_FREE

Document the MADV_FREE flags added to madvise() in Linux 4.5

Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 man2/madvise.2 | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/man2/madvise.2 b/man2/madvise.2
index c1df67c..4704304 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -143,6 +143,25 @@ flag are special memory areas that are not managed
 by the virtual memory subsystem.
 Such pages are typically created by device drivers that
 map the pages into user space.)
+.TP
+.B MADV_FREE " (since Linux 4.5)"
+Application is finished with the given range, so kernel can free
+resources associated with it but the freeing could be delayed until
+memory pressure happens or canceld by write operation by user.
+
+After a successful MADV_FREE operation, user shouldn't expect kernel
+keeps stale data on the page. However, subsequent write of pages
+in the range will succeed and then kernel cannot free those dirtied pages
+so user can always see just written data. If there was no subsequent
+write, kernel can free those clean pages any time. In such case,
+user can see zero-fill-on-demand pages.
+
+Note that, it works only with private anonymous pages (see
+.BR mmap (2)).
+On swapless system, freeing pages in given range happens instantly
+regardless of memory pressure.
+
+
 .\"
 .\" ======================================================================
 .\"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 00/12] MADV_FREE support
@ 2016-02-05  2:15 ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2016-02-05  2:15 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Andrew Morton, linux-kernel, linux-mm, linux-api, Hugh Dickins,
	Johannes Weiner, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Shaohua Li,
	Michal Hocko, yalin.wang2010, Andy Lutomirski

On Thu, Jan 28, 2016 at 08:16:25AM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan,
> 
> On 11/30/2015 07:39 AM, Minchan Kim wrote:
> > In v4, Andrew wanted to settle in old basic MADV_FREE and introduces
> > new stuffs(ie, lazyfree LRU, swapless support and lazyfreeness) later
> > so this version doesn't include them.
> > 
> > I have been tested it on mmotm-2015-11-25-17-08 with additional
> > patch[1] from Kirill to prevent BUG_ON which he didn't send to
> > linux-mm yet as formal patch. With it, I couldn't find any
> > problem so far.
> > 
> > Note that this version is based on THP refcount redesign so
> > I needed some modification on MADV_FREE because split_huge_pmd
> > doesn't split a THP page any more and pmd_trans_huge(pmd) is not
> > enough to guarantee the page is not THP page.
> > As well, for MAVD_FREE lazy-split, THP split should respect
> > pmd's dirtiness rather than marking ptes of all subpages dirty
> > unconditionally. Please, review last patch in this patchset.
> 
> Now that MADV_FREE has been merged, would you be willing to write
> patch to the madvise(2) man page that describes the semantics, 
> noes limitations and restrictions, and (ideally) has some sentences
> describing use cases?
> 

Hello Michael,

Could you review this patch?

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-02-07 12:31 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-30  6:39 [PATCH v5 00/12] MADV_FREE support Minchan Kim
2015-11-30  6:39 ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 01/12] mm: support madvise(MADV_FREE) Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  8:20   ` Mika Penttilä
2015-11-30  8:20     ` Mika Penttilä
2015-11-30  8:20     ` Mika Penttilä
2015-11-30  9:22     ` Minchan Kim
2015-11-30  9:22       ` Minchan Kim
2015-11-30  9:22       ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 02/12] mm: define MADV_FREE for some arches Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 03/12] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 04/12] mm: free swp_entry in madvise_free Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 05/12] mm: move lazily freed pages to inactive list Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 06/12] mm: mark stable page dirty in KSM Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 07/12] x86: add pmd_[dirty|mkclean] for THP Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 08/12] sparc: " Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 09/12] powerpc: " Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 10/12] arm: add pmd_mkclean " Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 11/12] arm64: " Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2015-11-30  6:39 ` [PATCH v5 12/12] mm: don't split THP page when syscall is called Minchan Kim
2015-11-30  6:39   ` Minchan Kim
2016-01-28  7:16 ` [PATCH v5 00/12] MADV_FREE support Michael Kerrisk (man-pages)
2016-01-28  7:16   ` Michael Kerrisk (man-pages)
2016-01-29  7:32   ` Minchan Kim
2016-01-29  7:32     ` Minchan Kim
2016-01-29  7:32     ` Minchan Kim
2016-02-05  2:15 Minchan Kim
2016-02-05  2:15 ` Minchan Kim
2016-02-05  2:15 ` Minchan Kim
2016-02-06 13:32 ` Michael Kerrisk (man-pages)
2016-02-06 13:32   ` Michael Kerrisk (man-pages)
2016-02-06 13:32   ` Michael Kerrisk (man-pages)
2016-02-07 12:31   ` Minchan Kim
2016-02-07 12:31     ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.