All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] MADV_FREE support
@ 2015-10-30  7:01 ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

MADV_FREE is on linux-next so long time. The reason was two, I think.

1. MADV_FREE code on reclaim path was really mess.

2. Andrew really want to see voice of userland people who want to use
   the syscall.

A few month ago, Daniel Micay(jemalloc active contributor) requested me
to make progress upstreaming but I was busy at that time so it took
so long time for me to revist the code and finally, I clean it up the
mess recently so it solves the #2 issue.

As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
again recently and they said it would be great to have even though
it has swap dependency now so Andrew decided he will do that for v4.4.

When I test MADV_FREE patches on recent mmotm, there is some
problem with THP-refcount redesign so it's hard for long running
test. Even, there is some dependency with it because patch ordering of
MADV_FREE in mmotm is after THP refcount redesign so I discussed it
with Andrew in hallway this kernel summit and decided to send
patchset based on v4.3-rc7.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

In the meanwhile, Hugh reviewed all of code and asked me to tidy up
a lot patches related MADV_FREE on mmotm so this is the result of
the request.

In this version, I drop a enhance patch.

        mm: don't split THP page when syscall is called

Because it could delay split of THP page until reclaim path but it
made madvise_free void due to marking all pages(head + sub pages)
PG_dirty and pte_mkdirty when split happens.

I will see we could make THP split inheriting pmd's dirtiness to
subpages and remove adding PG_dirty part unconditionally in
__split_huge_page_refcount but it's rather risky to do it in
this moment(ie, close to merge window and Kirill is changing
that a lot) so I want to do it after closing merge window.
If we dont't do it now, we don't need following patches, either.

	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

So, I drop those patch too in this version and will resend it
when I send a lazy split patch of THP page.

A final modification since I sent clean up patchset(ie, MADV_FREE
refactoring and fix KSM page), there are three.

 1. Replace description and comment of KSM fix patch with Hugh's suggestion
 2. Avoid forcing SetPageDirty in try_to_unmap_one to avoid clean
    page swapout from Yalin
 3. Adding uapi patch to make value of MADV_FREE for all arches same
    from Chen

About 3, I include it because I thought it was good but Andrew
just missed the patch at that time. But when I read quilt series
file now, it seems Shaohua had some problem with it but I couldn't
find any mail in my mailbox. If it has something wrong,
please tell us.

#mm-support-madvisemadv_free.patch: other-arch syscall numbering mess ("arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures"). Shaohua Li <shli@kernel.org> testing disasters.

TODO: I will send man page patch if it would land for v4.4.

Andrew, you could replace all of MADV_FREE related patches with
this. IOW, these are

	# MADV_FREE stuff:
	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

	mm-support-madvisemadv_free.patch
	mm-support-madvisemadv_free-fix.patch
	mm-support-madvisemadv_free-fix-2.patch
	mm-support-madvisemadv_free-fix-3.patch
	mm-support-madvisemadv_free-vs-thp-rename-split_huge_page_pmd-to-split_huge_pmd.patch
	mm-support-madvisemadv_free-fix-5.patch
	mm-support-madvisemadv_free-fix-6.patch
	mm-mark-stable-page-dirty-in-ksm.patch
	mm-dont-split-thp-page-when-syscall-is-called.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-3.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-4.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-5.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6-fix.patch
	mm-free-swp_entry-in-madvise_free.patch
	mm-move-lazy-free-pages-to-inactive-list.patch
	mm-move-lazy-free-pages-to-inactive-list-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch

Chen Gang (1):
  arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
    architectures

Minchan Kim (7):
  mm: support madvise(MADV_FREE)
  mm: define MADV_FREE for some arches
  mm: free swp_entry in madvise_free
  mm: move lazily freed pages to inactive list
  mm: lru_deactivate_fn should clear PG_referenced
  mm: clear PG_dirty to mark page freeable
  mm: mark stable page dirty in KSM

 arch/alpha/include/uapi/asm/mman.h     |   1 +
 arch/mips/include/uapi/asm/mman.h      |   1 +
 arch/parisc/include/uapi/asm/mman.h    |   1 +
 arch/xtensa/include/uapi/asm/mman.h    |   1 +
 include/linux/rmap.h                   |   1 +
 include/linux/swap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/ksm.c                               |   6 ++
 mm/madvise.c                           | 162 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   7 ++
 mm/swap.c                              |  44 +++++++++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 15 files changed, 238 insertions(+), 5 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH 0/8] MADV_FREE support
@ 2015-10-30  7:01 ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li,
	Minchan Kim

MADV_FREE is on linux-next so long time. The reason was two, I think.

1. MADV_FREE code on reclaim path was really mess.

2. Andrew really want to see voice of userland people who want to use
   the syscall.

A few month ago, Daniel Micay(jemalloc active contributor) requested me
to make progress upstreaming but I was busy at that time so it took
so long time for me to revist the code and finally, I clean it up the
mess recently so it solves the #2 issue.

As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
again recently and they said it would be great to have even though
it has swap dependency now so Andrew decided he will do that for v4.4.

When I test MADV_FREE patches on recent mmotm, there is some
problem with THP-refcount redesign so it's hard for long running
test. Even, there is some dependency with it because patch ordering of
MADV_FREE in mmotm is after THP refcount redesign so I discussed it
with Andrew in hallway this kernel summit and decided to send
patchset based on v4.3-rc7.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

In the meanwhile, Hugh reviewed all of code and asked me to tidy up
a lot patches related MADV_FREE on mmotm so this is the result of
the request.

In this version, I drop a enhance patch.

        mm: don't split THP page when syscall is called

Because it could delay split of THP page until reclaim path but it
made madvise_free void due to marking all pages(head + sub pages)
PG_dirty and pte_mkdirty when split happens.

I will see we could make THP split inheriting pmd's dirtiness to
subpages and remove adding PG_dirty part unconditionally in
__split_huge_page_refcount but it's rather risky to do it in
this moment(ie, close to merge window and Kirill is changing
that a lot) so I want to do it after closing merge window.
If we dont't do it now, we don't need following patches, either.

	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

So, I drop those patch too in this version and will resend it
when I send a lazy split patch of THP page.

A final modification since I sent clean up patchset(ie, MADV_FREE
refactoring and fix KSM page), there are three.

 1. Replace description and comment of KSM fix patch with Hugh's suggestion
 2. Avoid forcing SetPageDirty in try_to_unmap_one to avoid clean
    page swapout from Yalin
 3. Adding uapi patch to make value of MADV_FREE for all arches same
    from Chen

About 3, I include it because I thought it was good but Andrew
just missed the patch at that time. But when I read quilt series
file now, it seems Shaohua had some problem with it but I couldn't
find any mail in my mailbox. If it has something wrong,
please tell us.

#mm-support-madvisemadv_free.patch: other-arch syscall numbering mess ("arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures"). Shaohua Li <shli-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> testing disasters.

TODO: I will send man page patch if it would land for v4.4.

Andrew, you could replace all of MADV_FREE related patches with
this. IOW, these are

	# MADV_FREE stuff:
	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

	mm-support-madvisemadv_free.patch
	mm-support-madvisemadv_free-fix.patch
	mm-support-madvisemadv_free-fix-2.patch
	mm-support-madvisemadv_free-fix-3.patch
	mm-support-madvisemadv_free-vs-thp-rename-split_huge_page_pmd-to-split_huge_pmd.patch
	mm-support-madvisemadv_free-fix-5.patch
	mm-support-madvisemadv_free-fix-6.patch
	mm-mark-stable-page-dirty-in-ksm.patch
	mm-dont-split-thp-page-when-syscall-is-called.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-3.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-4.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-5.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6-fix.patch
	mm-free-swp_entry-in-madvise_free.patch
	mm-move-lazy-free-pages-to-inactive-list.patch
	mm-move-lazy-free-pages-to-inactive-list-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch

Chen Gang (1):
  arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
    architectures

Minchan Kim (7):
  mm: support madvise(MADV_FREE)
  mm: define MADV_FREE for some arches
  mm: free swp_entry in madvise_free
  mm: move lazily freed pages to inactive list
  mm: lru_deactivate_fn should clear PG_referenced
  mm: clear PG_dirty to mark page freeable
  mm: mark stable page dirty in KSM

 arch/alpha/include/uapi/asm/mman.h     |   1 +
 arch/mips/include/uapi/asm/mman.h      |   1 +
 arch/parisc/include/uapi/asm/mman.h    |   1 +
 arch/xtensa/include/uapi/asm/mman.h    |   1 +
 include/linux/rmap.h                   |   1 +
 include/linux/swap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/ksm.c                               |   6 ++
 mm/madvise.c                           | 162 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   7 ++
 mm/swap.c                              |  44 +++++++++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 15 files changed, 238 insertions(+), 5 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH 0/8] MADV_FREE support
@ 2015-10-30  7:01 ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

MADV_FREE is on linux-next so long time. The reason was two, I think.

1. MADV_FREE code on reclaim path was really mess.

2. Andrew really want to see voice of userland people who want to use
   the syscall.

A few month ago, Daniel Micay(jemalloc active contributor) requested me
to make progress upstreaming but I was busy at that time so it took
so long time for me to revist the code and finally, I clean it up the
mess recently so it solves the #2 issue.

As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
again recently and they said it would be great to have even though
it has swap dependency now so Andrew decided he will do that for v4.4.

When I test MADV_FREE patches on recent mmotm, there is some
problem with THP-refcount redesign so it's hard for long running
test. Even, there is some dependency with it because patch ordering of
MADV_FREE in mmotm is after THP refcount redesign so I discussed it
with Andrew in hallway this kernel summit and decided to send
patchset based on v4.3-rc7.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

In the meanwhile, Hugh reviewed all of code and asked me to tidy up
a lot patches related MADV_FREE on mmotm so this is the result of
the request.

In this version, I drop a enhance patch.

        mm: don't split THP page when syscall is called

Because it could delay split of THP page until reclaim path but it
made madvise_free void due to marking all pages(head + sub pages)
PG_dirty and pte_mkdirty when split happens.

I will see we could make THP split inheriting pmd's dirtiness to
subpages and remove adding PG_dirty part unconditionally in
__split_huge_page_refcount but it's rather risky to do it in
this moment(ie, close to merge window and Kirill is changing
that a lot) so I want to do it after closing merge window.
If we dont't do it now, we don't need following patches, either.

	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

So, I drop those patch too in this version and will resend it
when I send a lazy split patch of THP page.

A final modification since I sent clean up patchset(ie, MADV_FREE
refactoring and fix KSM page), there are three.

 1. Replace description and comment of KSM fix patch with Hugh's suggestion
 2. Avoid forcing SetPageDirty in try_to_unmap_one to avoid clean
    page swapout from Yalin
 3. Adding uapi patch to make value of MADV_FREE for all arches same
    from Chen

About 3, I include it because I thought it was good but Andrew
just missed the patch at that time. But when I read quilt series
file now, it seems Shaohua had some problem with it but I couldn't
find any mail in my mailbox. If it has something wrong,
please tell us.

#mm-support-madvisemadv_free.patch: other-arch syscall numbering mess ("arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures"). Shaohua Li <shli@kernel.org> testing disasters.

TODO: I will send man page patch if it would land for v4.4.

Andrew, you could replace all of MADV_FREE related patches with
this. IOW, these are

	# MADV_FREE stuff:
	x86-add-pmd_-for-thp.patch
	x86-add-pmd_-for-thp-fix.patch
	sparc-add-pmd_-for-thp.patch
	sparc-add-pmd_-for-thp-fix.patch
	powerpc-add-pmd_-for-thp.patch
	arm-add-pmd_mkclean-for-thp.patch
	arm64-add-pmd_-for-thp.patch

	mm-support-madvisemadv_free.patch
	mm-support-madvisemadv_free-fix.patch
	mm-support-madvisemadv_free-fix-2.patch
	mm-support-madvisemadv_free-fix-3.patch
	mm-support-madvisemadv_free-vs-thp-rename-split_huge_page_pmd-to-split_huge_pmd.patch
	mm-support-madvisemadv_free-fix-5.patch
	mm-support-madvisemadv_free-fix-6.patch
	mm-mark-stable-page-dirty-in-ksm.patch
	mm-dont-split-thp-page-when-syscall-is-called.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-3.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-4.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-5.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch
	mm-dont-split-thp-page-when-syscall-is-called-fix-6-fix.patch
	mm-free-swp_entry-in-madvise_free.patch
	mm-move-lazy-free-pages-to-inactive-list.patch
	mm-move-lazy-free-pages-to-inactive-list-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
	mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch

Chen Gang (1):
  arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
    architectures

Minchan Kim (7):
  mm: support madvise(MADV_FREE)
  mm: define MADV_FREE for some arches
  mm: free swp_entry in madvise_free
  mm: move lazily freed pages to inactive list
  mm: lru_deactivate_fn should clear PG_referenced
  mm: clear PG_dirty to mark page freeable
  mm: mark stable page dirty in KSM

 arch/alpha/include/uapi/asm/mman.h     |   1 +
 arch/mips/include/uapi/asm/mman.h      |   1 +
 arch/parisc/include/uapi/asm/mman.h    |   1 +
 arch/xtensa/include/uapi/asm/mman.h    |   1 +
 include/linux/rmap.h                   |   1 +
 include/linux/swap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/ksm.c                               |   6 ++
 mm/madvise.c                           | 162 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   7 ++
 mm/swap.c                              |  44 +++++++++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 +-
 mm/vmstat.c                            |   1 +
 15 files changed, 238 insertions(+), 5 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH 1/8] mm: support madvise(MADV_FREE)
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 128 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   7 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 ++-
 mm/vmstat.c                            |   1 +
 8 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446aeef36e..f4c992826242 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_FREE = 8,			/* free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..2b1cef88b827 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..640311704e31 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,121 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *pte, ptent;
+	struct page *page;
+
+	split_huge_page_pmd(vma, addr, pmd);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (PageSwapCache(page)) {
+			if (!trylock_page(page))
+				continue;
+
+			if (!try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		/*
+		 * Some of architecture(ex, PPC) don't update TLB
+		 * with set_pte_at and tlb_remove_tlb_entry so for
+		 * the portability, remap the pte with old|clean
+		 * after pte clearing.
+		 */
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		ptent = pte_mkold(ptent);
+		ptent = pte_mkclean(ptent);
+		set_pte_at(mm, addr, pte, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +498,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +525,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f3dcd7..9449e91839ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
+		if (!PageDirty(page) && (flags & TTU_FREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (PageSwapCache(page)) {
 			/*
 			 * Store the swap location in the pte.
@@ -1414,6 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
+discard:
 	page_remove_rmap(page);
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d504adb7fa5f..10f63eded7b7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f63a9381f71..7a415b9fdd34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -906,6 +906,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool freeable = false;
 
 		cond_resched();
 
@@ -1049,6 +1050,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			freeable = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1060,8 +1062,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, freeable ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1186,6 +1189,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		if (freeable && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..59d45b22355f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 1/8] mm: support madvise(MADV_FREE)
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc		MADV_free-jemalloc

1 thread
records: 10			    records: 10
avg:	2961.90			    avg:   12069.70
std:	  71.96(2.43%)		    std:     186.68(1.55%)
max:	3070.00			    max:   12385.00
min:	2796.00			    min:   11746.00

2 thread
records: 10			    records: 10
avg:	5020.00			    avg:   17827.00
std:	 264.87(5.28%)		    std:     358.52(2.01%)
max:	5244.00			    max:   18760.00
min:	4251.00			    min:   17382.00

4 thread
records: 10			    records: 10
avg:	8988.80			    avg:   27930.80
std:	1175.33(13.08%)		    std:    3317.33(11.88%)
max:	9508.00			    max:   30879.00
min:	5477.00			    min:   21024.00

8 thread
records: 10			    records: 10
avg:   13036.50			    avg:   33739.40
std:	 170.67(1.31%)		    std:    5146.22(15.25%)
max:   13371.00			    max:   40572.00
min:   12785.00			    min:   24088.00

16 thread
records: 10			    records: 10
avg:   11092.40			    avg:   31424.20
std:	 710.60(6.41%)		    std:    3763.89(11.98%)
max:   12446.00			    max:   36635.00
min:	9949.00			    min:   25669.00

32 thread
records: 10			    records: 10
avg:   11067.00			    avg:   34495.80
std:	 971.06(8.77%)		    std:    2721.36(7.89%)
max:   12010.00			    max:   38598.00
min:	9002.00			    min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/rmap.h                   |   1 +
 include/linux/vm_event_item.h          |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c                           | 128 +++++++++++++++++++++++++++++++++
 mm/rmap.c                              |   7 ++
 mm/swap_state.c                        |   5 +-
 mm/vmscan.c                            |  10 ++-
 mm/vmstat.c                            |   1 +
 8 files changed, 149 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446aeef36e..f4c992826242 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
 	TTU_UNMAP = 1,			/* unmap mode */
 	TTU_MIGRATION = 2,		/* migration mode */
 	TTU_MUNLOCK = 4,		/* munlock mode */
+	TTU_FREE = 8,			/* free mode */
 
 	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
 	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..2b1cef88b827 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		PGFREE, PGACTIVATE, PGDEACTIVATE,
 		PGFAULT, PGMAJFAULT,
+		PGLAZYFREED,
 		FOR_ALL_ZONES(PGREFILL),
 		FOR_ALL_ZONES(PGSTEAL_KSWAPD),
 		FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..640311704e31 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
 #include <linux/backing-dev.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
 
 /*
  * Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,121 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	return 0;
 }
 
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+				unsigned long end, struct mm_walk *walk)
+
+{
+	struct mmu_gather *tlb = walk->private;
+	struct mm_struct *mm = tlb->mm;
+	struct vm_area_struct *vma = walk->vma;
+	spinlock_t *ptl;
+	pte_t *pte, ptent;
+	struct page *page;
+
+	split_huge_page_pmd(vma, addr, pmd);
+	if (pmd_trans_unstable(pmd))
+		return 0;
+
+	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	arch_enter_lazy_mmu_mode();
+	for (; addr != end; pte++, addr += PAGE_SIZE) {
+		ptent = *pte;
+
+		if (!pte_present(ptent))
+			continue;
+
+		page = vm_normal_page(vma, addr, ptent);
+		if (!page)
+			continue;
+
+		if (PageSwapCache(page)) {
+			if (!trylock_page(page))
+				continue;
+
+			if (!try_to_free_swap(page)) {
+				unlock_page(page);
+				continue;
+			}
+
+			ClearPageDirty(page);
+			unlock_page(page);
+		}
+
+		/*
+		 * Some of architecture(ex, PPC) don't update TLB
+		 * with set_pte_at and tlb_remove_tlb_entry so for
+		 * the portability, remap the pte with old|clean
+		 * after pte clearing.
+		 */
+		ptent = ptep_get_and_clear_full(mm, addr, pte,
+						tlb->fullmm);
+		ptent = pte_mkold(ptent);
+		ptent = pte_mkclean(ptent);
+		set_pte_at(mm, addr, pte, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
+	}
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+	return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end)
+{
+	struct mm_walk free_walk = {
+		.pmd_entry = madvise_free_pte_range,
+		.mm = vma->vm_mm,
+		.private = tlb,
+	};
+
+	tlb_start_vma(tlb, vma);
+	walk_page_range(addr, end, &free_walk);
+	tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+			unsigned long start_addr, unsigned long end_addr)
+{
+	unsigned long start, end;
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_gather tlb;
+
+	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+		return -EINVAL;
+
+	/* MADV_FREE works for only anon vma at the moment */
+	if (!vma_is_anonymous(vma))
+		return -EINVAL;
+
+	start = max(vma->vm_start, start_addr);
+	if (start >= vma->vm_end)
+		return -EINVAL;
+	end = min(vma->vm_end, end_addr);
+	if (end <= vma->vm_start)
+		return -EINVAL;
+
+	lru_add_drain();
+	tlb_gather_mmu(&tlb, mm, start, end);
+	update_hiwater_rss(mm);
+
+	mmu_notifier_invalidate_range_start(mm, start, end);
+	madvise_free_page_range(&tlb, vma, start, end);
+	mmu_notifier_invalidate_range_end(mm, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+
+	return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+			     struct vm_area_struct **prev,
+			     unsigned long start, unsigned long end)
+{
+	*prev = vma;
+	return madvise_free_single_vma(vma, start, end);
+}
+
 /*
  * Application no longer needs these pages.  If the pages are dirty,
  * it's OK to just throw them away.  The app will be more careful about
@@ -379,6 +498,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_remove(vma, prev, start, end);
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
+	case MADV_FREE:
+		/*
+		 * XXX: In this implementation, MADV_FREE works like
+		 * MADV_DONTNEED on swapless system or full swap.
+		 */
+		if (get_nr_swap_pages() > 0)
+			return madvise_free(vma, prev, start, end);
+		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
@@ -398,6 +525,7 @@ madvise_behavior_valid(int behavior)
 	case MADV_REMOVE:
 	case MADV_WILLNEED:
 	case MADV_DONTNEED:
+	case MADV_FREE:
 #ifdef CONFIG_KSM
 	case MADV_MERGEABLE:
 	case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f3dcd7..9449e91839ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 
+		if (!PageDirty(page) && (flags & TTU_FREE)) {
+			/* It's a freeable page by MADV_FREE */
+			dec_mm_counter(mm, MM_ANONPAGES);
+			goto discard;
+		}
+
 		if (PageSwapCache(page)) {
 			/*
 			 * Store the swap location in the pte.
@@ -1414,6 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	} else
 		dec_mm_counter(mm, MM_FILEPAGES);
 
+discard:
 	page_remove_rmap(page);
 	page_cache_release(page);
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d504adb7fa5f..10f63eded7b7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
 	 * deadlock in the swap out path.
 	 */
 	/*
-	 * Add it to the swap cache and mark it dirty
+	 * Add it to the swap cache.
 	 */
 	err = add_to_swap_cache(page, entry,
 			__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
 
-	if (!err) {	/* Success */
-		SetPageDirty(page);
+	if (!err) {
 		return 1;
 	} else {	/* -ENOMEM radix-tree allocation failure */
 		/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f63a9381f71..7a415b9fdd34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -906,6 +906,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
+		bool freeable = false;
 
 		cond_resched();
 
@@ -1049,6 +1050,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
+			freeable = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1060,8 +1062,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (try_to_unmap(page,
-					ttu_flags|TTU_BATCH_FLUSH)) {
+			switch (try_to_unmap(page, freeable ?
+				(ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
+				(ttu_flags | TTU_BATCH_FLUSH))) {
 			case SWAP_FAIL:
 				goto activate_locked;
 			case SWAP_AGAIN:
@@ -1186,6 +1189,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__clear_page_locked(page);
 free_it:
+		if (freeable && !PageDirty(page))
+			count_vm_event(PGLAZYFREED);
+
 		nr_reclaimed++;
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..59d45b22355f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {
 
 	"pgfault",
 	"pgmajfault",
+	"pglazyfreed",
 
 	TEXTS_FOR_ZONES("pgrefill")
 	TEXTS_FOR_ZONES("pgsteal_kswapd")
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 2/8] mm: define MADV_FREE for some arches
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim,
	Richard Henderson, Ivan Kokshaysky, James E.J. Bottomley,
	Helge Deller, Ralf Baechle, Chris Zankel, Max Filippov,
	kbuild test robot

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/include/uapi/asm/mman.h  | 1 +
 arch/mips/include/uapi/asm/mman.h   | 1 +
 arch/parisc/include/uapi/asm/mman.h | 1 +
 arch/xtensa/include/uapi/asm/mman.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b472bc2b..836fbd44f65b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876cae6b..106e741aa7ee 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251ca7b2..6cb8db76fd4e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0e0446..1b19f25bc567 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 2/8] mm: define MADV_FREE for some arches
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim,
	Richard Henderson, Ivan Kokshaysky, James E.J. Bottomley,
	Helge Deller, Ralf Baechle, Chris Zankel, Max Filippov,
	kbuild test robot

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/alpha/include/uapi/asm/mman.h  | 1 +
 arch/mips/include/uapi/asm/mman.h   | 1 +
 arch/parisc/include/uapi/asm/mman.h | 1 +
 arch/xtensa/include/uapi/asm/mman.h | 1 +
 4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b472bc2b..836fbd44f65b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
+#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876cae6b..106e741aa7ee 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251ca7b2..6cb8db76fd4e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
 #define MADV_REMOVE	9		/* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0e0446..1b19f25bc567 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
+#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_REMOVE	9		/* remove these pages & resources */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-10-30  7:01 ` Minchan Kim
  (?)
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, linux-arch, Minchan Kim

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 2 +-
 arch/mips/include/uapi/asm/mman.h      | 2 +-
 arch/parisc/include/uapi/asm/mman.h    | 2 +-
 arch/xtensa/include/uapi/asm/mman.h    | 2 +-
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
-#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
-#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 2 +-
 arch/mips/include/uapi/asm/mman.h      | 2 +-
 arch/parisc/include/uapi/asm/mman.h    | 2 +-
 arch/xtensa/include/uapi/asm/mman.h    | 2 +-
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
-#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
-#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, linux-arch, Minchan Kim

From: Chen Gang <gang.chen.5i5j@gmail.com>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: rth@twiddle.net <rth@twiddle.net>,
Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
Cc: mattst88@gmail.com <mattst88@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
Cc: deller@gmx.de <deller@gmx.de>
Cc: chris@zankel.net <chris@zankel.net>
Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-arch@vger.kernel.org
Cc: linux-api@vger.kernel.org
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
---
 arch/alpha/include/uapi/asm/mman.h     | 2 +-
 arch/mips/include/uapi/asm/mman.h      | 2 +-
 arch/parisc/include/uapi/asm/mman.h    | 2 +-
 arch/xtensa/include/uapi/asm/mman.h    | 2 +-
 include/uapi/asm-generic/mman-common.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
 #define MADV_WILLNEED	3		/* will need these pages */
 #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
 #define MADV_DONTNEED	6		/* don't need these pages */
-#define MADV_FREE	7		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
 #define MADV_SEQUENTIAL 2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
 #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
 #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
 #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
-#define MADV_FREE	8		/* free pages only if memory pressure */
 
 /* common/generic parameters */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
 #define MADV_SEQUENTIAL	2		/* expect sequential page references */
 #define MADV_WILLNEED	3		/* will need these pages */
 #define MADV_DONTNEED	4		/* don't need these pages */
-#define MADV_FREE	5		/* free pages only if memory pressure */
 
 /* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE	8		/* free pages only if memory pressure */
 #define MADV_REMOVE	9		/* remove these pages & resources */
 #define MADV_DONTFORK	10		/* don't inherit across fork */
 #define MADV_DOFORK	11		/* do inherit across fork */
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 4/8] mm: free swp_entry in madvise_free
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 640311704e31..663bd9fa0ae0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	swp_entry_t entry;
+	int nr_swap = 0;
 
 	split_huge_page_pmd(vma, addr, pmd);
 	if (pmd_trans_unstable(pmd))
@@ -280,8 +282,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
 
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
@@ -313,6 +329,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(mm, addr, pte, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
+
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
+
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 4/8] mm: free swp_entry in madvise_free
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 640311704e31..663bd9fa0ae0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	swp_entry_t entry;
+	int nr_swap = 0;
 
 	split_huge_page_pmd(vma, addr, pmd);
 	if (pmd_trans_unstable(pmd))
@@ -280,8 +282,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
+		/*
+		 * If the pte has swp_entry, just clear page table to
+		 * prevent swap-in which is more expensive rather than
+		 * (page allocation + zeroing).
+		 */
+		if (!pte_present(ptent)) {
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			nr_swap--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
 
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
@@ -313,6 +329,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(mm, addr, pte, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
+
+	if (nr_swap) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+	}
+
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 	cond_resched();
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim, Wang,
	Yalin

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list.  I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..f629df4cc13d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 663bd9fa0ae0..9ee9df8c768d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -327,6 +327,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = pte_mkold(ptent);
 		ptent = pte_mkclean(ptent);
 		set_pte_at(mm, addr, pte, ptent);
+		if (PageActive(page))
+			deactivate_page(page);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..d0eacc5f62a3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -799,6 +800,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -825,6 +843,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -854,6 +876,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -883,6 +925,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim, Wang,
	Yalin

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list.  I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..f629df4cc13d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 663bd9fa0ae0..9ee9df8c768d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -327,6 +327,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = pte_mkold(ptent);
 		ptent = pte_mkclean(ptent);
 		set_pte_at(mm, addr, pte, ptent);
+		if (PageActive(page))
+			deactivate_page(page);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..d0eacc5f62a3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -799,6 +800,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -825,6 +843,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -854,6 +876,26 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page.  This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -883,6 +925,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

deactivate_page aims for accelerate for reclaiming through
moving pages from active list to inactive list so we should
clear PG_referenced for the goal.

Acked-by: Hugh Dickins <hughd@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swap.c b/mm/swap.c
index d0eacc5f62a3..4a6aec976ab1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
 		ClearPageActive(page);
+		ClearPageReferenced(page);
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

deactivate_page aims for accelerate for reclaiming through
moving pages from active list to inactive list so we should
clear PG_referenced for the goal.

Acked-by: Hugh Dickins <hughd@google.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/swap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/swap.c b/mm/swap.c
index d0eacc5f62a3..4a6aec976ab1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
 
 		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
 		ClearPageActive(page);
+		ClearPageReferenced(page);
 		add_page_to_lru_list(page, lruvec, lru);
 
 		__count_vm_event(PGDEACTIVATE);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 7/8] mm: clear PG_dirty to mark page freeable
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

Basically, MADV_FREE relies on dirty bit in page table entry to decide
whether VM allows to discard the page or not.  IOW, if page table entry
includes marked dirty bit, VM shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9ee9df8c768d..fc24104d6b3a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -303,11 +303,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!page)
 			continue;
 
-		if (PageSwapCache(page)) {
+		if (PageSwapCache(page) || PageDirty(page)) {
 			if (!trylock_page(page))
 				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_count(page) != 1 + !!PageSwapCache(page)) {
+				unlock_page(page);
+				continue;
+			}
 
-			if (!try_to_free_swap(page)) {
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
 				unlock_page(page);
 				continue;
 			}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 7/8] mm: clear PG_dirty to mark page freeable
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

Basically, MADV_FREE relies on dirty bit in page table entry to decide
whether VM allows to discard the page or not.  IOW, if page table entry
includes marked dirty bit, VM shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 9ee9df8c768d..fc24104d6b3a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -303,11 +303,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		if (!page)
 			continue;
 
-		if (PageSwapCache(page)) {
+		if (PageSwapCache(page) || PageDirty(page)) {
 			if (!trylock_page(page))
 				continue;
+			/*
+			 * If page is shared with others, we couldn't clear
+			 * PG_dirty of the page.
+			 */
+			if (page_count(page) != 1 + !!PageSwapCache(page)) {
+				unlock_page(page);
+				continue;
+			}
 
-			if (!try_to_free_swap(page)) {
+			if (PageSwapCache(page) && !try_to_free_swap(page)) {
 				unlock_page(page);
 				continue;
 			}
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 8/8] mm: mark stable page dirty in KSM
  2015-10-30  7:01 ` Minchan Kim
@ 2015-10-30  7:01   ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101eaacdf..18d2b7afecff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1053,6 +1053,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Page reclaim just frees a clean page with no dirty
+			 * ptes: make sure that the ksm page would be swapped.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 8/8] mm: mark stable page dirty in KSM
@ 2015-10-30  7:01   ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-10-30  7:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michael Kerrisk, linux-api, Hugh Dickins,
	Johannes Weiner, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Minchan Kim

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/ksm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101eaacdf..18d2b7afecff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1053,6 +1053,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 			 */
 			set_page_stable_node(page, NULL);
 			mark_page_accessed(page);
+			/*
+			 * Page reclaim just frees a clean page with no dirty
+			 * ptes: make sure that the ksm page would be swapped.
+			 */
+			if (!PageDirty(page))
+				SetPageDirty(page);
 			err = 0;
 		} else if (pages_identical(page, kpage))
 			err = replace_page(vma, page, kpage, orig_pte);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/8] mm: free swp_entry in madvise_free
  2015-10-30  7:01   ` Minchan Kim
@ 2015-10-30 12:28     ` Michal Hocko
  -1 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:40, Minchan Kim wrote:
> When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
> consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
> slower (ie, 2x times) than madvise_dontneed.
> 
> loop = 5;
> mmap(512M);
> while (loop--) {
>         memset(512M);
>         madvise(MADV_FREE or MADV_DONTNEED);
> }
> 
> The reason is lots of swapin.
> 
> 1) dontneed: 1,612 swapin
> 2) madvfree: 879,585 swapin
> 
> If we find hinted pages were already swapped out when syscall is called,
> it's pointless to keep the swapped-out pages in pte.
> Instead, let's free the cold page because swapin is more expensive
> than (alloc page + zeroing).
> 
> With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> 
> 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Yes this makes a lot of sense.

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below.

> ---
>  mm/madvise.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 640311704e31..663bd9fa0ae0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	pte_t *pte, ptent;
>  	struct page *page;
> +	swp_entry_t entry;

This could go into !pte_present if block

> +	int nr_swap = 0;
>  
>  	split_huge_page_pmd(vma, addr, pmd);
>  	if (pmd_trans_unstable(pmd))
> @@ -280,8 +282,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (!pte_present(ptent))
> +		if (pte_none(ptent))
>  			continue;
> +		/*
> +		 * If the pte has swp_entry, just clear page table to
> +		 * prevent swap-in which is more expensive rather than
> +		 * (page allocation + zeroing).
> +		 */
> +		if (!pte_present(ptent)) {
> +			entry = pte_to_swp_entry(ptent);
> +			if (non_swap_entry(entry))
> +				continue;
> +			nr_swap--;
> +			free_swap_and_cache(entry);
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			continue;
> +		}
>  
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
> @@ -313,6 +329,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> +
> +	if (nr_swap) {
> +		if (current->mm == mm)
> +			sync_mm_rss(mm);
> +
> +		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
> +	}
> +
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  	cond_resched();
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/8] mm: free swp_entry in madvise_free
@ 2015-10-30 12:28     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:40, Minchan Kim wrote:
> When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
> consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
> slower (ie, 2x times) than madvise_dontneed.
> 
> loop = 5;
> mmap(512M);
> while (loop--) {
>         memset(512M);
>         madvise(MADV_FREE or MADV_DONTNEED);
> }
> 
> The reason is lots of swapin.
> 
> 1) dontneed: 1,612 swapin
> 2) madvfree: 879,585 swapin
> 
> If we find hinted pages were already swapped out when syscall is called,
> it's pointless to keep the swapped-out pages in pte.
> Instead, let's free the cold page because swapin is more expensive
> than (alloc page + zeroing).
> 
> With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> 
> 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Yes this makes a lot of sense.

Acked-by: Michal Hocko <mhocko@suse.com>

One nit below.

> ---
>  mm/madvise.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 640311704e31..663bd9fa0ae0 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	pte_t *pte, ptent;
>  	struct page *page;
> +	swp_entry_t entry;

This could go into !pte_present if block

> +	int nr_swap = 0;
>  
>  	split_huge_page_pmd(vma, addr, pmd);
>  	if (pmd_trans_unstable(pmd))
> @@ -280,8 +282,22 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (!pte_present(ptent))
> +		if (pte_none(ptent))
>  			continue;
> +		/*
> +		 * If the pte has swp_entry, just clear page table to
> +		 * prevent swap-in which is more expensive rather than
> +		 * (page allocation + zeroing).
> +		 */
> +		if (!pte_present(ptent)) {
> +			entry = pte_to_swp_entry(ptent);
> +			if (non_swap_entry(entry))
> +				continue;
> +			nr_swap--;
> +			free_swap_and_cache(entry);
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			continue;
> +		}
>  
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
> @@ -313,6 +329,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> +
> +	if (nr_swap) {
> +		if (current->mm == mm)
> +			sync_mm_rss(mm);
> +
> +		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
> +	}
> +
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  	cond_resched();
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-10-30 12:47     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:42, Minchan Kim wrote:
> deactivate_page aims for accelerate for reclaiming through
> moving pages from active list to inactive list so we should
> clear PG_referenced for the goal.

I might be missing something but aren't we using PG_referenced only for
pagecache (and shmem) pages?

> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/swap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index d0eacc5f62a3..4a6aec976ab1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>  		ClearPageActive(page);
> +		ClearPageReferenced(page);
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-10-30 12:47     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Fri 30-10-15 16:01:42, Minchan Kim wrote:
> deactivate_page aims for accelerate for reclaiming through
> moving pages from active list to inactive list so we should
> clear PG_referenced for the goal.

I might be missing something but aren't we using PG_referenced only for
pagecache (and shmem) pages?

> 
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Suggested-by: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
>  mm/swap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index d0eacc5f62a3..4a6aec976ab1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>  		ClearPageActive(page);
> +		ClearPageReferenced(page);
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-10-30 12:47     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:47 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:42, Minchan Kim wrote:
> deactivate_page aims for accelerate for reclaiming through
> moving pages from active list to inactive list so we should
> clear PG_referenced for the goal.

I might be missing something but aren't we using PG_referenced only for
pagecache (and shmem) pages?

> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/swap.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index d0eacc5f62a3..4a6aec976ab1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
>  
>  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
>  		ClearPageActive(page);
> +		ClearPageReferenced(page);
>  		add_page_to_lru_list(page, lruvec, lru);
>  
>  		__count_vm_event(PGDEACTIVATE);
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 7/8] mm: clear PG_dirty to mark page freeable
@ 2015-10-30 12:55     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:43, Minchan Kim wrote:
> Basically, MADV_FREE relies on dirty bit in page table entry to decide
> whether VM allows to discard the page or not.  IOW, if page table entry
> includes marked dirty bit, VM shouldn't discard the page.
> 
> However, as a example, if swap-in by read fault happens, page table entry
> doesn't have dirty bit so MADV_FREE could discard the page wrongly.
> 
> For avoiding the problem, MADV_FREE did more checks with PageDirty
> and PageSwapCache. It worked out because swapped-in page lives on
> swap cache and since it is evicted from the swap cache, the page has
> PG_dirty flag. So both page flags check effectively prevent
> wrong discarding by MADV_FREE.
> 
> However, a problem in above logic is that swapped-in page has
> PG_dirty still after they are removed from swap cache so VM cannot
> consider the page as freeable any more even if madvise_free is
> called in future.
> 
> Look at below example for detail.
> 
>     ptr = malloc();
>     memset(ptr);
>     ..
>     ..
>     .. heavy memory pressure so all of pages are swapped out
>     ..
>     ..
>     var = *ptr; -> a page swapped-in and could be removed from
>                    swapcache. Then, page table doesn't mark
>                    dirty bit and page descriptor includes PG_dirty
>     ..
>     ..
>     madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
>     ..
>     ..
>     ..
>     .. heavy memory pressure again.
>     .. In this time, VM cannot discard the page because the page
>     .. has *PG_dirty*
> 
> To solve the problem, this patch clears PG_dirty if only the page is owned
> exclusively by current process when madvise is called because PG_dirty
> represents ptes's dirtiness in several processes so we could clear it only
> if we own it exclusively.
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/madvise.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9ee9df8c768d..fc24104d6b3a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -303,11 +303,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!page)
>  			continue;
>  
> -		if (PageSwapCache(page)) {
> +		if (PageSwapCache(page) || PageDirty(page)) {
>  			if (!trylock_page(page))
>  				continue;
> +			/*
> +			 * If page is shared with others, we couldn't clear
> +			 * PG_dirty of the page.
> +			 */
> +			if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
>  
> -			if (!try_to_free_swap(page)) {
> +			if (PageSwapCache(page) && !try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 7/8] mm: clear PG_dirty to mark page freeable
@ 2015-10-30 12:55     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Fri 30-10-15 16:01:43, Minchan Kim wrote:
> Basically, MADV_FREE relies on dirty bit in page table entry to decide
> whether VM allows to discard the page or not.  IOW, if page table entry
> includes marked dirty bit, VM shouldn't discard the page.
> 
> However, as a example, if swap-in by read fault happens, page table entry
> doesn't have dirty bit so MADV_FREE could discard the page wrongly.
> 
> For avoiding the problem, MADV_FREE did more checks with PageDirty
> and PageSwapCache. It worked out because swapped-in page lives on
> swap cache and since it is evicted from the swap cache, the page has
> PG_dirty flag. So both page flags check effectively prevent
> wrong discarding by MADV_FREE.
> 
> However, a problem in above logic is that swapped-in page has
> PG_dirty still after they are removed from swap cache so VM cannot
> consider the page as freeable any more even if madvise_free is
> called in future.
> 
> Look at below example for detail.
> 
>     ptr = malloc();
>     memset(ptr);
>     ..
>     ..
>     .. heavy memory pressure so all of pages are swapped out
>     ..
>     ..
>     var = *ptr; -> a page swapped-in and could be removed from
>                    swapcache. Then, page table doesn't mark
>                    dirty bit and page descriptor includes PG_dirty
>     ..
>     ..
>     madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
>     ..
>     ..
>     ..
>     .. heavy memory pressure again.
>     .. In this time, VM cannot discard the page because the page
>     .. has *PG_dirty*
> 
> To solve the problem, this patch clears PG_dirty if only the page is owned
> exclusively by current process when madvise is called because PG_dirty
> represents ptes's dirtiness in several processes so we could clear it only
> if we own it exclusively.
> 
> Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

> ---
>  mm/madvise.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9ee9df8c768d..fc24104d6b3a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -303,11 +303,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!page)
>  			continue;
>  
> -		if (PageSwapCache(page)) {
> +		if (PageSwapCache(page) || PageDirty(page)) {
>  			if (!trylock_page(page))
>  				continue;
> +			/*
> +			 * If page is shared with others, we couldn't clear
> +			 * PG_dirty of the page.
> +			 */
> +			if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
>  
> -			if (!try_to_free_swap(page)) {
> +			if (PageSwapCache(page) && !try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 7/8] mm: clear PG_dirty to mark page freeable
@ 2015-10-30 12:55     ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-10-30 12:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri 30-10-15 16:01:43, Minchan Kim wrote:
> Basically, MADV_FREE relies on dirty bit in page table entry to decide
> whether VM allows to discard the page or not.  IOW, if page table entry
> includes marked dirty bit, VM shouldn't discard the page.
> 
> However, as a example, if swap-in by read fault happens, page table entry
> doesn't have dirty bit so MADV_FREE could discard the page wrongly.
> 
> For avoiding the problem, MADV_FREE did more checks with PageDirty
> and PageSwapCache. It worked out because swapped-in page lives on
> swap cache and since it is evicted from the swap cache, the page has
> PG_dirty flag. So both page flags check effectively prevent
> wrong discarding by MADV_FREE.
> 
> However, a problem in above logic is that swapped-in page has
> PG_dirty still after they are removed from swap cache so VM cannot
> consider the page as freeable any more even if madvise_free is
> called in future.
> 
> Look at below example for detail.
> 
>     ptr = malloc();
>     memset(ptr);
>     ..
>     ..
>     .. heavy memory pressure so all of pages are swapped out
>     ..
>     ..
>     var = *ptr; -> a page swapped-in and could be removed from
>                    swapcache. Then, page table doesn't mark
>                    dirty bit and page descriptor includes PG_dirty
>     ..
>     ..
>     madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
>     ..
>     ..
>     ..
>     .. heavy memory pressure again.
>     .. In this time, VM cannot discard the page because the page
>     .. has *PG_dirty*
> 
> To solve the problem, this patch clears PG_dirty if only the page is owned
> exclusively by current process when madvise is called because PG_dirty
> represents ptes's dirtiness in several processes so we could clear it only
> if we own it exclusively.
> 
> Acked-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/madvise.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9ee9df8c768d..fc24104d6b3a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -303,11 +303,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		if (!page)
>  			continue;
>  
> -		if (PageSwapCache(page)) {
> +		if (PageSwapCache(page) || PageDirty(page)) {
>  			if (!trylock_page(page))
>  				continue;
> +			/*
> +			 * If page is shared with others, we couldn't clear
> +			 * PG_dirty of the page.
> +			 */
> +			if (page_count(page) != 1 + !!PageSwapCache(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
>  
> -			if (!try_to_free_swap(page)) {
> +			if (PageSwapCache(page) && !try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -- 
> 1.9.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 1/8] mm: support madvise(MADV_FREE)
  2015-10-30  7:01   ` Minchan Kim
@ 2015-10-30 16:49     ` Shaohua Li
  -1 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-10-30 16:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010

On Fri, Oct 30, 2015 at 04:01:37PM +0900, Minchan Kim wrote:
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +
> +{
> +	struct mmu_gather *tlb = walk->private;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = walk->vma;
> +	spinlock_t *ptl;
> +	pte_t *pte, ptent;
> +	struct page *page;
> +
> +	split_huge_page_pmd(vma, addr, pmd);
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (PageSwapCache(page)) {
> +			if (!trylock_page(page))
> +				continue;
> +
> +			if (!try_to_free_swap(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
> +
> +			ClearPageDirty(page);
> +			unlock_page(page);
> +		}
> +
> +		/*
> +		 * Some of architecture(ex, PPC) don't update TLB
> +		 * with set_pte_at and tlb_remove_tlb_entry so for
> +		 * the portability, remap the pte with old|clean
> +		 * after pte clearing.
> +		 */
> +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> +						tlb->fullmm);
> +		ptent = pte_mkold(ptent);
> +		ptent = pte_mkclean(ptent);
> +		set_pte_at(mm, addr, pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);

The orginal ptent might not be dirty. In that case, the tlb_remove_tlb_entry
is unnecessary, so please add a check. In practice, I saw more TLB flush with
FREE compared to DONTNEED because of this issue.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 1/8] mm: support madvise(MADV_FREE)
@ 2015-10-30 16:49     ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-10-30 16:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010

On Fri, Oct 30, 2015 at 04:01:37PM +0900, Minchan Kim wrote:
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> +				unsigned long end, struct mm_walk *walk)
> +
> +{
> +	struct mmu_gather *tlb = walk->private;
> +	struct mm_struct *mm = tlb->mm;
> +	struct vm_area_struct *vma = walk->vma;
> +	spinlock_t *ptl;
> +	pte_t *pte, ptent;
> +	struct page *page;
> +
> +	split_huge_page_pmd(vma, addr, pmd);
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +
> +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> +	arch_enter_lazy_mmu_mode();
> +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> +		ptent = *pte;
> +
> +		if (!pte_present(ptent))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (!page)
> +			continue;
> +
> +		if (PageSwapCache(page)) {
> +			if (!trylock_page(page))
> +				continue;
> +
> +			if (!try_to_free_swap(page)) {
> +				unlock_page(page);
> +				continue;
> +			}
> +
> +			ClearPageDirty(page);
> +			unlock_page(page);
> +		}
> +
> +		/*
> +		 * Some of architecture(ex, PPC) don't update TLB
> +		 * with set_pte_at and tlb_remove_tlb_entry so for
> +		 * the portability, remap the pte with old|clean
> +		 * after pte clearing.
> +		 */
> +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> +						tlb->fullmm);
> +		ptent = pte_mkold(ptent);
> +		ptent = pte_mkclean(ptent);
> +		set_pte_at(mm, addr, pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);

The orginal ptent might not be dirty. In that case, the tlb_remove_tlb_entry
is unnecessary, so please add a check. In practice, I saw more TLB flush with
FREE compared to DONTNEED because of this issue.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-10-30 17:22     ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-10-30 17:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.
> 
> This fixes suboptimal behavior of MADV_FREE when pages living on the
> active list will sit there for a long time even under memory pressure
> while the inactive list is reclaimed heavily.  This basically breaks the
> whole purpose of using MADV_FREE to help the system to free memory which
> is might not be used.

My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.

Or do you want to push this first and address the policy issue later?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-10-30 17:22     ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-10-30 17:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.
> 
> This fixes suboptimal behavior of MADV_FREE when pages living on the
> active list will sit there for a long time even under memory pressure
> while the inactive list is reclaimed heavily.  This basically breaks the
> whole purpose of using MADV_FREE to help the system to free memory which
> is might not be used.

My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.

Or do you want to push this first and address the policy issue later?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-10-30 17:22     ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-10-30 17:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.
> 
> This fixes suboptimal behavior of MADV_FREE when pages living on the
> active list will sit there for a long time even under memory pressure
> while the inactive list is reclaimed heavily.  This basically breaks the
> whole purpose of using MADV_FREE to help the system to free memory which
> is might not be used.

My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.

Or do you want to push this first and address the policy issue later?

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-01  4:51   ` David Rientjes
  0 siblings, 0 replies; 97+ messages in thread
From: David Rientjes @ 2015-11-01  4:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Fri, 30 Oct 2015, Minchan Kim wrote:

> MADV_FREE is on linux-next so long time. The reason was two, I think.
> 
> 1. MADV_FREE code on reclaim path was really mess.
> 
> 2. Andrew really want to see voice of userland people who want to use
>    the syscall.
> 
> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> to make progress upstreaming but I was busy at that time so it took
> so long time for me to revist the code and finally, I clean it up the
> mess recently so it solves the #2 issue.
> 
> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> again recently and they said it would be great to have even though
> it has swap dependency now so Andrew decided he will do that for v4.4.
> 

First, thanks very much for refreshing the patchset and reposting after a 
series of changes have been periodically added to -mm, it makes it much 
easier.

For tcmalloc, we can do some things in the allocator itself to increase 
the amount of memory backed by thp.  Specifically, we can prefer to 
release Spans to pageblocks that are already not backed by thp so there is 
no additional split on each scavenge.  This is somewhat easy if all memory 
is organized into hugepage-aligned pageblocks in the allocator itself.  
Second, we can prefer to release Spans of longer length on each scavenge 
so we can delay scavenging for as long as possible in a hope we can find 
more pages to coalesce.  Third, we can discount refaulted released memory 
from the scavenging period.

That significantly improves the amount of memory backed by thp for 
tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
release memory to the system and MADV_FREE wouldn't help at all in a 
swapless environment.

To combat that, I've proposed a new MADV bit that simply caches the 
ranges freed by the allocator per vma and places them on both a per-vma 
and per-memcg list.  During reclaim, this list is iterated and ptes are 
freed after thp split period to the normal directed reclaim.  Without 
memory pressure, this backs 100% of the heap with thp with a relatively 
lightweight kernel change (the majority is vma manipulation on split) and 
a couple line change to tcmalloc.  When pulling memory from the returned 
freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
MADV bit to remove it from this cache, so there is a second madvise(2) 
syscall involved but the freeing call is much less expensive since there 
is no pagetable walk without memory pressure or synchronous thp split.

I've been looking at MADV_FREE to see if there is common ground that could 
be shared, but perhaps it's just easier to ask what your proposed strategy 
is so that tcmalloc users, especially those in swapless environments, 
would benefit from any of your work?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-01  4:51   ` David Rientjes
  0 siblings, 0 replies; 97+ messages in thread
From: David Rientjes @ 2015-11-01  4:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Fri, 30 Oct 2015, Minchan Kim wrote:

> MADV_FREE is on linux-next so long time. The reason was two, I think.
> 
> 1. MADV_FREE code on reclaim path was really mess.
> 
> 2. Andrew really want to see voice of userland people who want to use
>    the syscall.
> 
> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> to make progress upstreaming but I was busy at that time so it took
> so long time for me to revist the code and finally, I clean it up the
> mess recently so it solves the #2 issue.
> 
> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> again recently and they said it would be great to have even though
> it has swap dependency now so Andrew decided he will do that for v4.4.
> 

First, thanks very much for refreshing the patchset and reposting after a 
series of changes have been periodically added to -mm, it makes it much 
easier.

For tcmalloc, we can do some things in the allocator itself to increase 
the amount of memory backed by thp.  Specifically, we can prefer to 
release Spans to pageblocks that are already not backed by thp so there is 
no additional split on each scavenge.  This is somewhat easy if all memory 
is organized into hugepage-aligned pageblocks in the allocator itself.  
Second, we can prefer to release Spans of longer length on each scavenge 
so we can delay scavenging for as long as possible in a hope we can find 
more pages to coalesce.  Third, we can discount refaulted released memory 
from the scavenging period.

That significantly improves the amount of memory backed by thp for 
tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
release memory to the system and MADV_FREE wouldn't help at all in a 
swapless environment.

To combat that, I've proposed a new MADV bit that simply caches the 
ranges freed by the allocator per vma and places them on both a per-vma 
and per-memcg list.  During reclaim, this list is iterated and ptes are 
freed after thp split period to the normal directed reclaim.  Without 
memory pressure, this backs 100% of the heap with thp with a relatively 
lightweight kernel change (the majority is vma manipulation on split) and 
a couple line change to tcmalloc.  When pulling memory from the returned 
freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
MADV bit to remove it from this cache, so there is a second madvise(2) 
syscall involved but the freeing call is much less expensive since there 
is no pagetable walk without memory pressure or synchronous thp split.

I've been looking at MADV_FREE to see if there is common ground that could 
be shared, but perhaps it's just easier to ask what your proposed strategy 
is so that tcmalloc users, especially those in swapless environments, 
would benefit from any of your work?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-01  4:51   ` David Rientjes
  0 siblings, 0 replies; 97+ messages in thread
From: David Rientjes @ 2015-11-01  4:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Fri, 30 Oct 2015, Minchan Kim wrote:

> MADV_FREE is on linux-next so long time. The reason was two, I think.
> 
> 1. MADV_FREE code on reclaim path was really mess.
> 
> 2. Andrew really want to see voice of userland people who want to use
>    the syscall.
> 
> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> to make progress upstreaming but I was busy at that time so it took
> so long time for me to revist the code and finally, I clean it up the
> mess recently so it solves the #2 issue.
> 
> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> again recently and they said it would be great to have even though
> it has swap dependency now so Andrew decided he will do that for v4.4.
> 

First, thanks very much for refreshing the patchset and reposting after a 
series of changes have been periodically added to -mm, it makes it much 
easier.

For tcmalloc, we can do some things in the allocator itself to increase 
the amount of memory backed by thp.  Specifically, we can prefer to 
release Spans to pageblocks that are already not backed by thp so there is 
no additional split on each scavenge.  This is somewhat easy if all memory 
is organized into hugepage-aligned pageblocks in the allocator itself.  
Second, we can prefer to release Spans of longer length on each scavenge 
so we can delay scavenging for as long as possible in a hope we can find 
more pages to coalesce.  Third, we can discount refaulted released memory 
from the scavenging period.

That significantly improves the amount of memory backed by thp for 
tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
release memory to the system and MADV_FREE wouldn't help at all in a 
swapless environment.

To combat that, I've proposed a new MADV bit that simply caches the 
ranges freed by the allocator per vma and places them on both a per-vma 
and per-memcg list.  During reclaim, this list is iterated and ptes are 
freed after thp split period to the normal directed reclaim.  Without 
memory pressure, this backs 100% of the heap with thp with a relatively 
lightweight kernel change (the majority is vma manipulation on split) and 
a couple line change to tcmalloc.  When pulling memory from the returned 
freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
MADV bit to remove it from this cache, so there is a second madvise(2) 
syscall involved but the freeing call is much less expensive since there 
is no pagetable walk without memory pressure or synchronous thp split.

I've been looking at MADV_FREE to see if there is common ground that could 
be shared, but perhaps it's just easier to ask what your proposed strategy 
is so that tcmalloc users, especially those in swapless environments, 
would benefit from any of your work?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
  2015-11-01  4:51   ` David Rientjes
  (?)
  (?)
@ 2015-11-01  6:29   ` Daniel Micay
  2015-11-03  2:23       ` Minchan Kim
  2015-11-04 20:19       ` David Rientjes
  -1 siblings, 2 replies; 97+ messages in thread
From: Daniel Micay @ 2015-11-01  6:29 UTC (permalink / raw)
  To: David Rientjes, Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li

[-- Attachment #1: Type: text/plain, Size: 4983 bytes --]

On 01/11/15 12:51 AM, David Rientjes wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> 
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
>>
>> 2. Andrew really want to see voice of userland people who want to use
>>    the syscall.
>>
>> A few month ago, Daniel Micay(jemalloc active contributor) requested me
>> to make progress upstreaming but I was busy at that time so it took
>> so long time for me to revist the code and finally, I clean it up the
>> mess recently so it solves the #2 issue.
>>
>> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
>> again recently and they said it would be great to have even though
>> it has swap dependency now so Andrew decided he will do that for v4.4.
>>
> 
> First, thanks very much for refreshing the patchset and reposting after a 
> series of changes have been periodically added to -mm, it makes it much 
> easier.
> 
> For tcmalloc, we can do some things in the allocator itself to increase 
> the amount of memory backed by thp.  Specifically, we can prefer to 
> release Spans to pageblocks that are already not backed by thp so there is 
> no additional split on each scavenge.  This is somewhat easy if all memory 
> is organized into hugepage-aligned pageblocks in the allocator itself.  
> Second, we can prefer to release Spans of longer length on each scavenge 
> so we can delay scavenging for as long as possible in a hope we can find 
> more pages to coalesce.  Third, we can discount refaulted released memory 
> from the scavenging period.
> 
> That significantly improves the amount of memory backed by thp for 
> tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> release memory to the system and MADV_FREE wouldn't help at all in a 
> swapless environment.
> 
> To combat that, I've proposed a new MADV bit that simply caches the 
> ranges freed by the allocator per vma and places them on both a per-vma 
> and per-memcg list.  During reclaim, this list is iterated and ptes are 
> freed after thp split period to the normal directed reclaim.  Without 
> memory pressure, this backs 100% of the heap with thp with a relatively 
> lightweight kernel change (the majority is vma manipulation on split) and 
> a couple line change to tcmalloc.  When pulling memory from the returned 
> freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> MADV bit to remove it from this cache, so there is a second madvise(2) 
> syscall involved but the freeing call is much less expensive since there 
> is no pagetable walk without memory pressure or synchronous thp split.
> 
> I've been looking at MADV_FREE to see if there is common ground that could 
> be shared, but perhaps it's just easier to ask what your proposed strategy 
> is so that tcmalloc users, especially those in swapless environments, 
> would benefit from any of your work?

The current implementation requires swap because the kernel already has
robust infrastructure for swapping out anonymous memory when there's
memory pressure. The MADV_FREE implementation just has to hook in there
and cause pages to be dropped instead of swapped out. There's no reason
it couldn't be extended to work in swapless environments, but it will
take additional design and implementation work. As a stop-gap, I think
zram and friends will work fine as a form of swap for this.

It can definitely be improved to cooperate well with THP too. I've been
following the progress, and most of the problems seem to have been with
the THP and that's a very active area of development. Seems best to deal
with that after a simple, working implementation lands.

The best aspect of MADV_FREE is that it completely avoids page faults
when there's no memory pressure. Making use of the freed memory only
triggers page faults if the pages had to be dropped because the system
ran out of memory. It also avoids needing to zero the pages. The memory
can also still be freed at any time if there's memory pressure again
even if it's handed out as an allocation until it's actually touched.

The call to madvise still has significant overhead, but it's much
cheaper than MADV_DONTNEED. Allocators will be able to lean on the
kernel to make good decisions rather than implementing lazy freeing
entirely on their own. It should improve performance *and* behavior
under memory pressure since allocators can be more aggressive with it
than MADV_DONTNEED.

A nice future improvement would be landing MADV_FREE_UNDO feature to
allow an attempt to pin the pages in memory again. It would make this
work very well for implementing caches that are dropped under memory
pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
think browser vendors would be very interested in it.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-02  0:08     ` Hugh Dickins
  0 siblings, 0 replies; 97+ messages in thread
From: Hugh Dickins @ 2015-11-02  0:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, David Miller,
	Darrick J. Wong, Roland Dreier, Jason Evans, Daniel Micay,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Chen Gang, rth, ink, mattst88, Ralf Baechle, jejb, deller, chris,
	jcmvbkbc, Arnd Bergmann, sparclinux, linux-arch

On Fri, 30 Oct 2015, Minchan Kim wrote:
> From: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> For uapi, need try to let all macros have same value, and MADV_FREE is
> added into main branch recently, so need redefine MADV_FREE for it.
> 
> At present, '8' can be shared with all architectures, so redefine it to
> '8'.
> 
> Cc: rth@twiddle.net <rth@twiddle.net>,
> Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> Cc: mattst88@gmail.com <mattst88@gmail.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> Cc: deller@gmx.de <deller@gmx.de>
> Cc: chris@zankel.net <chris@zankel.net>
> Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Let me add
Acked-by: Hugh Dickins <hughd@google.com>
to this one too.

But I have extended your mail's Cc list: Darrick pointed out earlier
that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

I was a little suspicious that 8 is available for MADV_FREE: why did
the common/generic parameters start at 9 instead of 8 back in 2.6.16?
I think the answer is that we had MADV_REMOVE coming in from one
direction, and MADV_DONTFORK coming from another direction, and when
Roland looked for where to start the commons for MADV_DONTFORK, it
appeared that 8 was occupied - by MADV_REMOVE; then a little later
MADV_REMOVE was shifted to become the first of the commons, at 9.

Hugh

> ---
>  arch/alpha/include/uapi/asm/mman.h     | 2 +-
>  arch/mips/include/uapi/asm/mman.h      | 2 +-
>  arch/parisc/include/uapi/asm/mman.h    | 2 +-
>  arch/xtensa/include/uapi/asm/mman.h    | 2 +-
>  include/uapi/asm-generic/mman-common.h | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 836fbd44f65b..0b8a5de7aee3 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -44,9 +44,9 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
>  #define MADV_DONTNEED	6		/* don't need these pages */
> -#define MADV_FREE	7		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 106e741aa7ee..d247f5457944 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -67,9 +67,9 @@
>  #define MADV_SEQUENTIAL 2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 6cb8db76fd4e..700d83fd9352 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -40,9 +40,9 @@
>  #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
>  #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
>  #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
> -#define MADV_FREE	8		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1b19f25bc567..77eaca434071 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -80,9 +80,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 7a94102b7a02..869595947873 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,9 +34,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-02  0:08     ` Hugh Dickins
  0 siblings, 0 replies; 97+ messages in thread
From: Hugh Dickins @ 2015-11-02  0:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, David Miller, Darrick J. Wong, Roland Dreier,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Chen Gang,
	rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09,
	mattst88-Re5JQEeQqe8AvxtiuMwx3w, Ralf Baechle

On Fri, 30 Oct 2015, Minchan Kim wrote:
> From: Chen Gang <gang.chen.5i5j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> For uapi, need try to let all macros have same value, and MADV_FREE is
> added into main branch recently, so need redefine MADV_FREE for it.
> 
> At present, '8' can be shared with all architectures, so redefine it to
> '8'.
> 
> Cc: rth-hL46jP5Bxq7R7s880joybQ@public.gmane.org <rth-hL46jP5Bxq7R7s880joybQ@public.gmane.org>,
> Cc: ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09@public.gmane.org <ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09@public.gmane.org>
> Cc: mattst88-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <mattst88-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Ralf Baechle <ralf-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org>
> Cc: jejb-6jwH94ZQLHl74goWV3ctuw@public.gmane.org <jejb-6jwH94ZQLHl74goWV3ctuw@public.gmane.org>
> Cc: deller-Mmb7MZpHnFY@public.gmane.org <deller-Mmb7MZpHnFY@public.gmane.org>
> Cc: chris-YvXeqwSYzG2sTnJN9+BGXg@public.gmane.org <chris-YvXeqwSYzG2sTnJN9+BGXg@public.gmane.org>
> Cc: jcmvbkbc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <jcmvbkbc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
> Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Acked-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Chen Gang <gang.chen.5i5j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

Let me add
Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
to this one too.

But I have extended your mail's Cc list: Darrick pointed out earlier
that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

I was a little suspicious that 8 is available for MADV_FREE: why did
the common/generic parameters start at 9 instead of 8 back in 2.6.16?
I think the answer is that we had MADV_REMOVE coming in from one
direction, and MADV_DONTFORK coming from another direction, and when
Roland looked for where to start the commons for MADV_DONTFORK, it
appeared that 8 was occupied - by MADV_REMOVE; then a little later
MADV_REMOVE was shifted to become the first of the commons, at 9.

Hugh

> ---
>  arch/alpha/include/uapi/asm/mman.h     | 2 +-
>  arch/mips/include/uapi/asm/mman.h      | 2 +-
>  arch/parisc/include/uapi/asm/mman.h    | 2 +-
>  arch/xtensa/include/uapi/asm/mman.h    | 2 +-
>  include/uapi/asm-generic/mman-common.h | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 836fbd44f65b..0b8a5de7aee3 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -44,9 +44,9 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
>  #define MADV_DONTNEED	6		/* don't need these pages */
> -#define MADV_FREE	7		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 106e741aa7ee..d247f5457944 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -67,9 +67,9 @@
>  #define MADV_SEQUENTIAL 2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 6cb8db76fd4e..700d83fd9352 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -40,9 +40,9 @@
>  #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
>  #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
>  #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
> -#define MADV_FREE	8		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1b19f25bc567..77eaca434071 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -80,9 +80,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 7a94102b7a02..869595947873 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,9 +34,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-02  0:08     ` Hugh Dickins
  0 siblings, 0 replies; 97+ messages in thread
From: Hugh Dickins @ 2015-11-02  0:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, David Miller, Darrick J. Wong, Roland Dreier,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Chen Gang,
	rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09,
	mattst88-Re5JQEeQqe8AvxtiuMwx3w, Ralf Baechle

On Fri, 30 Oct 2015, Minchan Kim wrote:
> From: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> For uapi, need try to let all macros have same value, and MADV_FREE is
> added into main branch recently, so need redefine MADV_FREE for it.
> 
> At present, '8' can be shared with all architectures, so redefine it to
> '8'.
> 
> Cc: rth@twiddle.net <rth@twiddle.net>,
> Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> Cc: mattst88@gmail.com <mattst88@gmail.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> Cc: deller@gmx.de <deller@gmx.de>
> Cc: chris@zankel.net <chris@zankel.net>
> Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Let me add
Acked-by: Hugh Dickins <hughd@google.com>
to this one too.

But I have extended your mail's Cc list: Darrick pointed out earlier
that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

I was a little suspicious that 8 is available for MADV_FREE: why did
the common/generic parameters start at 9 instead of 8 back in 2.6.16?
I think the answer is that we had MADV_REMOVE coming in from one
direction, and MADV_DONTFORK coming from another direction, and when
Roland looked for where to start the commons for MADV_DONTFORK, it
appeared that 8 was occupied - by MADV_REMOVE; then a little later
MADV_REMOVE was shifted to become the first of the commons, at 9.

Hugh

> ---
>  arch/alpha/include/uapi/asm/mman.h     | 2 +-
>  arch/mips/include/uapi/asm/mman.h      | 2 +-
>  arch/parisc/include/uapi/asm/mman.h    | 2 +-
>  arch/xtensa/include/uapi/asm/mman.h    | 2 +-
>  include/uapi/asm-generic/mman-common.h | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 836fbd44f65b..0b8a5de7aee3 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -44,9 +44,9 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
>  #define MADV_DONTNEED	6		/* don't need these pages */
> -#define MADV_FREE	7		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 106e741aa7ee..d247f5457944 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -67,9 +67,9 @@
>  #define MADV_SEQUENTIAL 2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 6cb8db76fd4e..700d83fd9352 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -40,9 +40,9 @@
>  #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
>  #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
>  #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
> -#define MADV_FREE	8		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1b19f25bc567..77eaca434071 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -80,9 +80,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 7a94102b7a02..869595947873 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,9 +34,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-02  0:08     ` Hugh Dickins
  0 siblings, 0 replies; 97+ messages in thread
From: Hugh Dickins @ 2015-11-02  0:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, David Miller,
	Darrick J. Wong, Roland Dreier, Jason Evans, Daniel Micay,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Chen Gang, rth, ink, mattst88, Ralf Baechle, jejb, deller, chris,
	jcmvbkbc, Arnd Bergmann, sparclinux, linux-arch

On Fri, 30 Oct 2015, Minchan Kim wrote:
> From: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> For uapi, need try to let all macros have same value, and MADV_FREE is
> added into main branch recently, so need redefine MADV_FREE for it.
> 
> At present, '8' can be shared with all architectures, so redefine it to
> '8'.
> 
> Cc: rth@twiddle.net <rth@twiddle.net>,
> Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> Cc: mattst88@gmail.com <mattst88@gmail.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> Cc: deller@gmx.de <deller@gmx.de>
> Cc: chris@zankel.net <chris@zankel.net>
> Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: linux-arch@vger.kernel.org
> Cc: linux-api@vger.kernel.org
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Let me add
Acked-by: Hugh Dickins <hughd@google.com>
to this one too.

But I have extended your mail's Cc list: Darrick pointed out earlier
that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

I was a little suspicious that 8 is available for MADV_FREE: why did
the common/generic parameters start at 9 instead of 8 back in 2.6.16?
I think the answer is that we had MADV_REMOVE coming in from one
direction, and MADV_DONTFORK coming from another direction, and when
Roland looked for where to start the commons for MADV_DONTFORK, it
appeared that 8 was occupied - by MADV_REMOVE; then a little later
MADV_REMOVE was shifted to become the first of the commons, at 9.

Hugh

> ---
>  arch/alpha/include/uapi/asm/mman.h     | 2 +-
>  arch/mips/include/uapi/asm/mman.h      | 2 +-
>  arch/parisc/include/uapi/asm/mman.h    | 2 +-
>  arch/xtensa/include/uapi/asm/mman.h    | 2 +-
>  include/uapi/asm-generic/mman-common.h | 2 +-
>  5 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 836fbd44f65b..0b8a5de7aee3 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -44,9 +44,9 @@
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define	MADV_SPACEAVAIL	5		/* ensure resources are available */
>  #define MADV_DONTNEED	6		/* don't need these pages */
> -#define MADV_FREE	7		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index 106e741aa7ee..d247f5457944 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -67,9 +67,9 @@
>  #define MADV_SEQUENTIAL 2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 6cb8db76fd4e..700d83fd9352 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -40,9 +40,9 @@
>  #define MADV_SPACEAVAIL 5               /* insure that resources are reserved */
>  #define MADV_VPS_PURGE  6               /* Purge pages from VM page cache */
>  #define MADV_VPS_INHERIT 7              /* Inherit parents page size */
> -#define MADV_FREE	8		/* free pages only if memory pressure */
>  
>  /* common/generic parameters */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1b19f25bc567..77eaca434071 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -80,9 +80,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 7a94102b7a02..869595947873 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -34,9 +34,9 @@
>  #define MADV_SEQUENTIAL	2		/* expect sequential page references */
>  #define MADV_WILLNEED	3		/* will need these pages */
>  #define MADV_DONTNEED	4		/* don't need these pages */
> -#define MADV_FREE	5		/* free pages only if memory pressure */
>  
>  /* common parameters: try to keep these consistent across architectures */
> +#define MADV_FREE	8		/* free pages only if memory pressure */
>  #define MADV_REMOVE	9		/* remove these pages & resources */
>  #define MADV_DONTFORK	10		/* don't inherit across fork */
>  #define MADV_DOFORK	11		/* do inherit across fork */
> -- 
> 1.9.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 1/8] mm: support madvise(MADV_FREE)
@ 2015-11-03  0:10       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010

On Fri, Oct 30, 2015 at 09:49:37AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:37PM +0900, Minchan Kim wrote:
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > +	struct mmu_gather *tlb = walk->private;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, ptent;
> > +	struct page *page;
> > +
> > +	split_huge_page_pmd(vma, addr, pmd);
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (PageSwapCache(page)) {
> > +			if (!trylock_page(page))
> > +				continue;
> > +
> > +			if (!try_to_free_swap(page)) {
> > +				unlock_page(page);
> > +				continue;
> > +			}
> > +
> > +			ClearPageDirty(page);
> > +			unlock_page(page);
> > +		}
> > +
> > +		/*
> > +		 * Some of architecture(ex, PPC) don't update TLB
> > +		 * with set_pte_at and tlb_remove_tlb_entry so for
> > +		 * the portability, remap the pte with old|clean
> > +		 * after pte clearing.
> > +		 */
> > +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +						tlb->fullmm);
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> 
> The orginal ptent might not be dirty. In that case, the tlb_remove_tlb_entry
> is unnecessary, so please add a check. In practice, I saw more TLB flush with
> FREE compared to DONTNEED because of this issue.

Actually, it was my TODO but I forgot it. :(
I fixed for new version.
Thanks for the pointing out.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 1/8] mm: support madvise(MADV_FREE)
@ 2015-11-03  0:10       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w

On Fri, Oct 30, 2015 at 09:49:37AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:37PM +0900, Minchan Kim wrote:
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > +	struct mmu_gather *tlb = walk->private;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, ptent;
> > +	struct page *page;
> > +
> > +	split_huge_page_pmd(vma, addr, pmd);
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (PageSwapCache(page)) {
> > +			if (!trylock_page(page))
> > +				continue;
> > +
> > +			if (!try_to_free_swap(page)) {
> > +				unlock_page(page);
> > +				continue;
> > +			}
> > +
> > +			ClearPageDirty(page);
> > +			unlock_page(page);
> > +		}
> > +
> > +		/*
> > +		 * Some of architecture(ex, PPC) don't update TLB
> > +		 * with set_pte_at and tlb_remove_tlb_entry so for
> > +		 * the portability, remap the pte with old|clean
> > +		 * after pte clearing.
> > +		 */
> > +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +						tlb->fullmm);
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> 
> The orginal ptent might not be dirty. In that case, the tlb_remove_tlb_entry
> is unnecessary, so please add a check. In practice, I saw more TLB flush with
> FREE compared to DONTNEED because of this issue.

Actually, it was my TODO but I forgot it. :(
I fixed for new version.
Thanks for the pointing out.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 1/8] mm: support madvise(MADV_FREE)
@ 2015-11-03  0:10       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:10 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010

On Fri, Oct 30, 2015 at 09:49:37AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:37PM +0900, Minchan Kim wrote:
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > +				unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > +	struct mmu_gather *tlb = walk->private;
> > +	struct mm_struct *mm = tlb->mm;
> > +	struct vm_area_struct *vma = walk->vma;
> > +	spinlock_t *ptl;
> > +	pte_t *pte, ptent;
> > +	struct page *page;
> > +
> > +	split_huge_page_pmd(vma, addr, pmd);
> > +	if (pmd_trans_unstable(pmd))
> > +		return 0;
> > +
> > +	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > +	arch_enter_lazy_mmu_mode();
> > +	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > +		ptent = *pte;
> > +
> > +		if (!pte_present(ptent))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (!page)
> > +			continue;
> > +
> > +		if (PageSwapCache(page)) {
> > +			if (!trylock_page(page))
> > +				continue;
> > +
> > +			if (!try_to_free_swap(page)) {
> > +				unlock_page(page);
> > +				continue;
> > +			}
> > +
> > +			ClearPageDirty(page);
> > +			unlock_page(page);
> > +		}
> > +
> > +		/*
> > +		 * Some of architecture(ex, PPC) don't update TLB
> > +		 * with set_pte_at and tlb_remove_tlb_entry so for
> > +		 * the portability, remap the pte with old|clean
> > +		 * after pte clearing.
> > +		 */
> > +		ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +						tlb->fullmm);
> > +		ptent = pte_mkold(ptent);
> > +		ptent = pte_mkclean(ptent);
> > +		set_pte_at(mm, addr, pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> 
> The orginal ptent might not be dirty. In that case, the tlb_remove_tlb_entry
> is unnecessary, so please add a check. In practice, I saw more TLB flush with
> FREE compared to DONTNEED because of this issue.

Actually, it was my TODO but I forgot it. :(
I fixed for new version.
Thanks for the pointing out.

> 
> Thanks,
> Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-03  0:52       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > so there is no value keeping them in the active anonymous LRU so this
> > patch moves them to inactive LRU list's head.
> > 
> > This means that MADV_FREE-ed pages which were living on the inactive list
> > are reclaimed first because they are more likely to be cold rather than
> > recently active pages.
> > 
> > An arguable issue for the approach would be whether we should put the page
> > to the head or tail of the inactive list.  I chose head because the kernel
> > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > at least we know it's not *hot*, so landing of inactive head would be a
> > comprimise for various usecases.
> > 
> > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > active list will sit there for a long time even under memory pressure
> > while the inactive list is reclaimed heavily.  This basically breaks the
> > whole purpose of using MADV_FREE to help the system to free memory which
> > is might not be used.
> 
> My main concern is the policy how we should treat the FREE pages. Moving it to
> inactive lru is definitionly a good start, I'm wondering if it's enough. The
> MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> the lazy memory free. While MADV_FREE is intended to be a better replacement of
> MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> pressure too. I'm thinking of adding an extra lru list and wartermark for this
> to make sure FREE pages can be freed before system wide page reclaim. As you
> said, this is arguable, but I hope we can discuss about this issue more.

Yes, it's arguble. ;-)

It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.

With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.

Another bonus with new LRU list is we could support MADV_FREE on swapless
system.

> 
> Or do you want to push this first and address the policy issue later?

I believe adding new LRU list would be controversial(ie, not trivial)
for maintainer POV even though code wouldn't be complicated.
So, I want to see problems in *real practice*, not any theoritical
test program before diving into that.
To see such voice of request, we should release the syscall.
So, I want to push this first.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-03  0:52       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Wang, Yalin

On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > so there is no value keeping them in the active anonymous LRU so this
> > patch moves them to inactive LRU list's head.
> > 
> > This means that MADV_FREE-ed pages which were living on the inactive list
> > are reclaimed first because they are more likely to be cold rather than
> > recently active pages.
> > 
> > An arguable issue for the approach would be whether we should put the page
> > to the head or tail of the inactive list.  I chose head because the kernel
> > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > at least we know it's not *hot*, so landing of inactive head would be a
> > comprimise for various usecases.
> > 
> > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > active list will sit there for a long time even under memory pressure
> > while the inactive list is reclaimed heavily.  This basically breaks the
> > whole purpose of using MADV_FREE to help the system to free memory which
> > is might not be used.
> 
> My main concern is the policy how we should treat the FREE pages. Moving it to
> inactive lru is definitionly a good start, I'm wondering if it's enough. The
> MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> the lazy memory free. While MADV_FREE is intended to be a better replacement of
> MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> pressure too. I'm thinking of adding an extra lru list and wartermark for this
> to make sure FREE pages can be freed before system wide page reclaim. As you
> said, this is arguable, but I hope we can discuss about this issue more.

Yes, it's arguble. ;-)

It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.

With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.

Another bonus with new LRU list is we could support MADV_FREE on swapless
system.

> 
> Or do you want to push this first and address the policy issue later?

I believe adding new LRU list would be controversial(ie, not trivial)
for maintainer POV even though code wouldn't be complicated.
So, I want to see problems in *real practice*, not any theoritical
test program before diving into that.
To see such voice of request, we should release the syscall.
So, I want to push this first.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-03  0:52       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > so there is no value keeping them in the active anonymous LRU so this
> > patch moves them to inactive LRU list's head.
> > 
> > This means that MADV_FREE-ed pages which were living on the inactive list
> > are reclaimed first because they are more likely to be cold rather than
> > recently active pages.
> > 
> > An arguable issue for the approach would be whether we should put the page
> > to the head or tail of the inactive list.  I chose head because the kernel
> > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > at least we know it's not *hot*, so landing of inactive head would be a
> > comprimise for various usecases.
> > 
> > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > active list will sit there for a long time even under memory pressure
> > while the inactive list is reclaimed heavily.  This basically breaks the
> > whole purpose of using MADV_FREE to help the system to free memory which
> > is might not be used.
> 
> My main concern is the policy how we should treat the FREE pages. Moving it to
> inactive lru is definitionly a good start, I'm wondering if it's enough. The
> MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> the lazy memory free. While MADV_FREE is intended to be a better replacement of
> MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> pressure too. I'm thinking of adding an extra lru list and wartermark for this
> to make sure FREE pages can be freed before system wide page reclaim. As you
> said, this is arguable, but I hope we can discuss about this issue more.

Yes, it's arguble. ;-)

It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.

With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.

Another bonus with new LRU list is we could support MADV_FREE on swapless
system.

> 
> Or do you want to push this first and address the policy issue later?

I believe adding new LRU list would be controversial(ie, not trivial)
for maintainer POV even though code wouldn't be complicated.
So, I want to see problems in *real practice*, not any theoritical
test program before diving into that.
To see such voice of request, we should release the syscall.
So, I want to push this first.

> 
> Thanks,
> Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/8] mm: free swp_entry in madvise_free
@ 2015-11-03  0:53       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri, Oct 30, 2015 at 01:28:14PM +0100, Michal Hocko wrote:
> On Fri 30-10-15 16:01:40, Minchan Kim wrote:
> > When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
> > consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
> > slower (ie, 2x times) than madvise_dontneed.
> > 
> > loop = 5;
> > mmap(512M);
> > while (loop--) {
> >         memset(512M);
> >         madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > The reason is lots of swapin.
> > 
> > 1) dontneed: 1,612 swapin
> > 2) madvfree: 879,585 swapin
> > 
> > If we find hinted pages were already swapped out when syscall is called,
> > it's pointless to keep the swapped-out pages in pte.
> > Instead, let's free the cold page because swapin is more expensive
> > than (alloc page + zeroing).
> > 
> > With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> > 
> > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > 
> > Acked-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Yes this makes a lot of sense.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> 
> One nit below.
> 
> > ---
> >  mm/madvise.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 640311704e31..663bd9fa0ae0 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	spinlock_t *ptl;
> >  	pte_t *pte, ptent;
> >  	struct page *page;
> > +	swp_entry_t entry;
> 
> This could go into !pte_present if block

Sure, I fixed.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/8] mm: free swp_entry in madvise_free
@ 2015-11-03  0:53       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Fri, Oct 30, 2015 at 01:28:14PM +0100, Michal Hocko wrote:
> On Fri 30-10-15 16:01:40, Minchan Kim wrote:
> > When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
> > consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
> > slower (ie, 2x times) than madvise_dontneed.
> > 
> > loop = 5;
> > mmap(512M);
> > while (loop--) {
> >         memset(512M);
> >         madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > The reason is lots of swapin.
> > 
> > 1) dontneed: 1,612 swapin
> > 2) madvfree: 879,585 swapin
> > 
> > If we find hinted pages were already swapped out when syscall is called,
> > it's pointless to keep the swapped-out pages in pte.
> > Instead, let's free the cold page because swapin is more expensive
> > than (alloc page + zeroing).
> > 
> > With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> > 
> > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > 
> > Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> 
> Yes this makes a lot of sense.
> 
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> 
> One nit below.
> 
> > ---
> >  mm/madvise.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 640311704e31..663bd9fa0ae0 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	spinlock_t *ptl;
> >  	pte_t *pte, ptent;
> >  	struct page *page;
> > +	swp_entry_t entry;
> 
> This could go into !pte_present if block

Sure, I fixed.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/8] mm: free swp_entry in madvise_free
@ 2015-11-03  0:53       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  0:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri, Oct 30, 2015 at 01:28:14PM +0100, Michal Hocko wrote:
> On Fri 30-10-15 16:01:40, Minchan Kim wrote:
> > When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
> > consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
> > slower (ie, 2x times) than madvise_dontneed.
> > 
> > loop = 5;
> > mmap(512M);
> > while (loop--) {
> >         memset(512M);
> >         madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > The reason is lots of swapin.
> > 
> > 1) dontneed: 1,612 swapin
> > 2) madvfree: 879,585 swapin
> > 
> > If we find hinted pages were already swapped out when syscall is called,
> > it's pointless to keep the swapped-out pages in pte.
> > Instead, let's free the cold page because swapin is more expensive
> > than (alloc page + zeroing).
> > 
> > With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
> > 
> > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > 
> > Acked-by: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Yes this makes a lot of sense.
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> 
> One nit below.
> 
> > ---
> >  mm/madvise.c | 26 +++++++++++++++++++++++++-
> >  1 file changed, 25 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 640311704e31..663bd9fa0ae0 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -270,6 +270,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	spinlock_t *ptl;
> >  	pte_t *pte, ptent;
> >  	struct page *page;
> > +	swp_entry_t entry;
> 
> This could go into !pte_present if block

Sure, I fixed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
  2015-10-30 12:47     ` Michal Hocko
@ 2015-11-03  1:10       ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  1:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri, Oct 30, 2015 at 01:47:11PM +0100, Michal Hocko wrote:
> On Fri 30-10-15 16:01:42, Minchan Kim wrote:
> > deactivate_page aims for accelerate for reclaiming through
> > moving pages from active list to inactive list so we should
> > clear PG_referenced for the goal.
> 
> I might be missing something but aren't we using PG_referenced only for
> pagecache (and shmem) pages?

You don't miss anything. For pages which are candidate of MADV_FREEing(
ie, normal anonymous page, not shmem, tmpfs), they shouldn't have any
PG_referenced. Although normal anonymous pages have it, VM doesn't respect
it. One thing I suspect is GUP with FOLL_TOUCH which calls mark_page_accesssed
on anonymous page and will mark PG_referenced.
Technically, it's not a problem but just want to notice in this time.

Primary reason was I want to make deactivate_page *general* so it could
be used for file page as well as anon pages in future.
But at the moment, user of deactivate_page is only MADV_FREE so it might
be better to merge the logic for anon page deactivation into
deactivate_file_page and rename it as general "deactivate_page"
if you're thinking it's better.

> 
> > 
> > Acked-by: Hugh Dickins <hughd@google.com>
> > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/swap.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index d0eacc5f62a3..4a6aec976ab1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> >  
> >  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> >  		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> >  		add_page_to_lru_list(page, lruvec, lru);
> >  
> >  		__count_vm_event(PGDEACTIVATE);
> > -- 
> > 1.9.1
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-11-03  1:10       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  1:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Fri, Oct 30, 2015 at 01:47:11PM +0100, Michal Hocko wrote:
> On Fri 30-10-15 16:01:42, Minchan Kim wrote:
> > deactivate_page aims for accelerate for reclaiming through
> > moving pages from active list to inactive list so we should
> > clear PG_referenced for the goal.
> 
> I might be missing something but aren't we using PG_referenced only for
> pagecache (and shmem) pages?

You don't miss anything. For pages which are candidate of MADV_FREEing(
ie, normal anonymous page, not shmem, tmpfs), they shouldn't have any
PG_referenced. Although normal anonymous pages have it, VM doesn't respect
it. One thing I suspect is GUP with FOLL_TOUCH which calls mark_page_accesssed
on anonymous page and will mark PG_referenced.
Technically, it's not a problem but just want to notice in this time.

Primary reason was I want to make deactivate_page *general* so it could
be used for file page as well as anon pages in future.
But at the moment, user of deactivate_page is only MADV_FREE so it might
be better to merge the logic for anon page deactivation into
deactivate_file_page and rename it as general "deactivate_page"
if you're thinking it's better.

> 
> > 
> > Acked-by: Hugh Dickins <hughd@google.com>
> > Suggested-by: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/swap.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index d0eacc5f62a3..4a6aec976ab1 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -810,6 +810,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> >  
> >  		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> >  		ClearPageActive(page);
> > +		ClearPageReferenced(page);
> >  		add_page_to_lru_list(page, lruvec, lru);
> >  
> >  		__count_vm_event(PGDEACTIVATE);
> > -- 
> > 1.9.1
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-03  2:23       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:23 UTC (permalink / raw)
  To: Daniel Micay
  Cc: David Rientjes, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Sun, Nov 01, 2015 at 01:29:45AM -0500, Daniel Micay wrote:
> On 01/11/15 12:51 AM, David Rientjes wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > 
> >> MADV_FREE is on linux-next so long time. The reason was two, I think.
> >>
> >> 1. MADV_FREE code on reclaim path was really mess.
> >>
> >> 2. Andrew really want to see voice of userland people who want to use
> >>    the syscall.
> >>
> >> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> >> to make progress upstreaming but I was busy at that time so it took
> >> so long time for me to revist the code and finally, I clean it up the
> >> mess recently so it solves the #2 issue.
> >>
> >> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> >> again recently and they said it would be great to have even though
> >> it has swap dependency now so Andrew decided he will do that for v4.4.
> >>
> > 
> > First, thanks very much for refreshing the patchset and reposting after a 
> > series of changes have been periodically added to -mm, it makes it much 
> > easier.
> > 
> > For tcmalloc, we can do some things in the allocator itself to increase 
> > the amount of memory backed by thp.  Specifically, we can prefer to 
> > release Spans to pageblocks that are already not backed by thp so there is 
> > no additional split on each scavenge.  This is somewhat easy if all memory 
> > is organized into hugepage-aligned pageblocks in the allocator itself.  
> > Second, we can prefer to release Spans of longer length on each scavenge 
> > so we can delay scavenging for as long as possible in a hope we can find 
> > more pages to coalesce.  Third, we can discount refaulted released memory 
> > from the scavenging period.
> > 
> > That significantly improves the amount of memory backed by thp for 
> > tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> > release memory to the system and MADV_FREE wouldn't help at all in a 
> > swapless environment.
> > 
> > To combat that, I've proposed a new MADV bit that simply caches the 
> > ranges freed by the allocator per vma and places them on both a per-vma 
> > and per-memcg list.  During reclaim, this list is iterated and ptes are 
> > freed after thp split period to the normal directed reclaim.  Without 
> > memory pressure, this backs 100% of the heap with thp with a relatively 
> > lightweight kernel change (the majority is vma manipulation on split) and 
> > a couple line change to tcmalloc.  When pulling memory from the returned 
> > freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> > MADV bit to remove it from this cache, so there is a second madvise(2) 
> > syscall involved but the freeing call is much less expensive since there 
> > is no pagetable walk without memory pressure or synchronous thp split.
> > 
> > I've been looking at MADV_FREE to see if there is common ground that could 
> > be shared, but perhaps it's just easier to ask what your proposed strategy 
> > is so that tcmalloc users, especially those in swapless environments, 
> > would benefit from any of your work?
> 
> The current implementation requires swap because the kernel already has
> robust infrastructure for swapping out anonymous memory when there's
> memory pressure. The MADV_FREE implementation just has to hook in there
> and cause pages to be dropped instead of swapped out. There's no reason
> it couldn't be extended to work in swapless environments, but it will
> take additional design and implementation work. As a stop-gap, I think

Yes, I have two ideas to support swapless system.

First one I sent a few month ago but didn't receive enough comment.
https://lkml.org/lkml/2015/2/24/71

Second one, we could add new LRU list which has just MADV_FREEed
hinted pages and VM can age them fairly with another LRU lists.
It might be better policy but it needs more amount of changes in MM
so I want to listen from userland people once they start to use
syscall.

> zram and friends will work fine as a form of swap for this.
> 
> It can definitely be improved to cooperate well with THP too. I've been
> following the progress, and most of the problems seem to have been with
> the THP and that's a very active area of development. Seems best to deal
> with that after a simple, working implementation lands.

I have already patch which splits THP page lazy where in reclaim path,
not syscall context. The patch itself is really simple but THP is
sometime very subtle and is changing heavily so I didn't want to make
noise this time. If anyone needs it really this time,
I am happy to send it.

> 
> The best aspect of MADV_FREE is that it completely avoids page faults
> when there's no memory pressure. Making use of the freed memory only
> triggers page faults if the pages had to be dropped because the system
> ran out of memory. It also avoids needing to zero the pages. The memory
> can also still be freed at any time if there's memory pressure again
> even if it's handed out as an allocation until it's actually touched.
> 
> The call to madvise still has significant overhead, but it's much
> cheaper than MADV_DONTNEED. Allocators will be able to lean on the
> kernel to make good decisions rather than implementing lazy freeing
> entirely on their own. It should improve performance *and* behavior
> under memory pressure since allocators can be more aggressive with it
> than MADV_DONTNEED.
> 
> A nice future improvement would be landing MADV_FREE_UNDO feature to
> allow an attempt to pin the pages in memory again. It would make this
> work very well for implementing caches that are dropped under memory
> pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
> MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
> think browser vendors would be very interested in it.
> 



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-03  2:23       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:23 UTC (permalink / raw)
  To: Daniel Micay
  Cc: David Rientjes, Andrew Morton,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Sun, Nov 01, 2015 at 01:29:45AM -0500, Daniel Micay wrote:
> On 01/11/15 12:51 AM, David Rientjes wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > 
> >> MADV_FREE is on linux-next so long time. The reason was two, I think.
> >>
> >> 1. MADV_FREE code on reclaim path was really mess.
> >>
> >> 2. Andrew really want to see voice of userland people who want to use
> >>    the syscall.
> >>
> >> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> >> to make progress upstreaming but I was busy at that time so it took
> >> so long time for me to revist the code and finally, I clean it up the
> >> mess recently so it solves the #2 issue.
> >>
> >> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> >> again recently and they said it would be great to have even though
> >> it has swap dependency now so Andrew decided he will do that for v4.4.
> >>
> > 
> > First, thanks very much for refreshing the patchset and reposting after a 
> > series of changes have been periodically added to -mm, it makes it much 
> > easier.
> > 
> > For tcmalloc, we can do some things in the allocator itself to increase 
> > the amount of memory backed by thp.  Specifically, we can prefer to 
> > release Spans to pageblocks that are already not backed by thp so there is 
> > no additional split on each scavenge.  This is somewhat easy if all memory 
> > is organized into hugepage-aligned pageblocks in the allocator itself.  
> > Second, we can prefer to release Spans of longer length on each scavenge 
> > so we can delay scavenging for as long as possible in a hope we can find 
> > more pages to coalesce.  Third, we can discount refaulted released memory 
> > from the scavenging period.
> > 
> > That significantly improves the amount of memory backed by thp for 
> > tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> > release memory to the system and MADV_FREE wouldn't help at all in a 
> > swapless environment.
> > 
> > To combat that, I've proposed a new MADV bit that simply caches the 
> > ranges freed by the allocator per vma and places them on both a per-vma 
> > and per-memcg list.  During reclaim, this list is iterated and ptes are 
> > freed after thp split period to the normal directed reclaim.  Without 
> > memory pressure, this backs 100% of the heap with thp with a relatively 
> > lightweight kernel change (the majority is vma manipulation on split) and 
> > a couple line change to tcmalloc.  When pulling memory from the returned 
> > freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> > MADV bit to remove it from this cache, so there is a second madvise(2) 
> > syscall involved but the freeing call is much less expensive since there 
> > is no pagetable walk without memory pressure or synchronous thp split.
> > 
> > I've been looking at MADV_FREE to see if there is common ground that could 
> > be shared, but perhaps it's just easier to ask what your proposed strategy 
> > is so that tcmalloc users, especially those in swapless environments, 
> > would benefit from any of your work?
> 
> The current implementation requires swap because the kernel already has
> robust infrastructure for swapping out anonymous memory when there's
> memory pressure. The MADV_FREE implementation just has to hook in there
> and cause pages to be dropped instead of swapped out. There's no reason
> it couldn't be extended to work in swapless environments, but it will
> take additional design and implementation work. As a stop-gap, I think

Yes, I have two ideas to support swapless system.

First one I sent a few month ago but didn't receive enough comment.
https://lkml.org/lkml/2015/2/24/71

Second one, we could add new LRU list which has just MADV_FREEed
hinted pages and VM can age them fairly with another LRU lists.
It might be better policy but it needs more amount of changes in MM
so I want to listen from userland people once they start to use
syscall.

> zram and friends will work fine as a form of swap for this.
> 
> It can definitely be improved to cooperate well with THP too. I've been
> following the progress, and most of the problems seem to have been with
> the THP and that's a very active area of development. Seems best to deal
> with that after a simple, working implementation lands.

I have already patch which splits THP page lazy where in reclaim path,
not syscall context. The patch itself is really simple but THP is
sometime very subtle and is changing heavily so I didn't want to make
noise this time. If anyone needs it really this time,
I am happy to send it.

> 
> The best aspect of MADV_FREE is that it completely avoids page faults
> when there's no memory pressure. Making use of the freed memory only
> triggers page faults if the pages had to be dropped because the system
> ran out of memory. It also avoids needing to zero the pages. The memory
> can also still be freed at any time if there's memory pressure again
> even if it's handed out as an allocation until it's actually touched.
> 
> The call to madvise still has significant overhead, but it's much
> cheaper than MADV_DONTNEED. Allocators will be able to lean on the
> kernel to make good decisions rather than implementing lazy freeing
> entirely on their own. It should improve performance *and* behavior
> under memory pressure since allocators can be more aggressive with it
> than MADV_DONTNEED.
> 
> A nice future improvement would be landing MADV_FREE_UNDO feature to
> allow an attempt to pin the pages in memory again. It would make this
> work very well for implementing caches that are dropped under memory
> pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
> MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
> think browser vendors would be very interested in it.
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-03  2:23       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:23 UTC (permalink / raw)
  To: Daniel Micay
  Cc: David Rientjes, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Sun, Nov 01, 2015 at 01:29:45AM -0500, Daniel Micay wrote:
> On 01/11/15 12:51 AM, David Rientjes wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > 
> >> MADV_FREE is on linux-next so long time. The reason was two, I think.
> >>
> >> 1. MADV_FREE code on reclaim path was really mess.
> >>
> >> 2. Andrew really want to see voice of userland people who want to use
> >>    the syscall.
> >>
> >> A few month ago, Daniel Micay(jemalloc active contributor) requested me
> >> to make progress upstreaming but I was busy at that time so it took
> >> so long time for me to revist the code and finally, I clean it up the
> >> mess recently so it solves the #2 issue.
> >>
> >> As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
> >> again recently and they said it would be great to have even though
> >> it has swap dependency now so Andrew decided he will do that for v4.4.
> >>
> > 
> > First, thanks very much for refreshing the patchset and reposting after a 
> > series of changes have been periodically added to -mm, it makes it much 
> > easier.
> > 
> > For tcmalloc, we can do some things in the allocator itself to increase 
> > the amount of memory backed by thp.  Specifically, we can prefer to 
> > release Spans to pageblocks that are already not backed by thp so there is 
> > no additional split on each scavenge.  This is somewhat easy if all memory 
> > is organized into hugepage-aligned pageblocks in the allocator itself.  
> > Second, we can prefer to release Spans of longer length on each scavenge 
> > so we can delay scavenging for as long as possible in a hope we can find 
> > more pages to coalesce.  Third, we can discount refaulted released memory 
> > from the scavenging period.
> > 
> > That significantly improves the amount of memory backed by thp for 
> > tcmalloc.  The problem, however, is that tcmalloc uses MADV_DONTNEED to 
> > release memory to the system and MADV_FREE wouldn't help at all in a 
> > swapless environment.
> > 
> > To combat that, I've proposed a new MADV bit that simply caches the 
> > ranges freed by the allocator per vma and places them on both a per-vma 
> > and per-memcg list.  During reclaim, this list is iterated and ptes are 
> > freed after thp split period to the normal directed reclaim.  Without 
> > memory pressure, this backs 100% of the heap with thp with a relatively 
> > lightweight kernel change (the majority is vma manipulation on split) and 
> > a couple line change to tcmalloc.  When pulling memory from the returned 
> > freelists, the memory that we have MADV_DONTNEED'd, we need to use another 
> > MADV bit to remove it from this cache, so there is a second madvise(2) 
> > syscall involved but the freeing call is much less expensive since there 
> > is no pagetable walk without memory pressure or synchronous thp split.
> > 
> > I've been looking at MADV_FREE to see if there is common ground that could 
> > be shared, but perhaps it's just easier to ask what your proposed strategy 
> > is so that tcmalloc users, especially those in swapless environments, 
> > would benefit from any of your work?
> 
> The current implementation requires swap because the kernel already has
> robust infrastructure for swapping out anonymous memory when there's
> memory pressure. The MADV_FREE implementation just has to hook in there
> and cause pages to be dropped instead of swapped out. There's no reason
> it couldn't be extended to work in swapless environments, but it will
> take additional design and implementation work. As a stop-gap, I think

Yes, I have two ideas to support swapless system.

First one I sent a few month ago but didn't receive enough comment.
https://lkml.org/lkml/2015/2/24/71

Second one, we could add new LRU list which has just MADV_FREEed
hinted pages and VM can age them fairly with another LRU lists.
It might be better policy but it needs more amount of changes in MM
so I want to listen from userland people once they start to use
syscall.

> zram and friends will work fine as a form of swap for this.
> 
> It can definitely be improved to cooperate well with THP too. I've been
> following the progress, and most of the problems seem to have been with
> the THP and that's a very active area of development. Seems best to deal
> with that after a simple, working implementation lands.

I have already patch which splits THP page lazy where in reclaim path,
not syscall context. The patch itself is really simple but THP is
sometime very subtle and is changing heavily so I didn't want to make
noise this time. If anyone needs it really this time,
I am happy to send it.

> 
> The best aspect of MADV_FREE is that it completely avoids page faults
> when there's no memory pressure. Making use of the freed memory only
> triggers page faults if the pages had to be dropped because the system
> ran out of memory. It also avoids needing to zero the pages. The memory
> can also still be freed at any time if there's memory pressure again
> even if it's handed out as an allocation until it's actually touched.
> 
> The call to madvise still has significant overhead, but it's much
> cheaper than MADV_DONTNEED. Allocators will be able to lean on the
> kernel to make good decisions rather than implementing lazy freeing
> entirely on their own. It should improve performance *and* behavior
> under memory pressure since allocators can be more aggressive with it
> than MADV_DONTNEED.
> 
> A nice future improvement would be landing MADV_FREE_UNDO feature to
> allow an attempt to pin the pages in memory again. It would make this
> work very well for implementing caches that are dropped under memory
> pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
> MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
> think browser vendors would be very interested in it.
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-11-02  0:08     ` Hugh Dickins
  (?)
  (?)
@ 2015-11-03  2:32       ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, sparclinux, linux-arch

On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > For uapi, need try to let all macros have same value, and MADV_FREE is
> > added into main branch recently, so need redefine MADV_FREE for it.
> > 
> > At present, '8' can be shared with all architectures, so redefine it to
> > '8'.
> > 
> > Cc: rth@twiddle.net <rth@twiddle.net>,
> > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > Cc: Ralf Baechle <ralf@linux-mips.org>
> > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > Cc: deller@gmx.de <deller@gmx.de>
> > Cc: chris@zankel.net <chris@zankel.net>
> > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-api@vger.kernel.org
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> Let me add
> Acked-by: Hugh Dickins <hughd@google.com>
> to this one too.
> 
> But I have extended your mail's Cc list: Darrick pointed out earlier
> that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> that presents any obstacle nowadays, but Dave Miller should be Cc'ed.
> 
> I was a little suspicious that 8 is available for MADV_FREE: why did
> the common/generic parameters start at 9 instead of 8 back in 2.6.16?
> I think the answer is that we had MADV_REMOVE coming in from one
> direction, and MADV_DONTFORK coming from another direction, and when
> Roland looked for where to start the commons for MADV_DONTFORK, it
> appeared that 8 was occupied - by MADV_REMOVE; then a little later
> MADV_REMOVE was shifted to become the first of the commons, at 9.

Thanks for Ack, Ccing relevant people and history!

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:32       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb

On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > For uapi, need try to let all macros have same value, and MADV_FREE is
> > added into main branch recently, so need redefine MADV_FREE for it.
> > 
> > At present, '8' can be shared with all architectures, so redefine it to
> > '8'.
> > 
> > Cc: rth@twiddle.net <rth@twiddle.net>,
> > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > Cc: Ralf Baechle <ralf@linux-mips.org>
> > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > Cc: deller@gmx.de <deller@gmx.de>
> > Cc: chris@zankel.net <chris@zankel.net>
> > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-api@vger.kernel.org
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> Let me add
> Acked-by: Hugh Dickins <hughd@google.com>
> to this one too.
> 
> But I have extended your mail's Cc list: Darrick pointed out earlier
> that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> that presents any obstacle nowadays, but Dave Miller should be Cc'ed.
> 
> I was a little suspicious that 8 is available for MADV_FREE: why did
> the common/generic parameters start at 9 instead of 8 back in 2.6.16?
> I think the answer is that we had MADV_REMOVE coming in from one
> direction, and MADV_DONTFORK coming from another direction, and when
> Roland looked for where to start the commons for MADV_DONTFORK, it
> appeared that 8 was occupied - by MADV_REMOVE; then a little later
> MADV_REMOVE was shifted to become the first of the commons, at 9.

Thanks for Ack, Ccing relevant people and history!

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:32       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb

On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > For uapi, need try to let all macros have same value, and MADV_FREE is
> > added into main branch recently, so need redefine MADV_FREE for it.
> > 
> > At present, '8' can be shared with all architectures, so redefine it to
> > '8'.
> > 
> > Cc: rth@twiddle.net <rth@twiddle.net>,
> > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > Cc: Ralf Baechle <ralf@linux-mips.org>
> > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > Cc: deller@gmx.de <deller@gmx.de>
> > Cc: chris@zankel.net <chris@zankel.net>
> > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-api@vger.kernel.org
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> Let me add
> Acked-by: Hugh Dickins <hughd@google.com>
> to this one too.
> 
> But I have extended your mail's Cc list: Darrick pointed out earlier
> that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> that presents any obstacle nowadays, but Dave Miller should be Cc'ed.
> 
> I was a little suspicious that 8 is available for MADV_FREE: why did
> the common/generic parameters start at 9 instead of 8 back in 2.6.16?
> I think the answer is that we had MADV_REMOVE coming in from one
> direction, and MADV_DONTFORK coming from another direction, and when
> Roland looked for where to start the commons for MADV_DONTFORK, it
> appeared that 8 was occupied - by MADV_REMOVE; then a little later
> MADV_REMOVE was shifted to become the first of the commons, at 9.

Thanks for Ack, Ccing relevant people and history!

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:32       ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, sparclinux, linux-arch

On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> On Fri, 30 Oct 2015, Minchan Kim wrote:
> > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > For uapi, need try to let all macros have same value, and MADV_FREE is
> > added into main branch recently, so need redefine MADV_FREE for it.
> > 
> > At present, '8' can be shared with all architectures, so redefine it to
> > '8'.
> > 
> > Cc: rth@twiddle.net <rth@twiddle.net>,
> > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > Cc: Ralf Baechle <ralf@linux-mips.org>
> > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > Cc: deller@gmx.de <deller@gmx.de>
> > Cc: chris@zankel.net <chris@zankel.net>
> > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: linux-arch@vger.kernel.org
> > Cc: linux-api@vger.kernel.org
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> 
> Let me add
> Acked-by: Hugh Dickins <hughd@google.com>
> to this one too.
> 
> But I have extended your mail's Cc list: Darrick pointed out earlier
> that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> that presents any obstacle nowadays, but Dave Miller should be Cc'ed.
> 
> I was a little suspicious that 8 is available for MADV_FREE: why did
> the common/generic parameters start at 9 instead of 8 back in 2.6.16?
> I think the answer is that we had MADV_REMOVE coming in from one
> direction, and MADV_DONTFORK coming from another direction, and when
> Roland looked for where to start the commons for MADV_DONTFORK, it
> appeared that 8 was occupied - by MADV_REMOVE; then a little later
> MADV_REMOVE was shifted to become the first of the commons, at 9.

Thanks for Ack, Ccing relevant people and history!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-11-03  2:32       ` Minchan Kim
  (?)
  (?)
@ 2015-11-03  2:36         ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, sparclinux, linux-arch

On Tue, Nov 03, 2015 at 11:32:51AM +0900, Minchan Kim wrote:
> On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > > 
> > > For uapi, need try to let all macros have same value, and MADV_FREE is
> > > added into main branch recently, so need redefine MADV_FREE for it.
> > > 
> > > At present, '8' can be shared with all architectures, so redefine it to
> > > '8'.
> > > 
> > > Cc: rth@twiddle.net <rth@twiddle.net>,
> > > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > > Cc: Ralf Baechle <ralf@linux-mips.org>
> > > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > > Cc: deller@gmx.de <deller@gmx.de>
> > > Cc: chris@zankel.net <chris@zankel.net>
> > > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Cc: linux-arch@vger.kernel.org
> > > Cc: linux-api@vger.kernel.org
> > > Acked-by: Minchan Kim <minchan@kernel.org>
> > > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > Let me add
> > Acked-by: Hugh Dickins <hughd@google.com>
> > to this one too.
> > 
> > But I have extended your mail's Cc list: Darrick pointed out earlier
> > that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> > and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> > that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

For the convenience for Dave, I found this.

commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
Author: David S. Miller <davem@davemloft.net>
Date:   Sun Apr 20 02:14:23 2008 -0700

    [SPARC]: Remove SunOS and Solaris binary support.
    
    As per Documentation/feature-removal-schedule.txt
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

Hello Dave,
Could you confirm it?

Thanks.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:36         ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, David Miller, Darrick J. Wong, Roland Dreier,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Chen Gang,
	rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09,
	mattst88-Re5JQEeQqe8AvxtiuMwx3w, Ralf Baechle, jejb

On Tue, Nov 03, 2015 at 11:32:51AM +0900, Minchan Kim wrote:
> On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > > From: Chen Gang <gang.chen.5i5j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > 
> > > For uapi, need try to let all macros have same value, and MADV_FREE is
> > > added into main branch recently, so need redefine MADV_FREE for it.
> > > 
> > > At present, '8' can be shared with all architectures, so redefine it to
> > > '8'.
> > > 
> > > Cc: rth-hL46jP5Bxq7R7s880joybQ@public.gmane.org <rth-hL46jP5Bxq7R7s880joybQ@public.gmane.org>,
> > > Cc: ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09@public.gmane.org <ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09@public.gmane.org>
> > > Cc: mattst88-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <mattst88-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > Cc: Ralf Baechle <ralf-6z/3iImG2C8G8FEW9MqTrA@public.gmane.org>
> > > Cc: jejb-6jwH94ZQLHl74goWV3ctuw@public.gmane.org <jejb-6jwH94ZQLHl74goWV3ctuw@public.gmane.org>
> > > Cc: deller-Mmb7MZpHnFY@public.gmane.org <deller-Mmb7MZpHnFY@public.gmane.org>
> > > Cc: chris-YvXeqwSYzG2sTnJN9+BGXg@public.gmane.org <chris-YvXeqwSYzG2sTnJN9+BGXg@public.gmane.org>
> > > Cc: jcmvbkbc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org <jcmvbkbc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > > Cc: Arnd Bergmann <arnd-r2nGTMty4D4@public.gmane.org>
> > > Cc: linux-arch-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Cc: linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Acked-by: Minchan Kim <minchan-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> > > Signed-off-by: Chen Gang <gang.chen.5i5j-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> > 
> > Let me add
> > Acked-by: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > to this one too.
> > 
> > But I have extended your mail's Cc list: Darrick pointed out earlier
> > that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> > and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> > that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

For the convenience for Dave, I found this.

commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
Author: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Date:   Sun Apr 20 02:14:23 2008 -0700

    [SPARC]: Remove SunOS and Solaris binary support.
    
    As per Documentation/feature-removal-schedule.txt
    
    Signed-off-by: David S. Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

Hello Dave,
Could you confirm it?

Thanks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:36         ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, David Miller, Darrick J. Wong, Roland Dreier,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Chen Gang,
	rth-hL46jP5Bxq7R7s880joybQ, ink-biIs/Y0ymYJMZLIVYojuPNP0rXTJTi09,
	mattst88-Re5JQEeQqe8AvxtiuMwx3w, Ralf Baechle, jejb

On Tue, Nov 03, 2015 at 11:32:51AM +0900, Minchan Kim wrote:
> On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > > 
> > > For uapi, need try to let all macros have same value, and MADV_FREE is
> > > added into main branch recently, so need redefine MADV_FREE for it.
> > > 
> > > At present, '8' can be shared with all architectures, so redefine it to
> > > '8'.
> > > 
> > > Cc: rth@twiddle.net <rth@twiddle.net>,
> > > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > > Cc: Ralf Baechle <ralf@linux-mips.org>
> > > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > > Cc: deller@gmx.de <deller@gmx.de>
> > > Cc: chris@zankel.net <chris@zankel.net>
> > > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Cc: linux-arch@vger.kernel.org
> > > Cc: linux-api@vger.kernel.org
> > > Acked-by: Minchan Kim <minchan@kernel.org>
> > > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > Let me add
> > Acked-by: Hugh Dickins <hughd@google.com>
> > to this one too.
> > 
> > But I have extended your mail's Cc list: Darrick pointed out earlier
> > that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> > and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> > that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

For the convenience for Dave, I found this.

commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
Author: David S. Miller <davem@davemloft.net>
Date:   Sun Apr 20 02:14:23 2008 -0700

    [SPARC]: Remove SunOS and Solaris binary support.
    
    As per Documentation/feature-removal-schedule.txt
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

Hello Dave,
Could you confirm it?

Thanks.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  2:36         ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  2:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Johannes Weiner, zhangyanfei, Rik van Riel,
	Mel Gorman, KOSAKI Motohiro, David Miller, Darrick J. Wong,
	Roland Dreier, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Chen Gang, rth, ink,
	mattst88, Ralf Baechle, jejb, deller, chris, jcmvbkbc,
	Arnd Bergmann, sparclinux, linux-arch

On Tue, Nov 03, 2015 at 11:32:51AM +0900, Minchan Kim wrote:
> On Sun, Nov 01, 2015 at 04:08:27PM -0800, Hugh Dickins wrote:
> > On Fri, 30 Oct 2015, Minchan Kim wrote:
> > > From: Chen Gang <gang.chen.5i5j@gmail.com>
> > > 
> > > For uapi, need try to let all macros have same value, and MADV_FREE is
> > > added into main branch recently, so need redefine MADV_FREE for it.
> > > 
> > > At present, '8' can be shared with all architectures, so redefine it to
> > > '8'.
> > > 
> > > Cc: rth@twiddle.net <rth@twiddle.net>,
> > > Cc: ink@jurassic.park.msu.ru <ink@jurassic.park.msu.ru>
> > > Cc: mattst88@gmail.com <mattst88@gmail.com>
> > > Cc: Ralf Baechle <ralf@linux-mips.org>
> > > Cc: jejb@parisc-linux.org <jejb@parisc-linux.org>
> > > Cc: deller@gmx.de <deller@gmx.de>
> > > Cc: chris@zankel.net <chris@zankel.net>
> > > Cc: jcmvbkbc@gmail.com <jcmvbkbc@gmail.com>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Cc: linux-arch@vger.kernel.org
> > > Cc: linux-api@vger.kernel.org
> > > Acked-by: Minchan Kim <minchan@kernel.org>
> > > Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
> > 
> > Let me add
> > Acked-by: Hugh Dickins <hughd@google.com>
> > to this one too.
> > 
> > But I have extended your mail's Cc list: Darrick pointed out earlier
> > that dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h,
> > and that was in the kernel's sparc mman.h up until 2.6.25.  I doubt
> > that presents any obstacle nowadays, but Dave Miller should be Cc'ed.

For the convenience for Dave, I found this.

commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
Author: David S. Miller <davem@davemloft.net>
Date:   Sun Apr 20 02:14:23 2008 -0700

    [SPARC]: Remove SunOS and Solaris binary support.
    
    As per Documentation/feature-removal-schedule.txt
    
    Signed-off-by: David S. Miller <davem@davemloft.net>

Hello Dave,
Could you confirm it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-11-03  2:36         ` Minchan Kim
  (?)
@ 2015-11-03  3:36           ` David Miller
  -1 siblings, 0 replies; 97+ messages in thread
From: David Miller @ 2015-11-03  3:36 UTC (permalink / raw)
  To: minchan
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

From: Minchan Kim <minchan@kernel.org>
Date: Tue, 3 Nov 2015 11:36:51 +0900

> For the convenience for Dave, I found this.
> 
> commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> Author: David S. Miller <davem@davemloft.net>
> Date:   Sun Apr 20 02:14:23 2008 -0700
> 
>     [SPARC]: Remove SunOS and Solaris binary support.
>     
>     As per Documentation/feature-removal-schedule.txt
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Hello Dave,
> Could you confirm it?

I don't understand what you want me to confirm.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  3:36           ` David Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David Miller @ 2015-11-03  3:36 UTC (permalink / raw)
  To: minchan
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

From: Minchan Kim <minchan@kernel.org>
Date: Tue, 3 Nov 2015 11:36:51 +0900

> For the convenience for Dave, I found this.
> 
> commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> Author: David S. Miller <davem@davemloft.net>
> Date:   Sun Apr 20 02:14:23 2008 -0700
> 
>     [SPARC]: Remove SunOS and Solaris binary support.
>     
>     As per Documentation/feature-removal-schedule.txt
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Hello Dave,
> Could you confirm it?

I don't understand what you want me to confirm.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  3:36           ` David Miller
  0 siblings, 0 replies; 97+ messages in thread
From: David Miller @ 2015-11-03  3:36 UTC (permalink / raw)
  To: minchan
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

From: Minchan Kim <minchan@kernel.org>
Date: Tue, 3 Nov 2015 11:36:51 +0900

> For the convenience for Dave, I found this.
> 
> commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> Author: David S. Miller <davem@davemloft.net>
> Date:   Sun Apr 20 02:14:23 2008 -0700
> 
>     [SPARC]: Remove SunOS and Solaris binary support.
>     
>     As per Documentation/feature-removal-schedule.txt
>     
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> Hello Dave,
> Could you confirm it?

I don't understand what you want me to confirm.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
  2015-11-03  3:36           ` David Miller
  (?)
@ 2015-11-03  4:31             ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  4:31 UTC (permalink / raw)
  To: David Miller
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

On Mon, Nov 02, 2015 at 10:36:52PM -0500, David Miller wrote:
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 3 Nov 2015 11:36:51 +0900
> 
> > For the convenience for Dave, I found this.
> > 
> > commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> > Author: David S. Miller <davem@davemloft.net>
> > Date:   Sun Apr 20 02:14:23 2008 -0700
> > 
> >     [SPARC]: Remove SunOS and Solaris binary support.
> >     
> >     As per Documentation/feature-removal-schedule.txt
> >     
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > Hello Dave,
> > Could you confirm it?
> 
> I don't understand what you want me to confirm.

Sorry for lacking of the information.

Is it okay to use number 8 for upcoming madvise(addr, len, MADV_FREE)
feature in sparc arch?

The reason to ask is that Darrick pointed out earlier that
dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h
and Hugh pointed out that was in the kernel's sparc mman.h up
until 2.6.25 but disappeared now so I guess it's okay to use the
number 8 for MADV_FREE in sparc but want to confirm from you.

Thanks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  4:31             ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  4:31 UTC (permalink / raw)
  To: David Miller
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

On Mon, Nov 02, 2015 at 10:36:52PM -0500, David Miller wrote:
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 3 Nov 2015 11:36:51 +0900
> 
> > For the convenience for Dave, I found this.
> > 
> > commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> > Author: David S. Miller <davem@davemloft.net>
> > Date:   Sun Apr 20 02:14:23 2008 -0700
> > 
> >     [SPARC]: Remove SunOS and Solaris binary support.
> >     
> >     As per Documentation/feature-removal-schedule.txt
> >     
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > Hello Dave,
> > Could you confirm it?
> 
> I don't understand what you want me to confirm.

Sorry for lacking of the information.

Is it okay to use number 8 for upcoming madvise(addr, len, MADV_FREE)
feature in sparc arch?

The reason to ask is that Darrick pointed out earlier that
dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h
and Hugh pointed out that was in the kernel's sparc mman.h up
until 2.6.25 but disappeared now so I guess it's okay to use the
number 8 for MADV_FREE in sparc but want to confirm from you.

Thanks.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures
@ 2015-11-03  4:31             ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-03  4:31 UTC (permalink / raw)
  To: David Miller
  Cc: hughd, akpm, linux-kernel, linux-mm, mtk.manpages, linux-api,
	hannes, zhangyanfei, riel, mgorman, kosaki.motohiro,
	darrick.wong, roland, je, danielmicay, kirill, mhocko,
	yalin.wang2010, shli, gang.chen.5i5j, rth, ink, mattst88, ralf,
	jejb, deller, chris, jcmvbkbc, arnd, sparclinux, linux-arch

On Mon, Nov 02, 2015 at 10:36:52PM -0500, David Miller wrote:
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 3 Nov 2015 11:36:51 +0900
> 
> > For the convenience for Dave, I found this.
> > 
> > commit ec98c6b9b47df6df1c1fa6cf3d427414f8c2cf16
> > Author: David S. Miller <davem@davemloft.net>
> > Date:   Sun Apr 20 02:14:23 2008 -0700
> > 
> >     [SPARC]: Remove SunOS and Solaris binary support.
> >     
> >     As per Documentation/feature-removal-schedule.txt
> >     
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > 
> > Hello Dave,
> > Could you confirm it?
> 
> I don't understand what you want me to confirm.

Sorry for lacking of the information.

Is it okay to use number 8 for upcoming madvise(addr, len, MADV_FREE)
feature in sparc arch?

The reason to ask is that Darrick pointed out earlier that
dietlibc has a Solaris #define MADV_FREE 0x5 in its mman.h
and Hugh pointed out that was in the kernel's sparc mman.h up
until 2.6.25 but disappeared now so I guess it's okay to use the
number 8 for MADV_FREE in sparc but want to confirm from you.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-11-03  0:52       ` Minchan Kim
@ 2015-11-04  8:15         ` Michal Hocko
  -1 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-11-04  8:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, yalin.wang2010,
	Wang, Yalin

On Tue 03-11-15 09:52:23, Minchan Kim wrote:
[...]
> I believe adding new LRU list would be controversial(ie, not trivial)
> for maintainer POV even though code wouldn't be complicated.
> So, I want to see problems in *real practice*, not any theoritical
> test program before diving into that.
> To see such voice of request, we should release the syscall.
> So, I want to push this first.

Completely agreed. The functionality is useful already and a new LRU
list is not justified yet.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04  8:15         ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-11-04  8:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Daniel Micay, Kirill A. Shutemov, yalin.wang2010,
	Wang, Yalin

On Tue 03-11-15 09:52:23, Minchan Kim wrote:
[...]
> I believe adding new LRU list would be controversial(ie, not trivial)
> for maintainer POV even though code wouldn't be complicated.
> So, I want to see problems in *real practice*, not any theoritical
> test program before diving into that.
> To see such voice of request, we should release the syscall.
> So, I want to push this first.

Completely agreed. The functionality is useful already and a new LRU
list is not justified yet.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
  2015-11-03  1:10       ` Minchan Kim
  (?)
@ 2015-11-04  8:22         ` Michal Hocko
  -1 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-11-04  8:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Tue 03-11-15 10:10:30, Minchan Kim wrote:
> One thing I suspect is GUP with FOLL_TOUCH which calls mark_page_accesssed
> on anonymous page and will mark PG_referenced.

OK, this is what I've missed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-11-04  8:22         ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-11-04  8:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li

On Tue 03-11-15 10:10:30, Minchan Kim wrote:
> One thing I suspect is GUP with FOLL_TOUCH which calls mark_page_accesssed
> on anonymous page and will mark PG_referenced.

OK, this is what I've missed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced
@ 2015-11-04  8:22         ` Michal Hocko
  0 siblings, 0 replies; 97+ messages in thread
From: Michal Hocko @ 2015-11-04  8:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, yalin.wang2010, Shaohua Li

On Tue 03-11-15 10:10:30, Minchan Kim wrote:
> One thing I suspect is GUP with FOLL_TOUCH which calls mark_page_accesssed
> on anonymous page and will mark PG_referenced.

OK, this is what I've missed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-11-03  0:52       ` Minchan Kim
  (?)
@ 2015-11-04 17:53         ` Shaohua Li
  -1 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-11-04 17:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > so there is no value keeping them in the active anonymous LRU so this
> > > patch moves them to inactive LRU list's head.
> > > 
> > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > are reclaimed first because they are more likely to be cold rather than
> > > recently active pages.
> > > 
> > > An arguable issue for the approach would be whether we should put the page
> > > to the head or tail of the inactive list.  I chose head because the kernel
> > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > at least we know it's not *hot*, so landing of inactive head would be a
> > > comprimise for various usecases.
> > > 
> > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > active list will sit there for a long time even under memory pressure
> > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > whole purpose of using MADV_FREE to help the system to free memory which
> > > is might not be used.
> > 
> > My main concern is the policy how we should treat the FREE pages. Moving it to
> > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > to make sure FREE pages can be freed before system wide page reclaim. As you
> > said, this is arguable, but I hope we can discuss about this issue more.
> 
> Yes, it's arguble. ;-)
> 
> It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> but it's not true because the page would be dirty state when VM want to reclaim. 

There certainly are other usage cases, but even your patch log mainly describes
the jemalloc usage case, which uses MADV_DONTNEED.

> I'm also against with your's suggestion which let's discard FREEed page before
> system wide page reclaim because system would have lots of clean cold page
> caches or anonymous pages. In such case, reclaiming of them would be better.
> Yeb, it's really workload-dependent so we might need some heuristic which is
> normally what we want to avoid.
> 
> Having said that, I agree with you we could do better than the deactivation
> and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> it could promote it to acive-anon-LRU which would be very natural aging
> concept because it mean someone touches the page recenlty.
> 
> With that, I don't want to bias one side and don't want to add some knob for
> tuning the heuristic but let's rely on common fair aging scheme of VM.
> 
> Another bonus with new LRU list is we could support MADV_FREE on swapless
> system.
> 
> > 
> > Or do you want to push this first and address the policy issue later?
> 
> I believe adding new LRU list would be controversial(ie, not trivial)
> for maintainer POV even though code wouldn't be complicated.
> So, I want to see problems in *real practice*, not any theoritical
> test program before diving into that.
> To see such voice of request, we should release the syscall.
> So, I want to push this first.

The memory pressure issue isn't just in artificial test. In jemalloc, there is
a knob (lg_dirty_mult) to control the rate memory should be purged (using
MADV_DONTNEED). We already had several reports in our production environment
changing the knob can cause extra memory usage (and swap and so on). If
jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
cause the similar issue, eg (extram memory usage, swap). That said I don't
object to push this first, but the memory pressue issue can happen in real
production, I hope it's not ignored.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 17:53         ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-11-04 17:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Wang, Yalin

On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > so there is no value keeping them in the active anonymous LRU so this
> > > patch moves them to inactive LRU list's head.
> > > 
> > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > are reclaimed first because they are more likely to be cold rather than
> > > recently active pages.
> > > 
> > > An arguable issue for the approach would be whether we should put the page
> > > to the head or tail of the inactive list.  I chose head because the kernel
> > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > at least we know it's not *hot*, so landing of inactive head would be a
> > > comprimise for various usecases.
> > > 
> > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > active list will sit there for a long time even under memory pressure
> > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > whole purpose of using MADV_FREE to help the system to free memory which
> > > is might not be used.
> > 
> > My main concern is the policy how we should treat the FREE pages. Moving it to
> > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > to make sure FREE pages can be freed before system wide page reclaim. As you
> > said, this is arguable, but I hope we can discuss about this issue more.
> 
> Yes, it's arguble. ;-)
> 
> It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> but it's not true because the page would be dirty state when VM want to reclaim. 

There certainly are other usage cases, but even your patch log mainly describes
the jemalloc usage case, which uses MADV_DONTNEED.

> I'm also against with your's suggestion which let's discard FREEed page before
> system wide page reclaim because system would have lots of clean cold page
> caches or anonymous pages. In such case, reclaiming of them would be better.
> Yeb, it's really workload-dependent so we might need some heuristic which is
> normally what we want to avoid.
> 
> Having said that, I agree with you we could do better than the deactivation
> and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> it could promote it to acive-anon-LRU which would be very natural aging
> concept because it mean someone touches the page recenlty.
> 
> With that, I don't want to bias one side and don't want to add some knob for
> tuning the heuristic but let's rely on common fair aging scheme of VM.
> 
> Another bonus with new LRU list is we could support MADV_FREE on swapless
> system.
> 
> > 
> > Or do you want to push this first and address the policy issue later?
> 
> I believe adding new LRU list would be controversial(ie, not trivial)
> for maintainer POV even though code wouldn't be complicated.
> So, I want to see problems in *real practice*, not any theoritical
> test program before diving into that.
> To see such voice of request, we should release the syscall.
> So, I want to push this first.

The memory pressure issue isn't just in artificial test. In jemalloc, there is
a knob (lg_dirty_mult) to control the rate memory should be purged (using
MADV_DONTNEED). We already had several reports in our production environment
changing the knob can cause extra memory usage (and swap and so on). If
jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
cause the similar issue, eg (extram memory usage, swap). That said I don't
object to push this first, but the memory pressue issue can happen in real
production, I hope it's not ignored.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 17:53         ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-11-04 17:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > so there is no value keeping them in the active anonymous LRU so this
> > > patch moves them to inactive LRU list's head.
> > > 
> > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > are reclaimed first because they are more likely to be cold rather than
> > > recently active pages.
> > > 
> > > An arguable issue for the approach would be whether we should put the page
> > > to the head or tail of the inactive list.  I chose head because the kernel
> > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > at least we know it's not *hot*, so landing of inactive head would be a
> > > comprimise for various usecases.
> > > 
> > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > active list will sit there for a long time even under memory pressure
> > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > whole purpose of using MADV_FREE to help the system to free memory which
> > > is might not be used.
> > 
> > My main concern is the policy how we should treat the FREE pages. Moving it to
> > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > to make sure FREE pages can be freed before system wide page reclaim. As you
> > said, this is arguable, but I hope we can discuss about this issue more.
> 
> Yes, it's arguble. ;-)
> 
> It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> but it's not true because the page would be dirty state when VM want to reclaim. 

There certainly are other usage cases, but even your patch log mainly describes
the jemalloc usage case, which uses MADV_DONTNEED.

> I'm also against with your's suggestion which let's discard FREEed page before
> system wide page reclaim because system would have lots of clean cold page
> caches or anonymous pages. In such case, reclaiming of them would be better.
> Yeb, it's really workload-dependent so we might need some heuristic which is
> normally what we want to avoid.
> 
> Having said that, I agree with you we could do better than the deactivation
> and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> it could promote it to acive-anon-LRU which would be very natural aging
> concept because it mean someone touches the page recenlty.
> 
> With that, I don't want to bias one side and don't want to add some knob for
> tuning the heuristic but let's rely on common fair aging scheme of VM.
> 
> Another bonus with new LRU list is we could support MADV_FREE on swapless
> system.
> 
> > 
> > Or do you want to push this first and address the policy issue later?
> 
> I believe adding new LRU list would be controversial(ie, not trivial)
> for maintainer POV even though code wouldn't be complicated.
> So, I want to see problems in *real practice*, not any theoritical
> test program before diving into that.
> To see such voice of request, we should release the syscall.
> So, I want to push this first.

The memory pressure issue isn't just in artificial test. In jemalloc, there is
a knob (lg_dirty_mult) to control the rate memory should be purged (using
MADV_DONTNEED). We already had several reports in our production environment
changing the knob can cause extra memory usage (and swap and so on). If
jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
cause the similar issue, eg (extram memory usage, swap). That said I don't
object to push this first, but the memory pressue issue can happen in real
production, I hope it's not ignored.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-11-04 17:53         ` Shaohua Li
@ 2015-11-04 18:20           ` Shaohua Li
  -1 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-11-04 18:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > so there is no value keeping them in the active anonymous LRU so this
> > > > patch moves them to inactive LRU list's head.
> > > > 
> > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > are reclaimed first because they are more likely to be cold rather than
> > > > recently active pages.
> > > > 
> > > > An arguable issue for the approach would be whether we should put the page
> > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > comprimise for various usecases.
> > > > 
> > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > active list will sit there for a long time even under memory pressure
> > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > is might not be used.
> > > 
> > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > said, this is arguable, but I hope we can discuss about this issue more.
> > 
> > Yes, it's arguble. ;-)
> > 
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim. 
> 
> There certainly are other usage cases, but even your patch log mainly describes
> the jemalloc usage case, which uses MADV_DONTNEED.
> 
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> > 
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > 
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > 
> > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > system.
> > 
> > > 
> > > Or do you want to push this first and address the policy issue later?
> > 
> > I believe adding new LRU list would be controversial(ie, not trivial)
> > for maintainer POV even though code wouldn't be complicated.
> > So, I want to see problems in *real practice*, not any theoritical
> > test program before diving into that.
> > To see such voice of request, we should release the syscall.
> > So, I want to push this first.
> 
> The memory pressure issue isn't just in artificial test. In jemalloc, there is
> a knob (lg_dirty_mult) to control the rate memory should be purged (using
> MADV_DONTNEED). We already had several reports in our production environment
> changing the knob can cause extra memory usage (and swap and so on). If
> jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> cause the similar issue, eg (extram memory usage, swap). That said I don't
> object to push this first, but the memory pressue issue can happen in real
> production, I hope it's not ignored.

I think the question is if application uses MADV_DONTNEED originally, how much
better if we replace it to MADV_FREE compared to just delete the MADV_DONTNEED,
considering anonymous memory is hard to be reclaimed currently.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 18:20           ` Shaohua Li
  0 siblings, 0 replies; 97+ messages in thread
From: Shaohua Li @ 2015-11-04 18:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > so there is no value keeping them in the active anonymous LRU so this
> > > > patch moves them to inactive LRU list's head.
> > > > 
> > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > are reclaimed first because they are more likely to be cold rather than
> > > > recently active pages.
> > > > 
> > > > An arguable issue for the approach would be whether we should put the page
> > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > comprimise for various usecases.
> > > > 
> > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > active list will sit there for a long time even under memory pressure
> > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > is might not be used.
> > > 
> > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > said, this is arguable, but I hope we can discuss about this issue more.
> > 
> > Yes, it's arguble. ;-)
> > 
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim. 
> 
> There certainly are other usage cases, but even your patch log mainly describes
> the jemalloc usage case, which uses MADV_DONTNEED.
> 
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> > 
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > 
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > 
> > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > system.
> > 
> > > 
> > > Or do you want to push this first and address the policy issue later?
> > 
> > I believe adding new LRU list would be controversial(ie, not trivial)
> > for maintainer POV even though code wouldn't be complicated.
> > So, I want to see problems in *real practice*, not any theoritical
> > test program before diving into that.
> > To see such voice of request, we should release the syscall.
> > So, I want to push this first.
> 
> The memory pressure issue isn't just in artificial test. In jemalloc, there is
> a knob (lg_dirty_mult) to control the rate memory should be purged (using
> MADV_DONTNEED). We already had several reports in our production environment
> changing the knob can cause extra memory usage (and swap and so on). If
> jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> cause the similar issue, eg (extram memory usage, swap). That said I don't
> object to push this first, but the memory pressue issue can happen in real
> production, I hope it's not ignored.

I think the question is if application uses MADV_DONTNEED originally, how much
better if we replace it to MADV_FREE compared to just delete the MADV_DONTNEED,
considering anonymous memory is hard to be reclaimed currently.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
  2015-11-01  6:29   ` Daniel Micay
@ 2015-11-04 20:19       ` David Rientjes
  2015-11-04 20:19       ` David Rientjes
  1 sibling, 0 replies; 97+ messages in thread
From: David Rientjes @ 2015-11-04 20:19 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Sun, 1 Nov 2015, Daniel Micay wrote:

> It can definitely be improved to cooperate well with THP too. I've been
> following the progress, and most of the problems seem to have been with
> the THP and that's a very active area of development. Seems best to deal
> with that after a simple, working implementation lands.
> 
> The best aspect of MADV_FREE is that it completely avoids page faults
> when there's no memory pressure. Making use of the freed memory only
> triggers page faults if the pages had to be dropped because the system
> ran out of memory. It also avoids needing to zero the pages. The memory
> can also still be freed at any time if there's memory pressure again
> even if it's handed out as an allocation until it's actually touched.
> 
> The call to madvise still has significant overhead, but it's much
> cheaper than MADV_DONTNEED. Allocators will be able to lean on the
> kernel to make good decisions rather than implementing lazy freeing
> entirely on their own. It should improve performance *and* behavior
> under memory pressure since allocators can be more aggressive with it
> than MADV_DONTNEED.
> 
> A nice future improvement would be landing MADV_FREE_UNDO feature to
> allow an attempt to pin the pages in memory again. It would make this
> work very well for implementing caches that are dropped under memory
> pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
> MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
> think browser vendors would be very interested in it.
> 

This sounds similar to what I was proposing to prevent thp splits when 
there is no memory pressure.

MADV_SPLITTABLE marks ranges of memory as free and the underlying thp may 
be split if there is no memory pressure.  Under memory pressure, it acts 
identical to MADV_DONTNEED.  Without memory pressure, the range is 
enqueued on an lru for the memcg that the vma's mm owner belongs to 
(global for !CONFIG_MEMCG).  It is also linked on a per-vma list for the 
range.  Anytime the vma is manipulated, the MADV_SPLITTABLE ranges are 
also fixed up.

On subsequent memory pressure, the memcg hierarchy lru list is iterated 
(global for !CONFIG_MEMCG) and the MADV_SPLITTABLE ranges are actually 
zapped (including thp split if necessary) and the memory is really freed 
to the system.

MADV_UNSPLITTABLE marks ranges of memory that have already been freed 
through MADV_SPLITTABLE as being used again.  If there was no memory 
pressure and the MADV_SPLITTABLE was simply enqueued on the lru list, it 
is removed from that list after the range has been zeroed with the same 
user-facing semantics as MADV_DONTNEED.  Otherwise, nothing is done since 
the ptes are already zapped and we'll incur a refault.

The change to tcmalloc is simple: use MADV_SPLITTABLE instead of 
MADV_DONTNEED when freeing memory to the system and use MADV_UNSPLITTABLE 
when returning memory that has been already freed to the system.

This works well in experimentation when 100% of heap backed by thp with no 
memory pressure.  This is a type of lazy-free that prevents thp memory 
from being split without memory pressure.

I was wondering if this could become part of MADV_FREE behavior with the 
MADV_FREE_UNDO behavior as the equivalent to my MADV_UNSPLITTABLE.  If 
there is no ground to be shared, mine is just implemented seperately, but 
I'm trying to avoid additional system calls required for malloc 
implemenations.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/8] MADV_FREE support
@ 2015-11-04 20:19       ` David Rientjes
  0 siblings, 0 replies; 97+ messages in thread
From: David Rientjes @ 2015-11-04 20:19 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, Johannes Weiner,
	zhangyanfei, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Jason Evans, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Shaohua Li

On Sun, 1 Nov 2015, Daniel Micay wrote:

> It can definitely be improved to cooperate well with THP too. I've been
> following the progress, and most of the problems seem to have been with
> the THP and that's a very active area of development. Seems best to deal
> with that after a simple, working implementation lands.
> 
> The best aspect of MADV_FREE is that it completely avoids page faults
> when there's no memory pressure. Making use of the freed memory only
> triggers page faults if the pages had to be dropped because the system
> ran out of memory. It also avoids needing to zero the pages. The memory
> can also still be freed at any time if there's memory pressure again
> even if it's handed out as an allocation until it's actually touched.
> 
> The call to madvise still has significant overhead, but it's much
> cheaper than MADV_DONTNEED. Allocators will be able to lean on the
> kernel to make good decisions rather than implementing lazy freeing
> entirely on their own. It should improve performance *and* behavior
> under memory pressure since allocators can be more aggressive with it
> than MADV_DONTNEED.
> 
> A nice future improvement would be landing MADV_FREE_UNDO feature to
> allow an attempt to pin the pages in memory again. It would make this
> work very well for implementing caches that are dropped under memory
> pressure. Windows has this via MEM_RESET (essentially MADV_FREE) and
> MEM_RESET_UNDO. Android has it for ashmem too (pinning/unpinning). I
> think browser vendors would be very interested in it.
> 

This sounds similar to what I was proposing to prevent thp splits when 
there is no memory pressure.

MADV_SPLITTABLE marks ranges of memory as free and the underlying thp may 
be split if there is no memory pressure.  Under memory pressure, it acts 
identical to MADV_DONTNEED.  Without memory pressure, the range is 
enqueued on an lru for the memcg that the vma's mm owner belongs to 
(global for !CONFIG_MEMCG).  It is also linked on a per-vma list for the 
range.  Anytime the vma is manipulated, the MADV_SPLITTABLE ranges are 
also fixed up.

On subsequent memory pressure, the memcg hierarchy lru list is iterated 
(global for !CONFIG_MEMCG) and the MADV_SPLITTABLE ranges are actually 
zapped (including thp split if necessary) and the memory is really freed 
to the system.

MADV_UNSPLITTABLE marks ranges of memory that have already been freed 
through MADV_SPLITTABLE as being used again.  If there was no memory 
pressure and the MADV_SPLITTABLE was simply enqueued on the lru list, it 
is removed from that list after the range has been zeroed with the same 
user-facing semantics as MADV_DONTNEED.  Otherwise, nothing is done since 
the ptes are already zapped and we'll incur a refault.

The change to tcmalloc is simple: use MADV_SPLITTABLE instead of 
MADV_DONTNEED when freeing memory to the system and use MADV_UNSPLITTABLE 
when returning memory that has been already freed to the system.

This works well in experimentation when 100% of heap backed by thp with no 
memory pressure.  This is a type of lazy-free that prevents thp memory 
from being split without memory pressure.

I was wondering if this could become part of MADV_FREE behavior with the 
MADV_FREE_UNDO behavior as the equivalent to my MADV_UNSPLITTABLE.  If 
there is no ground to be shared, mine is just implemented seperately, but 
I'm trying to avoid additional system calls required for malloc 
implemenations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 20:55     ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 20:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.

Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
  page cache. It's the same cost to discard both types of pages, but
  restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

That would also fix the very unfortunate and unexpected consequence of
tying the lazy free optimization to the availability of swap space.

I would prefer to see this addressed before the code goes upstream.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 20:55     ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 20:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li,
	Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.

Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
  page cache. It's the same cost to discard both types of pages, but
  restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

That would also fix the very unfortunate and unexpected consequence of
tying the lazy free optimization to the availability of swap space.

I would prefer to see this addressed before the code goes upstream.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 20:55     ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 20:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010, Shaohua Li, Wang, Yalin

On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> MADV_FREE is a hint that it's okay to discard pages if there is memory
> pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> so there is no value keeping them in the active anonymous LRU so this
> patch moves them to inactive LRU list's head.
> 
> This means that MADV_FREE-ed pages which were living on the inactive list
> are reclaimed first because they are more likely to be cold rather than
> recently active pages.
> 
> An arguable issue for the approach would be whether we should put the page
> to the head or tail of the inactive list.  I chose head because the kernel
> cannot make sure it's really cold or warm for every MADV_FREE usecase but
> at least we know it's not *hot*, so landing of inactive head would be a
> comprimise for various usecases.

Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
  page cache. It's the same cost to discard both types of pages, but
  restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

That would also fix the very unfortunate and unexpected consequence of
tying the lazy free optimization to the availability of swap space.

I would prefer to see this addressed before the code goes upstream.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 21:48       ` Daniel Micay
  0 siblings, 0 replies; 97+ messages in thread
From: Daniel Micay @ 2015-11-04 21:48 UTC (permalink / raw)
  To: Johannes Weiner, Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, zhangyanfei, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010, Shaohua Li, Wang, Yalin

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

> Even if we're wrong about the aging of those MADV_FREE pages, their
> contents are invalidated; they can be discarded freely, and restoring
> them is a mere GFP_ZERO allocation. All other anonymous pages have to
> be written to disk, and potentially be read back.
> 
> [ Arguably, MADV_FREE pages should even be reclaimed before inactive
>   page cache. It's the same cost to discard both types of pages, but
>   restoring page cache involves IO. ]

Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

> It probably makes sense to stop thinking about them as anonymous pages
> entirely at this point when it comes to aging. They're really not. The
> LRU lists are split to differentiate access patterns and cost of page
> stealing (and restoring). From that angle, MADV_FREE pages really have
> nothing in common with in-use anonymous pages, and so they shouldn't
> be on the same LRU list.
> 
> That would also fix the very unfortunate and unexpected consequence of
> tying the lazy free optimization to the availability of swap space.
> 
> I would prefer to see this addressed before the code goes upstream.

I don't think it would be ideal for these potentially very hot pages to
be dropped before very cold pages were swapped out. It's the kind of
tuning that needs to be informed by lots of real world experience and
lots of testing. It wouldn't impact the API.

Whether MADV_FREE is useful as an API vs. something like a pair of
system calls for pinning and unpinning memory is what should be worried
about right now. The internal implementation just needs to be correct
and useful right now, not perfect. Simpler is probably better than it
being more well tuned for an initial implementation too.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 21:48       ` Daniel Micay
  0 siblings, 0 replies; 97+ messages in thread
From: Daniel Micay @ 2015-11-04 21:48 UTC (permalink / raw)
  To: Johannes Weiner, Minchan Kim
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Wang, Yalin

[-- Attachment #1: Type: text/plain, Size: 2212 bytes --]

> Even if we're wrong about the aging of those MADV_FREE pages, their
> contents are invalidated; they can be discarded freely, and restoring
> them is a mere GFP_ZERO allocation. All other anonymous pages have to
> be written to disk, and potentially be read back.
> 
> [ Arguably, MADV_FREE pages should even be reclaimed before inactive
>   page cache. It's the same cost to discard both types of pages, but
>   restoring page cache involves IO. ]

Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

> It probably makes sense to stop thinking about them as anonymous pages
> entirely at this point when it comes to aging. They're really not. The
> LRU lists are split to differentiate access patterns and cost of page
> stealing (and restoring). From that angle, MADV_FREE pages really have
> nothing in common with in-use anonymous pages, and so they shouldn't
> be on the same LRU list.
> 
> That would also fix the very unfortunate and unexpected consequence of
> tying the lazy free optimization to the availability of swap space.
> 
> I would prefer to see this addressed before the code goes upstream.

I don't think it would be ideal for these potentially very hot pages to
be dropped before very cold pages were swapped out. It's the kind of
tuning that needs to be informed by lots of real world experience and
lots of testing. It wouldn't impact the API.

Whether MADV_FREE is useful as an API vs. something like a pair of
system calls for pinning and unpinning memory is what should be worried
about right now. The internal implementation just needs to be correct
and useful right now, not perfect. Simpler is probably better than it
being more well tuned for an initial implementation too.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 22:55         ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 22:55 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Wang, Yalin

On Wed, Nov 04, 2015 at 04:48:17PM -0500, Daniel Micay wrote:
> > Even if we're wrong about the aging of those MADV_FREE pages, their
> > contents are invalidated; they can be discarded freely, and restoring
> > them is a mere GFP_ZERO allocation. All other anonymous pages have to
> > be written to disk, and potentially be read back.
> > 
> > [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> >   page cache. It's the same cost to discard both types of pages, but
> >   restoring page cache involves IO. ]
> 
> Keep in mind that this is memory the kernel wouldn't be getting back at
> all if the allocator wasn't going out of the way to purge it, and they
> aren't going to go out of their way to purge it if it means the kernel
> is going to steal the pages when there isn't actually memory pressure.

Well, obviously you'd still only reclaim them on memory pressure. I'm
only talking about where these pages should go on the LRU hierarchy.

> > It probably makes sense to stop thinking about them as anonymous pages
> > entirely at this point when it comes to aging. They're really not. The
> > LRU lists are split to differentiate access patterns and cost of page
> > stealing (and restoring). From that angle, MADV_FREE pages really have
> > nothing in common with in-use anonymous pages, and so they shouldn't
> > be on the same LRU list.
> > 
> > That would also fix the very unfortunate and unexpected consequence of
> > tying the lazy free optimization to the availability of swap space.
> > 
> > I would prefer to see this addressed before the code goes upstream.
> 
> I don't think it would be ideal for these potentially very hot pages to
> be dropped before very cold pages were swapped out. It's the kind of
> tuning that needs to be informed by lots of real world experience and
> lots of testing. It wouldn't impact the API.

What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

> Whether MADV_FREE is useful as an API vs. something like a pair of
> system calls for pinning and unpinning memory is what should be worried
> about right now. The internal implementation just needs to be correct
> and useful right now, not perfect. Simpler is probably better than it
> being more well tuned for an initial implementation too.

Yes, it wouldn't impact the API, but the dependency on swap is very
random from a user experience and severely limits the usefulness of
this. It should probably be addressed before this gets released. As
this involves getting the pages off the anon LRU, we need to figure
out where they should go instead.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 22:55         ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 22:55 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Wang, Yalin

On Wed, Nov 04, 2015 at 04:48:17PM -0500, Daniel Micay wrote:
> > Even if we're wrong about the aging of those MADV_FREE pages, their
> > contents are invalidated; they can be discarded freely, and restoring
> > them is a mere GFP_ZERO allocation. All other anonymous pages have to
> > be written to disk, and potentially be read back.
> > 
> > [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> >   page cache. It's the same cost to discard both types of pages, but
> >   restoring page cache involves IO. ]
> 
> Keep in mind that this is memory the kernel wouldn't be getting back at
> all if the allocator wasn't going out of the way to purge it, and they
> aren't going to go out of their way to purge it if it means the kernel
> is going to steal the pages when there isn't actually memory pressure.

Well, obviously you'd still only reclaim them on memory pressure. I'm
only talking about where these pages should go on the LRU hierarchy.

> > It probably makes sense to stop thinking about them as anonymous pages
> > entirely at this point when it comes to aging. They're really not. The
> > LRU lists are split to differentiate access patterns and cost of page
> > stealing (and restoring). From that angle, MADV_FREE pages really have
> > nothing in common with in-use anonymous pages, and so they shouldn't
> > be on the same LRU list.
> > 
> > That would also fix the very unfortunate and unexpected consequence of
> > tying the lazy free optimization to the availability of swap space.
> > 
> > I would prefer to see this addressed before the code goes upstream.
> 
> I don't think it would be ideal for these potentially very hot pages to
> be dropped before very cold pages were swapped out. It's the kind of
> tuning that needs to be informed by lots of real world experience and
> lots of testing. It wouldn't impact the API.

What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

> Whether MADV_FREE is useful as an API vs. something like a pair of
> system calls for pinning and unpinning memory is what should be worried
> about right now. The internal implementation just needs to be correct
> and useful right now, not perfect. Simpler is probably better than it
> being more well tuned for an initial implementation too.

Yes, it wouldn't impact the API, but the dependency on swap is very
random from a user experience and severely limits the usefulness of
this. It should probably be addressed before this gets released. As
this involves getting the pages off the anon LRU, we need to figure
out where they should go instead.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 22:55         ` Johannes Weiner
  0 siblings, 0 replies; 97+ messages in thread
From: Johannes Weiner @ 2015-11-04 22:55 UTC (permalink / raw)
  To: Daniel Micay
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Wang, Yalin

On Wed, Nov 04, 2015 at 04:48:17PM -0500, Daniel Micay wrote:
> > Even if we're wrong about the aging of those MADV_FREE pages, their
> > contents are invalidated; they can be discarded freely, and restoring
> > them is a mere GFP_ZERO allocation. All other anonymous pages have to
> > be written to disk, and potentially be read back.
> > 
> > [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> >   page cache. It's the same cost to discard both types of pages, but
> >   restoring page cache involves IO. ]
> 
> Keep in mind that this is memory the kernel wouldn't be getting back at
> all if the allocator wasn't going out of the way to purge it, and they
> aren't going to go out of their way to purge it if it means the kernel
> is going to steal the pages when there isn't actually memory pressure.

Well, obviously you'd still only reclaim them on memory pressure. I'm
only talking about where these pages should go on the LRU hierarchy.

> > It probably makes sense to stop thinking about them as anonymous pages
> > entirely at this point when it comes to aging. They're really not. The
> > LRU lists are split to differentiate access patterns and cost of page
> > stealing (and restoring). From that angle, MADV_FREE pages really have
> > nothing in common with in-use anonymous pages, and so they shouldn't
> > be on the same LRU list.
> > 
> > That would also fix the very unfortunate and unexpected consequence of
> > tying the lazy free optimization to the availability of swap space.
> > 
> > I would prefer to see this addressed before the code goes upstream.
> 
> I don't think it would be ideal for these potentially very hot pages to
> be dropped before very cold pages were swapped out. It's the kind of
> tuning that needs to be informed by lots of real world experience and
> lots of testing. It wouldn't impact the API.

What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

> Whether MADV_FREE is useful as an API vs. something like a pair of
> system calls for pinning and unpinning memory is what should be worried
> about right now. The internal implementation just needs to be correct
> and useful right now, not perfect. Simpler is probably better than it
> being more well tuned for an initial implementation too.

Yes, it wouldn't impact the API, but the dependency on swap is very
random from a user experience and severely limits the usefulness of
this. It should probably be addressed before this gets released. As
this involves getting the pages off the anon LRU, we need to figure
out where they should go instead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-11-04 22:55         ` Johannes Weiner
  (?)
  (?)
@ 2015-11-04 23:36         ` Daniel Micay
  2015-11-04 23:49             ` Daniel Micay
  -1 siblings, 1 reply; 97+ messages in thread
From: Daniel Micay @ 2015-11-04 23:36 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Wang, Yalin

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]

>>> It probably makes sense to stop thinking about them as anonymous pages
>>> entirely at this point when it comes to aging. They're really not. The
>>> LRU lists are split to differentiate access patterns and cost of page
>>> stealing (and restoring). From that angle, MADV_FREE pages really have
>>> nothing in common with in-use anonymous pages, and so they shouldn't
>>> be on the same LRU list.
>>>
>>> That would also fix the very unfortunate and unexpected consequence of
>>> tying the lazy free optimization to the availability of swap space.
>>>
>>> I would prefer to see this addressed before the code goes upstream.
>>
>> I don't think it would be ideal for these potentially very hot pages to
>> be dropped before very cold pages were swapped out. It's the kind of
>> tuning that needs to be informed by lots of real world experience and
>> lots of testing. It wouldn't impact the API.
> 
> What about them is hot? They contain garbage, you have to write to
> them before you can use them. Granted, you might have to refetch
> cachelines if you don't do cacheline-aligned populating writes, but
> you can do a lot of them before it's more expensive than doing IO.

It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.

>> Whether MADV_FREE is useful as an API vs. something like a pair of
>> system calls for pinning and unpinning memory is what should be worried
>> about right now. The internal implementation just needs to be correct
>> and useful right now, not perfect. Simpler is probably better than it
>> being more well tuned for an initial implementation too.
> 
> Yes, it wouldn't impact the API, but the dependency on swap is very
> random from a user experience and severely limits the usefulness of
> this. It should probably be addressed before this gets released. As
> this involves getting the pages off the anon LRU, we need to figure
> out where they should go instead.

From a user perspective, it doesn't depend on swap. It's just slower
without swap because it does what MADV_DONTNEED does. The current
implementation can be dropped in where MADV_DONTNEED was previously used.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 23:49             ` Daniel Micay
  0 siblings, 0 replies; 97+ messages in thread
From: Daniel Micay @ 2015-11-04 23:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm,
	Michael Kerrisk, linux-api, Hugh Dickins, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Kirill A. Shutemov, Michal Hocko, yalin.wang2010, Shaohua Li,
	Wang, Yalin

[-- Attachment #1: Type: text/plain, Size: 747 bytes --]

> From a user perspective, it doesn't depend on swap. It's just slower
> without swap because it does what MADV_DONTNEED does. The current
> implementation can be dropped in where MADV_DONTNEED was previously used.

It just wouldn't replace existing layers of purging logic until that
edge case is fixed and it gains better THP integration.

It's already a very useful API with significant performance wins over
MADV_DONTNEED, so it will be useful. The only risk involved in landing
it is that a better feature might come along. Worst case scenario being
that the kernel ends up with a synonym for MADV_DONTNEED (but I think
there will still be a use case for this even if a pinning/unpinning API
existed, as this is more precise).


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-04 23:49             ` Daniel Micay
  0 siblings, 0 replies; 97+ messages in thread
From: Daniel Micay @ 2015-11-04 23:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Kirill A. Shutemov, Michal Hocko,
	yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Shaohua Li, Wang, Yalin

[-- Attachment #1: Type: text/plain, Size: 747 bytes --]

> From a user perspective, it doesn't depend on swap. It's just slower
> without swap because it does what MADV_DONTNEED does. The current
> implementation can be dropped in where MADV_DONTNEED was previously used.

It just wouldn't replace existing layers of purging logic until that
edge case is fixed and it gains better THP integration.

It's already a very useful API with significant performance wins over
MADV_DONTNEED, so it will be useful. The only risk involved in landing
it is that a better feature might come along. Worst case scenario being
that the kernel ends up with a synonym for MADV_DONTNEED (but I think
there will still be a use case for this even if a pinning/unpinning API
existed, as this is more precise).


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-05  1:03           ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-05  1:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > so there is no value keeping them in the active anonymous LRU so this
> > > > patch moves them to inactive LRU list's head.
> > > > 
> > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > are reclaimed first because they are more likely to be cold rather than
> > > > recently active pages.
> > > > 
> > > > An arguable issue for the approach would be whether we should put the page
> > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > comprimise for various usecases.
> > > > 
> > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > active list will sit there for a long time even under memory pressure
> > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > is might not be used.
> > > 
> > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > said, this is arguable, but I hope we can discuss about this issue more.
> > 
> > Yes, it's arguble. ;-)
> > 
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim. 
> 
> There certainly are other usage cases, but even your patch log mainly describes
> the jemalloc usage case, which uses MADV_DONTNEED.
> 
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> > 
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > 
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > 
> > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > system.
> > 
> > > 
> > > Or do you want to push this first and address the policy issue later?
> > 
> > I believe adding new LRU list would be controversial(ie, not trivial)
> > for maintainer POV even though code wouldn't be complicated.
> > So, I want to see problems in *real practice*, not any theoritical
> > test program before diving into that.
> > To see such voice of request, we should release the syscall.
> > So, I want to push this first.
> 
> The memory pressure issue isn't just in artificial test. In jemalloc, there is
> a knob (lg_dirty_mult) to control the rate memory should be purged (using
> MADV_DONTNEED). We already had several reports in our production environment
> changing the knob can cause extra memory usage (and swap and so on). If
> jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> cause the similar issue, eg (extram memory usage, swap). That said I don't
> object to push this first, but the memory pressue issue can happen in real
> production, I hope it's not ignored.

Absolutely, I'm not saying I want to ignore the concern.
Adding new LRU would make churning of many part in MM so before that,
let's see the voice from userland and discuss what's the best if it
has trouble.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-05  1:03           ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-05  1:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Michael Kerrisk,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Hugh Dickins, Johannes Weiner,
	zhangyanfei-BthXqXjhjHXQFUHtdCDX3A, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Jason Evans, Daniel Micay, Kirill A. Shutemov,
	Michal Hocko, yalin.wang2010-Re5JQEeQqe8AvxtiuMwx3w, Wang, Yalin

On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > so there is no value keeping them in the active anonymous LRU so this
> > > > patch moves them to inactive LRU list's head.
> > > > 
> > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > are reclaimed first because they are more likely to be cold rather than
> > > > recently active pages.
> > > > 
> > > > An arguable issue for the approach would be whether we should put the page
> > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > comprimise for various usecases.
> > > > 
> > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > active list will sit there for a long time even under memory pressure
> > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > is might not be used.
> > > 
> > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > said, this is arguable, but I hope we can discuss about this issue more.
> > 
> > Yes, it's arguble. ;-)
> > 
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim. 
> 
> There certainly are other usage cases, but even your patch log mainly describes
> the jemalloc usage case, which uses MADV_DONTNEED.
> 
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> > 
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > 
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > 
> > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > system.
> > 
> > > 
> > > Or do you want to push this first and address the policy issue later?
> > 
> > I believe adding new LRU list would be controversial(ie, not trivial)
> > for maintainer POV even though code wouldn't be complicated.
> > So, I want to see problems in *real practice*, not any theoritical
> > test program before diving into that.
> > To see such voice of request, we should release the syscall.
> > So, I want to push this first.
> 
> The memory pressure issue isn't just in artificial test. In jemalloc, there is
> a knob (lg_dirty_mult) to control the rate memory should be purged (using
> MADV_DONTNEED). We already had several reports in our production environment
> changing the knob can cause extra memory usage (and swap and so on). If
> jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> cause the similar issue, eg (extram memory usage, swap). That said I don't
> object to push this first, but the memory pressue issue can happen in real
> production, I hope it's not ignored.

Absolutely, I'm not saying I want to ignore the concern.
Adding new LRU would make churning of many part in MM so before that,
let's see the voice from userland and discuss what's the best if it
has trouble.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-05  1:03           ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-05  1:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > so there is no value keeping them in the active anonymous LRU so this
> > > > patch moves them to inactive LRU list's head.
> > > > 
> > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > are reclaimed first because they are more likely to be cold rather than
> > > > recently active pages.
> > > > 
> > > > An arguable issue for the approach would be whether we should put the page
> > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > comprimise for various usecases.
> > > > 
> > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > active list will sit there for a long time even under memory pressure
> > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > is might not be used.
> > > 
> > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > said, this is arguable, but I hope we can discuss about this issue more.
> > 
> > Yes, it's arguble. ;-)
> > 
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim. 
> 
> There certainly are other usage cases, but even your patch log mainly describes
> the jemalloc usage case, which uses MADV_DONTNEED.
> 
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> > 
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > 
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > 
> > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > system.
> > 
> > > 
> > > Or do you want to push this first and address the policy issue later?
> > 
> > I believe adding new LRU list would be controversial(ie, not trivial)
> > for maintainer POV even though code wouldn't be complicated.
> > So, I want to see problems in *real practice*, not any theoritical
> > test program before diving into that.
> > To see such voice of request, we should release the syscall.
> > So, I want to push this first.
> 
> The memory pressure issue isn't just in artificial test. In jemalloc, there is
> a knob (lg_dirty_mult) to control the rate memory should be purged (using
> MADV_DONTNEED). We already had several reports in our production environment
> changing the knob can cause extra memory usage (and swap and so on). If
> jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> cause the similar issue, eg (extram memory usage, swap). That said I don't
> object to push this first, but the memory pressue issue can happen in real
> production, I hope it's not ignored.

Absolutely, I'm not saying I want to ignore the concern.
Adding new LRU would make churning of many part in MM so before that,
let's see the voice from userland and discuss what's the best if it
has trouble.

> 
> Thanks,
> Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
  2015-11-04 18:20           ` Shaohua Li
@ 2015-11-05  1:11             ` Minchan Kim
  -1 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-05  1:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 10:20:47AM -0800, Shaohua Li wrote:
> On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> > On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > > so there is no value keeping them in the active anonymous LRU so this
> > > > > patch moves them to inactive LRU list's head.
> > > > > 
> > > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > > are reclaimed first because they are more likely to be cold rather than
> > > > > recently active pages.
> > > > > 
> > > > > An arguable issue for the approach would be whether we should put the page
> > > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > > comprimise for various usecases.
> > > > > 
> > > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > > active list will sit there for a long time even under memory pressure
> > > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > > is might not be used.
> > > > 
> > > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > > said, this is arguable, but I hope we can discuss about this issue more.
> > > 
> > > Yes, it's arguble. ;-)
> > > 
> > > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > > but it's not true because the page would be dirty state when VM want to reclaim. 
> > 
> > There certainly are other usage cases, but even your patch log mainly describes
> > the jemalloc usage case, which uses MADV_DONTNEED.
> > 
> > > I'm also against with your's suggestion which let's discard FREEed page before
> > > system wide page reclaim because system would have lots of clean cold page
> > > caches or anonymous pages. In such case, reclaiming of them would be better.
> > > Yeb, it's really workload-dependent so we might need some heuristic which is
> > > normally what we want to avoid.
> > > 
> > > Having said that, I agree with you we could do better than the deactivation
> > > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > > it could promote it to acive-anon-LRU which would be very natural aging
> > > concept because it mean someone touches the page recenlty.
> > > 
> > > With that, I don't want to bias one side and don't want to add some knob for
> > > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > > 
> > > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > > system.
> > > 
> > > > 
> > > > Or do you want to push this first and address the policy issue later?
> > > 
> > > I believe adding new LRU list would be controversial(ie, not trivial)
> > > for maintainer POV even though code wouldn't be complicated.
> > > So, I want to see problems in *real practice*, not any theoritical
> > > test program before diving into that.
> > > To see such voice of request, we should release the syscall.
> > > So, I want to push this first.
> > 
> > The memory pressure issue isn't just in artificial test. In jemalloc, there is
> > a knob (lg_dirty_mult) to control the rate memory should be purged (using
> > MADV_DONTNEED). We already had several reports in our production environment
> > changing the knob can cause extra memory usage (and swap and so on). If
> > jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> > to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> > cause the similar issue, eg (extram memory usage, swap). That said I don't
> > object to push this first, but the memory pressue issue can happen in real
> > production, I hope it's not ignored.
> 
> I think the question is if application uses MADV_DONTNEED originally, how much
> better if we replace it to MADV_FREE compared to just delete the MADV_DONTNEED,
> considering anonymous memory is hard to be reclaimed currently.

So, the question from my side is application will use MADV_FREE
as replacement of MADV_DONTNEED without any tune or modification?
At least, I'd like to know jemalloc if they have a plan.

> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/8] mm: move lazily freed pages to inactive list
@ 2015-11-05  1:11             ` Minchan Kim
  0 siblings, 0 replies; 97+ messages in thread
From: Minchan Kim @ 2015-11-05  1:11 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Andrew Morton, linux-kernel, linux-mm, Michael Kerrisk,
	linux-api, Hugh Dickins, Johannes Weiner, zhangyanfei,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Jason Evans,
	Daniel Micay, Kirill A. Shutemov, Michal Hocko, yalin.wang2010,
	Wang, Yalin

On Wed, Nov 04, 2015 at 10:20:47AM -0800, Shaohua Li wrote:
> On Wed, Nov 04, 2015 at 09:53:42AM -0800, Shaohua Li wrote:
> > On Tue, Nov 03, 2015 at 09:52:23AM +0900, Minchan Kim wrote:
> > > On Fri, Oct 30, 2015 at 10:22:12AM -0700, Shaohua Li wrote:
> > > > On Fri, Oct 30, 2015 at 04:01:41PM +0900, Minchan Kim wrote:
> > > > > MADV_FREE is a hint that it's okay to discard pages if there is memory
> > > > > pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
> > > > > so there is no value keeping them in the active anonymous LRU so this
> > > > > patch moves them to inactive LRU list's head.
> > > > > 
> > > > > This means that MADV_FREE-ed pages which were living on the inactive list
> > > > > are reclaimed first because they are more likely to be cold rather than
> > > > > recently active pages.
> > > > > 
> > > > > An arguable issue for the approach would be whether we should put the page
> > > > > to the head or tail of the inactive list.  I chose head because the kernel
> > > > > cannot make sure it's really cold or warm for every MADV_FREE usecase but
> > > > > at least we know it's not *hot*, so landing of inactive head would be a
> > > > > comprimise for various usecases.
> > > > > 
> > > > > This fixes suboptimal behavior of MADV_FREE when pages living on the
> > > > > active list will sit there for a long time even under memory pressure
> > > > > while the inactive list is reclaimed heavily.  This basically breaks the
> > > > > whole purpose of using MADV_FREE to help the system to free memory which
> > > > > is might not be used.
> > > > 
> > > > My main concern is the policy how we should treat the FREE pages. Moving it to
> > > > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > > > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > > > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > > > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > > > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > > > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > > > to make sure FREE pages can be freed before system wide page reclaim. As you
> > > > said, this is arguable, but I hope we can discuss about this issue more.
> > > 
> > > Yes, it's arguble. ;-)
> > > 
> > > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > > but it's not true because the page would be dirty state when VM want to reclaim. 
> > 
> > There certainly are other usage cases, but even your patch log mainly describes
> > the jemalloc usage case, which uses MADV_DONTNEED.
> > 
> > > I'm also against with your's suggestion which let's discard FREEed page before
> > > system wide page reclaim because system would have lots of clean cold page
> > > caches or anonymous pages. In such case, reclaiming of them would be better.
> > > Yeb, it's really workload-dependent so we might need some heuristic which is
> > > normally what we want to avoid.
> > > 
> > > Having said that, I agree with you we could do better than the deactivation
> > > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > > it could promote it to acive-anon-LRU which would be very natural aging
> > > concept because it mean someone touches the page recenlty.
> > > 
> > > With that, I don't want to bias one side and don't want to add some knob for
> > > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > > 
> > > Another bonus with new LRU list is we could support MADV_FREE on swapless
> > > system.
> > > 
> > > > 
> > > > Or do you want to push this first and address the policy issue later?
> > > 
> > > I believe adding new LRU list would be controversial(ie, not trivial)
> > > for maintainer POV even though code wouldn't be complicated.
> > > So, I want to see problems in *real practice*, not any theoritical
> > > test program before diving into that.
> > > To see such voice of request, we should release the syscall.
> > > So, I want to push this first.
> > 
> > The memory pressure issue isn't just in artificial test. In jemalloc, there is
> > a knob (lg_dirty_mult) to control the rate memory should be purged (using
> > MADV_DONTNEED). We already had several reports in our production environment
> > changing the knob can cause extra memory usage (and swap and so on). If
> > jemalloc uses MADV_FREE, jemalloc will not purge any memory, which is equivent
> > to disable current MADV_DONTNEED (eg, lg_dirty_mult = -1). I'm sure this will
> > cause the similar issue, eg (extram memory usage, swap). That said I don't
> > object to push this first, but the memory pressue issue can happen in real
> > production, I hope it's not ignored.
> 
> I think the question is if application uses MADV_DONTNEED originally, how much
> better if we replace it to MADV_FREE compared to just delete the MADV_DONTNEED,
> considering anonymous memory is hard to be reclaimed currently.

So, the question from my side is application will use MADV_FREE
as replacement of MADV_DONTNEED without any tune or modification?
At least, I'd like to know jemalloc if they have a plan.

> 
> Thanks,
> Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2015-11-05  1:11 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-30  7:01 [PATCH 0/8] MADV_FREE support Minchan Kim
2015-10-30  7:01 ` Minchan Kim
2015-10-30  7:01 ` Minchan Kim
2015-10-30  7:01 ` [PATCH 1/8] mm: support madvise(MADV_FREE) Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30 16:49   ` Shaohua Li
2015-10-30 16:49     ` Shaohua Li
2015-11-03  0:10     ` Minchan Kim
2015-11-03  0:10       ` Minchan Kim
2015-11-03  0:10       ` Minchan Kim
2015-10-30  7:01 ` [PATCH 2/8] mm: define MADV_FREE for some arches Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30  7:01 ` [PATCH 3/8] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-11-02  0:08   ` Hugh Dickins
2015-11-02  0:08     ` Hugh Dickins
2015-11-02  0:08     ` Hugh Dickins
2015-11-02  0:08     ` Hugh Dickins
2015-11-03  2:32     ` Minchan Kim
2015-11-03  2:32       ` Minchan Kim
2015-11-03  2:32       ` Minchan Kim
2015-11-03  2:32       ` Minchan Kim
2015-11-03  2:36       ` Minchan Kim
2015-11-03  2:36         ` Minchan Kim
2015-11-03  2:36         ` Minchan Kim
2015-11-03  2:36         ` Minchan Kim
2015-11-03  3:36         ` David Miller
2015-11-03  3:36           ` David Miller
2015-11-03  3:36           ` David Miller
2015-11-03  4:31           ` Minchan Kim
2015-11-03  4:31             ` Minchan Kim
2015-11-03  4:31             ` Minchan Kim
2015-10-30  7:01 ` [PATCH 4/8] mm: free swp_entry in madvise_free Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30 12:28   ` Michal Hocko
2015-10-30 12:28     ` Michal Hocko
2015-11-03  0:53     ` Minchan Kim
2015-11-03  0:53       ` Minchan Kim
2015-11-03  0:53       ` Minchan Kim
2015-10-30  7:01 ` [PATCH 5/8] mm: move lazily freed pages to inactive list Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30 17:22   ` Shaohua Li
2015-10-30 17:22     ` Shaohua Li
2015-10-30 17:22     ` Shaohua Li
2015-11-03  0:52     ` Minchan Kim
2015-11-03  0:52       ` Minchan Kim
2015-11-03  0:52       ` Minchan Kim
2015-11-04  8:15       ` Michal Hocko
2015-11-04  8:15         ` Michal Hocko
2015-11-04 17:53       ` Shaohua Li
2015-11-04 17:53         ` Shaohua Li
2015-11-04 17:53         ` Shaohua Li
2015-11-04 18:20         ` Shaohua Li
2015-11-04 18:20           ` Shaohua Li
2015-11-05  1:11           ` Minchan Kim
2015-11-05  1:11             ` Minchan Kim
2015-11-05  1:03         ` Minchan Kim
2015-11-05  1:03           ` Minchan Kim
2015-11-05  1:03           ` Minchan Kim
2015-11-04 20:55   ` Johannes Weiner
2015-11-04 20:55     ` Johannes Weiner
2015-11-04 20:55     ` Johannes Weiner
2015-11-04 21:48     ` Daniel Micay
2015-11-04 21:48       ` Daniel Micay
2015-11-04 22:55       ` Johannes Weiner
2015-11-04 22:55         ` Johannes Weiner
2015-11-04 22:55         ` Johannes Weiner
2015-11-04 23:36         ` Daniel Micay
2015-11-04 23:49           ` Daniel Micay
2015-11-04 23:49             ` Daniel Micay
2015-10-30  7:01 ` [PATCH 6/8] mm: lru_deactivate_fn should clear PG_referenced Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30 12:47   ` Michal Hocko
2015-10-30 12:47     ` Michal Hocko
2015-10-30 12:47     ` Michal Hocko
2015-11-03  1:10     ` Minchan Kim
2015-11-03  1:10       ` Minchan Kim
2015-11-04  8:22       ` Michal Hocko
2015-11-04  8:22         ` Michal Hocko
2015-11-04  8:22         ` Michal Hocko
2015-10-30  7:01 ` [PATCH 7/8] mm: clear PG_dirty to mark page freeable Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-10-30 12:55   ` Michal Hocko
2015-10-30 12:55     ` Michal Hocko
2015-10-30 12:55     ` Michal Hocko
2015-10-30  7:01 ` [PATCH 8/8] mm: mark stable page dirty in KSM Minchan Kim
2015-10-30  7:01   ` Minchan Kim
2015-11-01  4:51 ` [PATCH 0/8] MADV_FREE support David Rientjes
2015-11-01  4:51   ` David Rientjes
2015-11-01  4:51   ` David Rientjes
2015-11-01  6:29   ` Daniel Micay
2015-11-03  2:23     ` Minchan Kim
2015-11-03  2:23       ` Minchan Kim
2015-11-03  2:23       ` Minchan Kim
2015-11-04 20:19     ` David Rientjes
2015-11-04 20:19       ` David Rientjes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.