All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/7] mm: fix some MADV_FREE issues
@ 2017-02-03 23:33 ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing
  without it.
- RSS accounting. MADV_FREE pages are accounted as normal anon pages and
  reclaimed lazily, so application's RSS becomes bigger. This confuses our
  workloads. We have monitoring daemon running and if it finds applications' RSS
  becomes abnormal, the daemon will kill the applications even kernel can reclaim
  the memory easily. Currently we don't export separate RSS accounting for
  MADV_FREE pages. This will prevent our setup using MADV_FREE too.

To address the first the two issues, we can either put MADV_FREE pages into a
separate LRU list (Minchan's previous patches and V1 patches), or put them into
LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second
idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full
of used once file pages. So we can still efficiently reclaim MADV_FREE pages
there without interference with other anon and active file pages. Putting the
pages into inactive file list also has an advantage which allows page reclaim
to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put
into the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap will
reclaim them.

For the third issue, we add a separate RSS count for MADV_FREE pages. The count
will be increased in madvise syscall and decreased in page reclaim (eg, unmap).
There is one limitation, the accounting doesn't work well for shared pages.
Please check the last patch. This probably isn't a big issue, because userspace
will write the pages before reusing them, which will break the page sharing
between two processes. And if two processes share a page, the page can't really
be lazyfreed.

Thanks,
Shaohua

V1->V2:
- Put MADV_FREE pages into LRU_INACTIVE_FILE list instead of adding a new lru
  list, suggested by Johannes
- Add RSS support

Minchan previous patches:
http://marc.info/?l=linux-mm&m=144800657002763&w=2
----------------------
Shaohua Li (7):
  mm: don't assume anonymous pages have SwapBacked flag
  mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  mm: reclaim MADV_FREE pages
  mm: enable MADV_FREE for swapless system
  mm: add vmstat account for MADV_FREE pages
  proc: show MADV_FREE pages info in smaps
  mm: add a separate RSS for MADV_FREE pages

 drivers/base/node.c           |  2 ++
 fs/proc/array.c               |  9 +++++---
 fs/proc/internal.h            |  3 ++-
 fs/proc/meminfo.c             |  1 +
 fs/proc/task_mmu.c            | 17 +++++++++++---
 fs/proc/task_nommu.c          |  4 +++-
 include/linux/mm_inline.h     | 36 +++++++++++++++++++++++++++---
 include/linux/mm_types.h      |  1 +
 include/linux/mmzone.h        |  2 ++
 include/linux/page-flags.h    |  6 +++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/gup.c                      |  2 ++
 mm/huge_memory.c              | 14 ++++++++----
 mm/khugepaged.c               | 10 ++++-----
 mm/madvise.c                  | 16 ++++++-------
 mm/memory.c                   | 13 +++++++++--
 mm/migrate.c                  |  5 ++++-
 mm/oom_kill.c                 | 10 +++++----
 mm/page_alloc.c               |  7 ++++--
 mm/rmap.c                     | 10 ++++++++-
 mm/swap.c                     | 50 +++++++++++++++++++++++------------------
 mm/vmscan.c                   | 52 +++++++++++++++++++++++++++++++------------
 mm/vmstat.c                   |  3 +++
 24 files changed, 200 insertions(+), 77 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH V2 0/7] mm: fix some MADV_FREE issues
@ 2017-02-03 23:33 ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing
  without it.
- RSS accounting. MADV_FREE pages are accounted as normal anon pages and
  reclaimed lazily, so application's RSS becomes bigger. This confuses our
  workloads. We have monitoring daemon running and if it finds applications' RSS
  becomes abnormal, the daemon will kill the applications even kernel can reclaim
  the memory easily. Currently we don't export separate RSS accounting for
  MADV_FREE pages. This will prevent our setup using MADV_FREE too.

To address the first the two issues, we can either put MADV_FREE pages into a
separate LRU list (Minchan's previous patches and V1 patches), or put them into
LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second
idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full
of used once file pages. So we can still efficiently reclaim MADV_FREE pages
there without interference with other anon and active file pages. Putting the
pages into inactive file list also has an advantage which allows page reclaim
to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put
into the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap will
reclaim them.

For the third issue, we add a separate RSS count for MADV_FREE pages. The count
will be increased in madvise syscall and decreased in page reclaim (eg, unmap).
There is one limitation, the accounting doesn't work well for shared pages.
Please check the last patch. This probably isn't a big issue, because userspace
will write the pages before reusing them, which will break the page sharing
between two processes. And if two processes share a page, the page can't really
be lazyfreed.

Thanks,
Shaohua

V1->V2:
- Put MADV_FREE pages into LRU_INACTIVE_FILE list instead of adding a new lru
  list, suggested by Johannes
- Add RSS support

Minchan previous patches:
http://marc.info/?l=linux-mm&m=144800657002763&w=2
----------------------
Shaohua Li (7):
  mm: don't assume anonymous pages have SwapBacked flag
  mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  mm: reclaim MADV_FREE pages
  mm: enable MADV_FREE for swapless system
  mm: add vmstat account for MADV_FREE pages
  proc: show MADV_FREE pages info in smaps
  mm: add a separate RSS for MADV_FREE pages

 drivers/base/node.c           |  2 ++
 fs/proc/array.c               |  9 +++++---
 fs/proc/internal.h            |  3 ++-
 fs/proc/meminfo.c             |  1 +
 fs/proc/task_mmu.c            | 17 +++++++++++---
 fs/proc/task_nommu.c          |  4 +++-
 include/linux/mm_inline.h     | 36 +++++++++++++++++++++++++++---
 include/linux/mm_types.h      |  1 +
 include/linux/mmzone.h        |  2 ++
 include/linux/page-flags.h    |  6 +++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/gup.c                      |  2 ++
 mm/huge_memory.c              | 14 ++++++++----
 mm/khugepaged.c               | 10 ++++-----
 mm/madvise.c                  | 16 ++++++-------
 mm/memory.c                   | 13 +++++++++--
 mm/migrate.c                  |  5 ++++-
 mm/oom_kill.c                 | 10 +++++----
 mm/page_alloc.c               |  7 ++++--
 mm/rmap.c                     | 10 ++++++++-
 mm/swap.c                     | 50 +++++++++++++++++++++++------------------
 mm/vmscan.c                   | 52 +++++++++++++++++++++++++++++++------------
 mm/vmstat.c                   |  3 +++
 24 files changed, 200 insertions(+), 77 deletions(-)

-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH V2 1/7] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/huge_memory.c | 1 -
 mm/khugepaged.c  | 8 +++-----
 mm/migrate.c     | 3 ++-
 mm/rmap.c        | 3 ++-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40bd376..ecf569d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2118,7 +2118,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
 	if (PageAnon(head)) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c..a4b499f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -481,8 +481,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -530,7 +529,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -577,8 +575,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page,
+				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 87f4d0f..eb76f87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1963,7 +1963,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);
-	__SetPageSwapBacked(new_page);
+	if (PageSwapBacked(page))
+		__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index c48e9c1..c8d6204 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1546,7 +1546,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * Store the swap location in the pte.
 		 * See handle_pte_fault() ...
 		 */
-		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+		VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
+			page);
 
 		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
 			/* It's a freeable page by MADV_FREE */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 1/7] mm: don't assume anonymous pages have SwapBacked flag
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/huge_memory.c | 1 -
 mm/khugepaged.c  | 8 +++-----
 mm/migrate.c     | 3 ++-
 mm/rmap.c        | 3 ++-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40bd376..ecf569d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2118,7 +2118,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
 	if (PageAnon(head)) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c..a4b499f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -481,8 +481,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -530,7 +529,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -577,8 +575,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page,
+				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 87f4d0f..eb76f87 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1963,7 +1963,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);
-	__SetPageSwapBacked(new_page);
+	if (PageSwapBacked(page))
+		__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index c48e9c1..c8d6204 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1546,7 +1546,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 * Store the swap location in the pte.
 		 * See handle_pte_fault() ...
 		 */
-		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+		VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
+			page);
 
 		if (!PageDirty(page) && (flags & TTU_LZFREE)) {
 			/* It's a freeable page by MADV_FREE */
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Userspace indicates MADV_FREE pages could be freed without pageout, so
it pretty much likes used once file pages. For such pages, we'd like to
reclaim them once there is memory pressure. Also it might be unfair
reclaiming MADV_FREE pages always before used once file pages and we
definitively want to reclaim the pages before other anonymous and file
pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.

We also clear the pages SwapBacked flag to indicate they are MADV_FREE
pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/mm_inline.h     |  5 +++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/huge_memory.c              |  5 ++---
 mm/madvise.c                  |  3 +--
 mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
 mm/vmstat.c                   |  1 +
 7 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68..fdded06 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
 	return !PageSwapBacked(page);
 }
 
+static inline bool page_is_lazyfree(struct page *page)
+{
+	return PageAnon(page) && !PageSwapBacked(page);
+}
+
 static __always_inline void __update_lru_size(struct lruvec *lruvec,
 				enum lru_list lru, enum zone_type zid,
 				int nr_pages)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd..486494e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 6aa1b6c..94e58da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		FOR_ALL_ZONES(ALLOCSTALL),
 		FOR_ALL_ZONES(PGSCAN_SKIP),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
 		PGREFILL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ecf569d..ddb9a94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ClearPageDirty(page);
 	unlock_page(page);
 
-	if (PageActive(page))
-		deactivate_page(page);
-
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
 		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
 			tlb->fullmm);
@@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		set_pmd_at(mm, addr, pmd, orig_pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
+
+	mark_page_lazyfree(page);
 	ret = true;
 out:
 	spin_unlock(ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index c867d88..c24549e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkclean(ptent);
 			ptent = pte_wrprotect(ptent);
 			set_pte_at(mm, addr, pte, ptent);
-			if (PageActive(page))
-				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		mark_page_lazyfree(page);
 	}
 out:
 	if (nr_swap) {
diff --git a/mm/swap.c b/mm/swap.c
index c4910f1..69a7e9d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,7 +46,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 #endif
@@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
+		if (page_is_lazyfree(page)) {
+			SetPageSwapBacked(page);
+			file = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
@@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 }
 
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
-		int lru = page_lru_base_type(page);
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		bool active = PageActive(page);
 
-		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
 		ClearPageActive(page);
 		ClearPageReferenced(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		ClearPageSwapBacked(page);
+		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0);
+		update_page_reclaim_stat(lruvec, 1, 0);
+		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
 	}
 }
 
@@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
 
 	activate_page_drain(cpu);
 }
@@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
 }
 
 /**
- * deactivate_page - deactivate a page
+ * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
  *
- * deactivate_page() moves @page to the inactive list if @page was on the active
- * list and was not an unevictable page.  This is done to accelerate the reclaim
- * of @page.
+ * mark_page_lazyfree() moves @page to the inactive file list.
+ * This is done to accelerate the reclaim of @page.
  */
-void deactivate_page(struct page *page)
-{
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+void mark_page_lazyfree(struct page *page)
+ {
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
 
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		put_cpu_var(lru_lazyfree_pvecs);
 	}
 }
 
@@ -704,7 +710,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff..7774196 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
 	"pgfree",
 	"pgactivate",
 	"pgdeactivate",
+	"pglazyfree",
 
 	"pgfault",
 	"pgmajfault",
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Userspace indicates MADV_FREE pages could be freed without pageout, so
it pretty much likes used once file pages. For such pages, we'd like to
reclaim them once there is memory pressure. Also it might be unfair
reclaiming MADV_FREE pages always before used once file pages and we
definitively want to reclaim the pages before other anonymous and file
pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.

We also clear the pages SwapBacked flag to indicate they are MADV_FREE
pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/mm_inline.h     |  5 +++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/huge_memory.c              |  5 ++---
 mm/madvise.c                  |  3 +--
 mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
 mm/vmstat.c                   |  1 +
 7 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68..fdded06 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
 	return !PageSwapBacked(page);
 }
 
+static inline bool page_is_lazyfree(struct page *page)
+{
+	return PageAnon(page) && !PageSwapBacked(page);
+}
+
 static __always_inline void __update_lru_size(struct lruvec *lruvec,
 				enum lru_list lru, enum zone_type zid,
 				int nr_pages)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd..486494e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 6aa1b6c..94e58da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		FOR_ALL_ZONES(ALLOCSTALL),
 		FOR_ALL_ZONES(PGSCAN_SKIP),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
 		PGREFILL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ecf569d..ddb9a94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ClearPageDirty(page);
 	unlock_page(page);
 
-	if (PageActive(page))
-		deactivate_page(page);
-
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
 		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
 			tlb->fullmm);
@@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		set_pmd_at(mm, addr, pmd, orig_pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
+
+	mark_page_lazyfree(page);
 	ret = true;
 out:
 	spin_unlock(ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index c867d88..c24549e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkclean(ptent);
 			ptent = pte_wrprotect(ptent);
 			set_pte_at(mm, addr, pte, ptent);
-			if (PageActive(page))
-				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		mark_page_lazyfree(page);
 	}
 out:
 	if (nr_swap) {
diff --git a/mm/swap.c b/mm/swap.c
index c4910f1..69a7e9d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,7 +46,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 #endif
@@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
+		if (page_is_lazyfree(page)) {
+			SetPageSwapBacked(page);
+			file = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
@@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 }
 
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
-		int lru = page_lru_base_type(page);
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		bool active = PageActive(page);
 
-		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
 		ClearPageActive(page);
 		ClearPageReferenced(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		ClearPageSwapBacked(page);
+		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0);
+		update_page_reclaim_stat(lruvec, 1, 0);
+		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
 	}
 }
 
@@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
 
 	activate_page_drain(cpu);
 }
@@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
 }
 
 /**
- * deactivate_page - deactivate a page
+ * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
  *
- * deactivate_page() moves @page to the inactive list if @page was on the active
- * list and was not an unevictable page.  This is done to accelerate the reclaim
- * of @page.
+ * mark_page_lazyfree() moves @page to the inactive file list.
+ * This is done to accelerate the reclaim of @page.
  */
-void deactivate_page(struct page *page)
-{
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+void mark_page_lazyfree(struct page *page)
+ {
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
 
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		put_cpu_var(lru_lazyfree_pvecs);
 	}
 }
 
@@ -704,7 +710,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff..7774196 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
 	"pgfree",
 	"pgactivate",
 	"pgdeactivate",
+	"pglazyfree",
 
 	"pgfault",
 	"pgmajfault",
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 3/7] mm: reclaim MADV_FREE pages
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

When memory pressure is high, we free MADV_FREE pages. If the pages are
not dirty in pte, the pages could be freed immediately. Otherwise we
can't reclaim them. We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.

We use normal page reclaim policy. Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age. This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/rmap.c   |  4 ++++
 mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c8d6204..5f05926 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			dec_mm_counter(mm, MM_ANONPAGES);
 			rp->lazyfreed++;
 			goto discard;
+		} else if (flags & TTU_LZFREE) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
 		}
 
 		if (swap_duplicate(entry) < 0) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 947ab6f..b304a84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_RECLAIM;
 
 	if (referenced_ptes) {
-		if (PageSwapBacked(page))
+		if (PageSwapBacked(page) || PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
 		 * All mapped pages start out with page table
@@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
 
 /* Check if a page is dirty or under writeback */
 static void page_check_dirty_writeback(struct page *page,
-				       bool *dirty, bool *writeback)
+			bool *dirty, bool *writeback, bool lazyfree)
 {
 	struct address_space *mapping;
 
@@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
 	 * Anonymous pages are not handled by flushers and must be written
 	 * from reclaim context. Do not stall reclaim based on them
 	 */
-	if (!page_is_file_cache(page)) {
+	if (!page_is_file_cache(page) || lazyfree) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
-		bool lazyfree = false;
+		bool lazyfree;
 		int ret = SWAP_SUCCESS;
 
 		cond_resched();
@@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		sc->nr_scanned++;
 
+		lazyfree = page_is_lazyfree(page);
+
 		if (unlikely(!page_evictable(page)))
 			goto cull_mlocked;
 
@@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
+		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
 			sc->nr_scanned++;
 
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
@@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * will stall and start writing pages if the tail of the LRU
 		 * is all dirty unqueued pages.
 		 */
-		page_check_dirty_writeback(page, &dirty, &writeback);
+		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
 		if (dirty || writeback)
 			nr_dirty++;
 
@@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			; /* try to reclaim the page below */
 		}
 
+		/* lazyfree page could be freed directly */
+		if (lazyfree) {
+			if (unlikely(PageTransHuge(page)) &&
+			    split_huge_page_to_list(page, page_list))
+				goto keep_locked;
+			goto unmap_page;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
@@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
-			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		VM_BUG_ON_PAGE(PageTransHuge(page), page);
-
+unmap_page:
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
+		if (page_mapped(page) && (mapping || lazyfree)) {
 			switch (ret = try_to_unmap(page, lazyfree ?
 				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
 				(ttu_flags | TTU_BATCH_FLUSH))) {
@@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_LZFREE:
-				goto lazyfree;
+				/* follow __remove_mapping for reference */
+				if (page_ref_freeze(page, 1)) {
+					if (!PageDirty(page))
+						goto lazyfree;
+					else
+						page_ref_unfreeze(page, 1);
+				}
+				goto keep_locked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-lazyfree:
 		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
-
+lazyfree:
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 cull_mlocked:
 		if (PageSwapCache(page))
 			try_to_free_swap(page);
+		if (lazyfree)
+			SetPageSwapBacked(page);
 		unlock_page(page);
 		list_add(&page->lru, &ret_pages);
 		continue;
@@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
+		if (lazyfree)
+			SetPageSwapBacked(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 3/7] mm: reclaim MADV_FREE pages
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

When memory pressure is high, we free MADV_FREE pages. If the pages are
not dirty in pte, the pages could be freed immediately. Otherwise we
can't reclaim them. We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.

We use normal page reclaim policy. Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age. This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/rmap.c   |  4 ++++
 mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
 2 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c8d6204..5f05926 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			dec_mm_counter(mm, MM_ANONPAGES);
 			rp->lazyfreed++;
 			goto discard;
+		} else if (flags & TTU_LZFREE) {
+			set_pte_at(mm, address, pte, pteval);
+			ret = SWAP_FAIL;
+			goto out_unmap;
 		}
 
 		if (swap_duplicate(entry) < 0) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 947ab6f..b304a84 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_RECLAIM;
 
 	if (referenced_ptes) {
-		if (PageSwapBacked(page))
+		if (PageSwapBacked(page) || PageAnon(page))
 			return PAGEREF_ACTIVATE;
 		/*
 		 * All mapped pages start out with page table
@@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
 
 /* Check if a page is dirty or under writeback */
 static void page_check_dirty_writeback(struct page *page,
-				       bool *dirty, bool *writeback)
+			bool *dirty, bool *writeback, bool lazyfree)
 {
 	struct address_space *mapping;
 
@@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
 	 * Anonymous pages are not handled by flushers and must be written
 	 * from reclaim context. Do not stall reclaim based on them
 	 */
-	if (!page_is_file_cache(page)) {
+	if (!page_is_file_cache(page) || lazyfree) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
-		bool lazyfree = false;
+		bool lazyfree;
 		int ret = SWAP_SUCCESS;
 
 		cond_resched();
@@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		sc->nr_scanned++;
 
+		lazyfree = page_is_lazyfree(page);
+
 		if (unlikely(!page_evictable(page)))
 			goto cull_mlocked;
 
@@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
+		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
 			sc->nr_scanned++;
 
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
@@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * will stall and start writing pages if the tail of the LRU
 		 * is all dirty unqueued pages.
 		 */
-		page_check_dirty_writeback(page, &dirty, &writeback);
+		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
 		if (dirty || writeback)
 			nr_dirty++;
 
@@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			; /* try to reclaim the page below */
 		}
 
+		/* lazyfree page could be freed directly */
+		if (lazyfree) {
+			if (unlikely(PageTransHuge(page)) &&
+			    split_huge_page_to_list(page, page_list))
+				goto keep_locked;
+			goto unmap_page;
+		}
+
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
@@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
-			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		VM_BUG_ON_PAGE(PageTransHuge(page), page);
-
+unmap_page:
 		/*
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
+		if (page_mapped(page) && (mapping || lazyfree)) {
 			switch (ret = try_to_unmap(page, lazyfree ?
 				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
 				(ttu_flags | TTU_BATCH_FLUSH))) {
@@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_LZFREE:
-				goto lazyfree;
+				/* follow __remove_mapping for reference */
+				if (page_ref_freeze(page, 1)) {
+					if (!PageDirty(page))
+						goto lazyfree;
+					else
+						page_ref_unfreeze(page, 1);
+				}
+				goto keep_locked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-lazyfree:
 		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
-
+lazyfree:
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 cull_mlocked:
 		if (PageSwapCache(page))
 			try_to_free_swap(page);
+		if (lazyfree)
+			SetPageSwapBacked(page);
 		unlock_page(page);
 		list_add(&page->lru, &ret_pages);
 		continue;
@@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
+		if (lazyfree)
+			SetPageSwapBacked(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 4/7] mm: enable MADV_FREE for swapless system
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Now MADV_FREE pages can be easily reclaimed even for swapless system. We
can safely enable MADV_FREE for all systems.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/madvise.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c24549e..fe40e93 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -579,13 +579,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_FREE:
-		/*
-		 * XXX: In this implementation, MADV_FREE works like
-		 * MADV_DONTNEED on swapless system or full swap.
-		 */
-		if (get_nr_swap_pages() > 0)
-			return madvise_free(vma, prev, start, end);
-		/* passthrough */
+		return madvise_free(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 4/7] mm: enable MADV_FREE for swapless system
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Now MADV_FREE pages can be easily reclaimed even for swapless system. We
can safely enable MADV_FREE for all systems.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/madvise.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c24549e..fe40e93 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -579,13 +579,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_FREE:
-		/*
-		 * XXX: In this implementation, MADV_FREE works like
-		 * MADV_DONTNEED on swapless system or full swap.
-		 */
-		if (get_nr_swap_pages() > 0)
-			return madvise_free(vma, prev, start, end);
-		/* passthrough */
+		return madvise_free(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Show MADV_FREE pages info in proc/sysfs files.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/base/node.c       |  2 ++
 fs/proc/meminfo.c         |  1 +
 include/linux/mm_inline.h | 31 ++++++++++++++++++++++++++++---
 include/linux/mmzone.h    |  2 ++
 mm/page_alloc.c           |  7 +++++--
 mm/vmscan.c               |  9 +++++++--
 mm/vmstat.c               |  2 ++
 7 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..9138db8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -71,6 +71,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
 		       "Node %d Unevictable:    %8lu kB\n"
+		       "Node %d LazyFree:       %8lu kB\n"
 		       "Node %d Mlocked:        %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
@@ -84,6 +85,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_LAZYFREE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a42849..b2e7b31 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
 	show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
 	show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
+	show_val_kb(m, "LazyFree:       ", global_node_page_state(NR_LAZYFREE));
 	show_val_kb(m, "Mlocked:        ", global_page_state(NR_MLOCK));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fdded06..3e496de 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -48,25 +48,50 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 #endif
 }
 
+static __always_inline void __update_lazyfree_size(struct lruvec *lruvec,
+				enum zone_type zid, int nr_pages)
+{
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	__mod_node_page_state(pgdat, NR_LAZYFREE, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LAZYFREE,
+				nr_pages);
+}
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, nr_pages);
+	update_lru_size(lruvec, lru, zid, nr_pages);
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list_tail(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, nr_pages);
+	update_lru_size(lruvec, lru, zid, nr_pages);
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, -nr_pages);
+	update_lru_size(lruvec, lru, zid, -nr_pages);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 338a786a..78985f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,6 +118,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_LAZYFREE,
 	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
@@ -147,6 +148,7 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_LAZYFREE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11b4cd4..d0ff8c2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4453,7 +4453,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-		" free:%lu free_pcp:%lu free_cma:%lu\n",
+		" free:%lu free_pcp:%lu free_cma:%lu lazy_free:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
 		global_node_page_state(NR_ISOLATED_ANON),
@@ -4472,7 +4472,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_page_state(NR_BOUNCE),
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
-		global_page_state(NR_FREE_CMA_PAGES));
+		global_page_state(NR_FREE_CMA_PAGES),
+		global_node_page_state(NR_LAZYFREE));
 
 	for_each_online_pgdat(pgdat) {
 		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -4484,6 +4485,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" lazy_free:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
@@ -4506,6 +4508,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_LAZYFREE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b304a84..1a98467 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1442,7 +1442,8 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
  * be complete before mem_cgroup_update_lru_size due to a santity check.
  */
 static __always_inline void update_lru_sizes(struct lruvec *lruvec,
-			enum lru_list lru, unsigned long *nr_zone_taken)
+			enum lru_list lru, unsigned long *nr_zone_taken,
+			unsigned long *nr_zone_lazyfree)
 {
 	int zid;
 
@@ -1450,6 +1451,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 		if (!nr_zone_taken[zid])
 			continue;
 
+		__update_lazyfree_size(lruvec, zid, -nr_zone_lazyfree[zid]);
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 #ifdef CONFIG_MEMCG
 		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
@@ -1486,6 +1488,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long nr_zone_lazyfree[MAX_NR_ZONES] = { 0 };
 	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
 	unsigned long skipped = 0, total_skipped = 0;
 	unsigned long scan, nr_pages;
@@ -1517,6 +1520,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			nr_pages = hpage_nr_pages(page);
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
+			if (page_is_lazyfree(page))
+				nr_zone_lazyfree[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
@@ -1560,7 +1565,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	*nr_scanned = scan + total_skipped;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    scan, skipped, nr_taken, mode, lru);
-	update_lru_sizes(lruvec, lru, nr_zone_taken);
+	update_lru_sizes(lruvec, lru, nr_zone_taken, nr_zone_lazyfree);
 	return nr_taken;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7774196..a70b52d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_zone_inactive_file",
 	"nr_zone_active_file",
 	"nr_zone_unevictable",
+	"nr_zone_lazyfree",
 	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
@@ -952,6 +953,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_lazyfree",
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Show MADV_FREE pages info in proc/sysfs files.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/base/node.c       |  2 ++
 fs/proc/meminfo.c         |  1 +
 include/linux/mm_inline.h | 31 ++++++++++++++++++++++++++++---
 include/linux/mmzone.h    |  2 ++
 mm/page_alloc.c           |  7 +++++--
 mm/vmscan.c               |  9 +++++++--
 mm/vmstat.c               |  2 ++
 7 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..9138db8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -71,6 +71,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
 		       "Node %d Unevictable:    %8lu kB\n"
+		       "Node %d LazyFree:       %8lu kB\n"
 		       "Node %d Mlocked:        %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
@@ -84,6 +85,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_LAZYFREE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a42849..b2e7b31 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
 	show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
 	show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
+	show_val_kb(m, "LazyFree:       ", global_node_page_state(NR_LAZYFREE));
 	show_val_kb(m, "Mlocked:        ", global_page_state(NR_MLOCK));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index fdded06..3e496de 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -48,25 +48,50 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
 #endif
 }
 
+static __always_inline void __update_lazyfree_size(struct lruvec *lruvec,
+				enum zone_type zid, int nr_pages)
+{
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+
+	__mod_node_page_state(pgdat, NR_LAZYFREE, nr_pages);
+	__mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LAZYFREE,
+				nr_pages);
+}
+
 static __always_inline void add_page_to_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, nr_pages);
+	update_lru_size(lruvec, lru, zid, nr_pages);
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void add_page_to_lru_list_tail(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
-	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, nr_pages);
+	update_lru_size(lruvec, lru, zid, nr_pages);
 	list_add_tail(&page->lru, &lruvec->lists[lru]);
 }
 
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
+	enum zone_type zid = page_zonenum(page);
+	int nr_pages = hpage_nr_pages(page);
+
 	list_del(&page->lru);
-	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
+	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
+		__update_lazyfree_size(lruvec, zid, -nr_pages);
+	update_lru_size(lruvec, lru, zid, -nr_pages);
 }
 
 /**
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 338a786a..78985f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,6 +118,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_LAZYFREE,
 	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
@@ -147,6 +148,7 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_LAZYFREE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 11b4cd4..d0ff8c2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4453,7 +4453,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-		" free:%lu free_pcp:%lu free_cma:%lu\n",
+		" free:%lu free_pcp:%lu free_cma:%lu lazy_free:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
 		global_node_page_state(NR_ISOLATED_ANON),
@@ -4472,7 +4472,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_page_state(NR_BOUNCE),
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
-		global_page_state(NR_FREE_CMA_PAGES));
+		global_page_state(NR_FREE_CMA_PAGES),
+		global_node_page_state(NR_LAZYFREE));
 
 	for_each_online_pgdat(pgdat) {
 		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -4484,6 +4485,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" lazy_free:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
@@ -4506,6 +4508,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_LAZYFREE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b304a84..1a98467 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1442,7 +1442,8 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
  * be complete before mem_cgroup_update_lru_size due to a santity check.
  */
 static __always_inline void update_lru_sizes(struct lruvec *lruvec,
-			enum lru_list lru, unsigned long *nr_zone_taken)
+			enum lru_list lru, unsigned long *nr_zone_taken,
+			unsigned long *nr_zone_lazyfree)
 {
 	int zid;
 
@@ -1450,6 +1451,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 		if (!nr_zone_taken[zid])
 			continue;
 
+		__update_lazyfree_size(lruvec, zid, -nr_zone_lazyfree[zid]);
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 #ifdef CONFIG_MEMCG
 		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
@@ -1486,6 +1488,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	struct list_head *src = &lruvec->lists[lru];
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
+	unsigned long nr_zone_lazyfree[MAX_NR_ZONES] = { 0 };
 	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
 	unsigned long skipped = 0, total_skipped = 0;
 	unsigned long scan, nr_pages;
@@ -1517,6 +1520,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			nr_pages = hpage_nr_pages(page);
 			nr_taken += nr_pages;
 			nr_zone_taken[page_zonenum(page)] += nr_pages;
+			if (page_is_lazyfree(page))
+				nr_zone_lazyfree[page_zonenum(page)] += nr_pages;
 			list_move(&page->lru, dst);
 			break;
 
@@ -1560,7 +1565,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	*nr_scanned = scan + total_skipped;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    scan, skipped, nr_taken, mode, lru);
-	update_lru_sizes(lruvec, lru, nr_zone_taken);
+	update_lru_sizes(lruvec, lru, nr_zone_taken, nr_zone_lazyfree);
 	return nr_taken;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7774196..a70b52d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_zone_inactive_file",
 	"nr_zone_active_file",
 	"nr_zone_unevictable",
+	"nr_zone_lazyfree",
 	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
@@ -952,6 +953,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_lazyfree",
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/task_mmu.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb2..8f2423f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -440,6 +440,7 @@ struct mem_size_stats {
 	unsigned long private_dirty;
 	unsigned long referenced;
 	unsigned long anonymous;
+	unsigned long lazyfree;
 	unsigned long anonymous_thp;
 	unsigned long shmem_thp;
 	unsigned long swap;
@@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	int i, nr = compound ? 1 << compound_order(page) : 1;
 	unsigned long size = nr * PAGE_SIZE;
 
-	if (PageAnon(page))
+	if (PageAnon(page)) {
 		mss->anonymous += size;
+		if (!PageSwapBacked(page))
+			mss->lazyfree += size;
+	}
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
@@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
+		   "LazyFree:       %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
 		   "ShmemPmdMapped: %8lu kB\n"
 		   "Shared_Hugetlb: %8lu kB\n"
@@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
+		   mss.lazyfree >> 10,
 		   mss.anonymous_thp >> 10,
 		   mss.shmem_thp >> 10,
 		   mss.shared_hugetlb >> 10,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/task_mmu.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb2..8f2423f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -440,6 +440,7 @@ struct mem_size_stats {
 	unsigned long private_dirty;
 	unsigned long referenced;
 	unsigned long anonymous;
+	unsigned long lazyfree;
 	unsigned long anonymous_thp;
 	unsigned long shmem_thp;
 	unsigned long swap;
@@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	int i, nr = compound ? 1 << compound_order(page) : 1;
 	unsigned long size = nr * PAGE_SIZE;
 
-	if (PageAnon(page))
+	if (PageAnon(page)) {
 		mss->anonymous += size;
+		if (!PageSwapBacked(page))
+			mss->lazyfree += size;
+	}
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
@@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
+		   "LazyFree:       %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
 		   "ShmemPmdMapped: %8lu kB\n"
 		   "Shared_Hugetlb: %8lu kB\n"
@@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
+		   mss.lazyfree >> 10,
 		   mss.anonymous_thp >> 10,
 		   mss.shmem_thp >> 10,
 		   mss.shared_hugetlb >> 10,
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-03 23:33 ` Shaohua Li
@ 2017-02-03 23:33   ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Add a separate RSS for MADV_FREE pages. The pages are charged into
MM_ANONPAGES (because they are mapped anon pages) and also charged into
the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
display the RSS, which userspace can use to determine the RSS excluding
MADV_FREE pages.

The basic idea is to increment the RSS in madvise and decrement in unmap
or page reclaim. There is one limitation. If a page is shared by two
processes, since madvise only has mm cotext of current process, it isn't
convenient to charge the RSS for both processes. So we don't charge the
RSS if the mapcount isn't 1. On the other hand, fork can make a
MADV_FREE page shared by two processes. To make things consistent, we
uncharge the RSS from the source mm in fork.

A new flag is added to indicate if a page is accounted into the RSS. We
can't use SwapBacked flag to do the determination because we can't
guarantee the page has SwapBacked flag cleared in madvise. We are
reusing mappedtodisk flag which should not be set for Anon pages.

There are a couple of other places we need to uncharge the RSS,
activate_page and mark_page_accessed. activate_page is used by swap,
where MADV_FREE pages are already not in lazyfree state before going
into swap. mark_page_accessed is mainly used for file pages, but there
are several places it's used by anonymous pages. I fixed gup, but not
some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
inprecise RSS accounting.

Please note, the accounting is never going to be precise. MADV_FREE page
could be written by userspace without notification to the kernel. The
page can't be reclaimed like other clean lazyfree pages. The page isn't
real lazyfree page. But since kernel isn't aware of this, the page is
still accounted as lazyfree, thus the accounting could be incorrect.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/array.c            |  9 ++++++---
 fs/proc/internal.h         |  3 ++-
 fs/proc/task_mmu.c         |  9 +++++++--
 fs/proc/task_nommu.c       |  4 +++-
 include/linux/mm_types.h   |  1 +
 include/linux/page-flags.h |  6 ++++++
 mm/gup.c                   |  2 ++
 mm/huge_memory.c           |  8 ++++++++
 mm/khugepaged.c            |  2 ++
 mm/madvise.c               |  5 +++++
 mm/memory.c                | 13 +++++++++++--
 mm/migrate.c               |  2 ++
 mm/oom_kill.c              | 10 ++++++----
 mm/rmap.c                  |  3 +++
 14 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 51a4213..c2281f4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -583,17 +583,19 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
 	unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
+	unsigned long lazyfree = 0;
 	struct mm_struct *mm = get_task_mm(task);
 
 	if (mm) {
-		size = task_statm(mm, &shared, &text, &data, &resident);
+		size = task_statm(mm, &shared, &text, &data, &resident,
+				  &lazyfree);
 		mmput(mm);
 	}
 	/*
 	 * For quick read, open code by putting numbers directly
 	 * expected format is
-	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n",
-	 *               size, resident, shared, text, data);
+	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0 %lu\n",
+	 *               size, resident, shared, text, data, lazyfree);
 	 */
 	seq_put_decimal_ull(m, "", size);
 	seq_put_decimal_ull(m, " ", resident);
@@ -602,6 +604,7 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	seq_put_decimal_ull(m, " ", 0);
 	seq_put_decimal_ull(m, " ", data);
 	seq_put_decimal_ull(m, " ", 0);
+	seq_put_decimal_ull(m, " ", lazyfree);
 	seq_putc(m, '\n');
 
 	return 0;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index e2c3c46..6587b9c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -290,5 +290,6 @@ extern const struct file_operations proc_pagemap_operations;
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
 				unsigned long *, unsigned long *,
-				unsigned long *, unsigned long *);
+				unsigned long *, unsigned long *,
+				unsigned long *);
 extern void task_mem(struct seq_file *, struct mm_struct *);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f2423f..f18b568 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -23,9 +23,10 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
+	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem, lazyfree;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
+	lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	anon = get_mm_counter(mm, MM_ANONPAGES);
 	file = get_mm_counter(mm, MM_FILEPAGES);
 	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
@@ -59,6 +60,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"RssAnon:\t%8lu kB\n"
 		"RssFile:\t%8lu kB\n"
 		"RssShmem:\t%8lu kB\n"
+		"RssLazyfree:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -75,6 +77,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		anon << (PAGE_SHIFT-10),
 		file << (PAGE_SHIFT-10),
 		shmem << (PAGE_SHIFT-10),
+		lazyfree << (PAGE_SHIFT-10),
 		mm->data_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
@@ -90,7 +93,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	*shared = get_mm_counter(mm, MM_FILEPAGES) +
 			get_mm_counter(mm, MM_SHMEMPAGES);
@@ -98,6 +102,7 @@ unsigned long task_statm(struct mm_struct *mm,
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
 	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	return mm->total_vm;
 }
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1ef97cf..50426de 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -94,7 +94,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -120,6 +121,7 @@ unsigned long task_statm(struct mm_struct *mm,
 	size >>= PAGE_SHIFT;
 	size += *text + *data;
 	*resident = size;
+	*lazyfree = 0;
 	return size;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440..b6a1428 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -376,6 +376,7 @@ enum {
 	MM_ANONPAGES,	/* Resident anonymous pages */
 	MM_SWAPENTS,	/* Anonymous swap entries */
 	MM_SHMEMPAGES,	/* Resident shared memory pages */
+	MM_LAZYFREEPAGES, /* Lazyfree pages, also charged into MM_ANONPAGES */
 	NR_MM_COUNTERS
 };
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..67c732b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,8 @@ enum pageflags {
 #endif
 	__NR_PAGEFLAGS,
 
+	PG_lazyfreeaccounted = PG_mappedtodisk, /* only for anon MADV_FREE pages */
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -428,6 +430,10 @@ TESTPAGEFLAG_FALSE(Ksm)
 
 u64 stable_page_flags(struct page *page);
 
+PAGEFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTSETFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTCLEARFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret;
diff --git a/mm/gup.c b/mm/gup.c
index 40abe4c..e64d990 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -171,6 +171,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 * mark_page_accessed().
 		 */
 		mark_page_accessed(page);
+		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(mm, MM_LAZYFREEPAGES);
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ddb9a94..951fa34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -871,6 +871,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
+	if (PageAnon(src_page) && TestClearPageLazyFreeAccounted(src_page))
+		add_mm_counter(src_mm, MM_LAZYFREEPAGES, -HPAGE_PMD_NR);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	atomic_long_inc(&dst_mm->nr_ptes);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -1402,6 +1404,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
 
+	if (page_mapcount(page) == 1 && !TestSetPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, HPAGE_PMD_NR);
 	mark_page_lazyfree(page);
 	ret = true;
 out:
@@ -1459,6 +1463,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			pte_free(tlb->mm, pgtable);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			if (TestClearPageLazyFreeAccounted(page))
+				add_mm_counter(tlb->mm, MM_LAZYFREEPAGES,
+						-HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
 				zap_deposited_table(tlb->mm, pmd);
@@ -1917,6 +1924,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_swapbacked) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
+			 (1L << PG_lazyfreeaccounted) |
 			 (1L << PG_active) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4b499f..e4668db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -577,6 +577,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		}
 		inc_node_page_state(page,
 				NR_ISOLATED_ANON + page_is_file_cache(page));
+		if (TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(vma->vm_mm, MM_LAZYFREEPAGES);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index fe40e93..3c90956 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -275,6 +275,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	int nr_lazyfree_accounted = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -380,9 +381,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			set_pte_at(mm, addr, pte, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		if (page_mapcount(page) == 1 &&
+		    !TestSetPageLazyFreeAccounted(page))
+			nr_lazyfree_accounted++;
 		mark_page_lazyfree(page);
 	}
 out:
+	add_mm_counter(mm, MM_LAZYFREEPAGES, nr_lazyfree_accounted);
 	if (nr_swap) {
 		if (current->mm == mm)
 			sync_mm_rss(mm);
diff --git a/mm/memory.c b/mm/memory.c
index cf97d88..e275de1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -850,7 +850,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 static inline unsigned long
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		unsigned long addr, int *rss, int *rss_src_lazyfree)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -915,6 +915,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
+		if (PageAnon(page) &&
+		    TestClearPageLazyFreeAccounted(page))
+			(*rss_src_lazyfree)++;
 		rss[mm_counter(page)]++;
 	}
 
@@ -932,10 +935,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
+	int rss_src_lazyfree;
 	swp_entry_t entry = (swp_entry_t){0};
 
 again:
 	init_rss_vec(rss);
+	rss_src_lazyfree = 0;
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -963,13 +968,14 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+					vma, addr, rss, &rss_src_lazyfree);
 		if (entry.val)
 			break;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	add_mm_counter(src_mm, MM_LAZYFREEPAGES, -rss_src_lazyfree);
 	spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
@@ -1163,6 +1169,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
+			if (PageAnon(page) &&
+			    TestClearPageLazyFreeAccounted(page))
+				rss[MM_LAZYFREEPAGES]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
diff --git a/mm/migrate.c b/mm/migrate.c
index eb76f87..6e586d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -642,6 +642,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
 		SetPageMappedToDisk(newpage);
+	if (PageLazyFreeAccounted(page))
+		SetPageLazyFreeAccounted(newpage);
 
 	/* Move dirty on pages not done by migrate_page_move_mapping() */
 	if (PageDirty(page))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 51c0918..54e0604 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -528,11 +528,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+			K(get_mm_counter(mm, MM_SHMEMPAGES)),
+			K(get_mm_counter(mm, MM_LAZYFREEPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -878,11 +879,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+		K(get_mm_counter(victim->mm, MM_LAZYFREEPAGES)));
 	task_unlock(victim);
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 5f05926..86c80d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1585,6 +1585,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	put_page(page);
 
 out_unmap:
+	/* regardless of success or failure, the page isn't lazyfree */
+	if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, -hpage_nr_pages(page));
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-03 23:33   ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-03 23:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

Add a separate RSS for MADV_FREE pages. The pages are charged into
MM_ANONPAGES (because they are mapped anon pages) and also charged into
the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
display the RSS, which userspace can use to determine the RSS excluding
MADV_FREE pages.

The basic idea is to increment the RSS in madvise and decrement in unmap
or page reclaim. There is one limitation. If a page is shared by two
processes, since madvise only has mm cotext of current process, it isn't
convenient to charge the RSS for both processes. So we don't charge the
RSS if the mapcount isn't 1. On the other hand, fork can make a
MADV_FREE page shared by two processes. To make things consistent, we
uncharge the RSS from the source mm in fork.

A new flag is added to indicate if a page is accounted into the RSS. We
can't use SwapBacked flag to do the determination because we can't
guarantee the page has SwapBacked flag cleared in madvise. We are
reusing mappedtodisk flag which should not be set for Anon pages.

There are a couple of other places we need to uncharge the RSS,
activate_page and mark_page_accessed. activate_page is used by swap,
where MADV_FREE pages are already not in lazyfree state before going
into swap. mark_page_accessed is mainly used for file pages, but there
are several places it's used by anonymous pages. I fixed gup, but not
some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
inprecise RSS accounting.

Please note, the accounting is never going to be precise. MADV_FREE page
could be written by userspace without notification to the kernel. The
page can't be reclaimed like other clean lazyfree pages. The page isn't
real lazyfree page. But since kernel isn't aware of this, the page is
still accounted as lazyfree, thus the accounting could be incorrect.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/array.c            |  9 ++++++---
 fs/proc/internal.h         |  3 ++-
 fs/proc/task_mmu.c         |  9 +++++++--
 fs/proc/task_nommu.c       |  4 +++-
 include/linux/mm_types.h   |  1 +
 include/linux/page-flags.h |  6 ++++++
 mm/gup.c                   |  2 ++
 mm/huge_memory.c           |  8 ++++++++
 mm/khugepaged.c            |  2 ++
 mm/madvise.c               |  5 +++++
 mm/memory.c                | 13 +++++++++++--
 mm/migrate.c               |  2 ++
 mm/oom_kill.c              | 10 ++++++----
 mm/rmap.c                  |  3 +++
 14 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 51a4213..c2281f4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -583,17 +583,19 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
 	unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
+	unsigned long lazyfree = 0;
 	struct mm_struct *mm = get_task_mm(task);
 
 	if (mm) {
-		size = task_statm(mm, &shared, &text, &data, &resident);
+		size = task_statm(mm, &shared, &text, &data, &resident,
+				  &lazyfree);
 		mmput(mm);
 	}
 	/*
 	 * For quick read, open code by putting numbers directly
 	 * expected format is
-	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n",
-	 *               size, resident, shared, text, data);
+	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0 %lu\n",
+	 *               size, resident, shared, text, data, lazyfree);
 	 */
 	seq_put_decimal_ull(m, "", size);
 	seq_put_decimal_ull(m, " ", resident);
@@ -602,6 +604,7 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	seq_put_decimal_ull(m, " ", 0);
 	seq_put_decimal_ull(m, " ", data);
 	seq_put_decimal_ull(m, " ", 0);
+	seq_put_decimal_ull(m, " ", lazyfree);
 	seq_putc(m, '\n');
 
 	return 0;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index e2c3c46..6587b9c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -290,5 +290,6 @@ extern const struct file_operations proc_pagemap_operations;
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
 				unsigned long *, unsigned long *,
-				unsigned long *, unsigned long *);
+				unsigned long *, unsigned long *,
+				unsigned long *);
 extern void task_mem(struct seq_file *, struct mm_struct *);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f2423f..f18b568 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -23,9 +23,10 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
+	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem, lazyfree;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
+	lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	anon = get_mm_counter(mm, MM_ANONPAGES);
 	file = get_mm_counter(mm, MM_FILEPAGES);
 	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
@@ -59,6 +60,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"RssAnon:\t%8lu kB\n"
 		"RssFile:\t%8lu kB\n"
 		"RssShmem:\t%8lu kB\n"
+		"RssLazyfree:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -75,6 +77,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		anon << (PAGE_SHIFT-10),
 		file << (PAGE_SHIFT-10),
 		shmem << (PAGE_SHIFT-10),
+		lazyfree << (PAGE_SHIFT-10),
 		mm->data_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
@@ -90,7 +93,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	*shared = get_mm_counter(mm, MM_FILEPAGES) +
 			get_mm_counter(mm, MM_SHMEMPAGES);
@@ -98,6 +102,7 @@ unsigned long task_statm(struct mm_struct *mm,
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
 	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	return mm->total_vm;
 }
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1ef97cf..50426de 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -94,7 +94,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -120,6 +121,7 @@ unsigned long task_statm(struct mm_struct *mm,
 	size >>= PAGE_SHIFT;
 	size += *text + *data;
 	*resident = size;
+	*lazyfree = 0;
 	return size;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440..b6a1428 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -376,6 +376,7 @@ enum {
 	MM_ANONPAGES,	/* Resident anonymous pages */
 	MM_SWAPENTS,	/* Anonymous swap entries */
 	MM_SHMEMPAGES,	/* Resident shared memory pages */
+	MM_LAZYFREEPAGES, /* Lazyfree pages, also charged into MM_ANONPAGES */
 	NR_MM_COUNTERS
 };
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..67c732b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,8 @@ enum pageflags {
 #endif
 	__NR_PAGEFLAGS,
 
+	PG_lazyfreeaccounted = PG_mappedtodisk, /* only for anon MADV_FREE pages */
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -428,6 +430,10 @@ TESTPAGEFLAG_FALSE(Ksm)
 
 u64 stable_page_flags(struct page *page);
 
+PAGEFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTSETFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTCLEARFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret;
diff --git a/mm/gup.c b/mm/gup.c
index 40abe4c..e64d990 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -171,6 +171,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 * mark_page_accessed().
 		 */
 		mark_page_accessed(page);
+		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(mm, MM_LAZYFREEPAGES);
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ddb9a94..951fa34 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -871,6 +871,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
+	if (PageAnon(src_page) && TestClearPageLazyFreeAccounted(src_page))
+		add_mm_counter(src_mm, MM_LAZYFREEPAGES, -HPAGE_PMD_NR);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	atomic_long_inc(&dst_mm->nr_ptes);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -1402,6 +1404,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
 
+	if (page_mapcount(page) == 1 && !TestSetPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, HPAGE_PMD_NR);
 	mark_page_lazyfree(page);
 	ret = true;
 out:
@@ -1459,6 +1463,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			pte_free(tlb->mm, pgtable);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			if (TestClearPageLazyFreeAccounted(page))
+				add_mm_counter(tlb->mm, MM_LAZYFREEPAGES,
+						-HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
 				zap_deposited_table(tlb->mm, pmd);
@@ -1917,6 +1924,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_swapbacked) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
+			 (1L << PG_lazyfreeaccounted) |
 			 (1L << PG_active) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4b499f..e4668db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -577,6 +577,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		}
 		inc_node_page_state(page,
 				NR_ISOLATED_ANON + page_is_file_cache(page));
+		if (TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(vma->vm_mm, MM_LAZYFREEPAGES);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index fe40e93..3c90956 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -275,6 +275,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	int nr_lazyfree_accounted = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -380,9 +381,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			set_pte_at(mm, addr, pte, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		if (page_mapcount(page) == 1 &&
+		    !TestSetPageLazyFreeAccounted(page))
+			nr_lazyfree_accounted++;
 		mark_page_lazyfree(page);
 	}
 out:
+	add_mm_counter(mm, MM_LAZYFREEPAGES, nr_lazyfree_accounted);
 	if (nr_swap) {
 		if (current->mm == mm)
 			sync_mm_rss(mm);
diff --git a/mm/memory.c b/mm/memory.c
index cf97d88..e275de1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -850,7 +850,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 static inline unsigned long
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		unsigned long addr, int *rss, int *rss_src_lazyfree)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -915,6 +915,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
+		if (PageAnon(page) &&
+		    TestClearPageLazyFreeAccounted(page))
+			(*rss_src_lazyfree)++;
 		rss[mm_counter(page)]++;
 	}
 
@@ -932,10 +935,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
+	int rss_src_lazyfree;
 	swp_entry_t entry = (swp_entry_t){0};
 
 again:
 	init_rss_vec(rss);
+	rss_src_lazyfree = 0;
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -963,13 +968,14 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+					vma, addr, rss, &rss_src_lazyfree);
 		if (entry.val)
 			break;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	add_mm_counter(src_mm, MM_LAZYFREEPAGES, -rss_src_lazyfree);
 	spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
@@ -1163,6 +1169,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
+			if (PageAnon(page) &&
+			    TestClearPageLazyFreeAccounted(page))
+				rss[MM_LAZYFREEPAGES]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
diff --git a/mm/migrate.c b/mm/migrate.c
index eb76f87..6e586d2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -642,6 +642,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
 		SetPageMappedToDisk(newpage);
+	if (PageLazyFreeAccounted(page))
+		SetPageLazyFreeAccounted(newpage);
 
 	/* Move dirty on pages not done by migrate_page_move_mapping() */
 	if (PageDirty(page))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 51c0918..54e0604 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -528,11 +528,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+			K(get_mm_counter(mm, MM_SHMEMPAGES)),
+			K(get_mm_counter(mm, MM_LAZYFREEPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -878,11 +879,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+		K(get_mm_counter(victim->mm, MM_LAZYFREEPAGES)));
 	task_unlock(victim);
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 5f05926..86c80d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1585,6 +1585,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	put_page(page);
 
 out_unmap:
+	/* regardless of success or failure, the page isn't lazyfree */
+	if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, -hpage_nr_pages(page));
 	pte_unmap_unlock(pte, ptl);
 	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
 		mmu_notifier_invalidate_page(mm, address);
-- 
2.9.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-04  6:38     ` Hillf Danton
  -1 siblings, 0 replies; 62+ messages in thread
From: Hillf Danton @ 2017-02-04  6:38 UTC (permalink / raw)
  To: 'Shaohua Li', linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

On February 04, 2017 7:33 AM Shaohua Li wrote: 
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);

<snipped>

> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
> 
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }

You are not adding it but would you please try to fix or avoid flipping
preempt count with page table lock hold?

thanks
Hillf

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-04  6:38     ` Hillf Danton
  0 siblings, 0 replies; 62+ messages in thread
From: Hillf Danton @ 2017-02-04  6:38 UTC (permalink / raw)
  To: 'Shaohua Li', linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm

On February 04, 2017 7:33 AM Shaohua Li wrote: 
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);

<snipped>

> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
> 
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }

You are not adding it but would you please try to fix or avoid flipping
preempt count with page table lock hold?

thanks
Hillf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-04  6:38     ` Hillf Danton
@ 2017-02-09  6:33       ` Hillf Danton
  -1 siblings, 0 replies; 62+ messages in thread
From: Hillf Danton @ 2017-02-09  6:33 UTC (permalink / raw)
  To: Hillf Danton, 'Shaohua Li', linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm


On February 04, 2017 2:38 PM Hillf Danton wrote: 
> 
> On February 04, 2017 7:33 AM Shaohua Li wrote:
> > @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		set_pmd_at(mm, addr, pmd, orig_pmd);
> >  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >  	}
> > +
> > +	mark_page_lazyfree(page);
> >  	ret = true;
> >  out:
> >  	spin_unlock(ptl);
> 
> <snipped>
> 
> > -void deactivate_page(struct page *page)
> > -{
> > -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> > +void mark_page_lazyfree(struct page *page)
> > + {
> > +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> > +	    !PageUnevictable(page)) {
> > +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
> >
> >  		get_page(page);
> >  		if (!pagevec_add(pvec, page) || PageCompound(page))
> > -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> > -		put_cpu_var(lru_deactivate_pvecs);
> > +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> > +		put_cpu_var(lru_lazyfree_pvecs);
> >  	}
> >  }
> 
> You are not adding it but would you please try to fix or avoid flipping
> preempt count with page table lock hold?
> 
preempt_en/disable are embedded in spin_lock/unlock, so please
ignore my noise.

thanks
Hillf

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-09  6:33       ` Hillf Danton
  0 siblings, 0 replies; 62+ messages in thread
From: Hillf Danton @ 2017-02-09  6:33 UTC (permalink / raw)
  To: Hillf Danton, 'Shaohua Li', linux-kernel, linux-mm
  Cc: Kernel-team, danielmicay, mhocko, minchan, hughd, hannes, riel,
	mgorman, akpm


On February 04, 2017 2:38 PM Hillf Danton wrote: 
> 
> On February 04, 2017 7:33 AM Shaohua Li wrote:
> > @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		set_pmd_at(mm, addr, pmd, orig_pmd);
> >  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >  	}
> > +
> > +	mark_page_lazyfree(page);
> >  	ret = true;
> >  out:
> >  	spin_unlock(ptl);
> 
> <snipped>
> 
> > -void deactivate_page(struct page *page)
> > -{
> > -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> > -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> > +void mark_page_lazyfree(struct page *page)
> > + {
> > +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> > +	    !PageUnevictable(page)) {
> > +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
> >
> >  		get_page(page);
> >  		if (!pagevec_add(pvec, page) || PageCompound(page))
> > -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> > -		put_cpu_var(lru_deactivate_pvecs);
> > +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> > +		put_cpu_var(lru_lazyfree_pvecs);
> >  	}
> >  }
> 
> You are not adding it but would you please try to fix or avoid flipping
> preempt count with page table lock hold?
> 
preempt_en/disable are embedded in spin_lock/unlock, so please
ignore my noise.

thanks
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10  6:50     ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-10  6:50 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 03, 2017 at 03:33:18PM -0800, Shaohua Li wrote:
> Userspace indicates MADV_FREE pages could be freed without pageout, so
> it pretty much likes used once file pages. For such pages, we'd like to
> reclaim them once there is memory pressure. Also it might be unfair
> reclaiming MADV_FREE pages always before used once file pages and we
> definitively want to reclaim the pages before other anonymous and file
> pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> pages.

I think this patch should be merged with 3/7. Otherwise, MADV_FREE will
be broken during the bisect.

> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/mm_inline.h     |  5 +++++
>  include/linux/swap.h          |  2 +-
>  include/linux/vm_event_item.h |  2 +-
>  mm/huge_memory.c              |  5 ++---
>  mm/madvise.c                  |  3 +--
>  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
>  mm/vmstat.c                   |  1 +
>  7 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68..fdded06 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
>  	return !PageSwapBacked(page);
>  }
>  
> +static inline bool page_is_lazyfree(struct page *page)
> +{
> +	return PageAnon(page) && !PageSwapBacked(page);
> +}
> +

trivial:

How about using PageLazyFree for consistency with other PageXXX?
As well, use SetPageLazyFree/ClearPageLazyFree rather than using
raw {Set,Clear}PageSwapBacked.

>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  				enum lru_list lru, enum zone_type zid,
>  				int nr_pages)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..486494e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> -extern void deactivate_page(struct page *page);
> +extern void mark_page_lazyfree(struct page *page);

trivial:

How about "deactivate_lazyfree_page"? IMO, it would show intention
clear that move the lazy free page to inactive list.

It's just matter of preference so I'm not strong against.

>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 6aa1b6c..94e58da 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		FOR_ALL_ZONES(ALLOCSTALL),
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
> -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,
>  		PGREFILL,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ecf569d..ddb9a94 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ClearPageDirty(page);
>  	unlock_page(page);
>  
> -	if (PageActive(page))
> -		deactivate_page(page);
> -
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
>  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c867d88..c24549e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			ptent = pte_mkclean(ptent);
>  			ptent = pte_wrprotect(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
> -			if (PageActive(page))
> -				deactivate_page(page);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/swap.c b/mm/swap.c
> index c4910f1..69a7e9d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -46,7 +46,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
> @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		if (page_is_lazyfree(page)) {
> +			SetPageSwapBacked(page);
> +			file = 0;

I don't see why you set file with 0. Could you explain the rationale?

> +			lru = LRU_INACTIVE_ANON;
> +		}
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  }
>  
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
> -		int lru = page_lru_base_type(page);
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		bool active = PageActive(page);
>  
> -		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
> -		add_page_to_lru_list(page, lruvec, lru);
> +		ClearPageSwapBacked(page);
> +		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
>  
> -		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		update_page_reclaim_stat(lruvec, 1, 0);
> +		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> -	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
>  	if (pagevec_count(pvec))
> -		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
>  
>  	activate_page_drain(cpu);
>  }
> @@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
>  }
>  
>  /**
> - * deactivate_page - deactivate a page
> + * mark_page_lazyfree - make an anon page lazyfree
>   * @page: page to deactivate
>   *
> - * deactivate_page() moves @page to the inactive list if @page was on the active
> - * list and was not an unevictable page.  This is done to accelerate the reclaim
> - * of @page.
> + * mark_page_lazyfree() moves @page to the inactive file list.
> + * This is done to accelerate the reclaim of @page.
>   */
> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
>  
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }
>  
> @@ -704,7 +710,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> -		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, lru_add_drain_wq, work);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff..7774196 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
>  	"pgfree",
>  	"pgactivate",
>  	"pgdeactivate",
> +	"pglazyfree",
>  
>  	"pgfault",
>  	"pgmajfault",
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-10  6:50     ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-10  6:50 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 03, 2017 at 03:33:18PM -0800, Shaohua Li wrote:
> Userspace indicates MADV_FREE pages could be freed without pageout, so
> it pretty much likes used once file pages. For such pages, we'd like to
> reclaim them once there is memory pressure. Also it might be unfair
> reclaiming MADV_FREE pages always before used once file pages and we
> definitively want to reclaim the pages before other anonymous and file
> pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> pages.

I think this patch should be merged with 3/7. Otherwise, MADV_FREE will
be broken during the bisect.

> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/mm_inline.h     |  5 +++++
>  include/linux/swap.h          |  2 +-
>  include/linux/vm_event_item.h |  2 +-
>  mm/huge_memory.c              |  5 ++---
>  mm/madvise.c                  |  3 +--
>  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
>  mm/vmstat.c                   |  1 +
>  7 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68..fdded06 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
>  	return !PageSwapBacked(page);
>  }
>  
> +static inline bool page_is_lazyfree(struct page *page)
> +{
> +	return PageAnon(page) && !PageSwapBacked(page);
> +}
> +

trivial:

How about using PageLazyFree for consistency with other PageXXX?
As well, use SetPageLazyFree/ClearPageLazyFree rather than using
raw {Set,Clear}PageSwapBacked.

>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  				enum lru_list lru, enum zone_type zid,
>  				int nr_pages)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..486494e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> -extern void deactivate_page(struct page *page);
> +extern void mark_page_lazyfree(struct page *page);

trivial:

How about "deactivate_lazyfree_page"? IMO, it would show intention
clear that move the lazy free page to inactive list.

It's just matter of preference so I'm not strong against.

>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 6aa1b6c..94e58da 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		FOR_ALL_ZONES(ALLOCSTALL),
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
> -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,
>  		PGREFILL,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ecf569d..ddb9a94 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ClearPageDirty(page);
>  	unlock_page(page);
>  
> -	if (PageActive(page))
> -		deactivate_page(page);
> -
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
>  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c867d88..c24549e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			ptent = pte_mkclean(ptent);
>  			ptent = pte_wrprotect(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
> -			if (PageActive(page))
> -				deactivate_page(page);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/swap.c b/mm/swap.c
> index c4910f1..69a7e9d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -46,7 +46,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
> @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		if (page_is_lazyfree(page)) {
> +			SetPageSwapBacked(page);
> +			file = 0;

I don't see why you set file with 0. Could you explain the rationale?

> +			lru = LRU_INACTIVE_ANON;
> +		}
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  }
>  
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
> -		int lru = page_lru_base_type(page);
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		bool active = PageActive(page);
>  
> -		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
> -		add_page_to_lru_list(page, lruvec, lru);
> +		ClearPageSwapBacked(page);
> +		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
>  
> -		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		update_page_reclaim_stat(lruvec, 1, 0);
> +		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> -	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
>  	if (pagevec_count(pvec))
> -		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
>  
>  	activate_page_drain(cpu);
>  }
> @@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
>  }
>  
>  /**
> - * deactivate_page - deactivate a page
> + * mark_page_lazyfree - make an anon page lazyfree
>   * @page: page to deactivate
>   *
> - * deactivate_page() moves @page to the inactive list if @page was on the active
> - * list and was not an unevictable page.  This is done to accelerate the reclaim
> - * of @page.
> + * mark_page_lazyfree() moves @page to the inactive file list.
> + * This is done to accelerate the reclaim of @page.
>   */
> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
>  
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }
>  
> @@ -704,7 +710,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> -		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, lru_add_drain_wq, work);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff..7774196 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
>  	"pgfree",
>  	"pgactivate",
>  	"pgdeactivate",
> +	"pglazyfree",
>  
>  	"pgfault",
>  	"pgmajfault",
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10  6:58     ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-10  6:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 03, 2017 at 03:33:19PM -0800, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/rmap.c   |  4 ++++
>  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8d6204..5f05926 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			dec_mm_counter(mm, MM_ANONPAGES);
>  			rp->lazyfreed++;
>  			goto discard;
> +		} else if (flags & TTU_LZFREE) {
> +			set_pte_at(mm, address, pte, pteval);
> +			ret = SWAP_FAIL;
> +			goto out_unmap;

trivial:

How about this?

if (flags && TTU_LZFREE) {
	if (PageDirty(page)) {
		set_pte_at(XXX);
		ret = SWAP_FAIL;
		goto out_unmap;
	} else {
		dec_mm_counter(mm, MM_ANONPAGES);
		rp->lazyfreed++;
		goto discard;
	}
}

>  		}
>  
>  		if (swap_duplicate(entry) < 0) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..b304a84 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_RECLAIM;
>  
>  	if (referenced_ptes) {
> -		if (PageSwapBacked(page))
> +		if (PageSwapBacked(page) || PageAnon(page))

If anyone accesses MADV_FREEed range with load op, not store,
why shouldn't we discard that pages?

>  			return PAGEREF_ACTIVATE;
>  		/*
>  		 * All mapped pages start out with page table
> @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
>  
>  /* Check if a page is dirty or under writeback */
>  static void page_check_dirty_writeback(struct page *page,
> -				       bool *dirty, bool *writeback)
> +			bool *dirty, bool *writeback, bool lazyfree)
>  {
>  	struct address_space *mapping;
>  
> @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) || lazyfree) {

tivial:

We can check it with PageLazyFree in here rather than passing lazyfree
argument. It's consistent like page_is_file_cache in here.

>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> -		bool lazyfree = false;
> +		bool lazyfree;
>  		int ret = SWAP_SUCCESS;
>  
>  		cond_resched();
> @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		sc->nr_scanned++;
>  
> +		lazyfree = page_is_lazyfree(page);
> +
>  		if (unlikely(!page_evictable(page)))
>  			goto cull_mlocked;
>  
> @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
>  			sc->nr_scanned++;

In this phase, we cannot know whether lazyfree marked page is discarable
or not. If it is freeable and mapped, this logic makes sense. However,
if the page is dirty?

>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * will stall and start writing pages if the tail of the LRU
>  		 * is all dirty unqueued pages.
>  		 */
> -		page_check_dirty_writeback(page, &dirty, &writeback);
> +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
>  		if (dirty || writeback)
>  			nr_dirty++;
>  
> @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			; /* try to reclaim the page below */
>  		}
>  
> +		/* lazyfree page could be freed directly */
> +		if (lazyfree) {
> +			if (unlikely(PageTransHuge(page)) &&
> +			    split_huge_page_to_list(page, page_list))
> +				goto keep_locked;
> +			goto unmap_page;
> +		}
> +

Maybe, we can remove this hunk. Instead add lazyfree check in here.

		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
			if (!(sc->gfp_mask & __GFP_IO))

>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> @@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
>  				goto activate_locked;
> -			lazyfree = true;
>  			may_enter_fs = 1;
>  
>  			/* Adding to swap updated mapping */
> @@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> -
> +unmap_page:
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page) && (mapping || lazyfree)) {
>  			switch (ret = try_to_unmap(page, lazyfree ?
>  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
>  				(ttu_flags | TTU_BATCH_FLUSH))) {
> @@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_LZFREE:
> -				goto lazyfree;
> +				/* follow __remove_mapping for reference */
> +				if (page_ref_freeze(page, 1)) {
> +					if (!PageDirty(page))
> +						goto lazyfree;
> +					else
> +						page_ref_unfreeze(page, 1);
> +				}
> +				goto keep_locked;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
>  		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
> -
> +lazyfree:
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  cull_mlocked:
>  		if (PageSwapCache(page))
>  			try_to_free_swap(page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		unlock_page(page);
>  		list_add(&page->lru, &ret_pages);
>  		continue;
> @@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
@ 2017-02-10  6:58     ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-10  6:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 03, 2017 at 03:33:19PM -0800, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/rmap.c   |  4 ++++
>  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8d6204..5f05926 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			dec_mm_counter(mm, MM_ANONPAGES);
>  			rp->lazyfreed++;
>  			goto discard;
> +		} else if (flags & TTU_LZFREE) {
> +			set_pte_at(mm, address, pte, pteval);
> +			ret = SWAP_FAIL;
> +			goto out_unmap;

trivial:

How about this?

if (flags && TTU_LZFREE) {
	if (PageDirty(page)) {
		set_pte_at(XXX);
		ret = SWAP_FAIL;
		goto out_unmap;
	} else {
		dec_mm_counter(mm, MM_ANONPAGES);
		rp->lazyfreed++;
		goto discard;
	}
}

>  		}
>  
>  		if (swap_duplicate(entry) < 0) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..b304a84 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_RECLAIM;
>  
>  	if (referenced_ptes) {
> -		if (PageSwapBacked(page))
> +		if (PageSwapBacked(page) || PageAnon(page))

If anyone accesses MADV_FREEed range with load op, not store,
why shouldn't we discard that pages?

>  			return PAGEREF_ACTIVATE;
>  		/*
>  		 * All mapped pages start out with page table
> @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
>  
>  /* Check if a page is dirty or under writeback */
>  static void page_check_dirty_writeback(struct page *page,
> -				       bool *dirty, bool *writeback)
> +			bool *dirty, bool *writeback, bool lazyfree)
>  {
>  	struct address_space *mapping;
>  
> @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) || lazyfree) {

tivial:

We can check it with PageLazyFree in here rather than passing lazyfree
argument. It's consistent like page_is_file_cache in here.

>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> -		bool lazyfree = false;
> +		bool lazyfree;
>  		int ret = SWAP_SUCCESS;
>  
>  		cond_resched();
> @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		sc->nr_scanned++;
>  
> +		lazyfree = page_is_lazyfree(page);
> +
>  		if (unlikely(!page_evictable(page)))
>  			goto cull_mlocked;
>  
> @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
>  			sc->nr_scanned++;

In this phase, we cannot know whether lazyfree marked page is discarable
or not. If it is freeable and mapped, this logic makes sense. However,
if the page is dirty?

>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * will stall and start writing pages if the tail of the LRU
>  		 * is all dirty unqueued pages.
>  		 */
> -		page_check_dirty_writeback(page, &dirty, &writeback);
> +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
>  		if (dirty || writeback)
>  			nr_dirty++;
>  
> @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			; /* try to reclaim the page below */
>  		}
>  
> +		/* lazyfree page could be freed directly */
> +		if (lazyfree) {
> +			if (unlikely(PageTransHuge(page)) &&
> +			    split_huge_page_to_list(page, page_list))
> +				goto keep_locked;
> +			goto unmap_page;
> +		}
> +

Maybe, we can remove this hunk. Instead add lazyfree check in here.

		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
			if (!(sc->gfp_mask & __GFP_IO))

>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> @@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
>  				goto activate_locked;
> -			lazyfree = true;
>  			may_enter_fs = 1;
>  
>  			/* Adding to swap updated mapping */
> @@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> -
> +unmap_page:
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page) && (mapping || lazyfree)) {
>  			switch (ret = try_to_unmap(page, lazyfree ?
>  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
>  				(ttu_flags | TTU_BATCH_FLUSH))) {
> @@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_LZFREE:
> -				goto lazyfree;
> +				/* follow __remove_mapping for reference */
> +				if (page_ref_freeze(page, 1)) {
> +					if (!PageDirty(page))
> +						goto lazyfree;
> +					else
> +						page_ref_unfreeze(page, 1);
> +				}
> +				goto keep_locked;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
>  		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
> -
> +lazyfree:
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  cull_mlocked:
>  		if (PageSwapCache(page))
>  			try_to_free_swap(page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		unlock_page(page);
>  		list_add(&page->lru, &ret_pages);
>  		continue;
> @@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10 13:02     ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:18, Shaohua Li wrote:
> Userspace indicates MADV_FREE pages could be freed without pageout, so
> it pretty much likes used once file pages. For such pages, we'd like to
> reclaim them once there is memory pressure. Also it might be unfair
> reclaiming MADV_FREE pages always before used once file pages and we
> definitively want to reclaim the pages before other anonymous and file
> pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> pages.

I like this. I have expected this to be more convoluted but it looks
quite straightforward. I didn't get to do a really deep review and add
my acked-by but from a quick look there do not seem to be any surprises.
I was worried about vmstat accounting. There are some places which
isolate page from LRU and account based on the LRU and later use
page_is_file_cache to tell which LRU this was. This should work fine,
though, because you never touch pages which are off-lru.

That being said I do not see any major issues. There might be some minor
things and this will need a lot of testing but it is definitely a move
into right direction. I hope to do the deeper review after I get back
from vacation (20th Feb).

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>

I guess
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

would be appropriate.

> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/mm_inline.h     |  5 +++++
>  include/linux/swap.h          |  2 +-
>  include/linux/vm_event_item.h |  2 +-
>  mm/huge_memory.c              |  5 ++---
>  mm/madvise.c                  |  3 +--
>  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
>  mm/vmstat.c                   |  1 +
>  7 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68..fdded06 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
>  	return !PageSwapBacked(page);
>  }
>  
> +static inline bool page_is_lazyfree(struct page *page)
> +{
> +	return PageAnon(page) && !PageSwapBacked(page);
> +}
> +
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  				enum lru_list lru, enum zone_type zid,
>  				int nr_pages)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..486494e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> -extern void deactivate_page(struct page *page);
> +extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 6aa1b6c..94e58da 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		FOR_ALL_ZONES(ALLOCSTALL),
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
> -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,
>  		PGREFILL,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ecf569d..ddb9a94 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ClearPageDirty(page);
>  	unlock_page(page);
>  
> -	if (PageActive(page))
> -		deactivate_page(page);
> -
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
>  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c867d88..c24549e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			ptent = pte_mkclean(ptent);
>  			ptent = pte_wrprotect(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
> -			if (PageActive(page))
> -				deactivate_page(page);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/swap.c b/mm/swap.c
> index c4910f1..69a7e9d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -46,7 +46,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
> @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		if (page_is_lazyfree(page)) {
> +			SetPageSwapBacked(page);
> +			file = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  }
>  
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
> -		int lru = page_lru_base_type(page);
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		bool active = PageActive(page);
>  
> -		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
> -		add_page_to_lru_list(page, lruvec, lru);
> +		ClearPageSwapBacked(page);
> +		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
>  
> -		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		update_page_reclaim_stat(lruvec, 1, 0);
> +		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> -	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
>  	if (pagevec_count(pvec))
> -		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
>  
>  	activate_page_drain(cpu);
>  }
> @@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
>  }
>  
>  /**
> - * deactivate_page - deactivate a page
> + * mark_page_lazyfree - make an anon page lazyfree
>   * @page: page to deactivate
>   *
> - * deactivate_page() moves @page to the inactive list if @page was on the active
> - * list and was not an unevictable page.  This is done to accelerate the reclaim
> - * of @page.
> + * mark_page_lazyfree() moves @page to the inactive file list.
> + * This is done to accelerate the reclaim of @page.
>   */
> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
>  
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }
>  
> @@ -704,7 +710,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> -		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, lru_add_drain_wq, work);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff..7774196 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
>  	"pgfree",
>  	"pgactivate",
>  	"pgdeactivate",
> +	"pglazyfree",
>  
>  	"pgfault",
>  	"pgmajfault",
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-10 13:02     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:18, Shaohua Li wrote:
> Userspace indicates MADV_FREE pages could be freed without pageout, so
> it pretty much likes used once file pages. For such pages, we'd like to
> reclaim them once there is memory pressure. Also it might be unfair
> reclaiming MADV_FREE pages always before used once file pages and we
> definitively want to reclaim the pages before other anonymous and file
> pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> pages.

I like this. I have expected this to be more convoluted but it looks
quite straightforward. I didn't get to do a really deep review and add
my acked-by but from a quick look there do not seem to be any surprises.
I was worried about vmstat accounting. There are some places which
isolate page from LRU and account based on the LRU and later use
page_is_file_cache to tell which LRU this was. This should work fine,
though, because you never touch pages which are off-lru.

That being said I do not see any major issues. There might be some minor
things and this will need a lot of testing but it is definitely a move
into right direction. I hope to do the deeper review after I get back
from vacation (20th Feb).

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>

I guess
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

would be appropriate.

> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/mm_inline.h     |  5 +++++
>  include/linux/swap.h          |  2 +-
>  include/linux/vm_event_item.h |  2 +-
>  mm/huge_memory.c              |  5 ++---
>  mm/madvise.c                  |  3 +--
>  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
>  mm/vmstat.c                   |  1 +
>  7 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index e030a68..fdded06 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
>  	return !PageSwapBacked(page);
>  }
>  
> +static inline bool page_is_lazyfree(struct page *page)
> +{
> +	return PageAnon(page) && !PageSwapBacked(page);
> +}
> +
>  static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  				enum lru_list lru, enum zone_type zid,
>  				int nr_pages)
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..486494e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> -extern void deactivate_page(struct page *page);
> +extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 6aa1b6c..94e58da 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		FOR_ALL_ZONES(ALLOCSTALL),
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
> -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,
>  		PGREFILL,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ecf569d..ddb9a94 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ClearPageDirty(page);
>  	unlock_page(page);
>  
> -	if (PageActive(page))
> -		deactivate_page(page);
> -
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
>  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
> @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c867d88..c24549e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			ptent = pte_mkclean(ptent);
>  			ptent = pte_wrprotect(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
> -			if (PageActive(page))
> -				deactivate_page(page);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/swap.c b/mm/swap.c
> index c4910f1..69a7e9d 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -46,7 +46,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
> @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
>  		int lru = page_lru_base_type(page);
>  
>  		del_page_from_lru_list(page, lruvec, lru);
> +		if (page_is_lazyfree(page)) {
> +			SetPageSwapBacked(page);
> +			file = 0;
> +			lru = LRU_INACTIVE_ANON;
> +		}
>  		SetPageActive(page);
>  		lru += LRU_ACTIVE;
>  		add_page_to_lru_list(page, lruvec, lru);
> @@ -561,20 +566,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  }
>  
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
> -		int lru = page_lru_base_type(page);
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		bool active = PageActive(page);
>  
> -		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
> -		add_page_to_lru_list(page, lruvec, lru);
> +		ClearPageSwapBacked(page);
> +		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
>  
> -		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		update_page_reclaim_stat(lruvec, 1, 0);
> +		count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
>  	}
>  }
>  
> @@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> -	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
>  	if (pagevec_count(pvec))
> -		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
>  
>  	activate_page_drain(cpu);
>  }
> @@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
>  }
>  
>  /**
> - * deactivate_page - deactivate a page
> + * mark_page_lazyfree - make an anon page lazyfree
>   * @page: page to deactivate
>   *
> - * deactivate_page() moves @page to the inactive list if @page was on the active
> - * list and was not an unevictable page.  This is done to accelerate the reclaim
> - * of @page.
> + * mark_page_lazyfree() moves @page to the inactive file list.
> + * This is done to accelerate the reclaim of @page.
>   */
> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
>  
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }
>  
> @@ -704,7 +710,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> -		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, lru_add_drain_wq, work);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff..7774196 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
>  	"pgfree",
>  	"pgactivate",
>  	"pgdeactivate",
> +	"pglazyfree",
>  
>  	"pgfault",
>  	"pgmajfault",
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10 13:23     ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:23 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:19, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.

Ohh, so this is where the convoluted part sits ;) I thought we just
check for references/dirty bit and make lazy free page regular anon
again and activate it. lazyfree checks in shrink_page_list seem to
be quite excessive to me. Maybe I am just oversimplifying it, though.
 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/rmap.c   |  4 ++++
>  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8d6204..5f05926 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			dec_mm_counter(mm, MM_ANONPAGES);
>  			rp->lazyfreed++;
>  			goto discard;
> +		} else if (flags & TTU_LZFREE) {
> +			set_pte_at(mm, address, pte, pteval);
> +			ret = SWAP_FAIL;
> +			goto out_unmap;
>  		}
>  
>  		if (swap_duplicate(entry) < 0) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..b304a84 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_RECLAIM;
>  
>  	if (referenced_ptes) {
> -		if (PageSwapBacked(page))
> +		if (PageSwapBacked(page) || PageAnon(page))
>  			return PAGEREF_ACTIVATE;
>  		/*
>  		 * All mapped pages start out with page table
> @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
>  
>  /* Check if a page is dirty or under writeback */
>  static void page_check_dirty_writeback(struct page *page,
> -				       bool *dirty, bool *writeback)
> +			bool *dirty, bool *writeback, bool lazyfree)
>  {
>  	struct address_space *mapping;
>  
> @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) || lazyfree) {
>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> -		bool lazyfree = false;
> +		bool lazyfree;
>  		int ret = SWAP_SUCCESS;
>  
>  		cond_resched();
> @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		sc->nr_scanned++;
>  
> +		lazyfree = page_is_lazyfree(page);
> +
>  		if (unlikely(!page_evictable(page)))
>  			goto cull_mlocked;
>  
> @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
>  			sc->nr_scanned++;
>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * will stall and start writing pages if the tail of the LRU
>  		 * is all dirty unqueued pages.
>  		 */
> -		page_check_dirty_writeback(page, &dirty, &writeback);
> +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
>  		if (dirty || writeback)
>  			nr_dirty++;
>  
> @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			; /* try to reclaim the page below */
>  		}
>  
> +		/* lazyfree page could be freed directly */
> +		if (lazyfree) {
> +			if (unlikely(PageTransHuge(page)) &&
> +			    split_huge_page_to_list(page, page_list))
> +				goto keep_locked;
> +			goto unmap_page;
> +		}
> +
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> @@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
>  				goto activate_locked;
> -			lazyfree = true;
>  			may_enter_fs = 1;
>  
>  			/* Adding to swap updated mapping */
> @@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> -
> +unmap_page:
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page) && (mapping || lazyfree)) {
>  			switch (ret = try_to_unmap(page, lazyfree ?
>  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
>  				(ttu_flags | TTU_BATCH_FLUSH))) {
> @@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_LZFREE:
> -				goto lazyfree;
> +				/* follow __remove_mapping for reference */
> +				if (page_ref_freeze(page, 1)) {
> +					if (!PageDirty(page))
> +						goto lazyfree;
> +					else
> +						page_ref_unfreeze(page, 1);
> +				}
> +				goto keep_locked;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
>  		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
> -
> +lazyfree:
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  cull_mlocked:
>  		if (PageSwapCache(page))
>  			try_to_free_swap(page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		unlock_page(page);
>  		list_add(&page->lru, &ret_pages);
>  		continue;
> @@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
@ 2017-02-10 13:23     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:23 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:19, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.

Ohh, so this is where the convoluted part sits ;) I thought we just
check for references/dirty bit and make lazy free page regular anon
again and activate it. lazyfree checks in shrink_page_list seem to
be quite excessive to me. Maybe I am just oversimplifying it, though.
 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  mm/rmap.c   |  4 ++++
>  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
>  2 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8d6204..5f05926 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			dec_mm_counter(mm, MM_ANONPAGES);
>  			rp->lazyfreed++;
>  			goto discard;
> +		} else if (flags & TTU_LZFREE) {
> +			set_pte_at(mm, address, pte, pteval);
> +			ret = SWAP_FAIL;
> +			goto out_unmap;
>  		}
>  
>  		if (swap_duplicate(entry) < 0) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 947ab6f..b304a84 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
>  		return PAGEREF_RECLAIM;
>  
>  	if (referenced_ptes) {
> -		if (PageSwapBacked(page))
> +		if (PageSwapBacked(page) || PageAnon(page))
>  			return PAGEREF_ACTIVATE;
>  		/*
>  		 * All mapped pages start out with page table
> @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
>  
>  /* Check if a page is dirty or under writeback */
>  static void page_check_dirty_writeback(struct page *page,
> -				       bool *dirty, bool *writeback)
> +			bool *dirty, bool *writeback, bool lazyfree)
>  {
>  	struct address_space *mapping;
>  
> @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) || lazyfree) {
>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> -		bool lazyfree = false;
> +		bool lazyfree;
>  		int ret = SWAP_SUCCESS;
>  
>  		cond_resched();
> @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		sc->nr_scanned++;
>  
> +		lazyfree = page_is_lazyfree(page);
> +
>  		if (unlikely(!page_evictable(page)))
>  			goto cull_mlocked;
>  
> @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
>  			sc->nr_scanned++;
>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * will stall and start writing pages if the tail of the LRU
>  		 * is all dirty unqueued pages.
>  		 */
> -		page_check_dirty_writeback(page, &dirty, &writeback);
> +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
>  		if (dirty || writeback)
>  			nr_dirty++;
>  
> @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			; /* try to reclaim the page below */
>  		}
>  
> +		/* lazyfree page could be freed directly */
> +		if (lazyfree) {
> +			if (unlikely(PageTransHuge(page)) &&
> +			    split_huge_page_to_list(page, page_list))
> +				goto keep_locked;
> +			goto unmap_page;
> +		}
> +
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> @@ -1116,7 +1126,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
>  				goto activate_locked;
> -			lazyfree = true;
>  			may_enter_fs = 1;
>  
>  			/* Adding to swap updated mapping */
> @@ -1128,12 +1137,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		VM_BUG_ON_PAGE(PageTransHuge(page), page);
> -
> +unmap_page:
>  		/*
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page) && (mapping || lazyfree)) {
>  			switch (ret = try_to_unmap(page, lazyfree ?
>  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
>  				(ttu_flags | TTU_BATCH_FLUSH))) {
> @@ -1145,7 +1154,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_LZFREE:
> -				goto lazyfree;
> +				/* follow __remove_mapping for reference */
> +				if (page_ref_freeze(page, 1)) {
> +					if (!PageDirty(page))
> +						goto lazyfree;
> +					else
> +						page_ref_unfreeze(page, 1);
> +				}
> +				goto keep_locked;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1257,10 +1273,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
>  		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
> -
> +lazyfree:
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1285,6 +1300,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  cull_mlocked:
>  		if (PageSwapCache(page))
>  			try_to_free_swap(page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		unlock_page(page);
>  		list_add(&page->lru, &ret_pages);
>  		continue;
> @@ -1294,6 +1311,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		if (lazyfree)
> +			SetPageSwapBacked(page);
>  		SetPageActive(page);
>  		pgactivate++;
>  keep_locked:
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10 13:27     ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:27 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> Show MADV_FREE pages info in proc/sysfs files.

How are we going to use this information? Why it isn't sufficient to
watch for lazyfree events? I mean this adds quite some code and it is
not clear (at least from the changelog) we we need this information.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/base/node.c       |  2 ++
>  fs/proc/meminfo.c         |  1 +
>  include/linux/mm_inline.h | 31 ++++++++++++++++++++++++++++---
>  include/linux/mmzone.h    |  2 ++
>  mm/page_alloc.c           |  7 +++++--
>  mm/vmscan.c               |  9 +++++++--
>  mm/vmstat.c               |  2 ++
>  7 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 5548f96..9138db8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -71,6 +71,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       "Node %d Active(file):   %8lu kB\n"
>  		       "Node %d Inactive(file): %8lu kB\n"
>  		       "Node %d Unevictable:    %8lu kB\n"
> +		       "Node %d LazyFree:       %8lu kB\n"
>  		       "Node %d Mlocked:        %8lu kB\n",
>  		       nid, K(i.totalram),
>  		       nid, K(i.freeram),
> @@ -84,6 +85,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
>  		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
>  		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +		       nid, K(node_page_state(pgdat, NR_LAZYFREE)),
>  		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>  
>  #ifdef CONFIG_HIGHMEM
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 8a42849..b2e7b31 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  	show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
>  	show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
>  	show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
> +	show_val_kb(m, "LazyFree:       ", global_node_page_state(NR_LAZYFREE));
>  	show_val_kb(m, "Mlocked:        ", global_page_state(NR_MLOCK));
>  
>  #ifdef CONFIG_HIGHMEM
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index fdded06..3e496de 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -48,25 +48,50 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
>  #endif
>  }
>  
> +static __always_inline void __update_lazyfree_size(struct lruvec *lruvec,
> +				enum zone_type zid, int nr_pages)
> +{
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	__mod_node_page_state(pgdat, NR_LAZYFREE, nr_pages);
> +	__mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LAZYFREE,
> +				nr_pages);
> +}
> +
>  static __always_inline void add_page_to_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> -	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, nr_pages);
> +	update_lru_size(lruvec, lru, zid, nr_pages);
>  	list_add(&page->lru, &lruvec->lists[lru]);
>  }
>  
>  static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> -	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, nr_pages);
> +	update_lru_size(lruvec, lru, zid, nr_pages);
>  	list_add_tail(&page->lru, &lruvec->lists[lru]);
>  }
>  
>  static __always_inline void del_page_from_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
>  	list_del(&page->lru);
> -	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, -nr_pages);
> +	update_lru_size(lruvec, lru, zid, -nr_pages);
>  }
>  
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 338a786a..78985f1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -118,6 +118,7 @@ enum zone_stat_item {
>  	NR_ZONE_INACTIVE_FILE,
>  	NR_ZONE_ACTIVE_FILE,
>  	NR_ZONE_UNEVICTABLE,
> +	NR_ZONE_LAZYFREE,
>  	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_SLAB_RECLAIMABLE,
> @@ -147,6 +148,7 @@ enum node_stat_item {
>  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
>  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> +	NR_LAZYFREE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
>  	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11b4cd4..d0ff8c2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4453,7 +4453,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
>  		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
>  		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
> -		" free:%lu free_pcp:%lu free_cma:%lu\n",
> +		" free:%lu free_pcp:%lu free_cma:%lu lazy_free:%lu\n",
>  		global_node_page_state(NR_ACTIVE_ANON),
>  		global_node_page_state(NR_INACTIVE_ANON),
>  		global_node_page_state(NR_ISOLATED_ANON),
> @@ -4472,7 +4472,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		global_page_state(NR_BOUNCE),
>  		global_page_state(NR_FREE_PAGES),
>  		free_pcp,
> -		global_page_state(NR_FREE_CMA_PAGES));
> +		global_page_state(NR_FREE_CMA_PAGES),
> +		global_node_page_state(NR_LAZYFREE));
>  
>  	for_each_online_pgdat(pgdat) {
>  		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
> @@ -4484,6 +4485,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			" active_file:%lukB"
>  			" inactive_file:%lukB"
>  			" unevictable:%lukB"
> +			" lazy_free:%lukB"
>  			" isolated(anon):%lukB"
>  			" isolated(file):%lukB"
>  			" mapped:%lukB"
> @@ -4506,6 +4508,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
>  			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
>  			K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +			K(node_page_state(pgdat, NR_LAZYFREE)),
>  			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
>  			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
>  			K(node_page_state(pgdat, NR_FILE_MAPPED)),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b304a84..1a98467 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1442,7 +1442,8 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>   * be complete before mem_cgroup_update_lru_size due to a santity check.
>   */
>  static __always_inline void update_lru_sizes(struct lruvec *lruvec,
> -			enum lru_list lru, unsigned long *nr_zone_taken)
> +			enum lru_list lru, unsigned long *nr_zone_taken,
> +			unsigned long *nr_zone_lazyfree)
>  {
>  	int zid;
>  
> @@ -1450,6 +1451,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
>  		if (!nr_zone_taken[zid])
>  			continue;
>  
> +		__update_lazyfree_size(lruvec, zid, -nr_zone_lazyfree[zid]);
>  		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
>  #ifdef CONFIG_MEMCG
>  		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
> @@ -1486,6 +1488,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	struct list_head *src = &lruvec->lists[lru];
>  	unsigned long nr_taken = 0;
>  	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +	unsigned long nr_zone_lazyfree[MAX_NR_ZONES] = { 0 };
>  	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
>  	unsigned long skipped = 0, total_skipped = 0;
>  	unsigned long scan, nr_pages;
> @@ -1517,6 +1520,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			nr_pages = hpage_nr_pages(page);
>  			nr_taken += nr_pages;
>  			nr_zone_taken[page_zonenum(page)] += nr_pages;
> +			if (page_is_lazyfree(page))
> +				nr_zone_lazyfree[page_zonenum(page)] += nr_pages;
>  			list_move(&page->lru, dst);
>  			break;
>  
> @@ -1560,7 +1565,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	*nr_scanned = scan + total_skipped;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				    scan, skipped, nr_taken, mode, lru);
> -	update_lru_sizes(lruvec, lru, nr_zone_taken);
> +	update_lru_sizes(lruvec, lru, nr_zone_taken, nr_zone_lazyfree);
>  	return nr_taken;
>  }
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7774196..a70b52d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
>  	"nr_zone_inactive_file",
>  	"nr_zone_active_file",
>  	"nr_zone_unevictable",
> +	"nr_zone_lazyfree",
>  	"nr_zone_write_pending",
>  	"nr_mlock",
>  	"nr_slab_reclaimable",
> @@ -952,6 +953,7 @@ const char * const vmstat_text[] = {
>  	"nr_inactive_file",
>  	"nr_active_file",
>  	"nr_unevictable",
> +	"nr_lazyfree",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
>  	"nr_pages_scanned",
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
@ 2017-02-10 13:27     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:27 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> Show MADV_FREE pages info in proc/sysfs files.

How are we going to use this information? Why it isn't sufficient to
watch for lazyfree events? I mean this adds quite some code and it is
not clear (at least from the changelog) we we need this information.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/base/node.c       |  2 ++
>  fs/proc/meminfo.c         |  1 +
>  include/linux/mm_inline.h | 31 ++++++++++++++++++++++++++++---
>  include/linux/mmzone.h    |  2 ++
>  mm/page_alloc.c           |  7 +++++--
>  mm/vmscan.c               |  9 +++++++--
>  mm/vmstat.c               |  2 ++
>  7 files changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 5548f96..9138db8 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -71,6 +71,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       "Node %d Active(file):   %8lu kB\n"
>  		       "Node %d Inactive(file): %8lu kB\n"
>  		       "Node %d Unevictable:    %8lu kB\n"
> +		       "Node %d LazyFree:       %8lu kB\n"
>  		       "Node %d Mlocked:        %8lu kB\n",
>  		       nid, K(i.totalram),
>  		       nid, K(i.freeram),
> @@ -84,6 +85,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
>  		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
>  		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +		       nid, K(node_page_state(pgdat, NR_LAZYFREE)),
>  		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
>  
>  #ifdef CONFIG_HIGHMEM
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 8a42849..b2e7b31 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  	show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
>  	show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
>  	show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
> +	show_val_kb(m, "LazyFree:       ", global_node_page_state(NR_LAZYFREE));
>  	show_val_kb(m, "Mlocked:        ", global_page_state(NR_MLOCK));
>  
>  #ifdef CONFIG_HIGHMEM
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index fdded06..3e496de 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -48,25 +48,50 @@ static __always_inline void update_lru_size(struct lruvec *lruvec,
>  #endif
>  }
>  
> +static __always_inline void __update_lazyfree_size(struct lruvec *lruvec,
> +				enum zone_type zid, int nr_pages)
> +{
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +
> +	__mod_node_page_state(pgdat, NR_LAZYFREE, nr_pages);
> +	__mod_zone_page_state(&pgdat->node_zones[zid], NR_ZONE_LAZYFREE,
> +				nr_pages);
> +}
> +
>  static __always_inline void add_page_to_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> -	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, nr_pages);
> +	update_lru_size(lruvec, lru, zid, nr_pages);
>  	list_add(&page->lru, &lruvec->lists[lru]);
>  }
>  
>  static __always_inline void add_page_to_lru_list_tail(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> -	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, nr_pages);
> +	update_lru_size(lruvec, lru, zid, nr_pages);
>  	list_add_tail(&page->lru, &lruvec->lists[lru]);
>  }
>  
>  static __always_inline void del_page_from_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> +	enum zone_type zid = page_zonenum(page);
> +	int nr_pages = hpage_nr_pages(page);
> +
>  	list_del(&page->lru);
> -	update_lru_size(lruvec, lru, page_zonenum(page), -hpage_nr_pages(page));
> +	if (lru == LRU_INACTIVE_FILE && page_is_lazyfree(page))
> +		__update_lazyfree_size(lruvec, zid, -nr_pages);
> +	update_lru_size(lruvec, lru, zid, -nr_pages);
>  }
>  
>  /**
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 338a786a..78985f1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -118,6 +118,7 @@ enum zone_stat_item {
>  	NR_ZONE_INACTIVE_FILE,
>  	NR_ZONE_ACTIVE_FILE,
>  	NR_ZONE_UNEVICTABLE,
> +	NR_ZONE_LAZYFREE,
>  	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_SLAB_RECLAIMABLE,
> @@ -147,6 +148,7 @@ enum node_stat_item {
>  	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
>  	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
> +	NR_LAZYFREE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
>  	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 11b4cd4..d0ff8c2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4453,7 +4453,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
>  		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
>  		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
> -		" free:%lu free_pcp:%lu free_cma:%lu\n",
> +		" free:%lu free_pcp:%lu free_cma:%lu lazy_free:%lu\n",
>  		global_node_page_state(NR_ACTIVE_ANON),
>  		global_node_page_state(NR_INACTIVE_ANON),
>  		global_node_page_state(NR_ISOLATED_ANON),
> @@ -4472,7 +4472,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		global_page_state(NR_BOUNCE),
>  		global_page_state(NR_FREE_PAGES),
>  		free_pcp,
> -		global_page_state(NR_FREE_CMA_PAGES));
> +		global_page_state(NR_FREE_CMA_PAGES),
> +		global_node_page_state(NR_LAZYFREE));
>  
>  	for_each_online_pgdat(pgdat) {
>  		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
> @@ -4484,6 +4485,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			" active_file:%lukB"
>  			" inactive_file:%lukB"
>  			" unevictable:%lukB"
> +			" lazy_free:%lukB"
>  			" isolated(anon):%lukB"
>  			" isolated(file):%lukB"
>  			" mapped:%lukB"
> @@ -4506,6 +4508,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
>  			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
>  			K(node_page_state(pgdat, NR_UNEVICTABLE)),
> +			K(node_page_state(pgdat, NR_LAZYFREE)),
>  			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
>  			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
>  			K(node_page_state(pgdat, NR_FILE_MAPPED)),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b304a84..1a98467 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1442,7 +1442,8 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>   * be complete before mem_cgroup_update_lru_size due to a santity check.
>   */
>  static __always_inline void update_lru_sizes(struct lruvec *lruvec,
> -			enum lru_list lru, unsigned long *nr_zone_taken)
> +			enum lru_list lru, unsigned long *nr_zone_taken,
> +			unsigned long *nr_zone_lazyfree)
>  {
>  	int zid;
>  
> @@ -1450,6 +1451,7 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
>  		if (!nr_zone_taken[zid])
>  			continue;
>  
> +		__update_lazyfree_size(lruvec, zid, -nr_zone_lazyfree[zid]);
>  		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
>  #ifdef CONFIG_MEMCG
>  		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
> @@ -1486,6 +1488,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	struct list_head *src = &lruvec->lists[lru];
>  	unsigned long nr_taken = 0;
>  	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
> +	unsigned long nr_zone_lazyfree[MAX_NR_ZONES] = { 0 };
>  	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
>  	unsigned long skipped = 0, total_skipped = 0;
>  	unsigned long scan, nr_pages;
> @@ -1517,6 +1520,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			nr_pages = hpage_nr_pages(page);
>  			nr_taken += nr_pages;
>  			nr_zone_taken[page_zonenum(page)] += nr_pages;
> +			if (page_is_lazyfree(page))
> +				nr_zone_lazyfree[page_zonenum(page)] += nr_pages;
>  			list_move(&page->lru, dst);
>  			break;
>  
> @@ -1560,7 +1565,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	*nr_scanned = scan + total_skipped;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				    scan, skipped, nr_taken, mode, lru);
> -	update_lru_sizes(lruvec, lru, nr_zone_taken);
> +	update_lru_sizes(lruvec, lru, nr_zone_taken, nr_zone_lazyfree);
>  	return nr_taken;
>  }
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7774196..a70b52d 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
>  	"nr_zone_inactive_file",
>  	"nr_zone_active_file",
>  	"nr_zone_unevictable",
> +	"nr_zone_lazyfree",
>  	"nr_zone_write_pending",
>  	"nr_mlock",
>  	"nr_slab_reclaimable",
> @@ -952,6 +953,7 @@ const char * const vmstat_text[] = {
>  	"nr_inactive_file",
>  	"nr_active_file",
>  	"nr_unevictable",
> +	"nr_lazyfree",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
>  	"nr_pages_scanned",
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10 13:30     ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

$DESCRIPTION_OF_YOUR_USECASE_GOES_HERE

Moreover Documentation/filesystems/proc.txt should be updated as well.

Other than that, the patch looks good to me.

On Fri 03-02-17 15:33:22, Shaohua Li wrote:
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

after the description is added and documentation updated
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/proc/task_mmu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee3efb2..8f2423f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -440,6 +440,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long lazyfree;
>  	unsigned long anonymous_thp;
>  	unsigned long shmem_thp;
>  	unsigned long swap;
> @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	int i, nr = compound ? 1 << compound_order(page) : 1;
>  	unsigned long size = nr * PAGE_SIZE;
>  
> -	if (PageAnon(page))
> +	if (PageAnon(page)) {
>  		mss->anonymous += size;
> +		if (!PageSwapBacked(page))
> +			mss->lazyfree += size;
> +	}
>  
>  	mss->resident += size;
>  	/* Accumulate the size in pages that have been accessed. */
> @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "LazyFree:       %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
>  		   "ShmemPmdMapped: %8lu kB\n"
>  		   "Shared_Hugetlb: %8lu kB\n"
> @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.lazyfree >> 10,
>  		   mss.anonymous_thp >> 10,
>  		   mss.shmem_thp >> 10,
>  		   mss.shared_hugetlb >> 10,
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
@ 2017-02-10 13:30     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

$DESCRIPTION_OF_YOUR_USECASE_GOES_HERE

Moreover Documentation/filesystems/proc.txt should be updated as well.

Other than that, the patch looks good to me.

On Fri 03-02-17 15:33:22, Shaohua Li wrote:
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

after the description is added and documentation updated
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/proc/task_mmu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee3efb2..8f2423f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -440,6 +440,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long lazyfree;
>  	unsigned long anonymous_thp;
>  	unsigned long shmem_thp;
>  	unsigned long swap;
> @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	int i, nr = compound ? 1 << compound_order(page) : 1;
>  	unsigned long size = nr * PAGE_SIZE;
>  
> -	if (PageAnon(page))
> +	if (PageAnon(page)) {
>  		mss->anonymous += size;
> +		if (!PageSwapBacked(page))
> +			mss->lazyfree += size;
> +	}
>  
>  	mss->resident += size;
>  	/* Accumulate the size in pages that have been accessed. */
> @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "LazyFree:       %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
>  		   "ShmemPmdMapped: %8lu kB\n"
>  		   "Shared_Hugetlb: %8lu kB\n"
> @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.lazyfree >> 10,
>  		   mss.anonymous_thp >> 10,
>  		   mss.shmem_thp >> 10,
>  		   mss.shared_hugetlb >> 10,
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-10 13:35     ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> Add a separate RSS for MADV_FREE pages. The pages are charged into
> MM_ANONPAGES (because they are mapped anon pages) and also charged into
> the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> display the RSS, which userspace can use to determine the RSS excluding
> MADV_FREE pages.
> 
> The basic idea is to increment the RSS in madvise and decrement in unmap
> or page reclaim. There is one limitation. If a page is shared by two
> processes, since madvise only has mm cotext of current process, it isn't
> convenient to charge the RSS for both processes. So we don't charge the
> RSS if the mapcount isn't 1. On the other hand, fork can make a
> MADV_FREE page shared by two processes. To make things consistent, we
> uncharge the RSS from the source mm in fork.
> 
> A new flag is added to indicate if a page is accounted into the RSS. We
> can't use SwapBacked flag to do the determination because we can't
> guarantee the page has SwapBacked flag cleared in madvise. We are
> reusing mappedtodisk flag which should not be set for Anon pages.
> 
> There are a couple of other places we need to uncharge the RSS,
> activate_page and mark_page_accessed. activate_page is used by swap,
> where MADV_FREE pages are already not in lazyfree state before going
> into swap. mark_page_accessed is mainly used for file pages, but there
> are several places it's used by anonymous pages. I fixed gup, but not
> some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> inprecise RSS accounting.
> 
> Please note, the accounting is never going to be precise. MADV_FREE page
> could be written by userspace without notification to the kernel. The
> page can't be reclaimed like other clean lazyfree pages. The page isn't
> real lazyfree page. But since kernel isn't aware of this, the page is
> still accounted as lazyfree, thus the accounting could be incorrect.

This is all quite complex and as you say unprecise already. From the
description it is not even clear why do we need it at all. Why is
/proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
even though you managed to recycle an existing one which is a plus.

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-10 13:35     ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-10 13:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> Add a separate RSS for MADV_FREE pages. The pages are charged into
> MM_ANONPAGES (because they are mapped anon pages) and also charged into
> the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> display the RSS, which userspace can use to determine the RSS excluding
> MADV_FREE pages.
> 
> The basic idea is to increment the RSS in madvise and decrement in unmap
> or page reclaim. There is one limitation. If a page is shared by two
> processes, since madvise only has mm cotext of current process, it isn't
> convenient to charge the RSS for both processes. So we don't charge the
> RSS if the mapcount isn't 1. On the other hand, fork can make a
> MADV_FREE page shared by two processes. To make things consistent, we
> uncharge the RSS from the source mm in fork.
> 
> A new flag is added to indicate if a page is accounted into the RSS. We
> can't use SwapBacked flag to do the determination because we can't
> guarantee the page has SwapBacked flag cleared in madvise. We are
> reusing mappedtodisk flag which should not be set for Anon pages.
> 
> There are a couple of other places we need to uncharge the RSS,
> activate_page and mark_page_accessed. activate_page is used by swap,
> where MADV_FREE pages are already not in lazyfree state before going
> into swap. mark_page_accessed is mainly used for file pages, but there
> are several places it's used by anonymous pages. I fixed gup, but not
> some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> inprecise RSS accounting.
> 
> Please note, the accounting is never going to be precise. MADV_FREE page
> could be written by userspace without notification to the kernel. The
> page can't be reclaimed like other clean lazyfree pages. The page isn't
> real lazyfree page. But since kernel isn't aware of this, the page is
> still accounted as lazyfree, thus the accounting could be incorrect.

This is all quite complex and as you say unprecise already. From the
description it is not even clear why do we need it at all. Why is
/proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
even though you managed to recycle an existing one which is a plus.

Thanks
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-10  6:50     ` Minchan Kim
@ 2017-02-10 17:30       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 03:50:22PM +0900, Minchan Kim wrote:
> Hi Shaohua,

Thanks for your time!
 
> On Fri, Feb 03, 2017 at 03:33:18PM -0800, Shaohua Li wrote:
> > Userspace indicates MADV_FREE pages could be freed without pageout, so
> > it pretty much likes used once file pages. For such pages, we'd like to
> > reclaim them once there is memory pressure. Also it might be unfair
> > reclaiming MADV_FREE pages always before used once file pages and we
> > definitively want to reclaim the pages before other anonymous and file
> > pages.
> > 
> > To speed up MADV_FREE pages reclaim, we put the pages into
> > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > nowadays and should be full of used once file pages. Reclaiming
> > MADV_FREE pages will not have much interfere of anonymous and active
> > file pages. And the inactive file pages and MADV_FREE pages will be
> > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > means we can reclaim the pages without swap support. This idea is
> > suggested by Johannes.
> > 
> > We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> > pages.
> 
> I think this patch should be merged with 3/7. Otherwise, MADV_FREE will
> be broken during the bisect.

Maybe I should move the patch 3 ahead, then we won't break bisect and still
make the patches clear.

> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  include/linux/mm_inline.h     |  5 +++++
> >  include/linux/swap.h          |  2 +-
> >  include/linux/vm_event_item.h |  2 +-
> >  mm/huge_memory.c              |  5 ++---
> >  mm/madvise.c                  |  3 +--
> >  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
> >  mm/vmstat.c                   |  1 +
> >  7 files changed, 39 insertions(+), 29 deletions(-)
> > 
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index e030a68..fdded06 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
> >  	return !PageSwapBacked(page);
> >  }
> >  
> > +static inline bool page_is_lazyfree(struct page *page)
> > +{
> > +	return PageAnon(page) && !PageSwapBacked(page);
> > +}
> > +
> 
> trivial:
> 
> How about using PageLazyFree for consistency with other PageXXX?
> As well, use SetPageLazyFree/ClearPageLazyFree rather than using
> raw {Set,Clear}PageSwapBacked.

So SetPageLazyFree == ClearPageSwapBacked, that would be weird. I personally
prefer directly using {Set, Clear}PageSwapBacked, because reader can
immediately know what's happening. If using the PageLazyFree, people always
need to refer the code and check the relationship between PageLazyFree and
PageSwapBacked.
 
> >  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> >  				enum lru_list lru, enum zone_type zid,
> >  				int nr_pages)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 45e91dd..486494e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > -extern void deactivate_page(struct page *page);
> > +extern void mark_page_lazyfree(struct page *page);
> 
> trivial:
> 
> How about "deactivate_lazyfree_page"? IMO, it would show intention
> clear that move the lazy free page to inactive list.
> 
> It's just matter of preference so I'm not strong against.

Yes, I thought about the name a little bit. Don't think we should use
deactivate, because it sounds that only works for active page, while the
function works for both active/inactive pages. I'm open to any suggestions.

> >  extern void swap_setup(void);
> >  
> >  extern void add_page_to_unevictable_list(struct page *page);
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 6aa1b6c..94e58da 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGALLOC),
> >  		FOR_ALL_ZONES(ALLOCSTALL),
> >  		FOR_ALL_ZONES(PGSCAN_SKIP),
> > -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> > +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
> >  		PGFAULT, PGMAJFAULT,
> >  		PGLAZYFREED,
> >  		PGREFILL,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index ecf569d..ddb9a94 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		ClearPageDirty(page);
> >  	unlock_page(page);
> >  
> > -	if (PageActive(page))
> > -		deactivate_page(page);
> > -
> >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> >  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> >  			tlb->fullmm);
> > @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		set_pmd_at(mm, addr, pmd, orig_pmd);
> >  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >  	}
> > +
> > +	mark_page_lazyfree(page);
> >  	ret = true;
> >  out:
> >  	spin_unlock(ptl);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c867d88..c24549e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  			ptent = pte_mkclean(ptent);
> >  			ptent = pte_wrprotect(ptent);
> >  			set_pte_at(mm, addr, pte, ptent);
> > -			if (PageActive(page))
> > -				deactivate_page(page);
> >  			tlb_remove_tlb_entry(tlb, pte, addr);
> >  		}
> > +		mark_page_lazyfree(page);
> >  	}
> >  out:
> >  	if (nr_swap) {
> > diff --git a/mm/swap.c b/mm/swap.c
> > index c4910f1..69a7e9d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -46,7 +46,7 @@ int page_cluster;
> >  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> > -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> >  #endif
> > @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
> >  		int lru = page_lru_base_type(page);
> >  
> >  		del_page_from_lru_list(page, lruvec, lru);
> > +		if (page_is_lazyfree(page)) {
> > +			SetPageSwapBacked(page);
> > +			file = 0;
> 
> I don't see why you set file with 0. Could you explain the rationale?

We are moving the page back to active anonymous list, so I'd like to charge the
recent_scanned and recent_rotated to anonymous.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-10 17:30       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 03:50:22PM +0900, Minchan Kim wrote:
> Hi Shaohua,

Thanks for your time!
 
> On Fri, Feb 03, 2017 at 03:33:18PM -0800, Shaohua Li wrote:
> > Userspace indicates MADV_FREE pages could be freed without pageout, so
> > it pretty much likes used once file pages. For such pages, we'd like to
> > reclaim them once there is memory pressure. Also it might be unfair
> > reclaiming MADV_FREE pages always before used once file pages and we
> > definitively want to reclaim the pages before other anonymous and file
> > pages.
> > 
> > To speed up MADV_FREE pages reclaim, we put the pages into
> > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > nowadays and should be full of used once file pages. Reclaiming
> > MADV_FREE pages will not have much interfere of anonymous and active
> > file pages. And the inactive file pages and MADV_FREE pages will be
> > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > means we can reclaim the pages without swap support. This idea is
> > suggested by Johannes.
> > 
> > We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> > pages.
> 
> I think this patch should be merged with 3/7. Otherwise, MADV_FREE will
> be broken during the bisect.

Maybe I should move the patch 3 ahead, then we won't break bisect and still
make the patches clear.

> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  include/linux/mm_inline.h     |  5 +++++
> >  include/linux/swap.h          |  2 +-
> >  include/linux/vm_event_item.h |  2 +-
> >  mm/huge_memory.c              |  5 ++---
> >  mm/madvise.c                  |  3 +--
> >  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
> >  mm/vmstat.c                   |  1 +
> >  7 files changed, 39 insertions(+), 29 deletions(-)
> > 
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index e030a68..fdded06 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -22,6 +22,11 @@ static inline int page_is_file_cache(struct page *page)
> >  	return !PageSwapBacked(page);
> >  }
> >  
> > +static inline bool page_is_lazyfree(struct page *page)
> > +{
> > +	return PageAnon(page) && !PageSwapBacked(page);
> > +}
> > +
> 
> trivial:
> 
> How about using PageLazyFree for consistency with other PageXXX?
> As well, use SetPageLazyFree/ClearPageLazyFree rather than using
> raw {Set,Clear}PageSwapBacked.

So SetPageLazyFree == ClearPageSwapBacked, that would be weird. I personally
prefer directly using {Set, Clear}PageSwapBacked, because reader can
immediately know what's happening. If using the PageLazyFree, people always
need to refer the code and check the relationship between PageLazyFree and
PageSwapBacked.
 
> >  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> >  				enum lru_list lru, enum zone_type zid,
> >  				int nr_pages)
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 45e91dd..486494e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
> >  extern void lru_add_drain_all(void);
> >  extern void rotate_reclaimable_page(struct page *page);
> >  extern void deactivate_file_page(struct page *page);
> > -extern void deactivate_page(struct page *page);
> > +extern void mark_page_lazyfree(struct page *page);
> 
> trivial:
> 
> How about "deactivate_lazyfree_page"? IMO, it would show intention
> clear that move the lazy free page to inactive list.
> 
> It's just matter of preference so I'm not strong against.

Yes, I thought about the name a little bit. Don't think we should use
deactivate, because it sounds that only works for active page, while the
function works for both active/inactive pages. I'm open to any suggestions.

> >  extern void swap_setup(void);
> >  
> >  extern void add_page_to_unevictable_list(struct page *page);
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 6aa1b6c..94e58da 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		FOR_ALL_ZONES(PGALLOC),
> >  		FOR_ALL_ZONES(ALLOCSTALL),
> >  		FOR_ALL_ZONES(PGSCAN_SKIP),
> > -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> > +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
> >  		PGFAULT, PGMAJFAULT,
> >  		PGLAZYFREED,
> >  		PGREFILL,
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index ecf569d..ddb9a94 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1391,9 +1391,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		ClearPageDirty(page);
> >  	unlock_page(page);
> >  
> > -	if (PageActive(page))
> > -		deactivate_page(page);
> > -
> >  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> >  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> >  			tlb->fullmm);
> > @@ -1404,6 +1401,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >  		set_pmd_at(mm, addr, pmd, orig_pmd);
> >  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> >  	}
> > +
> > +	mark_page_lazyfree(page);
> >  	ret = true;
> >  out:
> >  	spin_unlock(ptl);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index c867d88..c24549e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -378,10 +378,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  			ptent = pte_mkclean(ptent);
> >  			ptent = pte_wrprotect(ptent);
> >  			set_pte_at(mm, addr, pte, ptent);
> > -			if (PageActive(page))
> > -				deactivate_page(page);
> >  			tlb_remove_tlb_entry(tlb, pte, addr);
> >  		}
> > +		mark_page_lazyfree(page);
> >  	}
> >  out:
> >  	if (nr_swap) {
> > diff --git a/mm/swap.c b/mm/swap.c
> > index c4910f1..69a7e9d 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -46,7 +46,7 @@ int page_cluster;
> >  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
> >  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
> >  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> > -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> > +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
> >  #endif
> > @@ -268,6 +268,11 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
> >  		int lru = page_lru_base_type(page);
> >  
> >  		del_page_from_lru_list(page, lruvec, lru);
> > +		if (page_is_lazyfree(page)) {
> > +			SetPageSwapBacked(page);
> > +			file = 0;
> 
> I don't see why you set file with 0. Could you explain the rationale?

We are moving the page back to active anonymous list, so I'd like to charge the
recent_scanned and recent_rotated to anonymous.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-10 13:02     ` Michal Hocko
@ 2017-02-10 17:33       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:02:36PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:18, Shaohua Li wrote:
> > Userspace indicates MADV_FREE pages could be freed without pageout, so
> > it pretty much likes used once file pages. For such pages, we'd like to
> > reclaim them once there is memory pressure. Also it might be unfair
> > reclaiming MADV_FREE pages always before used once file pages and we
> > definitively want to reclaim the pages before other anonymous and file
> > pages.
> > 
> > To speed up MADV_FREE pages reclaim, we put the pages into
> > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > nowadays and should be full of used once file pages. Reclaiming
> > MADV_FREE pages will not have much interfere of anonymous and active
> > file pages. And the inactive file pages and MADV_FREE pages will be
> > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > means we can reclaim the pages without swap support. This idea is
> > suggested by Johannes.
> > 
> > We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> > pages.
> 
> I like this. I have expected this to be more convoluted but it looks
> quite straightforward. I didn't get to do a really deep review and add
> my acked-by but from a quick look there do not seem to be any surprises.
> I was worried about vmstat accounting. There are some places which
> isolate page from LRU and account based on the LRU and later use
> page_is_file_cache to tell which LRU this was. This should work fine,
> though, because you never touch pages which are off-lru.
> 
> That being said I do not see any major issues. There might be some minor
> things and this will need a lot of testing but it is definitely a move
> into right direction. I hope to do the deeper review after I get back
> from vacation (20th Feb).

Sweat! Thanks for your time! 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> I guess
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> would be appropriate.

Sure, will add in next post and will add 'the patches are based on Minchan's
patches' too.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-10 17:33       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:02:36PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:18, Shaohua Li wrote:
> > Userspace indicates MADV_FREE pages could be freed without pageout, so
> > it pretty much likes used once file pages. For such pages, we'd like to
> > reclaim them once there is memory pressure. Also it might be unfair
> > reclaiming MADV_FREE pages always before used once file pages and we
> > definitively want to reclaim the pages before other anonymous and file
> > pages.
> > 
> > To speed up MADV_FREE pages reclaim, we put the pages into
> > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > nowadays and should be full of used once file pages. Reclaiming
> > MADV_FREE pages will not have much interfere of anonymous and active
> > file pages. And the inactive file pages and MADV_FREE pages will be
> > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > means we can reclaim the pages without swap support. This idea is
> > suggested by Johannes.
> > 
> > We also clear the pages SwapBacked flag to indicate they are MADV_FREE
> > pages.
> 
> I like this. I have expected this to be more convoluted but it looks
> quite straightforward. I didn't get to do a really deep review and add
> my acked-by but from a quick look there do not seem to be any surprises.
> I was worried about vmstat accounting. There are some places which
> isolate page from LRU and account based on the LRU and later use
> page_is_file_cache to tell which LRU this was. This should work fine,
> though, because you never touch pages which are off-lru.
> 
> That being said I do not see any major issues. There might be some minor
> things and this will need a lot of testing but it is definitely a move
> into right direction. I hope to do the deeper review after I get back
> from vacation (20th Feb).

Sweat! Thanks for your time! 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> I guess
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> would be appropriate.

Sure, will add in next post and will add 'the patches are based on Minchan's
patches' too.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
  2017-02-10  6:58     ` Minchan Kim
@ 2017-02-10 17:43       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 03:58:39PM +0900, Minchan Kim wrote:
> On Fri, Feb 03, 2017 at 03:33:19PM -0800, Shaohua Li wrote:
> > When memory pressure is high, we free MADV_FREE pages. If the pages are
> > not dirty in pte, the pages could be freed immediately. Otherwise we
> > can't reclaim them. We put the pages back to anonumous LRU list (by
> > setting SwapBacked flag) and the pages will be reclaimed in normal
> > swapout way.
> > 
> > We use normal page reclaim policy. Since MADV_FREE pages are put into
> > inactive file list, such pages and inactive file pages are reclaimed
> > according to their age. This is expected, because we don't want to
> > reclaim too many MADV_FREE pages before used once pages.
> > 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  mm/rmap.c   |  4 ++++
> >  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
> >  2 files changed, 35 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index c8d6204..5f05926 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			dec_mm_counter(mm, MM_ANONPAGES);
> >  			rp->lazyfreed++;
> >  			goto discard;
> > +		} else if (flags & TTU_LZFREE) {
> > +			set_pte_at(mm, address, pte, pteval);
> > +			ret = SWAP_FAIL;
> > +			goto out_unmap;
> 
> trivial:
> 
> How about this?
> 
> if (flags && TTU_LZFREE) {
> 	if (PageDirty(page)) {
> 		set_pte_at(XXX);
> 		ret = SWAP_FAIL;
> 		goto out_unmap;
> 	} else {
> 		dec_mm_counter(mm, MM_ANONPAGES);
> 		rp->lazyfreed++;
> 		goto discard;
> 	}
> }
ok
 
> >  		}
> >  
> >  		if (swap_duplicate(entry) < 0) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 947ab6f..b304a84 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
> >  		return PAGEREF_RECLAIM;
> >  
> >  	if (referenced_ptes) {
> > -		if (PageSwapBacked(page))
> > +		if (PageSwapBacked(page) || PageAnon(page))
> 
> If anyone accesses MADV_FREEed range with load op, not store,
> why shouldn't we discard that pages?

Don't have strong opinion about this, userspace probably shouldn't do this. I'm
ok to delete it if you insist.

> >  			return PAGEREF_ACTIVATE;
> >  		/*
> >  		 * All mapped pages start out with page table
> > @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
> >  
> >  /* Check if a page is dirty or under writeback */
> >  static void page_check_dirty_writeback(struct page *page,
> > -				       bool *dirty, bool *writeback)
> > +			bool *dirty, bool *writeback, bool lazyfree)
> >  {
> >  	struct address_space *mapping;
> >  
> > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> >  	 * Anonymous pages are not handled by flushers and must be written
> >  	 * from reclaim context. Do not stall reclaim based on them
> >  	 */
> > -	if (!page_is_file_cache(page)) {
> > +	if (!page_is_file_cache(page) || lazyfree) {
> 
> tivial:
> 
> We can check it with PageLazyFree in here rather than passing lazyfree
> argument. It's consistent like page_is_file_cache in here.

ok 
> >  		*dirty = false;
> >  		*writeback = false;
> >  		return;
> > @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		int may_enter_fs;
> >  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
> >  		bool dirty, writeback;
> > -		bool lazyfree = false;
> > +		bool lazyfree;
> >  		int ret = SWAP_SUCCESS;
> >  
> >  		cond_resched();
> > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  		sc->nr_scanned++;
> >  
> > +		lazyfree = page_is_lazyfree(page);
> > +
> >  		if (unlikely(!page_evictable(page)))
> >  			goto cull_mlocked;
> >  
> > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep_locked;
> >  
> >  		/* Double the slab pressure for mapped and swapcache pages */
> > -		if (page_mapped(page) || PageSwapCache(page))
> > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> >  			sc->nr_scanned++;
> 
> In this phase, we cannot know whether lazyfree marked page is discarable
> or not. If it is freeable and mapped, this logic makes sense. However,
> if the page is dirty?

I think this doesn't matter. If the page is dirty, it will go to reclaim in
next round and swap out. At that time, we will add nr_scanned there.

> >  
> >  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * will stall and start writing pages if the tail of the LRU
> >  		 * is all dirty unqueued pages.
> >  		 */
> > -		page_check_dirty_writeback(page, &dirty, &writeback);
> > +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
> >  		if (dirty || writeback)
> >  			nr_dirty++;
> >  
> > @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			; /* try to reclaim the page below */
> >  		}
> >  
> > +		/* lazyfree page could be freed directly */
> > +		if (lazyfree) {
> > +			if (unlikely(PageTransHuge(page)) &&
> > +			    split_huge_page_to_list(page, page_list))
> > +				goto keep_locked;
> > +			goto unmap_page;
> > +		}
> > +
> 
> Maybe, we can remove this hunk. Instead add lazyfree check in here.
> 
> 		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
> 			if (!(sc->gfp_mask & __GFP_IO))
ok

Thanks,
Shaohua 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
@ 2017-02-10 17:43       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 03:58:39PM +0900, Minchan Kim wrote:
> On Fri, Feb 03, 2017 at 03:33:19PM -0800, Shaohua Li wrote:
> > When memory pressure is high, we free MADV_FREE pages. If the pages are
> > not dirty in pte, the pages could be freed immediately. Otherwise we
> > can't reclaim them. We put the pages back to anonumous LRU list (by
> > setting SwapBacked flag) and the pages will be reclaimed in normal
> > swapout way.
> > 
> > We use normal page reclaim policy. Since MADV_FREE pages are put into
> > inactive file list, such pages and inactive file pages are reclaimed
> > according to their age. This is expected, because we don't want to
> > reclaim too many MADV_FREE pages before used once pages.
> > 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  mm/rmap.c   |  4 ++++
> >  mm/vmscan.c | 43 +++++++++++++++++++++++++++++++------------
> >  2 files changed, 35 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index c8d6204..5f05926 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1554,6 +1554,10 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			dec_mm_counter(mm, MM_ANONPAGES);
> >  			rp->lazyfreed++;
> >  			goto discard;
> > +		} else if (flags & TTU_LZFREE) {
> > +			set_pte_at(mm, address, pte, pteval);
> > +			ret = SWAP_FAIL;
> > +			goto out_unmap;
> 
> trivial:
> 
> How about this?
> 
> if (flags && TTU_LZFREE) {
> 	if (PageDirty(page)) {
> 		set_pte_at(XXX);
> 		ret = SWAP_FAIL;
> 		goto out_unmap;
> 	} else {
> 		dec_mm_counter(mm, MM_ANONPAGES);
> 		rp->lazyfreed++;
> 		goto discard;
> 	}
> }
ok
 
> >  		}
> >  
> >  		if (swap_duplicate(entry) < 0) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 947ab6f..b304a84 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
> >  		return PAGEREF_RECLAIM;
> >  
> >  	if (referenced_ptes) {
> > -		if (PageSwapBacked(page))
> > +		if (PageSwapBacked(page) || PageAnon(page))
> 
> If anyone accesses MADV_FREEed range with load op, not store,
> why shouldn't we discard that pages?

Don't have strong opinion about this, userspace probably shouldn't do this. I'm
ok to delete it if you insist.

> >  			return PAGEREF_ACTIVATE;
> >  		/*
> >  		 * All mapped pages start out with page table
> > @@ -903,7 +903,7 @@ static enum page_references page_check_references(struct page *page,
> >  
> >  /* Check if a page is dirty or under writeback */
> >  static void page_check_dirty_writeback(struct page *page,
> > -				       bool *dirty, bool *writeback)
> > +			bool *dirty, bool *writeback, bool lazyfree)
> >  {
> >  	struct address_space *mapping;
> >  
> > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> >  	 * Anonymous pages are not handled by flushers and must be written
> >  	 * from reclaim context. Do not stall reclaim based on them
> >  	 */
> > -	if (!page_is_file_cache(page)) {
> > +	if (!page_is_file_cache(page) || lazyfree) {
> 
> tivial:
> 
> We can check it with PageLazyFree in here rather than passing lazyfree
> argument. It's consistent like page_is_file_cache in here.

ok 
> >  		*dirty = false;
> >  		*writeback = false;
> >  		return;
> > @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		int may_enter_fs;
> >  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
> >  		bool dirty, writeback;
> > -		bool lazyfree = false;
> > +		bool lazyfree;
> >  		int ret = SWAP_SUCCESS;
> >  
> >  		cond_resched();
> > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  		sc->nr_scanned++;
> >  
> > +		lazyfree = page_is_lazyfree(page);
> > +
> >  		if (unlikely(!page_evictable(page)))
> >  			goto cull_mlocked;
> >  
> > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep_locked;
> >  
> >  		/* Double the slab pressure for mapped and swapcache pages */
> > -		if (page_mapped(page) || PageSwapCache(page))
> > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> >  			sc->nr_scanned++;
> 
> In this phase, we cannot know whether lazyfree marked page is discarable
> or not. If it is freeable and mapped, this logic makes sense. However,
> if the page is dirty?

I think this doesn't matter. If the page is dirty, it will go to reclaim in
next round and swap out. At that time, we will add nr_scanned there.

> >  
> >  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > @@ -1005,7 +1007,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * will stall and start writing pages if the tail of the LRU
> >  		 * is all dirty unqueued pages.
> >  		 */
> > -		page_check_dirty_writeback(page, &dirty, &writeback);
> > +		page_check_dirty_writeback(page, &dirty, &writeback, lazyfree);
> >  		if (dirty || writeback)
> >  			nr_dirty++;
> >  
> > @@ -1107,6 +1109,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			; /* try to reclaim the page below */
> >  		}
> >  
> > +		/* lazyfree page could be freed directly */
> > +		if (lazyfree) {
> > +			if (unlikely(PageTransHuge(page)) &&
> > +			    split_huge_page_to_list(page, page_list))
> > +				goto keep_locked;
> > +			goto unmap_page;
> > +		}
> > +
> 
> Maybe, we can remove this hunk. Instead add lazyfree check in here.
> 
> 		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
> 			if (!(sc->gfp_mask & __GFP_IO))
ok

Thanks,
Shaohua 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
  2017-02-10 13:27     ` Michal Hocko
@ 2017-02-10 17:50       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:27:27PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> > Show MADV_FREE pages info in proc/sysfs files.
> 
> How are we going to use this information? Why it isn't sufficient to
> watch for lazyfree events? I mean this adds quite some code and it is
> not clear (at least from the changelog) we we need this information.

It's just like any other meminfo we added to let user know what happens in the
system. Users can use the info for monitoring/diagnosing. the
lazyfree/lazyfreed events can't reflect the lazyfree page info because
'lazyfree - lazyfreed' doesn't equal current lazyfree pages and the events
aren't per-node. I'll add more description in the changelog.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
@ 2017-02-10 17:50       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:27:27PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> > Show MADV_FREE pages info in proc/sysfs files.
> 
> How are we going to use this information? Why it isn't sufficient to
> watch for lazyfree events? I mean this adds quite some code and it is
> not clear (at least from the changelog) we we need this information.

It's just like any other meminfo we added to let user know what happens in the
system. Users can use the info for monitoring/diagnosing. the
lazyfree/lazyfreed events can't reflect the lazyfree page info because
'lazyfree - lazyfreed' doesn't equal current lazyfree pages and the events
aren't per-node. I'll add more description in the changelog.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-10 13:30     ` Michal Hocko
@ 2017-02-10 17:52       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:30:40PM +0100, Michal Hocko wrote:
> $DESCRIPTION_OF_YOUR_USECASE_GOES_HERE
> 
> Moreover Documentation/filesystems/proc.txt should be updated as well.
> 
> Other than that, the patch looks good to me.

Ok, will add more description and add doc for proc.txt. I don't have solid use
case for this though. It's consistent with other info we exported to userspace
and mostly for diagnosing purpose.

Thanks,
Shaohua
 
> On Fri 03-02-17 15:33:22, Shaohua Li wrote:
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> after the description is added and documentation updated
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  fs/proc/task_mmu.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index ee3efb2..8f2423f 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -440,6 +440,7 @@ struct mem_size_stats {
> >  	unsigned long private_dirty;
> >  	unsigned long referenced;
> >  	unsigned long anonymous;
> > +	unsigned long lazyfree;
> >  	unsigned long anonymous_thp;
> >  	unsigned long shmem_thp;
> >  	unsigned long swap;
> > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	int i, nr = compound ? 1 << compound_order(page) : 1;
> >  	unsigned long size = nr * PAGE_SIZE;
> >  
> > -	if (PageAnon(page))
> > +	if (PageAnon(page)) {
> >  		mss->anonymous += size;
> > +		if (!PageSwapBacked(page))
> > +			mss->lazyfree += size;
> > +	}
> >  
> >  	mss->resident += size;
> >  	/* Accumulate the size in pages that have been accessed. */
> > @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> >  		   "Private_Dirty:  %8lu kB\n"
> >  		   "Referenced:     %8lu kB\n"
> >  		   "Anonymous:      %8lu kB\n"
> > +		   "LazyFree:       %8lu kB\n"
> >  		   "AnonHugePages:  %8lu kB\n"
> >  		   "ShmemPmdMapped: %8lu kB\n"
> >  		   "Shared_Hugetlb: %8lu kB\n"
> > @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> >  		   mss.private_dirty >> 10,
> >  		   mss.referenced >> 10,
> >  		   mss.anonymous >> 10,
> > +		   mss.lazyfree >> 10,
> >  		   mss.anonymous_thp >> 10,
> >  		   mss.shmem_thp >> 10,
> >  		   mss.shared_hugetlb >> 10,
> > -- 
> > 2.9.3
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
@ 2017-02-10 17:52       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 17:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:30:40PM +0100, Michal Hocko wrote:
> $DESCRIPTION_OF_YOUR_USECASE_GOES_HERE
> 
> Moreover Documentation/filesystems/proc.txt should be updated as well.
> 
> Other than that, the patch looks good to me.

Ok, will add more description and add doc for proc.txt. I don't have solid use
case for this though. It's consistent with other info we exported to userspace
and mostly for diagnosing purpose.

Thanks,
Shaohua
 
> On Fri 03-02-17 15:33:22, Shaohua Li wrote:
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> after the description is added and documentation updated
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> > ---
> >  fs/proc/task_mmu.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index ee3efb2..8f2423f 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -440,6 +440,7 @@ struct mem_size_stats {
> >  	unsigned long private_dirty;
> >  	unsigned long referenced;
> >  	unsigned long anonymous;
> > +	unsigned long lazyfree;
> >  	unsigned long anonymous_thp;
> >  	unsigned long shmem_thp;
> >  	unsigned long swap;
> > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	int i, nr = compound ? 1 << compound_order(page) : 1;
> >  	unsigned long size = nr * PAGE_SIZE;
> >  
> > -	if (PageAnon(page))
> > +	if (PageAnon(page)) {
> >  		mss->anonymous += size;
> > +		if (!PageSwapBacked(page))
> > +			mss->lazyfree += size;
> > +	}
> >  
> >  	mss->resident += size;
> >  	/* Accumulate the size in pages that have been accessed. */
> > @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> >  		   "Private_Dirty:  %8lu kB\n"
> >  		   "Referenced:     %8lu kB\n"
> >  		   "Anonymous:      %8lu kB\n"
> > +		   "LazyFree:       %8lu kB\n"
> >  		   "AnonHugePages:  %8lu kB\n"
> >  		   "ShmemPmdMapped: %8lu kB\n"
> >  		   "Shared_Hugetlb: %8lu kB\n"
> > @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
> >  		   mss.private_dirty >> 10,
> >  		   mss.referenced >> 10,
> >  		   mss.anonymous >> 10,
> > +		   mss.lazyfree >> 10,
> >  		   mss.anonymous_thp >> 10,
> >  		   mss.shmem_thp >> 10,
> >  		   mss.shared_hugetlb >> 10,
> > -- 
> > 2.9.3
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-10 13:35     ` Michal Hocko
@ 2017-02-10 18:01       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:35:05PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > display the RSS, which userspace can use to determine the RSS excluding
> > MADV_FREE pages.
> > 
> > The basic idea is to increment the RSS in madvise and decrement in unmap
> > or page reclaim. There is one limitation. If a page is shared by two
> > processes, since madvise only has mm cotext of current process, it isn't
> > convenient to charge the RSS for both processes. So we don't charge the
> > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > MADV_FREE page shared by two processes. To make things consistent, we
> > uncharge the RSS from the source mm in fork.
> > 
> > A new flag is added to indicate if a page is accounted into the RSS. We
> > can't use SwapBacked flag to do the determination because we can't
> > guarantee the page has SwapBacked flag cleared in madvise. We are
> > reusing mappedtodisk flag which should not be set for Anon pages.
> > 
> > There are a couple of other places we need to uncharge the RSS,
> > activate_page and mark_page_accessed. activate_page is used by swap,
> > where MADV_FREE pages are already not in lazyfree state before going
> > into swap. mark_page_accessed is mainly used for file pages, but there
> > are several places it's used by anonymous pages. I fixed gup, but not
> > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > inprecise RSS accounting.
> > 
> > Please note, the accounting is never going to be precise. MADV_FREE page
> > could be written by userspace without notification to the kernel. The
> > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > real lazyfree page. But since kernel isn't aware of this, the page is
> > still accounted as lazyfree, thus the accounting could be incorrect.
> 
> This is all quite complex and as you say unprecise already. From the
> description it is not even clear why do we need it at all. Why is
> /proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
> even though you managed to recycle an existing one which is a plus.

We have monitor app running in the system to check other apps' RSS and kill
them if RSS is abnormal. Checking /proc/pid/smaps is too complicated and slow,
don't think we can go that way. Yes, the accounting isn't precise, but should
be much better than exporting nothing to userspace.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-10 18:01       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-10 18:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 02:35:05PM +0100, Michal Hocko wrote:
> On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > display the RSS, which userspace can use to determine the RSS excluding
> > MADV_FREE pages.
> > 
> > The basic idea is to increment the RSS in madvise and decrement in unmap
> > or page reclaim. There is one limitation. If a page is shared by two
> > processes, since madvise only has mm cotext of current process, it isn't
> > convenient to charge the RSS for both processes. So we don't charge the
> > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > MADV_FREE page shared by two processes. To make things consistent, we
> > uncharge the RSS from the source mm in fork.
> > 
> > A new flag is added to indicate if a page is accounted into the RSS. We
> > can't use SwapBacked flag to do the determination because we can't
> > guarantee the page has SwapBacked flag cleared in madvise. We are
> > reusing mappedtodisk flag which should not be set for Anon pages.
> > 
> > There are a couple of other places we need to uncharge the RSS,
> > activate_page and mark_page_accessed. activate_page is used by swap,
> > where MADV_FREE pages are already not in lazyfree state before going
> > into swap. mark_page_accessed is mainly used for file pages, but there
> > are several places it's used by anonymous pages. I fixed gup, but not
> > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > inprecise RSS accounting.
> > 
> > Please note, the accounting is never going to be precise. MADV_FREE page
> > could be written by userspace without notification to the kernel. The
> > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > real lazyfree page. But since kernel isn't aware of this, the page is
> > still accounted as lazyfree, thus the accounting could be incorrect.
> 
> This is all quite complex and as you say unprecise already. From the
> description it is not even clear why do we need it at all. Why is
> /proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
> even though you managed to recycle an existing one which is a plus.

We have monitor app running in the system to check other apps' RSS and kill
them if RSS is abnormal. Checking /proc/pid/smaps is too complicated and slow,
don't think we can go that way. Yes, the accounting isn't precise, but should
be much better than exporting nothing to userspace.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-10 17:30       ` Shaohua Li
@ 2017-02-13  4:57         ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-13  4:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 10, 2017 at 09:30:09AM -0800, Shaohua Li wrote:

< snip >

> > > +static inline bool page_is_lazyfree(struct page *page)
> > > +{
> > > +	return PageAnon(page) && !PageSwapBacked(page);
> > > +}
> > > +
> > 
> > trivial:
> > 
> > How about using PageLazyFree for consistency with other PageXXX?
> > As well, use SetPageLazyFree/ClearPageLazyFree rather than using
> > raw {Set,Clear}PageSwapBacked.
> 
> So SetPageLazyFree == ClearPageSwapBacked, that would be weird. I personally
> prefer directly using {Set, Clear}PageSwapBacked, because reader can
> immediately know what's happening. If using the PageLazyFree, people always
> need to refer the code and check the relationship between PageLazyFree and
> PageSwapBacked.

I was not against so I was about to sending "No problem" now but I found your
patch 5 which accounts lazyfreeable pages in zone/node stat and handle them
in lru list management. Hmm, I think now we don't handle lazyfree pages with
separate LRU list so it's awkward to me although it may work. So, my idea is
we can handle it through wrapper regardless of LRU management.

For instance,

void SetLazyFreePage(struct page *page)
{
	if (!TestSetPageSwapBacked(page))
		inc_zone_page_state(page, NR_ZONE_LAZYFREE);
}


void ClearLazyFreePage(struct page *page)
{
	if (TestClearPageSwapBacked(page))
		dec_zone_page_state(page, NR_ZONE_LAZYFREE);
}

madvise_free_pte_range:
	SetLageFreePage(page);

activate_page,shrink_page_list:
	ClearLazyFreePage(page);

free_pages_prepare:
	if (PageMappingFlags(page)) {
		if (PageLazyFreePage(page))
			dec_zone_page_state(page, NR_ZONE_LAZYFREE);
		page->mapping = NULL;
	}

Surely, it's orthgonal issue regardless of using wrapper but it might
nudge you to use wrapper.

>  
> > >  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> > >  				enum lru_list lru, enum zone_type zid,
> > >  				int nr_pages)
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 45e91dd..486494e 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
> > >  extern void lru_add_drain_all(void);
> > >  extern void rotate_reclaimable_page(struct page *page);
> > >  extern void deactivate_file_page(struct page *page);
> > > -extern void deactivate_page(struct page *page);
> > > +extern void mark_page_lazyfree(struct page *page);
> > 
> > trivial:
> > 
> > How about "deactivate_lazyfree_page"? IMO, it would show intention
> > clear that move the lazy free page to inactive list.
> > 
> > It's just matter of preference so I'm not strong against.
> 
> Yes, I thought about the name a little bit. Don't think we should use
> deactivate, because it sounds that only works for active page, while the
> function works for both active/inactive pages. I'm open to any suggestions.

Indeed.

I don't have better idea, either so my last suggestion is "demote_lazyfree_page".
It seems there are several papers/wikipedia to use *demote* in LRU managment.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
@ 2017-02-13  4:57         ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-13  4:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 10, 2017 at 09:30:09AM -0800, Shaohua Li wrote:

< snip >

> > > +static inline bool page_is_lazyfree(struct page *page)
> > > +{
> > > +	return PageAnon(page) && !PageSwapBacked(page);
> > > +}
> > > +
> > 
> > trivial:
> > 
> > How about using PageLazyFree for consistency with other PageXXX?
> > As well, use SetPageLazyFree/ClearPageLazyFree rather than using
> > raw {Set,Clear}PageSwapBacked.
> 
> So SetPageLazyFree == ClearPageSwapBacked, that would be weird. I personally
> prefer directly using {Set, Clear}PageSwapBacked, because reader can
> immediately know what's happening. If using the PageLazyFree, people always
> need to refer the code and check the relationship between PageLazyFree and
> PageSwapBacked.

I was not against so I was about to sending "No problem" now but I found your
patch 5 which accounts lazyfreeable pages in zone/node stat and handle them
in lru list management. Hmm, I think now we don't handle lazyfree pages with
separate LRU list so it's awkward to me although it may work. So, my idea is
we can handle it through wrapper regardless of LRU management.

For instance,

void SetLazyFreePage(struct page *page)
{
	if (!TestSetPageSwapBacked(page))
		inc_zone_page_state(page, NR_ZONE_LAZYFREE);
}


void ClearLazyFreePage(struct page *page)
{
	if (TestClearPageSwapBacked(page))
		dec_zone_page_state(page, NR_ZONE_LAZYFREE);
}

madvise_free_pte_range:
	SetLageFreePage(page);

activate_page,shrink_page_list:
	ClearLazyFreePage(page);

free_pages_prepare:
	if (PageMappingFlags(page)) {
		if (PageLazyFreePage(page))
			dec_zone_page_state(page, NR_ZONE_LAZYFREE);
		page->mapping = NULL;
	}

Surely, it's orthgonal issue regardless of using wrapper but it might
nudge you to use wrapper.

>  
> > >  static __always_inline void __update_lru_size(struct lruvec *lruvec,
> > >  				enum lru_list lru, enum zone_type zid,
> > >  				int nr_pages)
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 45e91dd..486494e 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
> > >  extern void lru_add_drain_all(void);
> > >  extern void rotate_reclaimable_page(struct page *page);
> > >  extern void deactivate_file_page(struct page *page);
> > > -extern void deactivate_page(struct page *page);
> > > +extern void mark_page_lazyfree(struct page *page);
> > 
> > trivial:
> > 
> > How about "deactivate_lazyfree_page"? IMO, it would show intention
> > clear that move the lazy free page to inactive list.
> > 
> > It's just matter of preference so I'm not strong against.
> 
> Yes, I thought about the name a little bit. Don't think we should use
> deactivate, because it sounds that only works for active page, while the
> function works for both active/inactive pages. I'm open to any suggestions.

Indeed.

I don't have better idea, either so my last suggestion is "demote_lazyfree_page".
It seems there are several papers/wikipedia to use *demote* in LRU managment.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
  2017-02-10 17:43       ` Shaohua Li
@ 2017-02-13  5:06         ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-13  5:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 09:43:07AM -0800, Shaohua Li wrote:

< snip >

> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 947ab6f..b304a84 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
> > >  		return PAGEREF_RECLAIM;
> > >  
> > >  	if (referenced_ptes) {
> > > -		if (PageSwapBacked(page))
> > > +		if (PageSwapBacked(page) || PageAnon(page))
> > 
> > If anyone accesses MADV_FREEed range with load op, not store,
> > why shouldn't we discard that pages?
> 
> Don't have strong opinion about this, userspace probably shouldn't do this. I'm
> ok to delete it if you insist.

Yes, I prefer to removing unnecessary code unless there is a some reaason.

> 
> > >  			return PAGEREF_ACTIVATE;
> > >  		/*
> > >  		 * All mapped pages start out with page table

< snip >

> > > @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		int may_enter_fs;
> > >  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
> > >  		bool dirty, writeback;
> > > -		bool lazyfree = false;
> > > +		bool lazyfree;
> > >  		int ret = SWAP_SUCCESS;
> > >  
> > >  		cond_resched();
> > > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  
> > >  		sc->nr_scanned++;
> > >  
> > > +		lazyfree = page_is_lazyfree(page);
> > > +
> > >  		if (unlikely(!page_evictable(page)))
> > >  			goto cull_mlocked;
> > >  
> > > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			goto keep_locked;
> > >  
> > >  		/* Double the slab pressure for mapped and swapcache pages */
> > > -		if (page_mapped(page) || PageSwapCache(page))
> > > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> > >  			sc->nr_scanned++;
> > 
> > In this phase, we cannot know whether lazyfree marked page is discarable
> > or not. If it is freeable and mapped, this logic makes sense. However,
> > if the page is dirty?
> 
> I think this doesn't matter. If the page is dirty, it will go to reclaim in
> next round and swap out. At that time, we will add nr_scanned there.

If the lazyfree page in LRU comes around again into this, it's true but
the page could be freed before that.
Having said that, I don't know how critical it is and what kinds of rationale
was to push slab reclaim so I don't insist on it.

Thanks.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 3/7] mm: reclaim MADV_FREE pages
@ 2017-02-13  5:06         ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-13  5:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 10, 2017 at 09:43:07AM -0800, Shaohua Li wrote:

< snip >

> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 947ab6f..b304a84 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -864,7 +864,7 @@ static enum page_references page_check_references(struct page *page,
> > >  		return PAGEREF_RECLAIM;
> > >  
> > >  	if (referenced_ptes) {
> > > -		if (PageSwapBacked(page))
> > > +		if (PageSwapBacked(page) || PageAnon(page))
> > 
> > If anyone accesses MADV_FREEed range with load op, not store,
> > why shouldn't we discard that pages?
> 
> Don't have strong opinion about this, userspace probably shouldn't do this. I'm
> ok to delete it if you insist.

Yes, I prefer to removing unnecessary code unless there is a some reaason.

> 
> > >  			return PAGEREF_ACTIVATE;
> > >  		/*
> > >  		 * All mapped pages start out with page table

< snip >

> > > @@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		int may_enter_fs;
> > >  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
> > >  		bool dirty, writeback;
> > > -		bool lazyfree = false;
> > > +		bool lazyfree;
> > >  		int ret = SWAP_SUCCESS;
> > >  
> > >  		cond_resched();
> > > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  
> > >  		sc->nr_scanned++;
> > >  
> > > +		lazyfree = page_is_lazyfree(page);
> > > +
> > >  		if (unlikely(!page_evictable(page)))
> > >  			goto cull_mlocked;
> > >  
> > > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			goto keep_locked;
> > >  
> > >  		/* Double the slab pressure for mapped and swapcache pages */
> > > -		if (page_mapped(page) || PageSwapCache(page))
> > > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> > >  			sc->nr_scanned++;
> > 
> > In this phase, we cannot know whether lazyfree marked page is discarable
> > or not. If it is freeable and mapped, this logic makes sense. However,
> > if the page is dirty?
> 
> I think this doesn't matter. If the page is dirty, it will go to reclaim in
> next round and swap out. At that time, we will add nr_scanned there.

If the lazyfree page in LRU comes around again into this, it's true but
the page could be freed before that.
Having said that, I don't know how critical it is and what kinds of rationale
was to push slab reclaim so I don't insist on it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
  2017-02-10 17:50       ` Shaohua Li
@ 2017-02-21  9:43         ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-21  9:43 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

[Sorry for a late reply I was on vacation last week]

On Fri 10-02-17 09:50:15, Shaohua Li wrote:
> On Fri, Feb 10, 2017 at 02:27:27PM +0100, Michal Hocko wrote:
> > On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> > > Show MADV_FREE pages info in proc/sysfs files.
> > 
> > How are we going to use this information? Why it isn't sufficient to
> > watch for lazyfree events? I mean this adds quite some code and it is
> > not clear (at least from the changelog) we we need this information.
> 
> It's just like any other meminfo we added to let user know what happens in the
> system. Users can use the info for monitoring/diagnosing. the
> lazyfree/lazyfreed events can't reflect the lazyfree page info because
> 'lazyfree - lazyfreed' doesn't equal current lazyfree pages and the events
> aren't per-node. I'll add more description in the changelog.

Well, I would prefer to not add new counters until there is a strong
reason for them. Maybe a trace point would be more appropriate for
debugging purposes.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages
@ 2017-02-21  9:43         ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-21  9:43 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

[Sorry for a late reply I was on vacation last week]

On Fri 10-02-17 09:50:15, Shaohua Li wrote:
> On Fri, Feb 10, 2017 at 02:27:27PM +0100, Michal Hocko wrote:
> > On Fri 03-02-17 15:33:21, Shaohua Li wrote:
> > > Show MADV_FREE pages info in proc/sysfs files.
> > 
> > How are we going to use this information? Why it isn't sufficient to
> > watch for lazyfree events? I mean this adds quite some code and it is
> > not clear (at least from the changelog) we we need this information.
> 
> It's just like any other meminfo we added to let user know what happens in the
> system. Users can use the info for monitoring/diagnosing. the
> lazyfree/lazyfreed events can't reflect the lazyfree page info because
> 'lazyfree - lazyfreed' doesn't equal current lazyfree pages and the events
> aren't per-node. I'll add more description in the changelog.

Well, I would prefer to not add new counters until there is a strong
reason for them. Maybe a trace point would be more appropriate for
debugging purposes.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-10 18:01       ` Shaohua Li
@ 2017-02-21  9:45         ` Michal Hocko
  -1 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-21  9:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 10-02-17 10:01:02, Shaohua Li wrote:
> On Fri, Feb 10, 2017 at 02:35:05PM +0100, Michal Hocko wrote:
> > On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> > > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > > display the RSS, which userspace can use to determine the RSS excluding
> > > MADV_FREE pages.
> > > 
> > > The basic idea is to increment the RSS in madvise and decrement in unmap
> > > or page reclaim. There is one limitation. If a page is shared by two
> > > processes, since madvise only has mm cotext of current process, it isn't
> > > convenient to charge the RSS for both processes. So we don't charge the
> > > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > > MADV_FREE page shared by two processes. To make things consistent, we
> > > uncharge the RSS from the source mm in fork.
> > > 
> > > A new flag is added to indicate if a page is accounted into the RSS. We
> > > can't use SwapBacked flag to do the determination because we can't
> > > guarantee the page has SwapBacked flag cleared in madvise. We are
> > > reusing mappedtodisk flag which should not be set for Anon pages.
> > > 
> > > There are a couple of other places we need to uncharge the RSS,
> > > activate_page and mark_page_accessed. activate_page is used by swap,
> > > where MADV_FREE pages are already not in lazyfree state before going
> > > into swap. mark_page_accessed is mainly used for file pages, but there
> > > are several places it's used by anonymous pages. I fixed gup, but not
> > > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > > inprecise RSS accounting.
> > > 
> > > Please note, the accounting is never going to be precise. MADV_FREE page
> > > could be written by userspace without notification to the kernel. The
> > > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > > real lazyfree page. But since kernel isn't aware of this, the page is
> > > still accounted as lazyfree, thus the accounting could be incorrect.
> > 
> > This is all quite complex and as you say unprecise already. From the
> > description it is not even clear why do we need it at all. Why is
> > /proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
> > even though you managed to recycle an existing one which is a plus.
> 
> We have monitor app running in the system to check other apps' RSS and kill
> them if RSS is abnormal. Checking /proc/pid/smaps is too complicated and slow,
> don't think we can go that way.

Could you be more specific about why "slow" matters?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-21  9:45         ` Michal Hocko
  0 siblings, 0 replies; 62+ messages in thread
From: Michal Hocko @ 2017-02-21  9:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, minchan, hughd,
	hannes, riel, mgorman, akpm

On Fri 10-02-17 10:01:02, Shaohua Li wrote:
> On Fri, Feb 10, 2017 at 02:35:05PM +0100, Michal Hocko wrote:
> > On Fri 03-02-17 15:33:23, Shaohua Li wrote:
> > > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > > display the RSS, which userspace can use to determine the RSS excluding
> > > MADV_FREE pages.
> > > 
> > > The basic idea is to increment the RSS in madvise and decrement in unmap
> > > or page reclaim. There is one limitation. If a page is shared by two
> > > processes, since madvise only has mm cotext of current process, it isn't
> > > convenient to charge the RSS for both processes. So we don't charge the
> > > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > > MADV_FREE page shared by two processes. To make things consistent, we
> > > uncharge the RSS from the source mm in fork.
> > > 
> > > A new flag is added to indicate if a page is accounted into the RSS. We
> > > can't use SwapBacked flag to do the determination because we can't
> > > guarantee the page has SwapBacked flag cleared in madvise. We are
> > > reusing mappedtodisk flag which should not be set for Anon pages.
> > > 
> > > There are a couple of other places we need to uncharge the RSS,
> > > activate_page and mark_page_accessed. activate_page is used by swap,
> > > where MADV_FREE pages are already not in lazyfree state before going
> > > into swap. mark_page_accessed is mainly used for file pages, but there
> > > are several places it's used by anonymous pages. I fixed gup, but not
> > > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > > inprecise RSS accounting.
> > > 
> > > Please note, the accounting is never going to be precise. MADV_FREE page
> > > could be written by userspace without notification to the kernel. The
> > > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > > real lazyfree page. But since kernel isn't aware of this, the page is
> > > still accounted as lazyfree, thus the accounting could be incorrect.
> > 
> > This is all quite complex and as you say unprecise already. From the
> > description it is not even clear why do we need it at all. Why is
> > /proc/<pid>/smaps insufficient? I am also not fun of a new page flag -
> > even though you managed to recycle an existing one which is a plus.
> 
> We have monitor app running in the system to check other apps' RSS and kill
> them if RSS is abnormal. Checking /proc/pid/smaps is too complicated and slow,
> don't think we can go that way.

Could you be more specific about why "slow" matters?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-22  0:46     ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-22  0:46 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote:
> Add a separate RSS for MADV_FREE pages. The pages are charged into
> MM_ANONPAGES (because they are mapped anon pages) and also charged into
> the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> display the RSS, which userspace can use to determine the RSS excluding
> MADV_FREE pages.

I'm not sure statm is right place. With definition of statm and considering
your usecase, it would be right place but when I look "stuats", it already
shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it.
It would be more consistent if you don't have big overhead.

> 
> The basic idea is to increment the RSS in madvise and decrement in unmap
> or page reclaim. There is one limitation. If a page is shared by two
> processes, since madvise only has mm cotext of current process, it isn't
> convenient to charge the RSS for both processes. So we don't charge the
> RSS if the mapcount isn't 1. On the other hand, fork can make a
> MADV_FREE page shared by two processes. To make things consistent, we
> uncharge the RSS from the source mm in fork.

I don't understand why we need new flag.

What's the problem like handling it normal anon|file|swapent|shmem?
IOW, we can increase in madvise context and increase for child in copy_one_pte
if the pte is still not dirty. And then decrease it in zap_pte_range/
try_to_unmap_one if it finds it's dirty or discardable.

Although it's shared by fork, VM can discard it if processes doesn't
make it dirty.

> 
> A new flag is added to indicate if a page is accounted into the RSS. We
> can't use SwapBacked flag to do the determination because we can't
> guarantee the page has SwapBacked flag cleared in madvise. We are
> reusing mappedtodisk flag which should not be set for Anon pages.
> 
> There are a couple of other places we need to uncharge the RSS,
> activate_page and mark_page_accessed. activate_page is used by swap,
> where MADV_FREE pages are already not in lazyfree state before going
> into swap. mark_page_accessed is mainly used for file pages, but there
> are several places it's used by anonymous pages. I fixed gup, but not
> some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> inprecise RSS accounting.
> 
> Please note, the accounting is never going to be precise. MADV_FREE page
> could be written by userspace without notification to the kernel. The
> page can't be reclaimed like other clean lazyfree pages. The page isn't
> real lazyfree page. But since kernel isn't aware of this, the page is
> still accounted as lazyfree, thus the accounting could be incorrect.

Right. Lazyfree is not inaccurate without CoW where it's point to decrease
lazyfree rss count when the store happens so we might be tempted to make
it to Cow at the cost of performance degradation but still it's not accurate
without making mark_page_accessed be aware of each mm context which is
hard part. So, I agree this stat is useful but don't want to make it
complicate.

Thanks.

> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/proc/array.c            |  9 ++++++---
>  fs/proc/internal.h         |  3 ++-
>  fs/proc/task_mmu.c         |  9 +++++++--
>  fs/proc/task_nommu.c       |  4 +++-
>  include/linux/mm_types.h   |  1 +
>  include/linux/page-flags.h |  6 ++++++
>  mm/gup.c                   |  2 ++
>  mm/huge_memory.c           |  8 ++++++++
>  mm/khugepaged.c            |  2 ++
>  mm/madvise.c               |  5 +++++
>  mm/memory.c                | 13 +++++++++++--
>  mm/migrate.c               |  2 ++
>  mm/oom_kill.c              | 10 ++++++----
>  mm/rmap.c                  |  3 +++
>  14 files changed, 64 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 51a4213..c2281f4 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -583,17 +583,19 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
>  			struct pid *pid, struct task_struct *task)
>  {
>  	unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
> +	unsigned long lazyfree = 0;
>  	struct mm_struct *mm = get_task_mm(task);
>  
>  	if (mm) {
> -		size = task_statm(mm, &shared, &text, &data, &resident);
> +		size = task_statm(mm, &shared, &text, &data, &resident,
> +				  &lazyfree);
>  		mmput(mm);
>  	}
>  	/*
>  	 * For quick read, open code by putting numbers directly
>  	 * expected format is
> -	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n",
> -	 *               size, resident, shared, text, data);
> +	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0 %lu\n",
> +	 *               size, resident, shared, text, data, lazyfree);
>  	 */
>  	seq_put_decimal_ull(m, "", size);
>  	seq_put_decimal_ull(m, " ", resident);
> @@ -602,6 +604,7 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
>  	seq_put_decimal_ull(m, " ", 0);
>  	seq_put_decimal_ull(m, " ", data);
>  	seq_put_decimal_ull(m, " ", 0);
> +	seq_put_decimal_ull(m, " ", lazyfree);
>  	seq_putc(m, '\n');
>  
>  	return 0;
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index e2c3c46..6587b9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -290,5 +290,6 @@ extern const struct file_operations proc_pagemap_operations;
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
>  				unsigned long *, unsigned long *,
> -				unsigned long *, unsigned long *);
> +				unsigned long *, unsigned long *,
> +				unsigned long *);
>  extern void task_mem(struct seq_file *, struct mm_struct *);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8f2423f..f18b568 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -23,9 +23,10 @@
>  
>  void task_mem(struct seq_file *m, struct mm_struct *mm)
>  {
> -	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
> +	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem, lazyfree;
>  	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>  
> +	lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
>  	anon = get_mm_counter(mm, MM_ANONPAGES);
>  	file = get_mm_counter(mm, MM_FILEPAGES);
>  	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> @@ -59,6 +60,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  		"RssAnon:\t%8lu kB\n"
>  		"RssFile:\t%8lu kB\n"
>  		"RssShmem:\t%8lu kB\n"
> +		"RssLazyfree:\t%8lu kB\n"
>  		"VmData:\t%8lu kB\n"
>  		"VmStk:\t%8lu kB\n"
>  		"VmExe:\t%8lu kB\n"
> @@ -75,6 +77,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  		anon << (PAGE_SHIFT-10),
>  		file << (PAGE_SHIFT-10),
>  		shmem << (PAGE_SHIFT-10),
> +		lazyfree << (PAGE_SHIFT-10),
>  		mm->data_vm << (PAGE_SHIFT-10),
>  		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>  		ptes >> 10,
> @@ -90,7 +93,8 @@ unsigned long task_vsize(struct mm_struct *mm)
>  
>  unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
> -			 unsigned long *data, unsigned long *resident)
> +			 unsigned long *data, unsigned long *resident,
> +			 unsigned long *lazyfree)
>  {
>  	*shared = get_mm_counter(mm, MM_FILEPAGES) +
>  			get_mm_counter(mm, MM_SHMEMPAGES);
> @@ -98,6 +102,7 @@ unsigned long task_statm(struct mm_struct *mm,
>  								>> PAGE_SHIFT;
>  	*data = mm->data_vm + mm->stack_vm;
>  	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
>  	return mm->total_vm;
>  }
>  
> diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
> index 1ef97cf..50426de 100644
> --- a/fs/proc/task_nommu.c
> +++ b/fs/proc/task_nommu.c
> @@ -94,7 +94,8 @@ unsigned long task_vsize(struct mm_struct *mm)
>  
>  unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
> -			 unsigned long *data, unsigned long *resident)
> +			 unsigned long *data, unsigned long *resident,
> +			 unsigned long *lazyfree)
>  {
>  	struct vm_area_struct *vma;
>  	struct vm_region *region;
> @@ -120,6 +121,7 @@ unsigned long task_statm(struct mm_struct *mm,
>  	size >>= PAGE_SHIFT;
>  	size += *text + *data;
>  	*resident = size;
> +	*lazyfree = 0;
>  	return size;
>  }
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4f6d440..b6a1428 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -376,6 +376,7 @@ enum {
>  	MM_ANONPAGES,	/* Resident anonymous pages */
>  	MM_SWAPENTS,	/* Anonymous swap entries */
>  	MM_SHMEMPAGES,	/* Resident shared memory pages */
> +	MM_LAZYFREEPAGES, /* Lazyfree pages, also charged into MM_ANONPAGES */
>  	NR_MM_COUNTERS
>  };
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6b5818d..67c732b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -107,6 +107,8 @@ enum pageflags {
>  #endif
>  	__NR_PAGEFLAGS,
>  
> +	PG_lazyfreeaccounted = PG_mappedtodisk, /* only for anon MADV_FREE pages */
> +
>  	/* Filesystems */
>  	PG_checked = PG_owner_priv_1,
>  
> @@ -428,6 +430,10 @@ TESTPAGEFLAG_FALSE(Ksm)
>  
>  u64 stable_page_flags(struct page *page);
>  
> +PAGEFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +	TESTSETFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +	TESTCLEARFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +
>  static inline int PageUptodate(struct page *page)
>  {
>  	int ret;
> diff --git a/mm/gup.c b/mm/gup.c
> index 40abe4c..e64d990 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -171,6 +171,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		 * mark_page_accessed().
>  		 */
>  		mark_page_accessed(page);
> +		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
> +			dec_mm_counter(mm, MM_LAZYFREEPAGES);
>  	}
>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/* Do not mlock pte-mapped THP */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ddb9a94..951fa34 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -871,6 +871,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>  	get_page(src_page);
>  	page_dup_rmap(src_page, true);
> +	if (PageAnon(src_page) && TestClearPageLazyFreeAccounted(src_page))
> +		add_mm_counter(src_mm, MM_LAZYFREEPAGES, -HPAGE_PMD_NR);
>  	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  	atomic_long_inc(&dst_mm->nr_ptes);
>  	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -1402,6 +1404,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
>  
> +	if (page_mapcount(page) == 1 && !TestSetPageLazyFreeAccounted(page))
> +		add_mm_counter(mm, MM_LAZYFREEPAGES, HPAGE_PMD_NR);
>  	mark_page_lazyfree(page);
>  	ret = true;
>  out:
> @@ -1459,6 +1463,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			pte_free(tlb->mm, pgtable);
>  			atomic_long_dec(&tlb->mm->nr_ptes);
>  			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			if (TestClearPageLazyFreeAccounted(page))
> +				add_mm_counter(tlb->mm, MM_LAZYFREEPAGES,
> +						-HPAGE_PMD_NR);
>  		} else {
>  			if (arch_needs_pgtable_deposit())
>  				zap_deposited_table(tlb->mm, pmd);
> @@ -1917,6 +1924,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			 (1L << PG_swapbacked) |
>  			 (1L << PG_mlocked) |
>  			 (1L << PG_uptodate) |
> +			 (1L << PG_lazyfreeaccounted) |
>  			 (1L << PG_active) |
>  			 (1L << PG_locked) |
>  			 (1L << PG_unevictable) |
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a4b499f..e4668db 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -577,6 +577,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		}
>  		inc_node_page_state(page,
>  				NR_ISOLATED_ANON + page_is_file_cache(page));
> +		if (TestClearPageLazyFreeAccounted(page))
> +			dec_mm_counter(vma->vm_mm, MM_LAZYFREEPAGES);
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fe40e93..3c90956 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -275,6 +275,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct page *page;
>  	int nr_swap = 0;
>  	unsigned long next;
> +	int nr_lazyfree_accounted = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd))
> @@ -380,9 +381,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			set_pte_at(mm, addr, pte, ptent);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		if (page_mapcount(page) == 1 &&
> +		    !TestSetPageLazyFreeAccounted(page))
> +			nr_lazyfree_accounted++;
>  		mark_page_lazyfree(page);
>  	}
>  out:
> +	add_mm_counter(mm, MM_LAZYFREEPAGES, nr_lazyfree_accounted);
>  	if (nr_swap) {
>  		if (current->mm == mm)
>  			sync_mm_rss(mm);
> diff --git a/mm/memory.c b/mm/memory.c
> index cf97d88..e275de1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -850,7 +850,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  static inline unsigned long
>  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> -		unsigned long addr, int *rss)
> +		unsigned long addr, int *rss, int *rss_src_lazyfree)
>  {
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
> @@ -915,6 +915,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page, false);
> +		if (PageAnon(page) &&
> +		    TestClearPageLazyFreeAccounted(page))
> +			(*rss_src_lazyfree)++;
>  		rss[mm_counter(page)]++;
>  	}
>  
> @@ -932,10 +935,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	spinlock_t *src_ptl, *dst_ptl;
>  	int progress = 0;
>  	int rss[NR_MM_COUNTERS];
> +	int rss_src_lazyfree;
>  	swp_entry_t entry = (swp_entry_t){0};
>  
>  again:
>  	init_rss_vec(rss);
> +	rss_src_lazyfree = 0;
>  
>  	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>  	if (!dst_pte)
> @@ -963,13 +968,14 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			continue;
>  		}
>  		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> -							vma, addr, rss);
> +					vma, addr, rss, &rss_src_lazyfree);
>  		if (entry.val)
>  			break;
>  		progress += 8;
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> +	add_mm_counter(src_mm, MM_LAZYFREEPAGES, -rss_src_lazyfree);
>  	spin_unlock(src_ptl);
>  	pte_unmap(orig_src_pte);
>  	add_mm_rss_vec(dst_mm, rss);
> @@ -1163,6 +1169,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  					mark_page_accessed(page);
>  			}
>  			rss[mm_counter(page)]--;
> +			if (PageAnon(page) &&
> +			    TestClearPageLazyFreeAccounted(page))
> +				rss[MM_LAZYFREEPAGES]--;
>  			page_remove_rmap(page, false);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index eb76f87..6e586d2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -642,6 +642,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>  		SetPageChecked(newpage);
>  	if (PageMappedToDisk(page))
>  		SetPageMappedToDisk(newpage);
> +	if (PageLazyFreeAccounted(page))
> +		SetPageLazyFreeAccounted(newpage);
>  
>  	/* Move dirty on pages not done by migrate_page_move_mapping() */
>  	if (PageDirty(page))
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 51c0918..54e0604 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -528,11 +528,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  					 NULL);
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> -	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> +	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
> -			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +			K(get_mm_counter(mm, MM_SHMEMPAGES)),
> +			K(get_mm_counter(mm, MM_LAZYFREEPAGES)));
>  	up_read(&mm->mmap_sem);
>  
>  	/*
> @@ -878,11 +879,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mark_oom_victim(victim);
> -	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> +	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> -		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> +		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> +		K(get_mm_counter(victim->mm, MM_LAZYFREEPAGES)));
>  	task_unlock(victim);
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5f05926..86c80d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1585,6 +1585,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	put_page(page);
>  
>  out_unmap:
> +	/* regardless of success or failure, the page isn't lazyfree */
> +	if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
> +		add_mm_counter(mm, MM_LAZYFREEPAGES, -hpage_nr_pages(page));
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
>  		mmu_notifier_invalidate_page(mm, address);
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-22  0:46     ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-22  0:46 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

Hi Shaohua,

On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote:
> Add a separate RSS for MADV_FREE pages. The pages are charged into
> MM_ANONPAGES (because they are mapped anon pages) and also charged into
> the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> display the RSS, which userspace can use to determine the RSS excluding
> MADV_FREE pages.

I'm not sure statm is right place. With definition of statm and considering
your usecase, it would be right place but when I look "stuats", it already
shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it.
It would be more consistent if you don't have big overhead.

> 
> The basic idea is to increment the RSS in madvise and decrement in unmap
> or page reclaim. There is one limitation. If a page is shared by two
> processes, since madvise only has mm cotext of current process, it isn't
> convenient to charge the RSS for both processes. So we don't charge the
> RSS if the mapcount isn't 1. On the other hand, fork can make a
> MADV_FREE page shared by two processes. To make things consistent, we
> uncharge the RSS from the source mm in fork.

I don't understand why we need new flag.

What's the problem like handling it normal anon|file|swapent|shmem?
IOW, we can increase in madvise context and increase for child in copy_one_pte
if the pte is still not dirty. And then decrease it in zap_pte_range/
try_to_unmap_one if it finds it's dirty or discardable.

Although it's shared by fork, VM can discard it if processes doesn't
make it dirty.

> 
> A new flag is added to indicate if a page is accounted into the RSS. We
> can't use SwapBacked flag to do the determination because we can't
> guarantee the page has SwapBacked flag cleared in madvise. We are
> reusing mappedtodisk flag which should not be set for Anon pages.
> 
> There are a couple of other places we need to uncharge the RSS,
> activate_page and mark_page_accessed. activate_page is used by swap,
> where MADV_FREE pages are already not in lazyfree state before going
> into swap. mark_page_accessed is mainly used for file pages, but there
> are several places it's used by anonymous pages. I fixed gup, but not
> some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> inprecise RSS accounting.
> 
> Please note, the accounting is never going to be precise. MADV_FREE page
> could be written by userspace without notification to the kernel. The
> page can't be reclaimed like other clean lazyfree pages. The page isn't
> real lazyfree page. But since kernel isn't aware of this, the page is
> still accounted as lazyfree, thus the accounting could be incorrect.

Right. Lazyfree is not inaccurate without CoW where it's point to decrease
lazyfree rss count when the store happens so we might be tempted to make
it to Cow at the cost of performance degradation but still it's not accurate
without making mark_page_accessed be aware of each mm context which is
hard part. So, I agree this stat is useful but don't want to make it
complicate.

Thanks.

> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/proc/array.c            |  9 ++++++---
>  fs/proc/internal.h         |  3 ++-
>  fs/proc/task_mmu.c         |  9 +++++++--
>  fs/proc/task_nommu.c       |  4 +++-
>  include/linux/mm_types.h   |  1 +
>  include/linux/page-flags.h |  6 ++++++
>  mm/gup.c                   |  2 ++
>  mm/huge_memory.c           |  8 ++++++++
>  mm/khugepaged.c            |  2 ++
>  mm/madvise.c               |  5 +++++
>  mm/memory.c                | 13 +++++++++++--
>  mm/migrate.c               |  2 ++
>  mm/oom_kill.c              | 10 ++++++----
>  mm/rmap.c                  |  3 +++
>  14 files changed, 64 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 51a4213..c2281f4 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -583,17 +583,19 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
>  			struct pid *pid, struct task_struct *task)
>  {
>  	unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
> +	unsigned long lazyfree = 0;
>  	struct mm_struct *mm = get_task_mm(task);
>  
>  	if (mm) {
> -		size = task_statm(mm, &shared, &text, &data, &resident);
> +		size = task_statm(mm, &shared, &text, &data, &resident,
> +				  &lazyfree);
>  		mmput(mm);
>  	}
>  	/*
>  	 * For quick read, open code by putting numbers directly
>  	 * expected format is
> -	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n",
> -	 *               size, resident, shared, text, data);
> +	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0 %lu\n",
> +	 *               size, resident, shared, text, data, lazyfree);
>  	 */
>  	seq_put_decimal_ull(m, "", size);
>  	seq_put_decimal_ull(m, " ", resident);
> @@ -602,6 +604,7 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
>  	seq_put_decimal_ull(m, " ", 0);
>  	seq_put_decimal_ull(m, " ", data);
>  	seq_put_decimal_ull(m, " ", 0);
> +	seq_put_decimal_ull(m, " ", lazyfree);
>  	seq_putc(m, '\n');
>  
>  	return 0;
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index e2c3c46..6587b9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -290,5 +290,6 @@ extern const struct file_operations proc_pagemap_operations;
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
>  				unsigned long *, unsigned long *,
> -				unsigned long *, unsigned long *);
> +				unsigned long *, unsigned long *,
> +				unsigned long *);
>  extern void task_mem(struct seq_file *, struct mm_struct *);
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8f2423f..f18b568 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -23,9 +23,10 @@
>  
>  void task_mem(struct seq_file *m, struct mm_struct *mm)
>  {
> -	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
> +	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem, lazyfree;
>  	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
>  
> +	lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
>  	anon = get_mm_counter(mm, MM_ANONPAGES);
>  	file = get_mm_counter(mm, MM_FILEPAGES);
>  	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
> @@ -59,6 +60,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  		"RssAnon:\t%8lu kB\n"
>  		"RssFile:\t%8lu kB\n"
>  		"RssShmem:\t%8lu kB\n"
> +		"RssLazyfree:\t%8lu kB\n"
>  		"VmData:\t%8lu kB\n"
>  		"VmStk:\t%8lu kB\n"
>  		"VmExe:\t%8lu kB\n"
> @@ -75,6 +77,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  		anon << (PAGE_SHIFT-10),
>  		file << (PAGE_SHIFT-10),
>  		shmem << (PAGE_SHIFT-10),
> +		lazyfree << (PAGE_SHIFT-10),
>  		mm->data_vm << (PAGE_SHIFT-10),
>  		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
>  		ptes >> 10,
> @@ -90,7 +93,8 @@ unsigned long task_vsize(struct mm_struct *mm)
>  
>  unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
> -			 unsigned long *data, unsigned long *resident)
> +			 unsigned long *data, unsigned long *resident,
> +			 unsigned long *lazyfree)
>  {
>  	*shared = get_mm_counter(mm, MM_FILEPAGES) +
>  			get_mm_counter(mm, MM_SHMEMPAGES);
> @@ -98,6 +102,7 @@ unsigned long task_statm(struct mm_struct *mm,
>  								>> PAGE_SHIFT;
>  	*data = mm->data_vm + mm->stack_vm;
>  	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
> +	*lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
>  	return mm->total_vm;
>  }
>  
> diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
> index 1ef97cf..50426de 100644
> --- a/fs/proc/task_nommu.c
> +++ b/fs/proc/task_nommu.c
> @@ -94,7 +94,8 @@ unsigned long task_vsize(struct mm_struct *mm)
>  
>  unsigned long task_statm(struct mm_struct *mm,
>  			 unsigned long *shared, unsigned long *text,
> -			 unsigned long *data, unsigned long *resident)
> +			 unsigned long *data, unsigned long *resident,
> +			 unsigned long *lazyfree)
>  {
>  	struct vm_area_struct *vma;
>  	struct vm_region *region;
> @@ -120,6 +121,7 @@ unsigned long task_statm(struct mm_struct *mm,
>  	size >>= PAGE_SHIFT;
>  	size += *text + *data;
>  	*resident = size;
> +	*lazyfree = 0;
>  	return size;
>  }
>  
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4f6d440..b6a1428 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -376,6 +376,7 @@ enum {
>  	MM_ANONPAGES,	/* Resident anonymous pages */
>  	MM_SWAPENTS,	/* Anonymous swap entries */
>  	MM_SHMEMPAGES,	/* Resident shared memory pages */
> +	MM_LAZYFREEPAGES, /* Lazyfree pages, also charged into MM_ANONPAGES */
>  	NR_MM_COUNTERS
>  };
>  
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6b5818d..67c732b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -107,6 +107,8 @@ enum pageflags {
>  #endif
>  	__NR_PAGEFLAGS,
>  
> +	PG_lazyfreeaccounted = PG_mappedtodisk, /* only for anon MADV_FREE pages */
> +
>  	/* Filesystems */
>  	PG_checked = PG_owner_priv_1,
>  
> @@ -428,6 +430,10 @@ TESTPAGEFLAG_FALSE(Ksm)
>  
>  u64 stable_page_flags(struct page *page);
>  
> +PAGEFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +	TESTSETFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +	TESTCLEARFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
> +
>  static inline int PageUptodate(struct page *page)
>  {
>  	int ret;
> diff --git a/mm/gup.c b/mm/gup.c
> index 40abe4c..e64d990 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -171,6 +171,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  		 * mark_page_accessed().
>  		 */
>  		mark_page_accessed(page);
> +		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
> +			dec_mm_counter(mm, MM_LAZYFREEPAGES);
>  	}
>  	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>  		/* Do not mlock pte-mapped THP */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ddb9a94..951fa34 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -871,6 +871,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
>  	get_page(src_page);
>  	page_dup_rmap(src_page, true);
> +	if (PageAnon(src_page) && TestClearPageLazyFreeAccounted(src_page))
> +		add_mm_counter(src_mm, MM_LAZYFREEPAGES, -HPAGE_PMD_NR);
>  	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  	atomic_long_inc(&dst_mm->nr_ptes);
>  	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -1402,6 +1404,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
>  
> +	if (page_mapcount(page) == 1 && !TestSetPageLazyFreeAccounted(page))
> +		add_mm_counter(mm, MM_LAZYFREEPAGES, HPAGE_PMD_NR);
>  	mark_page_lazyfree(page);
>  	ret = true;
>  out:
> @@ -1459,6 +1463,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			pte_free(tlb->mm, pgtable);
>  			atomic_long_dec(&tlb->mm->nr_ptes);
>  			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> +			if (TestClearPageLazyFreeAccounted(page))
> +				add_mm_counter(tlb->mm, MM_LAZYFREEPAGES,
> +						-HPAGE_PMD_NR);
>  		} else {
>  			if (arch_needs_pgtable_deposit())
>  				zap_deposited_table(tlb->mm, pmd);
> @@ -1917,6 +1924,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			 (1L << PG_swapbacked) |
>  			 (1L << PG_mlocked) |
>  			 (1L << PG_uptodate) |
> +			 (1L << PG_lazyfreeaccounted) |
>  			 (1L << PG_active) |
>  			 (1L << PG_locked) |
>  			 (1L << PG_unevictable) |
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a4b499f..e4668db 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -577,6 +577,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  		}
>  		inc_node_page_state(page,
>  				NR_ISOLATED_ANON + page_is_file_cache(page));
> +		if (TestClearPageLazyFreeAccounted(page))
> +			dec_mm_counter(vma->vm_mm, MM_LAZYFREEPAGES);
>  		VM_BUG_ON_PAGE(!PageLocked(page), page);
>  		VM_BUG_ON_PAGE(PageLRU(page), page);
>  
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fe40e93..3c90956 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -275,6 +275,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	struct page *page;
>  	int nr_swap = 0;
>  	unsigned long next;
> +	int nr_lazyfree_accounted = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd))
> @@ -380,9 +381,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			set_pte_at(mm, addr, pte, ptent);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		if (page_mapcount(page) == 1 &&
> +		    !TestSetPageLazyFreeAccounted(page))
> +			nr_lazyfree_accounted++;
>  		mark_page_lazyfree(page);
>  	}
>  out:
> +	add_mm_counter(mm, MM_LAZYFREEPAGES, nr_lazyfree_accounted);
>  	if (nr_swap) {
>  		if (current->mm == mm)
>  			sync_mm_rss(mm);
> diff --git a/mm/memory.c b/mm/memory.c
> index cf97d88..e275de1 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -850,7 +850,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  static inline unsigned long
>  copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
> -		unsigned long addr, int *rss)
> +		unsigned long addr, int *rss, int *rss_src_lazyfree)
>  {
>  	unsigned long vm_flags = vma->vm_flags;
>  	pte_t pte = *src_pte;
> @@ -915,6 +915,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page, false);
> +		if (PageAnon(page) &&
> +		    TestClearPageLazyFreeAccounted(page))
> +			(*rss_src_lazyfree)++;
>  		rss[mm_counter(page)]++;
>  	}
>  
> @@ -932,10 +935,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	spinlock_t *src_ptl, *dst_ptl;
>  	int progress = 0;
>  	int rss[NR_MM_COUNTERS];
> +	int rss_src_lazyfree;
>  	swp_entry_t entry = (swp_entry_t){0};
>  
>  again:
>  	init_rss_vec(rss);
> +	rss_src_lazyfree = 0;
>  
>  	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
>  	if (!dst_pte)
> @@ -963,13 +968,14 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  			continue;
>  		}
>  		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
> -							vma, addr, rss);
> +					vma, addr, rss, &rss_src_lazyfree);
>  		if (entry.val)
>  			break;
>  		progress += 8;
>  	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
> +	add_mm_counter(src_mm, MM_LAZYFREEPAGES, -rss_src_lazyfree);
>  	spin_unlock(src_ptl);
>  	pte_unmap(orig_src_pte);
>  	add_mm_rss_vec(dst_mm, rss);
> @@ -1163,6 +1169,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>  					mark_page_accessed(page);
>  			}
>  			rss[mm_counter(page)]--;
> +			if (PageAnon(page) &&
> +			    TestClearPageLazyFreeAccounted(page))
> +				rss[MM_LAZYFREEPAGES]--;
>  			page_remove_rmap(page, false);
>  			if (unlikely(page_mapcount(page) < 0))
>  				print_bad_pte(vma, addr, ptent, page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index eb76f87..6e586d2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -642,6 +642,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>  		SetPageChecked(newpage);
>  	if (PageMappedToDisk(page))
>  		SetPageMappedToDisk(newpage);
> +	if (PageLazyFreeAccounted(page))
> +		SetPageLazyFreeAccounted(newpage);
>  
>  	/* Move dirty on pages not done by migrate_page_move_mapping() */
>  	if (PageDirty(page))
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 51c0918..54e0604 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -528,11 +528,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  					 NULL);
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> -	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> +	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
> -			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> +			K(get_mm_counter(mm, MM_SHMEMPAGES)),
> +			K(get_mm_counter(mm, MM_LAZYFREEPAGES)));
>  	up_read(&mm->mmap_sem);
>  
>  	/*
> @@ -878,11 +879,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
>  	mark_oom_victim(victim);
> -	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> +	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
>  		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> -		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> +		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> +		K(get_mm_counter(victim->mm, MM_LAZYFREEPAGES)));
>  	task_unlock(victim);
>  
>  	/*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5f05926..86c80d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1585,6 +1585,9 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	put_page(page);
>  
>  out_unmap:
> +	/* regardless of success or failure, the page isn't lazyfree */
> +	if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
> +		add_mm_counter(mm, MM_LAZYFREEPAGES, -hpage_nr_pages(page));
>  	pte_unmap_unlock(pte, ptl);
>  	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
>  		mmu_notifier_invalidate_page(mm, address);
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-22  0:46     ` Minchan Kim
@ 2017-02-22  1:27       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-22  1:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Wed, Feb 22, 2017 at 09:46:05AM +0900, Minchan Kim wrote:
> Hi Shaohua,
> 
> On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote:
> > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > display the RSS, which userspace can use to determine the RSS excluding
> > MADV_FREE pages.
> 
> I'm not sure statm is right place. With definition of statm and considering
> your usecase, it would be right place but when I look "stuats", it already
> shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it.
> It would be more consistent if you don't have big overhead.
> 
> > 
> > The basic idea is to increment the RSS in madvise and decrement in unmap
> > or page reclaim. There is one limitation. If a page is shared by two
> > processes, since madvise only has mm cotext of current process, it isn't
> > convenient to charge the RSS for both processes. So we don't charge the
> > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > MADV_FREE page shared by two processes. To make things consistent, we
> > uncharge the RSS from the source mm in fork.
> 
> I don't understand why we need new flag.
> 
> What's the problem like handling it normal anon|file|swapent|shmem?
> IOW, we can increase in madvise context and increase for child in copy_one_pte
> if the pte is still not dirty. And then decrease it in zap_pte_range/
> try_to_unmap_one if it finds it's dirty or discardable.
> 
> Although it's shared by fork, VM can discard it if processes doesn't
> make it dirty.

The thing is we could madvise the same page twice. madvise context can't
guarantee we move the page to inactive file list, so we could wrongly increase
the count.

> > 
> > A new flag is added to indicate if a page is accounted into the RSS. We
> > can't use SwapBacked flag to do the determination because we can't
> > guarantee the page has SwapBacked flag cleared in madvise. We are
> > reusing mappedtodisk flag which should not be set for Anon pages.
> > 
> > There are a couple of other places we need to uncharge the RSS,
> > activate_page and mark_page_accessed. activate_page is used by swap,
> > where MADV_FREE pages are already not in lazyfree state before going
> > into swap. mark_page_accessed is mainly used for file pages, but there
> > are several places it's used by anonymous pages. I fixed gup, but not
> > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > inprecise RSS accounting.
> > 
> > Please note, the accounting is never going to be precise. MADV_FREE page
> > could be written by userspace without notification to the kernel. The
> > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > real lazyfree page. But since kernel isn't aware of this, the page is
> > still accounted as lazyfree, thus the accounting could be incorrect.
> 
> Right. Lazyfree is not inaccurate without CoW where it's point to decrease
> lazyfree rss count when the store happens so we might be tempted to make
> it to Cow at the cost of performance degradation but still it's not accurate
> without making mark_page_accessed be aware of each mm context which is
> hard part. So, I agree this stat is useful but don't want to make it
> complicate.

Yes, it only could be accurate with extra pagefault cost, but apparently nobody
wants to pay for it.

I talked to jemalloc guys here. They have concerns about the accounting since
it's not accurate. I'll drop the accounting patches in next post. The only
interface which can export accurate info is /proc/pid/smaps, we probably go
that.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages
@ 2017-02-22  1:27       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-22  1:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Wed, Feb 22, 2017 at 09:46:05AM +0900, Minchan Kim wrote:
> Hi Shaohua,
> 
> On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote:
> > Add a separate RSS for MADV_FREE pages. The pages are charged into
> > MM_ANONPAGES (because they are mapped anon pages) and also charged into
> > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to
> > display the RSS, which userspace can use to determine the RSS excluding
> > MADV_FREE pages.
> 
> I'm not sure statm is right place. With definition of statm and considering
> your usecase, it would be right place but when I look "stuats", it already
> shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it.
> It would be more consistent if you don't have big overhead.
> 
> > 
> > The basic idea is to increment the RSS in madvise and decrement in unmap
> > or page reclaim. There is one limitation. If a page is shared by two
> > processes, since madvise only has mm cotext of current process, it isn't
> > convenient to charge the RSS for both processes. So we don't charge the
> > RSS if the mapcount isn't 1. On the other hand, fork can make a
> > MADV_FREE page shared by two processes. To make things consistent, we
> > uncharge the RSS from the source mm in fork.
> 
> I don't understand why we need new flag.
> 
> What's the problem like handling it normal anon|file|swapent|shmem?
> IOW, we can increase in madvise context and increase for child in copy_one_pte
> if the pte is still not dirty. And then decrease it in zap_pte_range/
> try_to_unmap_one if it finds it's dirty or discardable.
> 
> Although it's shared by fork, VM can discard it if processes doesn't
> make it dirty.

The thing is we could madvise the same page twice. madvise context can't
guarantee we move the page to inactive file list, so we could wrongly increase
the count.

> > 
> > A new flag is added to indicate if a page is accounted into the RSS. We
> > can't use SwapBacked flag to do the determination because we can't
> > guarantee the page has SwapBacked flag cleared in madvise. We are
> > reusing mappedtodisk flag which should not be set for Anon pages.
> > 
> > There are a couple of other places we need to uncharge the RSS,
> > activate_page and mark_page_accessed. activate_page is used by swap,
> > where MADV_FREE pages are already not in lazyfree state before going
> > into swap. mark_page_accessed is mainly used for file pages, but there
> > are several places it's used by anonymous pages. I fixed gup, but not
> > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
> > inprecise RSS accounting.
> > 
> > Please note, the accounting is never going to be precise. MADV_FREE page
> > could be written by userspace without notification to the kernel. The
> > page can't be reclaimed like other clean lazyfree pages. The page isn't
> > real lazyfree page. But since kernel isn't aware of this, the page is
> > still accounted as lazyfree, thus the accounting could be incorrect.
> 
> Right. Lazyfree is not inaccurate without CoW where it's point to decrease
> lazyfree rss count when the store happens so we might be tempted to make
> it to Cow at the cost of performance degradation but still it's not accurate
> without making mark_page_accessed be aware of each mm context which is
> hard part. So, I agree this stat is useful but don't want to make it
> complicate.

Yes, it only could be accurate with extra pagefault cost, but apparently nobody
wants to pay for it.

I talked to jemalloc guys here. They have concerns about the accounting since
it's not accurate. I'll drop the accounting patches in next post. The only
interface which can export accurate info is /proc/pid/smaps, we probably go
that.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-03 23:33   ` Shaohua Li
@ 2017-02-22  2:47     ` Minchan Kim
  -1 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-22  2:47 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 03, 2017 at 03:33:22PM -0800, Shaohua Li wrote:
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/proc/task_mmu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee3efb2..8f2423f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -440,6 +440,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long lazyfree;
>  	unsigned long anonymous_thp;
>  	unsigned long shmem_thp;
>  	unsigned long swap;
> @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	int i, nr = compound ? 1 << compound_order(page) : 1;
>  	unsigned long size = nr * PAGE_SIZE;
>  
> -	if (PageAnon(page))
> +	if (PageAnon(page)) {
>  		mss->anonymous += size;
> +		if (!PageSwapBacked(page))

How about this?

		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))

> +			mss->lazyfree += size;
> +	}
>  
>  	mss->resident += size;
>  	/* Accumulate the size in pages that have been accessed. */
> @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "LazyFree:       %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
>  		   "ShmemPmdMapped: %8lu kB\n"
>  		   "Shared_Hugetlb: %8lu kB\n"
> @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.lazyfree >> 10,
>  		   mss.anonymous_thp >> 10,
>  		   mss.shmem_thp >> 10,
>  		   mss.shared_hugetlb >> 10,
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
@ 2017-02-22  2:47     ` Minchan Kim
  0 siblings, 0 replies; 62+ messages in thread
From: Minchan Kim @ 2017-02-22  2:47 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Fri, Feb 03, 2017 at 03:33:22PM -0800, Shaohua Li wrote:
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  fs/proc/task_mmu.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee3efb2..8f2423f 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -440,6 +440,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long lazyfree;
>  	unsigned long anonymous_thp;
>  	unsigned long shmem_thp;
>  	unsigned long swap;
> @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	int i, nr = compound ? 1 << compound_order(page) : 1;
>  	unsigned long size = nr * PAGE_SIZE;
>  
> -	if (PageAnon(page))
> +	if (PageAnon(page)) {
>  		mss->anonymous += size;
> +		if (!PageSwapBacked(page))

How about this?

		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))

> +			mss->lazyfree += size;
> +	}
>  
>  	mss->resident += size;
>  	/* Accumulate the size in pages that have been accessed. */
> @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "LazyFree:       %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
>  		   "ShmemPmdMapped: %8lu kB\n"
>  		   "Shared_Hugetlb: %8lu kB\n"
> @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.lazyfree >> 10,
>  		   mss.anonymous_thp >> 10,
>  		   mss.shmem_thp >> 10,
>  		   mss.shared_hugetlb >> 10,
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-22  2:47     ` Minchan Kim
@ 2017-02-22  4:11       ` Shaohua Li
  -1 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-22  4:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Wed, Feb 22, 2017 at 11:47:21AM +0900, Minchan Kim wrote:
> On Fri, Feb 03, 2017 at 03:33:22PM -0800, Shaohua Li wrote:
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  fs/proc/task_mmu.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index ee3efb2..8f2423f 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -440,6 +440,7 @@ struct mem_size_stats {
> >  	unsigned long private_dirty;
> >  	unsigned long referenced;
> >  	unsigned long anonymous;
> > +	unsigned long lazyfree;
> >  	unsigned long anonymous_thp;
> >  	unsigned long shmem_thp;
> >  	unsigned long swap;
> > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	int i, nr = compound ? 1 << compound_order(page) : 1;
> >  	unsigned long size = nr * PAGE_SIZE;
> >  
> > -	if (PageAnon(page))
> > +	if (PageAnon(page)) {
> >  		mss->anonymous += size;
> > +		if (!PageSwapBacked(page))
> 
> How about this?
> 
> 		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))

Yes, already fixed like this.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps
@ 2017-02-22  4:11       ` Shaohua Li
  0 siblings, 0 replies; 62+ messages in thread
From: Shaohua Li @ 2017-02-22  4:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, Kernel-team, danielmicay, mhocko, hughd,
	hannes, riel, mgorman, akpm

On Wed, Feb 22, 2017 at 11:47:21AM +0900, Minchan Kim wrote:
> On Fri, Feb 03, 2017 at 03:33:22PM -0800, Shaohua Li wrote:
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  fs/proc/task_mmu.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index ee3efb2..8f2423f 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -440,6 +440,7 @@ struct mem_size_stats {
> >  	unsigned long private_dirty;
> >  	unsigned long referenced;
> >  	unsigned long anonymous;
> > +	unsigned long lazyfree;
> >  	unsigned long anonymous_thp;
> >  	unsigned long shmem_thp;
> >  	unsigned long swap;
> > @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
> >  	int i, nr = compound ? 1 << compound_order(page) : 1;
> >  	unsigned long size = nr * PAGE_SIZE;
> >  
> > -	if (PageAnon(page))
> > +	if (PageAnon(page)) {
> >  		mss->anonymous += size;
> > +		if (!PageSwapBacked(page))
> 
> How about this?
> 
> 		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))

Yes, already fixed like this.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2017-02-22  4:12 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-03 23:33 [PATCH V2 0/7] mm: fix some MADV_FREE issues Shaohua Li
2017-02-03 23:33 ` Shaohua Li
2017-02-03 23:33 ` [PATCH V2 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-03 23:33 ` [PATCH V2 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-04  6:38   ` Hillf Danton
2017-02-04  6:38     ` Hillf Danton
2017-02-09  6:33     ` Hillf Danton
2017-02-09  6:33       ` Hillf Danton
2017-02-10  6:50   ` Minchan Kim
2017-02-10  6:50     ` Minchan Kim
2017-02-10 17:30     ` Shaohua Li
2017-02-10 17:30       ` Shaohua Li
2017-02-13  4:57       ` Minchan Kim
2017-02-13  4:57         ` Minchan Kim
2017-02-10 13:02   ` Michal Hocko
2017-02-10 13:02     ` Michal Hocko
2017-02-10 17:33     ` Shaohua Li
2017-02-10 17:33       ` Shaohua Li
2017-02-03 23:33 ` [PATCH V2 3/7] mm: reclaim MADV_FREE pages Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-10  6:58   ` Minchan Kim
2017-02-10  6:58     ` Minchan Kim
2017-02-10 17:43     ` Shaohua Li
2017-02-10 17:43       ` Shaohua Li
2017-02-13  5:06       ` Minchan Kim
2017-02-13  5:06         ` Minchan Kim
2017-02-10 13:23   ` Michal Hocko
2017-02-10 13:23     ` Michal Hocko
2017-02-03 23:33 ` [PATCH V2 4/7] mm: enable MADV_FREE for swapless system Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-03 23:33 ` [PATCH V2 5/7] mm: add vmstat account for MADV_FREE pages Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-10 13:27   ` Michal Hocko
2017-02-10 13:27     ` Michal Hocko
2017-02-10 17:50     ` Shaohua Li
2017-02-10 17:50       ` Shaohua Li
2017-02-21  9:43       ` Michal Hocko
2017-02-21  9:43         ` Michal Hocko
2017-02-03 23:33 ` [PATCH V2 6/7] proc: show MADV_FREE pages info in smaps Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-10 13:30   ` Michal Hocko
2017-02-10 13:30     ` Michal Hocko
2017-02-10 17:52     ` Shaohua Li
2017-02-10 17:52       ` Shaohua Li
2017-02-22  2:47   ` Minchan Kim
2017-02-22  2:47     ` Minchan Kim
2017-02-22  4:11     ` Shaohua Li
2017-02-22  4:11       ` Shaohua Li
2017-02-03 23:33 ` [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages Shaohua Li
2017-02-03 23:33   ` Shaohua Li
2017-02-10 13:35   ` Michal Hocko
2017-02-10 13:35     ` Michal Hocko
2017-02-10 18:01     ` Shaohua Li
2017-02-10 18:01       ` Shaohua Li
2017-02-21  9:45       ` Michal Hocko
2017-02-21  9:45         ` Michal Hocko
2017-02-22  0:46   ` Minchan Kim
2017-02-22  0:46     ` Minchan Kim
2017-02-22  1:27     ` Shaohua Li
2017-02-22  1:27       ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.