linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 0/7] mm: fix some MADV_FREE issues
@ 2017-02-14 19:36 Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
                   ` (6 more replies)
  0 siblings, 7 replies; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing
  without it.
- RSS accounting. MADV_FREE pages are accounted as normal anon pages and
  reclaimed lazily, so application's RSS becomes bigger. This confuses our
  workloads. We have monitoring daemon running and if it finds applications' RSS
  becomes abnormal, the daemon will kill the applications even kernel can reclaim
  the memory easily. Currently we don't export separate RSS accounting for
  MADV_FREE pages. This will prevent our setup using MADV_FREE too.

To address the first the two issues, we can either put MADV_FREE pages into a
separate LRU list (Minchan's previous patches and V1 patches), or put them into
LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second
idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full
of used once file pages. So we can still efficiently reclaim MADV_FREE pages
there without interference with other anon and active file pages. Putting the
pages into inactive file list also has an advantage which allows page reclaim
to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put
into the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap will
reclaim them.

For the third issue, we add a separate RSS count for MADV_FREE pages. The count
will be increased in madvise syscall and decreased in page reclaim (eg, unmap).
There is one limitation, the accounting doesn't work well for shared pages.
Please check the last patch. This probably isn't a big issue, because userspace
will write the pages before reusing them, which will break the page sharing
between two processes. And if two processes share a page, the page can't really
be lazyfreed.

Thanks,
Shaohua

V2->V3:
- rebase to latest -mm tree
- Address severl issues pointed out by Minchan
- Add more descriptions

V1->V2:
- Put MADV_FREE pages into LRU_INACTIVE_FILE list instead of adding a new lru
  list, suggested by Johannes
- Add RSS support
http://marc.info/?l=linux-mm&m=148616481928054&w=2

Minchan previous patches:
http://marc.info/?l=linux-mm&m=144800657002763&w=2
----------------------

Shaohua Li (7):
  mm: don't assume anonymous pages have SwapBacked flag
  mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  mm: reclaim MADV_FREE pages
  mm: enable MADV_FREE for swapless system
  mm: add vmstat account for MADV_FREE pages
  proc: show MADV_FREE pages info in smaps
  mm: add a separate RSS for MADV_FREE pages

 drivers/base/node.c           |  2 ++
 fs/proc/array.c               |  9 +++++---
 fs/proc/internal.h            |  3 ++-
 fs/proc/meminfo.c             |  1 +
 fs/proc/task_mmu.c            | 17 ++++++++++++---
 fs/proc/task_nommu.c          |  4 +++-
 include/linux/mm_inline.h     | 29 ++++++++++++++++++++++++
 include/linux/mm_types.h      |  1 +
 include/linux/mmzone.h        |  2 ++
 include/linux/page-flags.h    |  6 +++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/gup.c                      |  2 ++
 mm/huge_memory.c              | 14 ++++++++----
 mm/khugepaged.c               | 10 ++++-----
 mm/madvise.c                  | 16 ++++++--------
 mm/memory.c                   | 13 +++++++++--
 mm/migrate.c                  |  5 ++++-
 mm/oom_kill.c                 | 10 +++++----
 mm/page_alloc.c               | 13 ++++++++---
 mm/rmap.c                     | 40 ++++++++++++++++++++++-----------
 mm/swap.c                     | 51 ++++++++++++++++++++++++-------------------
 mm/vmscan.c                   | 30 +++++++++++++++++--------
 mm/vmstat.c                   |  3 +++
 24 files changed, 203 insertions(+), 82 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-16 17:39   ` Johannes Weiner
  2017-02-14 19:36 ` [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/huge_memory.c | 1 -
 mm/khugepaged.c  | 8 +++-----
 mm/migrate.c     | 3 ++-
 mm/rmap.c        | 3 ++-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index efddd02..e602265 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2361,7 +2361,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
 	if (PageAnon(head)) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c..a4b499f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -481,8 +481,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -530,7 +529,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -577,8 +575,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page,
+				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c63ac0..7c8df1f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1943,7 +1943,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);
-	__SetPageSwapBacked(new_page);
+	if (PageSwapBacked(page))
+		__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index 8774791..af50eae 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1416,7 +1416,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
 			 */
-			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
+				page);
 
 			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
 				/* It's a freeable page by MADV_FREE */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-16 17:52   ` Johannes Weiner
  2017-02-14 19:36 ` [PATCH V3 3/7] mm: reclaim MADV_FREE pages Shaohua Li
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
pages, but they can be freed without pageout. To destinguish them
against normal anonymous pages, we clear their SwapBacked flag.

MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages. For such pages, we'd like to reclaim them once
there is memory pressure. Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.

This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.

The patch is based on Minchan's original patch.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/mm_inline.h     | 20 +++++++++++++++++
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/huge_memory.c              |  3 ---
 mm/madvise.c                  |  2 --
 mm/swap.c                     | 51 ++++++++++++++++++++++++-------------------
 mm/vmstat.c                   |  1 +
 7 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e030a68..e6e3af1 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,4 +126,24 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+/*
+ * lazyfree pages are clean anonymous pages. They have SwapBacked flag cleared
+ * to destinguish normal anonymous pages.
+ */
+static inline void set_page_lazyfree(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageAnon(page) || !PageSwapBacked(page), page);
+	ClearPageSwapBacked(page);
+}
+
+static inline void clear_page_lazyfree(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageAnon(page) || PageSwapBacked(page), page);
+	SetPageSwapBacked(page);
+}
+
+static inline bool page_is_lazyfree(struct page *page)
+{
+	return PageAnon(page) && !PageSwapBacked(page);
+}
 #endif
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd..486494e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 6aa1b6c..94e58da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		FOR_ALL_ZONES(ALLOCSTALL),
 		FOR_ALL_ZONES(PGSCAN_SKIP),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
 		PGREFILL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e602265..4ddda58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1562,9 +1562,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ClearPageDirty(page);
 	unlock_page(page);
 
-	if (PageActive(page))
-		deactivate_page(page);
-
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
 		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
 			tlb->fullmm);
diff --git a/mm/madvise.c b/mm/madvise.c
index 11fc65f..639c476 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -410,8 +410,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
 			set_pte_at(mm, addr, pte, ptent);
-			if (PageActive(page))
-				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 	}
diff --git a/mm/swap.c b/mm/swap.c
index c4910f1..9305c23 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,7 +46,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 #endif
@@ -268,6 +268,12 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
 		int lru = page_lru_base_type(page);
 
 		del_page_from_lru_list(page, lruvec, lru);
+		if (page_is_lazyfree(page)) {
+			clear_page_lazyfree(page);
+			/* charge to anon scanned/rotated reclaim_stat */
+			file = 0;
+			lru = LRU_INACTIVE_ANON;
+		}
 		SetPageActive(page);
 		lru += LRU_ACTIVE;
 		add_page_to_lru_list(page, lruvec, lru);
@@ -561,20 +567,21 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 }
 
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
-		int lru = page_lru_base_type(page);
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		bool active = PageActive(page);
 
-		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
 		ClearPageActive(page);
 		ClearPageReferenced(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		set_page_lazyfree(page);
+		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0);
+		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, 1, 0);
 	}
 }
 
@@ -604,9 +611,9 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
 
 	activate_page_drain(cpu);
 }
@@ -638,22 +645,22 @@ void deactivate_file_page(struct page *page)
 }
 
 /**
- * deactivate_page - deactivate a page
+ * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
  *
- * deactivate_page() moves @page to the inactive list if @page was on the active
- * list and was not an unevictable page.  This is done to accelerate the reclaim
- * of @page.
+ * mark_page_lazyfree() moves @page to the inactive file list.
+ * This is done to accelerate the reclaim of @page.
  */
-void deactivate_page(struct page *page)
-{
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+void mark_page_lazyfree(struct page *page)
+ {
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
 
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		put_cpu_var(lru_lazyfree_pvecs);
 	}
 }
 
@@ -704,7 +711,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff..7774196 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
 	"pgfree",
 	"pgactivate",
 	"pgdeactivate",
+	"pglazyfree",
 
 	"pgfault",
 	"pgmajfault",
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-16 18:40   ` Johannes Weiner
  2017-02-14 19:36 ` [PATCH V3 4/7] mm: enable MADV_FREE for swapless system Shaohua Li
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

When memory pressure is high, we free MADV_FREE pages. If the pages are
not dirty in pte, the pages could be freed immediately. Otherwise we
can't reclaim them. We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.

We use normal page reclaim policy. Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age. This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.

Based on Minchan's original patch

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/huge_memory.c |  2 ++
 mm/madvise.c     |  1 +
 mm/rmap.c        | 17 ++++++++++++-----
 mm/vmscan.c      | 30 +++++++++++++++++++++---------
 4 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ddda58..3bb5ad5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1571,6 +1571,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		set_pmd_at(mm, addr, pmd, orig_pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
+
+	mark_page_lazyfree(page);
 	ret = true;
 out:
 	spin_unlock(ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index 639c476..2faed38 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -412,6 +412,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			set_pte_at(mm, addr, pte, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		mark_page_lazyfree(page);
 	}
 out:
 	if (nr_swap) {
diff --git a/mm/rmap.c b/mm/rmap.c
index af50eae..2cbdada 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
 				page);
 
-			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
-				/* It's a freeable page by MADV_FREE */
-				dec_mm_counter(mm, MM_ANONPAGES);
-				rp->lazyfreed++;
-				goto discard;
+			if (flags & TTU_LZFREE) {
+				if (!PageDirty(page)) {
+					/* It's a freeable page by MADV_FREE */
+					dec_mm_counter(mm, MM_ANONPAGES);
+					rp->lazyfreed++;
+					goto discard;
+				} else {
+					set_pte_at(mm, address, pvmw.pte, pteval);
+					ret = SWAP_FAIL;
+					page_vma_mapped_walk_done(&pvmw);
+					break;
+				}
 			}
 
 			if (swap_duplicate(entry) < 0) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b40..435149c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
 	 * Anonymous pages are not handled by flushers and must be written
 	 * from reclaim context. Do not stall reclaim based on them
 	 */
-	if (!page_is_file_cache(page)) {
+	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -971,7 +971,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
-		bool lazyfree = false;
+		bool lazyfree;
 		int ret = SWAP_SUCCESS;
 
 		cond_resched();
@@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		sc->nr_scanned++;
 
+		lazyfree = page_is_lazyfree(page);
+
 		if (unlikely(!page_evictable(page)))
 			goto cull_mlocked;
 
@@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
+		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
 			sc->nr_scanned++;
 
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
@@ -1119,13 +1121,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
+		 * Lazyfree page could be freed directly
 		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
+		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
-			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
+		if (page_mapped(page) && (mapping || lazyfree)) {
 			switch (ret = try_to_unmap(page, lazyfree ?
 				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
 				(ttu_flags | TTU_BATCH_FLUSH))) {
@@ -1154,7 +1156,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			case SWAP_MLOCK:
 				goto cull_mlocked;
 			case SWAP_LZFREE:
-				goto lazyfree;
+				/* follow __remove_mapping for reference */
+				if (page_ref_freeze(page, 1)) {
+					if (!PageDirty(page))
+						goto lazyfree;
+					else
+						page_ref_unfreeze(page, 1);
+				}
+				goto keep_locked;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -1266,10 +1275,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-lazyfree:
 		if (!mapping || !__remove_mapping(mapping, page, true))
 			goto keep_locked;
-
+lazyfree:
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -1294,6 +1302,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 cull_mlocked:
 		if (PageSwapCache(page))
 			try_to_free_swap(page);
+		if (lazyfree)
+			clear_page_lazyfree(page);
 		unlock_page(page);
 		list_add(&page->lru, &ret_pages);
 		continue;
@@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
 			try_to_free_swap(page);
 		VM_BUG_ON_PAGE(PageActive(page), page);
+		if (lazyfree)
+			clear_page_lazyfree(page);
 		SetPageActive(page);
 		pgactivate++;
 keep_locked:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 4/7] mm: enable MADV_FREE for swapless system
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
                   ` (2 preceding siblings ...)
  2017-02-14 19:36 ` [PATCH V3 3/7] mm: reclaim MADV_FREE pages Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-17 16:16   ` Johannes Weiner
  2017-02-14 19:36 ` [PATCH V3 5/7] mm: add vmstat account for MADV_FREE pages Shaohua Li
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Now MADV_FREE pages can be easily reclaimed even for swapless system. We
can safely enable MADV_FREE for all systems.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/madvise.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 2faed38..851fabb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -611,13 +611,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_FREE:
-		/*
-		 * XXX: In this implementation, MADV_FREE works like
-		 * MADV_DONTNEED on swapless system or full swap.
-		 */
-		if (get_nr_swap_pages() > 0)
-			return madvise_free(vma, prev, start, end);
-		/* passthrough */
+		return madvise_free(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 5/7] mm: add vmstat account for MADV_FREE pages
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
                   ` (3 preceding siblings ...)
  2017-02-14 19:36 ` [PATCH V3 4/7] mm: enable MADV_FREE for swapless system Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 6/7] proc: show MADV_FREE pages info in smaps Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 7/7] mm: add a separate RSS for MADV_FREE pages Shaohua Li
  6 siblings, 0 replies; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Show MADV_FREE pages info in proc/sysfs files. Like other vm stat info
kernel exported, the MADV_FREE info will help us know how many memory
are MADV_FREE pages in a node/zone. This is useful for diagnoses and
monitoring in userspace.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/base/node.c       |  2 ++
 fs/proc/meminfo.c         |  1 +
 include/linux/mm_inline.h |  9 +++++++++
 include/linux/mmzone.h    |  2 ++
 mm/page_alloc.c           | 13 ++++++++++---
 mm/vmstat.c               |  2 ++
 6 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..9138db8 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -71,6 +71,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       "Node %d Active(file):   %8lu kB\n"
 		       "Node %d Inactive(file): %8lu kB\n"
 		       "Node %d Unevictable:    %8lu kB\n"
+		       "Node %d LazyFree:       %8lu kB\n"
 		       "Node %d Mlocked:        %8lu kB\n",
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
@@ -84,6 +85,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 		       nid, K(node_page_state(pgdat, NR_UNEVICTABLE)),
+		       nid, K(node_page_state(pgdat, NR_LAZYFREE)),
 		       nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8a42849..b2e7b31 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -80,6 +80,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "Active(file):   ", pages[LRU_ACTIVE_FILE]);
 	show_val_kb(m, "Inactive(file): ", pages[LRU_INACTIVE_FILE]);
 	show_val_kb(m, "Unevictable:    ", pages[LRU_UNEVICTABLE]);
+	show_val_kb(m, "LazyFree:       ", global_node_page_state(NR_LAZYFREE));
 	show_val_kb(m, "Mlocked:        ", global_page_state(NR_MLOCK));
 
 #ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index e6e3af1..0de5cb6 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -126,6 +126,13 @@ static __always_inline enum lru_list page_lru(struct page *page)
 
 #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
 
+static inline void __update_lazyfree_page_stat(struct page *page,
+						int nr_pages)
+{
+	mod_node_page_state(page_pgdat(page), NR_LAZYFREE, nr_pages);
+	mod_zone_page_state(page_zone(page), NR_ZONE_LAZYFREE, nr_pages);
+}
+
 /*
  * lazyfree pages are clean anonymous pages. They have SwapBacked flag cleared
  * to destinguish normal anonymous pages.
@@ -134,12 +141,14 @@ static inline void set_page_lazyfree(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageAnon(page) || !PageSwapBacked(page), page);
 	ClearPageSwapBacked(page);
+	__update_lazyfree_page_stat(page, hpage_nr_pages(page));
 }
 
 static inline void clear_page_lazyfree(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageAnon(page) || PageSwapBacked(page), page);
 	SetPageSwapBacked(page);
+	__update_lazyfree_page_stat(page, -hpage_nr_pages(page));
 }
 
 static inline bool page_is_lazyfree(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 338a786a..78985f1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -118,6 +118,7 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_LAZYFREE,
 	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
@@ -147,6 +148,7 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
+	NR_LAZYFREE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 19f438a..aa04d5c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1023,8 +1023,12 @@ static __always_inline bool free_pages_prepare(struct page *page,
 			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
 	}
-	if (PageMappingFlags(page))
+	if (PageMappingFlags(page)) {
+		if (page_is_lazyfree(page))
+			__update_lazyfree_page_stat(page,
+						-hpage_nr_pages(page));
 		page->mapping = NULL;
+	}
 	if (memcg_kmem_enabled() && PageKmemcg(page))
 		memcg_kmem_uncharge(page, order);
 	if (check_free)
@@ -4459,7 +4463,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
-		" free:%lu free_pcp:%lu free_cma:%lu\n",
+		" free:%lu free_pcp:%lu free_cma:%lu lazy_free:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
 		global_node_page_state(NR_ISOLATED_ANON),
@@ -4478,7 +4482,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_page_state(NR_BOUNCE),
 		global_page_state(NR_FREE_PAGES),
 		free_pcp,
-		global_page_state(NR_FREE_CMA_PAGES));
+		global_page_state(NR_FREE_CMA_PAGES),
+		global_node_page_state(NR_LAZYFREE));
 
 	for_each_online_pgdat(pgdat) {
 		if (show_mem_node_skip(filter, pgdat->node_id, nodemask))
@@ -4490,6 +4495,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
+			" lazy_free:%lukB"
 			" isolated(anon):%lukB"
 			" isolated(file):%lukB"
 			" mapped:%lukB"
@@ -4512,6 +4518,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
+			K(node_page_state(pgdat, NR_LAZYFREE)),
 			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
 			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7774196..a70b52d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,7 @@ const char * const vmstat_text[] = {
 	"nr_zone_inactive_file",
 	"nr_zone_active_file",
 	"nr_zone_unevictable",
+	"nr_zone_lazyfree",
 	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
@@ -952,6 +953,7 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
+	"nr_lazyfree",
 	"nr_isolated_anon",
 	"nr_isolated_file",
 	"nr_pages_scanned",
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 6/7] proc: show MADV_FREE pages info in smaps
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
                   ` (4 preceding siblings ...)
  2017-02-14 19:36 ` [PATCH V3 5/7] mm: add vmstat account for MADV_FREE pages Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  2017-02-14 19:36 ` [PATCH V3 7/7] mm: add a separate RSS for MADV_FREE pages Shaohua Li
  6 siblings, 0 replies; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

show MADV_FREE pages info of each vma in smaps. This is mainly for
diagnose purpose. Userspace could use it to understand what happens in
the vma.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/task_mmu.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb2..8f2423f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -440,6 +440,7 @@ struct mem_size_stats {
 	unsigned long private_dirty;
 	unsigned long referenced;
 	unsigned long anonymous;
+	unsigned long lazyfree;
 	unsigned long anonymous_thp;
 	unsigned long shmem_thp;
 	unsigned long swap;
@@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	int i, nr = compound ? 1 << compound_order(page) : 1;
 	unsigned long size = nr * PAGE_SIZE;
 
-	if (PageAnon(page))
+	if (PageAnon(page)) {
 		mss->anonymous += size;
+		if (!PageSwapBacked(page))
+			mss->lazyfree += size;
+	}
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
@@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
+		   "LazyFree:       %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
 		   "ShmemPmdMapped: %8lu kB\n"
 		   "Shared_Hugetlb: %8lu kB\n"
@@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
+		   mss.lazyfree >> 10,
 		   mss.anonymous_thp >> 10,
 		   mss.shmem_thp >> 10,
 		   mss.shared_hugetlb >> 10,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH V3 7/7] mm: add a separate RSS for MADV_FREE pages
  2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
                   ` (5 preceding siblings ...)
  2017-02-14 19:36 ` [PATCH V3 6/7] proc: show MADV_FREE pages info in smaps Shaohua Li
@ 2017-02-14 19:36 ` Shaohua Li
  6 siblings, 0 replies; 23+ messages in thread
From: Shaohua Li @ 2017-02-14 19:36 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Right now MADV_FREE pages are accounted as normal anon pages and
reclaimed lazily, so application's RSS becomes bigger. This confuses our
workloads. We have monitoring daemon running and if it finds
applications' RSS becomes abnormal, the daemon will kill the
applications even kernel can reclaim the memory easily. This is a big to
deploy MADV_FREE in our workloads.

Applications could use /proc/pid/smaps to query MADV_FREE pages, but
that is too slow. Instead, applications usually use /proc/pid/statm to
query RSS. So add a separate RSS for MADV_FREE pages. The pages are
charged into MM_ANONPAGES (because they are mapped anon pages) and also
charged into the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra
field to display the RSS, which userspace can use to determine the RSS
excluding MADV_FREE pages.

The basic idea is to increment the RSS in madvise and decrement in unmap
or page reclaim. There is one limitation. If a page is shared by two
processes, since madvise only has mm cotext of current process, it isn't
convenient to charge the RSS for both processes. So we don't charge the
RSS if the mapcount isn't 1. On the other hand, fork can make a
MADV_FREE page shared by two processes. To make things consistent, we
uncharge the RSS from the source mm in fork.

A new flag is added to indicate if a page is accounted into the RSS. We
can't use SwapBacked flag to do the determination because we can't
guarantee the page has SwapBacked flag cleared in madvise. We are
reusing mappedtodisk flag which should not be set for Anon pages.

There are a couple of other places we need to uncharge the RSS,
activate_page and mark_page_accessed. activate_page is used by swap,
where MADV_FREE pages are already not in lazyfree state before going
into swap. mark_page_accessed is mainly used for file pages, but there
are several places it's used by anonymous pages. I fixed gup, but not
some gpu drivers and kvm. If the drivers use MADV_FREE, we might have
inprecise RSS accounting.

Please note, the accounting is never going to be precise. MADV_FREE page
could be written by userspace without notification to the kernel. The
page can't be reclaimed like other clean lazyfree pages. The page isn't
real lazyfree page. But since kernel isn't aware of this, the page is
still accounted as lazyfree, thus the accounting could be incorrect.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 fs/proc/array.c            |  9 ++++++---
 fs/proc/internal.h         |  3 ++-
 fs/proc/task_mmu.c         |  9 +++++++--
 fs/proc/task_nommu.c       |  4 +++-
 include/linux/mm_types.h   |  1 +
 include/linux/page-flags.h |  6 ++++++
 mm/gup.c                   |  2 ++
 mm/huge_memory.c           |  8 ++++++++
 mm/khugepaged.c            |  2 ++
 mm/madvise.c               |  5 +++++
 mm/memory.c                | 13 +++++++++++--
 mm/migrate.c               |  2 ++
 mm/oom_kill.c              | 10 ++++++----
 mm/rmap.c                  | 22 ++++++++++++++--------
 14 files changed, 75 insertions(+), 21 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 51a4213..c2281f4 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -583,17 +583,19 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 			struct pid *pid, struct task_struct *task)
 {
 	unsigned long size = 0, resident = 0, shared = 0, text = 0, data = 0;
+	unsigned long lazyfree = 0;
 	struct mm_struct *mm = get_task_mm(task);
 
 	if (mm) {
-		size = task_statm(mm, &shared, &text, &data, &resident);
+		size = task_statm(mm, &shared, &text, &data, &resident,
+				  &lazyfree);
 		mmput(mm);
 	}
 	/*
 	 * For quick read, open code by putting numbers directly
 	 * expected format is
-	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0\n",
-	 *               size, resident, shared, text, data);
+	 * seq_printf(m, "%lu %lu %lu %lu 0 %lu 0 %lu\n",
+	 *               size, resident, shared, text, data, lazyfree);
 	 */
 	seq_put_decimal_ull(m, "", size);
 	seq_put_decimal_ull(m, " ", resident);
@@ -602,6 +604,7 @@ int proc_pid_statm(struct seq_file *m, struct pid_namespace *ns,
 	seq_put_decimal_ull(m, " ", 0);
 	seq_put_decimal_ull(m, " ", data);
 	seq_put_decimal_ull(m, " ", 0);
+	seq_put_decimal_ull(m, " ", lazyfree);
 	seq_putc(m, '\n');
 
 	return 0;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index e2c3c46..6587b9c 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -290,5 +290,6 @@ extern const struct file_operations proc_pagemap_operations;
 extern unsigned long task_vsize(struct mm_struct *);
 extern unsigned long task_statm(struct mm_struct *,
 				unsigned long *, unsigned long *,
-				unsigned long *, unsigned long *);
+				unsigned long *, unsigned long *,
+				unsigned long *);
 extern void task_mem(struct seq_file *, struct mm_struct *);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8f2423f..f18b568 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -23,9 +23,10 @@
 
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
-	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem;
+	unsigned long text, lib, swap, ptes, pmds, anon, file, shmem, lazyfree;
 	unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
 
+	lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	anon = get_mm_counter(mm, MM_ANONPAGES);
 	file = get_mm_counter(mm, MM_FILEPAGES);
 	shmem = get_mm_counter(mm, MM_SHMEMPAGES);
@@ -59,6 +60,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		"RssAnon:\t%8lu kB\n"
 		"RssFile:\t%8lu kB\n"
 		"RssShmem:\t%8lu kB\n"
+		"RssLazyfree:\t%8lu kB\n"
 		"VmData:\t%8lu kB\n"
 		"VmStk:\t%8lu kB\n"
 		"VmExe:\t%8lu kB\n"
@@ -75,6 +77,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 		anon << (PAGE_SHIFT-10),
 		file << (PAGE_SHIFT-10),
 		shmem << (PAGE_SHIFT-10),
+		lazyfree << (PAGE_SHIFT-10),
 		mm->data_vm << (PAGE_SHIFT-10),
 		mm->stack_vm << (PAGE_SHIFT-10), text, lib,
 		ptes >> 10,
@@ -90,7 +93,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	*shared = get_mm_counter(mm, MM_FILEPAGES) +
 			get_mm_counter(mm, MM_SHMEMPAGES);
@@ -98,6 +102,7 @@ unsigned long task_statm(struct mm_struct *mm,
 								>> PAGE_SHIFT;
 	*data = mm->data_vm + mm->stack_vm;
 	*resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+	*lazyfree = get_mm_counter(mm, MM_LAZYFREEPAGES);
 	return mm->total_vm;
 }
 
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index 1ef97cf..50426de 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -94,7 +94,8 @@ unsigned long task_vsize(struct mm_struct *mm)
 
 unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
-			 unsigned long *data, unsigned long *resident)
+			 unsigned long *data, unsigned long *resident,
+			 unsigned long *lazyfree)
 {
 	struct vm_area_struct *vma;
 	struct vm_region *region;
@@ -120,6 +121,7 @@ unsigned long task_statm(struct mm_struct *mm,
 	size >>= PAGE_SHIFT;
 	size += *text + *data;
 	*resident = size;
+	*lazyfree = 0;
 	return size;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f6d440..b6a1428 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -376,6 +376,7 @@ enum {
 	MM_ANONPAGES,	/* Resident anonymous pages */
 	MM_SWAPENTS,	/* Anonymous swap entries */
 	MM_SHMEMPAGES,	/* Resident shared memory pages */
+	MM_LAZYFREEPAGES, /* Lazyfree pages, also charged into MM_ANONPAGES */
 	NR_MM_COUNTERS
 };
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6b5818d..67c732b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,6 +107,8 @@ enum pageflags {
 #endif
 	__NR_PAGEFLAGS,
 
+	PG_lazyfreeaccounted = PG_mappedtodisk, /* only for anon MADV_FREE pages */
+
 	/* Filesystems */
 	PG_checked = PG_owner_priv_1,
 
@@ -428,6 +430,10 @@ TESTPAGEFLAG_FALSE(Ksm)
 
 u64 stable_page_flags(struct page *page);
 
+PAGEFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTSETFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+	TESTCLEARFLAG(LazyFreeAccounted, lazyfreeaccounted, PF_ANY)
+
 static inline int PageUptodate(struct page *page)
 {
 	int ret;
diff --git a/mm/gup.c b/mm/gup.c
index 1e67461..8cec0b5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -171,6 +171,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		 * mark_page_accessed().
 		 */
 		mark_page_accessed(page);
+		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(mm, MM_LAZYFREEPAGES);
 	}
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/* Do not mlock pte-mapped THP */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3bb5ad5..abb679d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -925,6 +925,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
 	get_page(src_page);
 	page_dup_rmap(src_page, true);
+	if (PageAnon(src_page) && TestClearPageLazyFreeAccounted(src_page))
+		add_mm_counter(src_mm, MM_LAZYFREEPAGES, -HPAGE_PMD_NR);
 	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 	atomic_long_inc(&dst_mm->nr_ptes);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -1572,6 +1574,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
 
+	if (page_mapcount(page) == 1 && !TestSetPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, HPAGE_PMD_NR);
 	mark_page_lazyfree(page);
 	ret = true;
 out:
@@ -1629,6 +1633,9 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			pte_free(tlb->mm, pgtable);
 			atomic_long_dec(&tlb->mm->nr_ptes);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+			if (TestClearPageLazyFreeAccounted(page))
+				add_mm_counter(tlb->mm, MM_LAZYFREEPAGES,
+						-HPAGE_PMD_NR);
 		} else {
 			if (arch_needs_pgtable_deposit())
 				zap_deposited_table(tlb->mm, pmd);
@@ -2160,6 +2167,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_swapbacked) |
 			 (1L << PG_mlocked) |
 			 (1L << PG_uptodate) |
+			 (1L << PG_lazyfreeaccounted) |
 			 (1L << PG_active) |
 			 (1L << PG_locked) |
 			 (1L << PG_unevictable) |
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4b499f..e4668db 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -577,6 +577,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		}
 		inc_node_page_state(page,
 				NR_ISOLATED_ANON + page_is_file_cache(page));
+		if (TestClearPageLazyFreeAccounted(page))
+			dec_mm_counter(vma->vm_mm, MM_LAZYFREEPAGES);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 851fabb..b051832 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -308,6 +308,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	int nr_lazyfree_accounted = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -412,9 +413,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			set_pte_at(mm, addr, pte, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		if (page_mapcount(page) == 1 &&
+		    !TestSetPageLazyFreeAccounted(page))
+			nr_lazyfree_accounted++;
 		mark_page_lazyfree(page);
 	}
 out:
+	add_mm_counter(mm, MM_LAZYFREEPAGES, nr_lazyfree_accounted);
 	if (nr_swap) {
 		if (current->mm == mm)
 			sync_mm_rss(mm);
diff --git a/mm/memory.c b/mm/memory.c
index 0c759ba..7441430 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -849,7 +849,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 static inline unsigned long
 copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		unsigned long addr, int *rss)
+		unsigned long addr, int *rss, int *rss_src_lazyfree)
 {
 	unsigned long vm_flags = vma->vm_flags;
 	pte_t pte = *src_pte;
@@ -914,6 +914,9 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page, false);
+		if (PageAnon(page) &&
+		    TestClearPageLazyFreeAccounted(page))
+			(*rss_src_lazyfree)++;
 		rss[mm_counter(page)]++;
 	}
 
@@ -931,10 +934,12 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	spinlock_t *src_ptl, *dst_ptl;
 	int progress = 0;
 	int rss[NR_MM_COUNTERS];
+	int rss_src_lazyfree;
 	swp_entry_t entry = (swp_entry_t){0};
 
 again:
 	init_rss_vec(rss);
+	rss_src_lazyfree = 0;
 
 	dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
 	if (!dst_pte)
@@ -962,13 +967,14 @@ static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 			continue;
 		}
 		entry.val = copy_one_pte(dst_mm, src_mm, dst_pte, src_pte,
-							vma, addr, rss);
+					vma, addr, rss, &rss_src_lazyfree);
 		if (entry.val)
 			break;
 		progress += 8;
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
+	add_mm_counter(src_mm, MM_LAZYFREEPAGES, -rss_src_lazyfree);
 	spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
@@ -1173,6 +1179,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 					mark_page_accessed(page);
 			}
 			rss[mm_counter(page)]--;
+			if (PageAnon(page) &&
+			    TestClearPageLazyFreeAccounted(page))
+				rss[MM_LAZYFREEPAGES]--;
 			page_remove_rmap(page, false);
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 7c8df1f..16781ed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -622,6 +622,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
 		SetPageChecked(newpage);
 	if (PageMappedToDisk(page))
 		SetPageMappedToDisk(newpage);
+	if (PageLazyFreeAccounted(page))
+		SetPageLazyFreeAccounted(newpage);
 
 	/* Move dirty on pages not done by migrate_page_move_mapping() */
 	if (PageDirty(page))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 51c0918..54e0604 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -528,11 +528,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 					 NULL);
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
-			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+			K(get_mm_counter(mm, MM_SHMEMPAGES)),
+			K(get_mm_counter(mm, MM_LAZYFREEPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -878,11 +879,12 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	mark_oom_victim(victim);
-	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, lazyfree-rss:%lukB\n",
 		task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+		K(get_mm_counter(victim->mm, MM_LAZYFREEPAGES)));
 	task_unlock(victim);
 
 	/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2cbdada..96a2db8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1336,8 +1336,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					mlock_vma_page(page);
 				}
 				ret = SWAP_MLOCK;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+				goto err_out;
 			}
 			if (flags & TTU_MUNLOCK)
 				continue;
@@ -1347,8 +1346,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (ptep_clear_flush_young_notify(vma, address,
 						pvmw.pte)) {
 				ret = SWAP_FAIL;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+				goto err_out;
 			}
 		}
 
@@ -1428,16 +1426,14 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				} else {
 					set_pte_at(mm, address, pvmw.pte, pteval);
 					ret = SWAP_FAIL;
-					page_vma_mapped_walk_done(&pvmw);
-					break;
+					goto err_out;
 				}
 			}
 
 			if (swap_duplicate(entry) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				ret = SWAP_FAIL;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
+				goto err_out;
 			}
 			if (list_empty(&mm->mmlist)) {
 				spin_lock(&mmlist_lock);
@@ -1456,9 +1452,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 discard:
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
+		/* regardless of success or failure, the page isn't lazyfree */
+		if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+			add_mm_counter(mm, MM_LAZYFREEPAGES,
+						-hpage_nr_pages(page));
 		mmu_notifier_invalidate_page(mm, address);
 	}
 	return ret;
+err_out:
+	/* regardless of success or failure, the page isn't lazyfree */
+	if (PageAnon(page) && TestClearPageLazyFreeAccounted(page))
+		add_mm_counter(mm, MM_LAZYFREEPAGES, -hpage_nr_pages(page));
+	page_vma_mapped_walk_done(&pvmw);
+	return ret;
 }
 
 bool is_vma_temporary_stack(struct vm_area_struct *vma)
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-14 19:36 ` [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
@ 2017-02-16 17:39   ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-16 17:39 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Tue, Feb 14, 2017 at 11:36:07AM -0800, Shaohua Li wrote:
> There are a few places the code assumes anonymous pages should have
> SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> for them. The assumption doesn't hold any more, so fix them.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-14 19:36 ` [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
@ 2017-02-16 17:52   ` Johannes Weiner
  2017-02-17  0:35     ` Shaohua Li
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2017-02-16 17:52 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Tue, Feb 14, 2017 at 11:36:08AM -0800, Shaohua Li wrote:
> @@ -126,4 +126,24 @@ static __always_inline enum lru_list page_lru(struct page *page)
>  
>  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
>  
> +/*
> + * lazyfree pages are clean anonymous pages. They have SwapBacked flag cleared
> + * to destinguish normal anonymous pages.
> + */
> +static inline void set_page_lazyfree(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(!PageAnon(page) || !PageSwapBacked(page), page);
> +	ClearPageSwapBacked(page);
> +}
> +
> +static inline void clear_page_lazyfree(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(!PageAnon(page) || PageSwapBacked(page), page);
> +	SetPageSwapBacked(page);
> +}
> +
> +static inline bool page_is_lazyfree(struct page *page)
> +{
> +	return PageAnon(page) && !PageSwapBacked(page);
> +}

Sorry for not getting to v2 in time, but I have to say I strongly
agree with your first iterations and would much prefer this to be
open-coded.

IMO this needlessly introduces a new state opaquely called "lazyfree",
when really that's just anonymous pages that don't need to be swapped
before reclaim - PageAnon && !PageSwapBacked. Very simple MM concept.

That especially shows when we later combine it with page_is_file_cache
checks like the next patch does.

The rest of the patch looks good to me.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-14 19:36 ` [PATCH V3 3/7] mm: reclaim MADV_FREE pages Shaohua Li
@ 2017-02-16 18:40   ` Johannes Weiner
  2017-02-17  0:27     ` Shaohua Li
  2017-02-17  5:41     ` Minchan Kim
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-16 18:40 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
>  				page);
>  
> -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> -				/* It's a freeable page by MADV_FREE */
> -				dec_mm_counter(mm, MM_ANONPAGES);
> -				rp->lazyfreed++;
> -				goto discard;
> +			if (flags & TTU_LZFREE) {
> +				if (!PageDirty(page)) {
> +					/* It's a freeable page by MADV_FREE */
> +					dec_mm_counter(mm, MM_ANONPAGES);
> +					rp->lazyfreed++;
> +					goto discard;
> +				} else {
> +					set_pte_at(mm, address, pvmw.pte, pteval);
> +					ret = SWAP_FAIL;
> +					page_vma_mapped_walk_done(&pvmw);
> +					break;
> +				}

I don't understand why we need the TTU_LZFREE bit in general. More on
that below at the callsite.

> @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {

Do we need this? MADV_FREE clears the dirty bit off the page; we could
just let them go through with the function without any special-casing.

> @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		sc->nr_scanned++;
>  
> +		lazyfree = page_is_lazyfree(page);
> +
>  		if (unlikely(!page_evictable(page)))
>  			goto cull_mlocked;
>  
> @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
>  			sc->nr_scanned++;
>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1119,13 +1121,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> +		 * Lazyfree page could be freed directly
>  		 */
> -		if (PageAnon(page) && !PageSwapCache(page)) {
> +		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {

lazyfree duplicates the anon check. As per the previous email, IMO it
would be much preferable to get rid of that "lazyfree" obscuring here.

This would simply be:

		if (PageAnon(page) && PageSwapBacked && !PageSwapCache)

> @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page) && (mapping || lazyfree)) {

Do we actually need to filter for mapping || lazyfree? If we fail to
allocate swap, we don't reach here. If the page is a truncated file
page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
able to just check for page_mapped() alone, no?

>  			switch (ret = try_to_unmap(page, lazyfree ?
>  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
>  				(ttu_flags | TTU_BATCH_FLUSH))) {

That bit I don't understand. Why do we need to pass TTU_LZFREE? What
information does that carry that cannot be gathered from inside ttu?

I.e. when ttu runs into PageAnon, can it simply check !PageSwapBacked?
And if it's still clean, it can lazyfreed++; goto discard.

Am I overlooking something?

> @@ -1154,7 +1156,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
>  			case SWAP_LZFREE:
> -				goto lazyfree;
> +				/* follow __remove_mapping for reference */
> +				if (page_ref_freeze(page, 1)) {
> +					if (!PageDirty(page))
> +						goto lazyfree;
> +					else
> +						page_ref_unfreeze(page, 1);
> +				}
> +				goto keep_locked;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */

This is a similar situation.

Can we let the page go through the regular __remove_mapping() process
and simply have that function check for PageAnon && !PageSwapBacked?

> @@ -1266,10 +1275,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
>  		if (!mapping || !__remove_mapping(mapping, page, true))
>  			goto keep_locked;
> -
> +lazyfree:

... eliminating this special casing.

> @@ -1294,6 +1302,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  cull_mlocked:
>  		if (PageSwapCache(page))
>  			try_to_free_swap(page);
> +		if (lazyfree)
> +			clear_page_lazyfree(page);

Why cancel the MADV_FREE state? The combination seems non-sensical,
but we can simply retain the invalidated state while the page goes to
the unevictable list; munlock should move it back to inactive_file.

>  		unlock_page(page);
>  		list_add(&page->lru, &ret_pages);
>  		continue;
> @@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
>  			try_to_free_swap(page);
>  		VM_BUG_ON_PAGE(PageActive(page), page);
> +		if (lazyfree)
> +			clear_page_lazyfree(page);

This is similar too.

Can we leave simply leave the page alone here? The only way we get to
this point is if somebody is reading the invalidated page. It's weird
for a lazyfreed page to become active, but it doesn't seem to warrant
active intervention here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-16 18:40   ` Johannes Weiner
@ 2017-02-17  0:27     ` Shaohua Li
  2017-02-17  5:45       ` Minchan Kim
  2017-02-17 16:01       ` Johannes Weiner
  2017-02-17  5:41     ` Minchan Kim
  1 sibling, 2 replies; 23+ messages in thread
From: Shaohua Li @ 2017-02-17  0:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> >  				page);
> >  
> > -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> > -				/* It's a freeable page by MADV_FREE */
> > -				dec_mm_counter(mm, MM_ANONPAGES);
> > -				rp->lazyfreed++;
> > -				goto discard;
> > +			if (flags & TTU_LZFREE) {
> > +				if (!PageDirty(page)) {
> > +					/* It's a freeable page by MADV_FREE */
> > +					dec_mm_counter(mm, MM_ANONPAGES);
> > +					rp->lazyfreed++;
> > +					goto discard;
> > +				} else {
> > +					set_pte_at(mm, address, pvmw.pte, pteval);
> > +					ret = SWAP_FAIL;
> > +					page_vma_mapped_walk_done(&pvmw);
> > +					break;
> > +				}
> 
> I don't understand why we need the TTU_LZFREE bit in general. More on
> that below at the callsite.

Sounds useless flag, don't see any reason we shouldn't free the MADV_FREE page
in places other than reclaim. Looks TTU_UNMAP is useless too..

> > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> >  	 * Anonymous pages are not handled by flushers and must be written
> >  	 * from reclaim context. Do not stall reclaim based on them
> >  	 */
> > -	if (!page_is_file_cache(page)) {
> > +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
> 
> Do we need this? MADV_FREE clears the dirty bit off the page; we could
> just let them go through with the function without any special-casing.

this is just to zero dirty and writeback
> 
> > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  		sc->nr_scanned++;
> >  
> > +		lazyfree = page_is_lazyfree(page);
> > +
> >  		if (unlikely(!page_evictable(page)))
> >  			goto cull_mlocked;
> >  
> > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep_locked;
> >  
> >  		/* Double the slab pressure for mapped and swapcache pages */
> > -		if (page_mapped(page) || PageSwapCache(page))
> > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> >  			sc->nr_scanned++;
> >  
> >  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > @@ -1119,13 +1121,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		/*
> >  		 * Anonymous process memory has backing store?
> >  		 * Try to allocate it some swap space here.
> > +		 * Lazyfree page could be freed directly
> >  		 */
> > -		if (PageAnon(page) && !PageSwapCache(page)) {
> > +		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
> 
> lazyfree duplicates the anon check. As per the previous email, IMO it
> would be much preferable to get rid of that "lazyfree" obscuring here.
> 
> This would simply be:
> 
> 		if (PageAnon(page) && PageSwapBacked && !PageSwapCache)

I'd agree if we only don't need the lazyfree variable, but I think we still
need to check it in other places. More in below.

> > @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * The page is mapped into the page tables of one or more
> >  		 * processes. Try to unmap it here.
> >  		 */
> > -		if (page_mapped(page) && mapping) {
> > +		if (page_mapped(page) && (mapping || lazyfree)) {
> 
> Do we actually need to filter for mapping || lazyfree? If we fail to
> allocate swap, we don't reach here. If the page is a truncated file
> page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
> able to just check for page_mapped() alone, no?

checking the mapping is faster than running into try_to_unamp, right?

> >  			switch (ret = try_to_unmap(page, lazyfree ?
> >  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
> >  				(ttu_flags | TTU_BATCH_FLUSH))) {
> 
> That bit I don't understand. Why do we need to pass TTU_LZFREE? What
> information does that carry that cannot be gathered from inside ttu?
> 
> I.e. when ttu runs into PageAnon, can it simply check !PageSwapBacked?
> And if it's still clean, it can lazyfreed++; goto discard.
> 
> Am I overlooking something?
> 
> > @@ -1154,7 +1156,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			case SWAP_MLOCK:
> >  				goto cull_mlocked;
> >  			case SWAP_LZFREE:
> > -				goto lazyfree;
> > +				/* follow __remove_mapping for reference */
> > +				if (page_ref_freeze(page, 1)) {
> > +					if (!PageDirty(page))
> > +						goto lazyfree;
> > +					else
> > +						page_ref_unfreeze(page, 1);
> > +				}
> > +				goto keep_locked;
> >  			case SWAP_SUCCESS:
> >  				; /* try to free the page below */
> 
> This is a similar situation.
> 
> Can we let the page go through the regular __remove_mapping() process
> and simply have that function check for PageAnon && !PageSwapBacked?

That will make the code more complicated. We don't call __remove_mapping if
!mapping. And we need to do bypass in __remove_mapping, for example, avoid
taking mapping->lock.
 
> > @@ -1266,10 +1275,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			}
> >  		}
> >  
> > -lazyfree:
> >  		if (!mapping || !__remove_mapping(mapping, page, true))
> >  			goto keep_locked;
> > -
> > +lazyfree:
> 
> ... eliminating this special casing.
> 
> > @@ -1294,6 +1302,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  cull_mlocked:
> >  		if (PageSwapCache(page))
> >  			try_to_free_swap(page);
> > +		if (lazyfree)
> > +			clear_page_lazyfree(page);
> 
> Why cancel the MADV_FREE state? The combination seems non-sensical,
> but we can simply retain the invalidated state while the page goes to
> the unevictable list; munlock should move it back to inactive_file.

This depends on the policy. If user locks the page, I think it's reasonable to
assume the page is hot, so it doesn't make sense to treat the page lazyfree.
 
> >  		unlock_page(page);
> >  		list_add(&page->lru, &ret_pages);
> >  		continue;
> > @@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
> >  			try_to_free_swap(page);
> >  		VM_BUG_ON_PAGE(PageActive(page), page);
> > +		if (lazyfree)
> > +			clear_page_lazyfree(page);
> 
> This is similar too.
> 
> Can we leave simply leave the page alone here? The only way we get to
> this point is if somebody is reading the invalidated page. It's weird
> for a lazyfreed page to become active, but it doesn't seem to warrant
> active intervention here.

So the unmap fails here probably because the page is dirty, which means the
page is written recently. It makes sense to assume the page is hot.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-16 17:52   ` Johannes Weiner
@ 2017-02-17  0:35     ` Shaohua Li
  2017-02-17 16:22       ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-17  0:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Thu, Feb 16, 2017 at 12:52:53PM -0500, Johannes Weiner wrote:
> On Tue, Feb 14, 2017 at 11:36:08AM -0800, Shaohua Li wrote:
> > @@ -126,4 +126,24 @@ static __always_inline enum lru_list page_lru(struct page *page)
> >  
> >  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
> >  
> > +/*
> > + * lazyfree pages are clean anonymous pages. They have SwapBacked flag cleared
> > + * to destinguish normal anonymous pages.
> > + */
> > +static inline void set_page_lazyfree(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(!PageAnon(page) || !PageSwapBacked(page), page);
> > +	ClearPageSwapBacked(page);
> > +}
> > +
> > +static inline void clear_page_lazyfree(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(!PageAnon(page) || PageSwapBacked(page), page);
> > +	SetPageSwapBacked(page);
> > +}
> > +
> > +static inline bool page_is_lazyfree(struct page *page)
> > +{
> > +	return PageAnon(page) && !PageSwapBacked(page);
> > +}
> 
> Sorry for not getting to v2 in time, but I have to say I strongly
> agree with your first iterations and would much prefer this to be
> open-coded.
> 
> IMO this needlessly introduces a new state opaquely called "lazyfree",
> when really that's just anonymous pages that don't need to be swapped
> before reclaim - PageAnon && !PageSwapBacked. Very simple MM concept.
> 
> That especially shows when we later combine it with page_is_file_cache
> checks like the next patch does.
> 
> The rest of the patch looks good to me.

Thanks! I do agree checking PageSwapBacked is clearer, but Minchan convinced me
because of the accounting issue. Where do you suggest we should put the
accounting to?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-16 18:40   ` Johannes Weiner
  2017-02-17  0:27     ` Shaohua Li
@ 2017-02-17  5:41     ` Minchan Kim
  2017-02-17  9:27       ` Minchan Kim
  2017-02-17 16:15       ` Johannes Weiner
  1 sibling, 2 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-17  5:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, mhocko, hughd,
	riel, mgorman, akpm

Hi Johannes,

On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> >  				page);
> >  
> > -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> > -				/* It's a freeable page by MADV_FREE */
> > -				dec_mm_counter(mm, MM_ANONPAGES);
> > -				rp->lazyfreed++;
> > -				goto discard;
> > +			if (flags & TTU_LZFREE) {
> > +				if (!PageDirty(page)) {
> > +					/* It's a freeable page by MADV_FREE */
> > +					dec_mm_counter(mm, MM_ANONPAGES);
> > +					rp->lazyfreed++;
> > +					goto discard;
> > +				} else {
> > +					set_pte_at(mm, address, pvmw.pte, pteval);
> > +					ret = SWAP_FAIL;
> > +					page_vma_mapped_walk_done(&pvmw);
> > +					break;
> > +				}
> 
> I don't understand why we need the TTU_LZFREE bit in general. More on
> that below at the callsite.

The reason I introduced it was ttu is used for migration/THP split path
as well as reclaim. It's clear to discard them in reclaim path because
it means surely memory pressure now but not sure with other path.

If you guys think it's always win to discard them in try_to_unmap
unconditionally, I think it would be better to be separate patch.

> 
> > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> >  	 * Anonymous pages are not handled by flushers and must be written
> >  	 * from reclaim context. Do not stall reclaim based on them
> >  	 */
> > -	if (!page_is_file_cache(page)) {
> > +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
> 
> Do we need this? MADV_FREE clears the dirty bit off the page; we could
> just let them go through with the function without any special-casing.

I thought some driver potentially can do GUP with FOLL_TOUCH so that the
lazyfree page can have PG_dirty with !PG_swapbacked. In this case,
throttling logic of shrink_page_list can be confused?

> 
> > @@ -986,6 +986,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  		sc->nr_scanned++;
> >  
> > +		lazyfree = page_is_lazyfree(page);
> > +
> >  		if (unlikely(!page_evictable(page)))
> >  			goto cull_mlocked;
> >  
> > @@ -993,7 +995,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep_locked;
> >  
> >  		/* Double the slab pressure for mapped and swapcache pages */
> > -		if (page_mapped(page) || PageSwapCache(page))
> > +		if ((page_mapped(page) || PageSwapCache(page)) && !lazyfree)
> >  			sc->nr_scanned++;
> >  
> >  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> > @@ -1119,13 +1121,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		/*
> >  		 * Anonymous process memory has backing store?
> >  		 * Try to allocate it some swap space here.
> > +		 * Lazyfree page could be freed directly
> >  		 */
> > -		if (PageAnon(page) && !PageSwapCache(page)) {
> > +		if (PageAnon(page) && !PageSwapCache(page) && !lazyfree) {
> 
> lazyfree duplicates the anon check. As per the previous email, IMO it
> would be much preferable to get rid of that "lazyfree" obscuring here.
> 
> This would simply be:
> 
> 		if (PageAnon(page) && PageSwapBacked && !PageSwapCache)

Agree.

> 
> > @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * The page is mapped into the page tables of one or more
> >  		 * processes. Try to unmap it here.
> >  		 */
> > -		if (page_mapped(page) && mapping) {
> > +		if (page_mapped(page) && (mapping || lazyfree)) {
> 
> Do we actually need to filter for mapping || lazyfree? If we fail to
> allocate swap, we don't reach here. If the page is a truncated file
> page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
> able to just check for page_mapped() alone, no?

try_to_unmap_one assumes every anonymous pages reached will have swp_entry
so it should be changed to check PageSwapCache if we go to the way.

> 
> >  			switch (ret = try_to_unmap(page, lazyfree ?
> >  				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
> >  				(ttu_flags | TTU_BATCH_FLUSH))) {
> 
> That bit I don't understand. Why do we need to pass TTU_LZFREE? What
> information does that carry that cannot be gathered from inside ttu?
> 
> I.e. when ttu runs into PageAnon, can it simply check !PageSwapBacked?
> And if it's still clean, it can lazyfreed++; goto discard.
> 
> Am I overlooking something?

As I said above, TTU_LZFREE signals when we should discard the page and
in my implementation, I thought it was only shrink_page_list which is
event for memory pressure.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17  0:27     ` Shaohua Li
@ 2017-02-17  5:45       ` Minchan Kim
  2017-02-17 16:11         ` Johannes Weiner
  2017-02-17 16:01       ` Johannes Weiner
  1 sibling, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2017-02-17  5:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Johannes Weiner, linux-mm, linux-kernel, Kernel-team, mhocko,
	hughd, riel, mgorman, akpm

Hi Shaohua,

On Thu, Feb 16, 2017 at 04:27:18PM -0800, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> > >  				page);
> > >  
> > > -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> > > -				/* It's a freeable page by MADV_FREE */
> > > -				dec_mm_counter(mm, MM_ANONPAGES);
> > > -				rp->lazyfreed++;
> > > -				goto discard;
> > > +			if (flags & TTU_LZFREE) {
> > > +				if (!PageDirty(page)) {
> > > +					/* It's a freeable page by MADV_FREE */
> > > +					dec_mm_counter(mm, MM_ANONPAGES);
> > > +					rp->lazyfreed++;
> > > +					goto discard;
> > > +				} else {
> > > +					set_pte_at(mm, address, pvmw.pte, pteval);
> > > +					ret = SWAP_FAIL;
> > > +					page_vma_mapped_walk_done(&pvmw);
> > > +					break;
> > > +				}
> > 
> > I don't understand why we need the TTU_LZFREE bit in general. More on
> > that below at the callsite.
> 
> Sounds useless flag, don't see any reason we shouldn't free the MADV_FREE page
> in places other than reclaim. Looks TTU_UNMAP is useless too..

Agree on TTU_UNMAP but for example, THP split doesn't mean free lazyfree pages,
I think.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17  5:41     ` Minchan Kim
@ 2017-02-17  9:27       ` Minchan Kim
  2017-02-17 16:15       ` Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2017-02-17  9:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, mhocko, hughd,
	riel, mgorman, akpm

On Fri, Feb 17, 2017 at 02:41:08PM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> > >  				page);
> > >  
> > > -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> > > -				/* It's a freeable page by MADV_FREE */
> > > -				dec_mm_counter(mm, MM_ANONPAGES);
> > > -				rp->lazyfreed++;
> > > -				goto discard;
> > > +			if (flags & TTU_LZFREE) {
> > > +				if (!PageDirty(page)) {
> > > +					/* It's a freeable page by MADV_FREE */
> > > +					dec_mm_counter(mm, MM_ANONPAGES);
> > > +					rp->lazyfreed++;
> > > +					goto discard;
> > > +				} else {
> > > +					set_pte_at(mm, address, pvmw.pte, pteval);
> > > +					ret = SWAP_FAIL;
> > > +					page_vma_mapped_walk_done(&pvmw);
> > > +					break;
> > > +				}
> > 
> > I don't understand why we need the TTU_LZFREE bit in general. More on
> > that below at the callsite.
> 
> The reason I introduced it was ttu is used for migration/THP split path
> as well as reclaim. It's clear to discard them in reclaim path because
> it means surely memory pressure now but not sure with other path.
> 
> If you guys think it's always win to discard them in try_to_unmap
> unconditionally, I think it would be better to be separate patch.

I was totally wrong.

Anon page with THP split/migration/HWPoison will not reach to discard path
in try_to_unmap_one so Johannes is right. We don't need TTU_LZFREE.

Sorry for the noise.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17  0:27     ` Shaohua Li
  2017-02-17  5:45       ` Minchan Kim
@ 2017-02-17 16:01       ` Johannes Weiner
  2017-02-17 18:43         ` Shaohua Li
  1 sibling, 1 reply; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 16:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Thu, Feb 16, 2017 at 04:27:18PM -0800, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> > >  	 * Anonymous pages are not handled by flushers and must be written
> > >  	 * from reclaim context. Do not stall reclaim based on them
> > >  	 */
> > > -	if (!page_is_file_cache(page)) {
> > > +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
> > 
> > Do we need this? MADV_FREE clears the dirty bit off the page; we could
> > just let them go through with the function without any special-casing.
> 
> this is just to zero dirty and writeback

Okay, I assumed that the page would always be !dirty && !writeback
here anyway, so we might as well fall through and check those bits.

But a previously failed TTU might have moved a pte dirty bit to the
page, so yes, we do need to filter for anon && !swapbacked here.

> > > @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		 * The page is mapped into the page tables of one or more
> > >  		 * processes. Try to unmap it here.
> > >  		 */
> > > -		if (page_mapped(page) && mapping) {
> > > +		if (page_mapped(page) && (mapping || lazyfree)) {
> > 
> > Do we actually need to filter for mapping || lazyfree? If we fail to
> > allocate swap, we don't reach here. If the page is a truncated file
> > page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
> > able to just check for page_mapped() alone, no?
> 
> checking the mapping is faster than running into try_to_unamp, right?

!mapping should be a rare case. In reclaim code, I think it's better
to keep it simple than to optimize away the rare function call.

> > > @@ -1154,7 +1156,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			case SWAP_MLOCK:
> > >  				goto cull_mlocked;
> > >  			case SWAP_LZFREE:
> > > -				goto lazyfree;
> > > +				/* follow __remove_mapping for reference */
> > > +				if (page_ref_freeze(page, 1)) {
> > > +					if (!PageDirty(page))
> > > +						goto lazyfree;
> > > +					else
> > > +						page_ref_unfreeze(page, 1);
> > > +				}
> > > +				goto keep_locked;
> > >  			case SWAP_SUCCESS:
> > >  				; /* try to free the page below */
> > 
> > This is a similar situation.
> > 
> > Can we let the page go through the regular __remove_mapping() process
> > and simply have that function check for PageAnon && !PageSwapBacked?
> 
> That will make the code more complicated. We don't call __remove_mapping if
> !mapping. And we need to do bypass in __remove_mapping, for example, avoid
> taking mapping->lock.

True, we won't get around a separate freeing path as long as the
refcount handling is intertwined with the mapping removal like that :/

What we should be able to do, however, is remove at least SWAP_LZFREE
and stick with SWAP_SUCCESS. On success, we can fall through up until
we do the __remove_mapping call. The page isn't dirty, so we skip that
PageDirty block; the page doesn't have private data, so we skip that
block too. And then we can branch on PageAnon && !PageSwapBacked that
does our alternate freeing path or __remove_mapping for others.

> > > @@ -1294,6 +1302,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  cull_mlocked:
> > >  		if (PageSwapCache(page))
> > >  			try_to_free_swap(page);
> > > +		if (lazyfree)
> > > +			clear_page_lazyfree(page);
> > 
> > Why cancel the MADV_FREE state? The combination seems non-sensical,
> > but we can simply retain the invalidated state while the page goes to
> > the unevictable list; munlock should move it back to inactive_file.
> 
> This depends on the policy. If user locks the page, I think it's reasonable to
> assume the page is hot, so it doesn't make sense to treat the page lazyfree.

I think the key issue is whether the page contains valid data, not
whether it is hot. When we clear the dirty bits along with
PageSwapBacked, we're declaring the data in the page invalid. There is
no practical usecase to mlock a page with invalid data, sure, but the
act of mlocking a page doesn't make its contents suddenly valid again.

I.e. I'd stick with the pure data integrity perspective here. That's
clearer and less error prone than intermingling it with eviction
policy, to avoid accidents where we lose valid data.

> > >  		unlock_page(page);
> > >  		list_add(&page->lru, &ret_pages);
> > >  		continue;
> > > @@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
> > >  			try_to_free_swap(page);
> > >  		VM_BUG_ON_PAGE(PageActive(page), page);
> > > +		if (lazyfree)
> > > +			clear_page_lazyfree(page);
> > 
> > This is similar too.
> > 
> > Can we leave simply leave the page alone here? The only way we get to
> > this point is if somebody is reading the invalidated page. It's weird
> > for a lazyfreed page to become active, but it doesn't seem to warrant
> > active intervention here.
> 
> So the unmap fails here probably because the page is dirty, which means the
> page is written recently. It makes sense to assume the page is hot.

Ah, good point.

But can we handle that explicitly please? Like above, I don't want to
undo the data invalidation just because somebody read the invalid data
a bunch of times and it has the access bits set. We should only re-set
the PageSwapBacked based on whether the page is actually dirty.

Maybe along the lines of SWAP_MLOCK we could add SWAP_DIRTY when TTU
fails because the page is dirty, and then have a cull_dirty: label in
shrink_page_list handle the lazy rescue of a reused MADV_FREE page?

This should work well with removing the mapping || lazyfree check when
calling TTU. Then TTU can fail on dirty && !mapping, which is a much
more obvious way of expressing it IMO - "This page contains valid data
but there is no mapping that backs it once we unmap it. Abort."

That's mostly why I'm in favor of removing the idea of a "lazyfree"
page as much as possible. IMO this whole thing becomes much more
understandable - and less bolted on to the side of the VM - when we
express it in existing concepts the VM uses for data integrity.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17  5:45       ` Minchan Kim
@ 2017-02-17 16:11         ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 16:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, mhocko, hughd,
	riel, mgorman, akpm

Hi Minchan,

On Fri, Feb 17, 2017 at 02:45:55PM +0900, Minchan Kim wrote:
> On Thu, Feb 16, 2017 at 04:27:18PM -0800, Shaohua Li wrote:
> > On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > > @@ -1419,11 +1419,18 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> > > >  				page);
> > > >  
> > > > -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> > > > -				/* It's a freeable page by MADV_FREE */
> > > > -				dec_mm_counter(mm, MM_ANONPAGES);
> > > > -				rp->lazyfreed++;
> > > > -				goto discard;
> > > > +			if (flags & TTU_LZFREE) {
> > > > +				if (!PageDirty(page)) {
> > > > +					/* It's a freeable page by MADV_FREE */
> > > > +					dec_mm_counter(mm, MM_ANONPAGES);
> > > > +					rp->lazyfreed++;
> > > > +					goto discard;
> > > > +				} else {
> > > > +					set_pte_at(mm, address, pvmw.pte, pteval);
> > > > +					ret = SWAP_FAIL;
> > > > +					page_vma_mapped_walk_done(&pvmw);
> > > > +					break;
> > > > +				}
> > > 
> > > I don't understand why we need the TTU_LZFREE bit in general. More on
> > > that below at the callsite.
> > 
> > Sounds useless flag, don't see any reason we shouldn't free the MADV_FREE page
> > in places other than reclaim. Looks TTU_UNMAP is useless too..
> 
> Agree on TTU_UNMAP but for example, THP split doesn't mean free lazyfree pages,
> I think.

Anon THP splitting uses the migration branch, so we should be fine.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17  5:41     ` Minchan Kim
  2017-02-17  9:27       ` Minchan Kim
@ 2017-02-17 16:15       ` Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 16:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, mhocko, hughd,
	riel, mgorman, akpm

On Fri, Feb 17, 2017 at 02:41:08PM +0900, Minchan Kim wrote:
> Hi Johannes,
> 
> On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> > >  	 * Anonymous pages are not handled by flushers and must be written
> > >  	 * from reclaim context. Do not stall reclaim based on them
> > >  	 */
> > > -	if (!page_is_file_cache(page)) {
> > > +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
> > 
> > Do we need this? MADV_FREE clears the dirty bit off the page; we could
> > just let them go through with the function without any special-casing.
> 
> I thought some driver potentially can do GUP with FOLL_TOUCH so that the
> lazyfree page can have PG_dirty with !PG_swapbacked. In this case,
> throttling logic of shrink_page_list can be confused?

Yep, agreed. We should filter these pages here.

> > > @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		 * The page is mapped into the page tables of one or more
> > >  		 * processes. Try to unmap it here.
> > >  		 */
> > > -		if (page_mapped(page) && mapping) {
> > > +		if (page_mapped(page) && (mapping || lazyfree)) {
> > 
> > Do we actually need to filter for mapping || lazyfree? If we fail to
> > allocate swap, we don't reach here. If the page is a truncated file
> > page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
> > able to just check for page_mapped() alone, no?
> 
> try_to_unmap_one assumes every anonymous pages reached will have swp_entry
> so it should be changed to check PageSwapCache if we go to the way.

Yep, I think it should check page_mapping(). To me that would make the
most sense, see other email: "Don't unmap a ram page with valid data
when there is no secondary storage mapping to maintain integrity."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 4/7] mm: enable MADV_FREE for swapless system
  2017-02-14 19:36 ` [PATCH V3 4/7] mm: enable MADV_FREE for swapless system Shaohua Li
@ 2017-02-17 16:16   ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 16:16 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Tue, Feb 14, 2017 at 11:36:10AM -0800, Shaohua Li wrote:
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-17  0:35     ` Shaohua Li
@ 2017-02-17 16:22       ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 16:22 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Thu, Feb 16, 2017 at 04:35:25PM -0800, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 12:52:53PM -0500, Johannes Weiner wrote:
> > On Tue, Feb 14, 2017 at 11:36:08AM -0800, Shaohua Li wrote:
> > > @@ -126,4 +126,24 @@ static __always_inline enum lru_list page_lru(struct page *page)
> > >  
> > >  #define lru_to_page(head) (list_entry((head)->prev, struct page, lru))
> > >  
> > > +/*
> > > + * lazyfree pages are clean anonymous pages. They have SwapBacked flag cleared
> > > + * to destinguish normal anonymous pages.
> > > + */
> > > +static inline void set_page_lazyfree(struct page *page)
> > > +{
> > > +	VM_BUG_ON_PAGE(!PageAnon(page) || !PageSwapBacked(page), page);
> > > +	ClearPageSwapBacked(page);
> > > +}
> > > +
> > > +static inline void clear_page_lazyfree(struct page *page)
> > > +{
> > > +	VM_BUG_ON_PAGE(!PageAnon(page) || PageSwapBacked(page), page);
> > > +	SetPageSwapBacked(page);
> > > +}
> > > +
> > > +static inline bool page_is_lazyfree(struct page *page)
> > > +{
> > > +	return PageAnon(page) && !PageSwapBacked(page);
> > > +}
> > 
> > Sorry for not getting to v2 in time, but I have to say I strongly
> > agree with your first iterations and would much prefer this to be
> > open-coded.
> > 
> > IMO this needlessly introduces a new state opaquely called "lazyfree",
> > when really that's just anonymous pages that don't need to be swapped
> > before reclaim - PageAnon && !PageSwapBacked. Very simple MM concept.
> > 
> > That especially shows when we later combine it with page_is_file_cache
> > checks like the next patch does.
> > 
> > The rest of the patch looks good to me.
> 
> Thanks! I do agree checking PageSwapBacked is clearer, but Minchan convinced me
> because of the accounting issue. Where do you suggest we should put the
> accounting to?

I now proposed quite a few changes to the setting and clearing sites,
so it's harder to judge, but AFAICT once those sites are consolidated,
open-coding the stat updates as well shouldn't be too bad, right?

One site to clear during MADV_FREE, one site to set during reclaim.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17 16:01       ` Johannes Weiner
@ 2017-02-17 18:43         ` Shaohua Li
  2017-02-17 20:03           ` Johannes Weiner
  0 siblings, 1 reply; 23+ messages in thread
From: Shaohua Li @ 2017-02-17 18:43 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Fri, Feb 17, 2017 at 11:01:54AM -0500, Johannes Weiner wrote:
> On Thu, Feb 16, 2017 at 04:27:18PM -0800, Shaohua Li wrote:
> > On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > > @@ -911,7 +911,7 @@ static void page_check_dirty_writeback(struct page *page,
> > > >  	 * Anonymous pages are not handled by flushers and must be written
> > > >  	 * from reclaim context. Do not stall reclaim based on them
> > > >  	 */
> > > > -	if (!page_is_file_cache(page)) {
> > > > +	if (!page_is_file_cache(page) || page_is_lazyfree(page)) {
> > > 
> > > Do we need this? MADV_FREE clears the dirty bit off the page; we could
> > > just let them go through with the function without any special-casing.
> > 
> > this is just to zero dirty and writeback
> 
> Okay, I assumed that the page would always be !dirty && !writeback
> here anyway, so we might as well fall through and check those bits.
> 
> But a previously failed TTU might have moved a pte dirty bit to the
> page, so yes, we do need to filter for anon && !swapbacked here.
> 
> > > > @@ -1142,7 +1144,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  		 * The page is mapped into the page tables of one or more
> > > >  		 * processes. Try to unmap it here.
> > > >  		 */
> > > > -		if (page_mapped(page) && mapping) {
> > > > +		if (page_mapped(page) && (mapping || lazyfree)) {
> > > 
> > > Do we actually need to filter for mapping || lazyfree? If we fail to
> > > allocate swap, we don't reach here. If the page is a truncated file
> > > page, ttu returns pretty much instantly with SWAP_AGAIN. We should be
> > > able to just check for page_mapped() alone, no?
> > 
> > checking the mapping is faster than running into try_to_unamp, right?
> 
> !mapping should be a rare case. In reclaim code, I think it's better
> to keep it simple than to optimize away the rare function call.

ok 
> > > > @@ -1154,7 +1156,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  			case SWAP_MLOCK:
> > > >  				goto cull_mlocked;
> > > >  			case SWAP_LZFREE:
> > > > -				goto lazyfree;
> > > > +				/* follow __remove_mapping for reference */
> > > > +				if (page_ref_freeze(page, 1)) {
> > > > +					if (!PageDirty(page))
> > > > +						goto lazyfree;
> > > > +					else
> > > > +						page_ref_unfreeze(page, 1);
> > > > +				}
> > > > +				goto keep_locked;
> > > >  			case SWAP_SUCCESS:
> > > >  				; /* try to free the page below */
> > > 
> > > This is a similar situation.
> > > 
> > > Can we let the page go through the regular __remove_mapping() process
> > > and simply have that function check for PageAnon && !PageSwapBacked?
> > 
> > That will make the code more complicated. We don't call __remove_mapping if
> > !mapping. And we need to do bypass in __remove_mapping, for example, avoid
> > taking mapping->lock.
> 
> True, we won't get around a separate freeing path as long as the
> refcount handling is intertwined with the mapping removal like that :/
> 
> What we should be able to do, however, is remove at least SWAP_LZFREE
> and stick with SWAP_SUCCESS. On success, we can fall through up until
> we do the __remove_mapping call. The page isn't dirty, so we skip that
> PageDirty block; the page doesn't have private data, so we skip that
> block too. And then we can branch on PageAnon && !PageSwapBacked that
> does our alternate freeing path or __remove_mapping for others.

Sounds good 
> > > > @@ -1294,6 +1302,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  cull_mlocked:
> > > >  		if (PageSwapCache(page))
> > > >  			try_to_free_swap(page);
> > > > +		if (lazyfree)
> > > > +			clear_page_lazyfree(page);
> > > 
> > > Why cancel the MADV_FREE state? The combination seems non-sensical,
> > > but we can simply retain the invalidated state while the page goes to
> > > the unevictable list; munlock should move it back to inactive_file.
> > 
> > This depends on the policy. If user locks the page, I think it's reasonable to
> > assume the page is hot, so it doesn't make sense to treat the page lazyfree.
> 
> I think the key issue is whether the page contains valid data, not
> whether it is hot. When we clear the dirty bits along with
> PageSwapBacked, we're declaring the data in the page invalid. There is
> no practical usecase to mlock a page with invalid data, sure, but the
> act of mlocking a page doesn't make its contents suddenly valid again.
> 
> I.e. I'd stick with the pure data integrity perspective here. That's
> clearer and less error prone than intermingling it with eviction
> policy, to avoid accidents where we lose valid data.

The problem is if user mlock the page, it's very likely the user will dirty the
page soon. After the page is munlocked, putting it back to layyfree means page
reclaim will waste time to reclaim the page because it's dirty. But that said,
I don't argue about this. It's a rare case, so either way is ok to me.

> > > >  		unlock_page(page);
> > > >  		list_add(&page->lru, &ret_pages);
> > > >  		continue;
> > > > @@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
> > > >  			try_to_free_swap(page);
> > > >  		VM_BUG_ON_PAGE(PageActive(page), page);
> > > > +		if (lazyfree)
> > > > +			clear_page_lazyfree(page);
> > > 
> > > This is similar too.
> > > 
> > > Can we leave simply leave the page alone here? The only way we get to
> > > this point is if somebody is reading the invalidated page. It's weird
> > > for a lazyfreed page to become active, but it doesn't seem to warrant
> > > active intervention here.
> > 
> > So the unmap fails here probably because the page is dirty, which means the
> > page is written recently. It makes sense to assume the page is hot.
> 
> Ah, good point.
> 
> But can we handle that explicitly please? Like above, I don't want to
> undo the data invalidation just because somebody read the invalid data
> a bunch of times and it has the access bits set. We should only re-set
> the PageSwapBacked based on whether the page is actually dirty.
> 
> Maybe along the lines of SWAP_MLOCK we could add SWAP_DIRTY when TTU
> fails because the page is dirty, and then have a cull_dirty: label in
> shrink_page_list handle the lazy rescue of a reused MADV_FREE page?
> 
> This should work well with removing the mapping || lazyfree check when
> calling TTU. Then TTU can fail on dirty && !mapping, which is a much
> more obvious way of expressing it IMO - "This page contains valid data
> but there is no mapping that backs it once we unmap it. Abort."
> 
> That's mostly why I'm in favor of removing the idea of a "lazyfree"
> page as much as possible. IMO this whole thing becomes much more
> understandable - and less bolted on to the side of the VM - when we
> express it in existing concepts the VM uses for data integrity.

Ok, it makes sense to only reset the PageSwapBacked bit for dirty page. In this
way, we jump to activate_locked for SWAP_DIRTY || (SWAP_FAIL && pagelazyfree)
and jump to activate_locked for SWAP_FAIL && !pagelazyfree. Is this what you
want to do? This will add extra checks for SWAP_FAIL. I'm not sure if this is
really worthy because it's rare the MADV_FREE page is read.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH V3 3/7] mm: reclaim MADV_FREE pages
  2017-02-17 18:43         ` Shaohua Li
@ 2017-02-17 20:03           ` Johannes Weiner
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Weiner @ 2017-02-17 20:03 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Fri, Feb 17, 2017 at 10:43:41AM -0800, Shaohua Li wrote:
> On Fri, Feb 17, 2017 at 11:01:54AM -0500, Johannes Weiner wrote:
> > On Thu, Feb 16, 2017 at 04:27:18PM -0800, Shaohua Li wrote:
> > > On Thu, Feb 16, 2017 at 01:40:18PM -0500, Johannes Weiner wrote:
> > > > On Tue, Feb 14, 2017 at 11:36:09AM -0800, Shaohua Li wrote:
> > > > >  		unlock_page(page);
> > > > >  		list_add(&page->lru, &ret_pages);
> > > > >  		continue;
> > > > > @@ -1303,6 +1313,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > > >  		if (PageSwapCache(page) && mem_cgroup_swap_full(page))
> > > > >  			try_to_free_swap(page);
> > > > >  		VM_BUG_ON_PAGE(PageActive(page), page);
> > > > > +		if (lazyfree)
> > > > > +			clear_page_lazyfree(page);
> > > > 
> > > > Can we leave simply leave the page alone here? The only way we get to
> > > > this point is if somebody is reading the invalidated page. It's weird
> > > > for a lazyfreed page to become active, but it doesn't seem to warrant
> > > > active intervention here.
> > > 
> > > So the unmap fails here probably because the page is dirty, which means the
> > > page is written recently. It makes sense to assume the page is hot.
> > 
> > Ah, good point.
> > 
> > But can we handle that explicitly please? Like above, I don't want to
> > undo the data invalidation just because somebody read the invalid data
> > a bunch of times and it has the access bits set. We should only re-set
> > the PageSwapBacked based on whether the page is actually dirty.
> > 
> > Maybe along the lines of SWAP_MLOCK we could add SWAP_DIRTY when TTU
> > fails because the page is dirty, and then have a cull_dirty: label in
> > shrink_page_list handle the lazy rescue of a reused MADV_FREE page?
> > 
> > This should work well with removing the mapping || lazyfree check when
> > calling TTU. Then TTU can fail on dirty && !mapping, which is a much
> > more obvious way of expressing it IMO - "This page contains valid data
> > but there is no mapping that backs it once we unmap it. Abort."
> > 
> > That's mostly why I'm in favor of removing the idea of a "lazyfree"
> > page as much as possible. IMO this whole thing becomes much more
> > understandable - and less bolted on to the side of the VM - when we
> > express it in existing concepts the VM uses for data integrity.
> 
> Ok, it makes sense to only reset the PageSwapBacked bit for dirty page. In this
> way, we jump to activate_locked for SWAP_DIRTY || (SWAP_FAIL && pagelazyfree)
> and jump to activate_locked for SWAP_FAIL && !pagelazyfree. Is this what you
> want to do? This will add extra checks for SWAP_FAIL. I'm not sure if this is
> really worthy because it's rare the MADV_FREE page is read.

Yes, for SWAP_DIRTY jump to activate_locked or have its own label that
sets PG_swapbacked again and moves the page back to the proper LRU.

SWAP_FAIL of an anon && !swapbacked && !dirty && referenced page can
be ignored IMO. This happens only when the user is reading invalid
data over and over, I see no reason to optimize for that. We activate
a MADV_FREE page, which is weird, but not a correctness issue, right?

Just to clarify, right now we have this:

---

SWAP_FAIL (failure on pte, swap, lazyfree):
  if pagelazyfree:
    clear pagelazyfree
  activate

SWAP_SUCCESS:
  regular reclaim

SWAP_LZFREE (success on lazyfree when page and ptes are all clean):
  free page

---

What I'm proposing is to separate lazyfree failure out from SWAP_FAIL
into its own branch. Then merge lazyfree success into SWAP_SUCCESS:

---

SWAP_FAIL (failure on pte, swap):
  activate

SWAP_SUCCESS:
  if anon && !swapbacked:
    free manually
  else:
    __remove_mapping()

SWAP_DIRTY (anon && !swapbacked && dirty):
  set swapbacked
  putback/activate

---

This way we have a mostly unified success path (we might later be able
to refactor __remove_mapping to split refcounting from mapping stuff
to remove the last trace of difference), and SWAP_DIRTY follows the
same type of delayed LRU fixup as we do for SWAP_MLOCK right now.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-02-17 20:03 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-14 19:36 [PATCH V3 0/7] mm: fix some MADV_FREE issues Shaohua Li
2017-02-14 19:36 ` [PATCH V3 1/7] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
2017-02-16 17:39   ` Johannes Weiner
2017-02-14 19:36 ` [PATCH V3 2/7] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
2017-02-16 17:52   ` Johannes Weiner
2017-02-17  0:35     ` Shaohua Li
2017-02-17 16:22       ` Johannes Weiner
2017-02-14 19:36 ` [PATCH V3 3/7] mm: reclaim MADV_FREE pages Shaohua Li
2017-02-16 18:40   ` Johannes Weiner
2017-02-17  0:27     ` Shaohua Li
2017-02-17  5:45       ` Minchan Kim
2017-02-17 16:11         ` Johannes Weiner
2017-02-17 16:01       ` Johannes Weiner
2017-02-17 18:43         ` Shaohua Li
2017-02-17 20:03           ` Johannes Weiner
2017-02-17  5:41     ` Minchan Kim
2017-02-17  9:27       ` Minchan Kim
2017-02-17 16:15       ` Johannes Weiner
2017-02-14 19:36 ` [PATCH V3 4/7] mm: enable MADV_FREE for swapless system Shaohua Li
2017-02-17 16:16   ` Johannes Weiner
2017-02-14 19:36 ` [PATCH V3 5/7] mm: add vmstat account for MADV_FREE pages Shaohua Li
2017-02-14 19:36 ` [PATCH V3 6/7] proc: show MADV_FREE pages info in smaps Shaohua Li
2017-02-14 19:36 ` [PATCH V3 7/7] mm: add a separate RSS for MADV_FREE pages Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).