linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 0/6] mm: fix some MADV_FREE issues
@ 2017-02-24 21:31 Shaohua Li
  2017-02-24 21:31 ` [PATCH V5 1/6] mm: delete unnecessary TTU_* flags Shaohua Li
                   ` (5 more replies)
  0 siblings, 6 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Hi,

We are trying to use MADV_FREE in jemalloc. Several issues are found. Without
solving the issues, jemalloc can't use the MADV_FREE feature.
- Doesn't support system without swap enabled. Because if swap is off, we can't
  or can't efficiently age anonymous pages. And since MADV_FREE pages are mixed
  with other anonymous pages, we can't reclaim MADV_FREE pages. In current
  implementation, MADV_FREE will fallback to MADV_DONTNEED without swap enabled.
  But in our environment, a lot of machines don't enable swap. This will prevent
  our setup using MADV_FREE.
- Increases memory pressure. page reclaim bias file pages reclaim against
  anonymous pages. This doesn't make sense for MADV_FREE pages, because those
  pages could be freed easily and refilled with very slight penality. Even page
  reclaim doesn't bias file pages, there is still an issue, because MADV_FREE
  pages and other anonymous pages are mixed together. To reclaim a MADV_FREE
  page, we probably must scan a lot of other anonymous pages, which is
  inefficient. In our test, we usually see oom with MADV_FREE enabled and nothing
  without it.
- Accounting. There are two accounting problems. We don't have a global
  accounting. If the system is abnormal, we don't know if it's a problem from
  MADV_FREE side. The other problem is RSS accounting. MADV_FREE pages are
  accounted as normal anon pages and reclaimed lazily, so application's RSS
  becomes bigger. This confuses our workloads. We have monitoring daemon running
  and if it finds applications' RSS becomes abnormal, the daemon will kill the
  applications even kernel can reclaim the memory easily.

To address the first the two issues, we can either put MADV_FREE pages into a
separate LRU list (Minchan's previous patches and V1 patches), or put them into
LRU_INACTIVE_FILE list (suggested by Johannes). The patchset use the second
idea. The reason is LRU_INACTIVE_FILE list is tiny nowadays and should be full
of used once file pages. So we can still efficiently reclaim MADV_FREE pages
there without interference with other anon and active file pages. Putting the
pages into inactive file list also has an advantage which allows page reclaim
to prioritize MADV_FREE pages and used once file pages. MADV_FREE pages are put
into the lru list and clear SwapBacked flag, so PageAnon(page) &&
!PageSwapBacked(page) will indicate a MADV_FREE pages. These pages will
directly freed without pageout if they are clean, otherwise normal swap will
reclaim them.

For the third issue, the previous post adds global accounting and a separate
RSS count for MADV_FREE pages. The problem is we never get accurate accounting
for MADV_FREE pages. The pages are mapped to userspace, can be dirtied without
notice from kernel side. To get accurate accounting, we could write protect the
page, but then there is extra page fault overhead, which people don't want to
pay. Jemalloc guys have concerns about the inaccurate accounting, so this post
drops the accounting patches temporarily. The info exported to /proc/pid/smaps
for MADV_FREE pages are kept, which is the only place we can get accurate
accounting right now.

Thanks,
Shaohua

V4->V5:
- Fix some minor issues pointed out by Johannes
- Integrate Johannes's cleanup patch
- Fix a bug to avoid swapin page is dicarded silently

V3->V4:
- rebase to latest -mm tree
- Address several issues pointed out by Johannes and Minchan
- Dropped vmstat and RSS accounting
http://marc.info/?l=linux-mm&m=148778961127710&w=2

V2->V3:
- rebase to latest -mm tree
- Address severl issues pointed out by Minchan
- Add more descriptions
http://marc.info/?l=linux-mm&m=148710098701674&w=2

V1->V2:
- Put MADV_FREE pages into LRU_INACTIVE_FILE list instead of adding a new lru
  list, suggested by Johannes
- Add RSS support
http://marc.info/?l=linux-mm&m=148616481928054&w=2

Minchan previous patches:
http://marc.info/?l=linux-mm&m=144800657002763&w=2

----------------------
Shaohua Li (6):
  mm: delete unnecessary TTU_* flags
  mm: don't assume anonymous pages have SwapBacked flag
  mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  mm: reclaim MADV_FREE pages
  mm: enable MADV_FREE for swapless system
  proc: show MADV_FREE pages info in smaps

 Documentation/filesystems/proc.txt |  4 +++
 fs/proc/task_mmu.c                 |  8 +++++-
 include/linux/rmap.h               | 24 ++++++++----------
 include/linux/swap.h               |  2 +-
 include/linux/vm_event_item.h      |  2 +-
 mm/huge_memory.c                   |  6 ++---
 mm/khugepaged.c                    |  8 +++---
 mm/madvise.c                       | 11 ++-------
 mm/memory-failure.c                |  2 +-
 mm/migrate.c                       |  3 ++-
 mm/rmap.c                          | 43 +++++++++++++++-----------------
 mm/swap.c                          | 50 +++++++++++++++++++++-----------------
 mm/vmscan.c                        | 45 +++++++++++++++++++---------------
 mm/vmstat.c                        |  1 +
 14 files changed, 107 insertions(+), 102 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH V5 1/6] mm: delete unnecessary TTU_* flags
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27 13:48   ` Michal Hocko
  2017-02-24 21:31 ` [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
always have the flag set if we want to do an unmap. For cases we don't
do an unmap, the TTU_LZFREE part of code should never run.

Also the TTU_UNMAP is unnecessary. If no other flags set (for
example, TTU_MIGRATION), an unmap is implied.

The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
code

Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/rmap.h | 22 +++++++++-------------
 mm/memory-failure.c  |  2 +-
 mm/rmap.c            |  2 +-
 mm/vmscan.c          | 11 ++++-------
 4 files changed, 15 insertions(+), 22 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8c89e90..7a39414 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -83,19 +83,17 @@ struct anon_vma_chain {
 };
 
 enum ttu_flags {
-	TTU_UNMAP = 1,			/* unmap mode */
-	TTU_MIGRATION = 2,		/* migration mode */
-	TTU_MUNLOCK = 4,		/* munlock mode */
-	TTU_LZFREE = 8,			/* lazy free mode */
-	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
-
-	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
-	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
-	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
-	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
+	TTU_MIGRATION		= 0x1,	/* migration mode */
+	TTU_MUNLOCK		= 0x2,	/* munlock mode */
+
+	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
+	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
+	TTU_IGNORE_ACCESS	= 0x10,	/* don't age */
+	TTU_IGNORE_HWPOISON	= 0x20,	/* corrupted page is recoverable */
+	TTU_BATCH_FLUSH		= 0x40,	/* Batch TLB flushes where possible
 					 * and caller guarantees they will
 					 * do a final flush if necessary */
-	TTU_RMAP_LOCKED = (1 << 12)	/* do not grab rmap lock:
+	TTU_RMAP_LOCKED		= 0x80	/* do not grab rmap lock:
 					 * caller holds it */
 };
 
@@ -193,8 +191,6 @@ static inline void page_dup_rmap(struct page *page, bool compound)
 int page_referenced(struct page *, int is_locked,
 			struct mem_cgroup *memcg, unsigned long *vm_flags);
 
-#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
-
 int try_to_unmap(struct page *, enum ttu_flags flags);
 
 /* Avoid racy checks */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3d0f2fd..b78d080 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -906,7 +906,7 @@ EXPORT_SYMBOL_GPL(get_hwpoison_page);
 static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 				  int trapno, int flags, struct page **hpagep)
 {
-	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
+	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	int ret;
diff --git a/mm/rmap.c b/mm/rmap.c
index 8774791..96eb85c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1418,7 +1418,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 */
 			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
 
-			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
+			if (!PageDirty(page)) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				rp->lazyfreed++;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b40..68ea50d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -971,7 +971,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
 		bool dirty, writeback;
-		bool lazyfree = false;
 		int ret = SWAP_SUCCESS;
 
 		cond_resched();
@@ -1125,7 +1124,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
-			lazyfree = true;
 			may_enter_fs = 1;
 
 			/* Adding to swap updated mapping */
@@ -1143,9 +1141,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page) && mapping) {
-			switch (ret = try_to_unmap(page, lazyfree ?
-				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
-				(ttu_flags | TTU_BATCH_FLUSH))) {
+			switch (ret = try_to_unmap(page,
+				ttu_flags | TTU_BATCH_FLUSH)) {
 			case SWAP_FAIL:
 				nr_unmap_fail++;
 				goto activate_locked;
@@ -1353,7 +1350,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
-			TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
+			TTU_IGNORE_ACCESS, NULL, true);
 	list_splice(&clean_pages, page_list);
 	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
@@ -1760,7 +1757,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
+	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
 				&stat, false);
 
 	spin_lock_irq(&pgdat->lru_lock);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
  2017-02-24 21:31 ` [PATCH V5 1/6] mm: delete unnecessary TTU_* flags Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27  6:48   ` Hillf Danton
  2017-02-27 14:35   ` Michal Hocko
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

There are a few places the code assumes anonymous pages should have
SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
for them. The assumption doesn't hold any more, so fix them.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/huge_memory.c | 1 -
 mm/khugepaged.c  | 8 +++-----
 mm/migrate.c     | 3 ++-
 mm/rmap.c        | 3 ++-
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7dda8d6..cf9fb46 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2361,7 +2361,6 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 
 	VM_BUG_ON_PAGE(is_huge_zero_page(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 
 	if (PageAnon(head)) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c..a4b499f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -481,8 +481,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -530,7 +529,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		VM_BUG_ON_PAGE(!PageAnon(page), page);
-		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
 		/*
 		 * We can do it before isolate_lru_page because the
@@ -577,8 +575,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
 		}
-		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_node_page_state(page,
+				NR_ISOLATED_ANON + page_is_file_cache(page));
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 2c63ac0..7c8df1f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1943,7 +1943,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 	/* Prepare a page as a migration target */
 	__SetPageLocked(new_page);
-	__SetPageSwapBacked(new_page);
+	if (PageSwapBacked(page))
+		__SetPageSwapBacked(new_page);
 
 	/* anon mapping, we can simply copy page->mapping to the new page: */
 	new_page->mapping = page->mapping;
diff --git a/mm/rmap.c b/mm/rmap.c
index 96eb85c..c621088 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1416,7 +1416,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * Store the swap location in the pte.
 			 * See handle_pte_fault() ...
 			 */
-			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
+			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
+				page);
 
 			if (!PageDirty(page)) {
 				/* It's a freeable page by MADV_FREE */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
  2017-02-24 21:31 ` [PATCH V5 1/6] mm: delete unnecessary TTU_* flags Shaohua Li
  2017-02-24 21:31 ` [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27  6:28   ` Minchan Kim
                     ` (3 more replies)
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
pages, but they can be freed without pageout. To destinguish them
against normal anonymous pages, we clear their SwapBacked flag.

MADV_FREE pages could be freed without pageout, so they pretty much like
used once file pages. For such pages, we'd like to reclaim them once
there is memory pressure. Also it might be unfair reclaiming MADV_FREE
pages always before used once file pages and we definitively want to
reclaim the pages before other anonymous and file pages.

To speed up MADV_FREE pages reclaim, we put the pages into
LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
nowadays and should be full of used once file pages. Reclaiming
MADV_FREE pages will not have much interfere of anonymous and active
file pages. And the inactive file pages and MADV_FREE pages will be
reclaimed according to their age, so we don't reclaim too many MADV_FREE
pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
means we can reclaim the pages without swap support. This idea is
suggested by Johannes.

This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
avoid bisect failure, next patch will do it.

The patch is based on Minchan's original patch.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/swap.h          |  2 +-
 include/linux/vm_event_item.h |  2 +-
 mm/huge_memory.c              |  3 ---
 mm/madvise.c                  |  2 --
 mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
 mm/vmstat.c                   |  1 +
 6 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 45e91dd..486494e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 6aa1b6c..94e58da 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		FOR_ALL_ZONES(PGALLOC),
 		FOR_ALL_ZONES(ALLOCSTALL),
 		FOR_ALL_ZONES(PGSCAN_SKIP),
-		PGFREE, PGACTIVATE, PGDEACTIVATE,
+		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
 		PGFAULT, PGMAJFAULT,
 		PGLAZYFREED,
 		PGREFILL,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf9fb46..3b7ee0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1562,9 +1562,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		ClearPageDirty(page);
 	unlock_page(page);
 
-	if (PageActive(page))
-		deactivate_page(page);
-
 	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
 		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
 			tlb->fullmm);
diff --git a/mm/madvise.c b/mm/madvise.c
index dc5927c..61e10b1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -411,8 +411,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			ptent = pte_mkold(ptent);
 			ptent = pte_mkclean(ptent);
 			set_pte_at(mm, addr, pte, ptent);
-			if (PageActive(page))
-				deactivate_page(page);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 	}
diff --git a/mm/swap.c b/mm/swap.c
index c4910f1..c4fb4b9 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,7 +46,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
 #endif
@@ -561,20 +561,26 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 }
 
 
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
 			    void *arg)
 {
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		int file = page_is_file_cache(page);
-		int lru = page_lru_base_type(page);
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		bool active = PageActive(page);
 
-		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
 		ClearPageActive(page);
 		ClearPageReferenced(page);
-		add_page_to_lru_list(page, lruvec, lru);
+		/*
+		 * lazyfree pages are clean anonymous pages. They have
+		 * SwapBacked flag cleared to distinguish normal anonymous
+		 * pages
+		 */
+		ClearPageSwapBacked(page);
+		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
 
-		__count_vm_event(PGDEACTIVATE);
-		update_page_reclaim_stat(lruvec, file, 0);
+		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
+		update_page_reclaim_stat(lruvec, 1, 0);
 	}
 }
 
@@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
 
 	activate_page_drain(cpu);
 }
@@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
 }
 
 /**
- * deactivate_page - deactivate a page
+ * mark_page_lazyfree - make an anon page lazyfree
  * @page: page to deactivate
  *
- * deactivate_page() moves @page to the inactive list if @page was on the active
- * list and was not an unevictable page.  This is done to accelerate the reclaim
- * of @page.
+ * mark_page_lazyfree() moves @page to the inactive file list.
+ * This is done to accelerate the reclaim of @page.
  */
-void deactivate_page(struct page *page)
-{
-	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+void mark_page_lazyfree(struct page *page)
+ {
+	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
+	    !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
 
 		get_page(page);
 		if (!pagevec_add(pvec, page) || PageCompound(page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+		put_cpu_var(lru_lazyfree_pvecs);
 	}
 }
 
@@ -704,7 +710,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			queue_work_on(cpu, lru_add_drain_wq, work);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff..7774196 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
 	"pgfree",
 	"pgactivate",
 	"pgdeactivate",
+	"pglazyfree",
 
 	"pgfault",
 	"pgmajfault",
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
                   ` (2 preceding siblings ...)
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27  6:33   ` Minchan Kim
                     ` (3 more replies)
  2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
  2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
  5 siblings, 4 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

When memory pressure is high, we free MADV_FREE pages. If the pages are
not dirty in pte, the pages could be freed immediately. Otherwise we
can't reclaim them. We put the pages back to anonumous LRU list (by
setting SwapBacked flag) and the pages will be reclaimed in normal
swapout way.

We use normal page reclaim policy. Since MADV_FREE pages are put into
inactive file list, such pages and inactive file pages are reclaimed
according to their age. This is expected, because we don't want to
reclaim too many MADV_FREE pages before used once pages.

Based on Minchan's original patch

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 include/linux/rmap.h |  2 +-
 mm/huge_memory.c     |  2 ++
 mm/madvise.c         |  1 +
 mm/rmap.c            | 40 +++++++++++++++++-----------------------
 mm/vmscan.c          | 34 ++++++++++++++++++++++------------
 5 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 7a39414..fee10d7 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
 #define SWAP_AGAIN	1
 #define SWAP_FAIL	2
 #define SWAP_MLOCK	3
-#define SWAP_LZFREE	4
+#define SWAP_DIRTY	4
 
 #endif	/* _LINUX_RMAP_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3b7ee0c..4c7454b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1571,6 +1571,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		set_pmd_at(mm, addr, pmd, orig_pmd);
 		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
 	}
+
+	mark_page_lazyfree(page);
 	ret = true;
 out:
 	spin_unlock(ptl);
diff --git a/mm/madvise.c b/mm/madvise.c
index 61e10b1..225af7d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -413,6 +413,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			set_pte_at(mm, addr, pte, ptent);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
+		mark_page_lazyfree(page);
 	}
 out:
 	if (nr_swap) {
diff --git a/mm/rmap.c b/mm/rmap.c
index c621088..bb45712 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1281,11 +1281,6 @@ void page_remove_rmap(struct page *page, bool compound)
 	 */
 }
 
-struct rmap_private {
-	enum ttu_flags flags;
-	int lazyfreed;
-};
-
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1301,8 +1296,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	struct page *subpage;
 	int ret = SWAP_AGAIN;
-	struct rmap_private *rp = arg;
-	enum ttu_flags flags = rp->flags;
+	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
 	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
@@ -1419,11 +1413,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
 				page);
 
-			if (!PageDirty(page)) {
+			/*
+			 * swapin page could be clean, it has data stored in
+			 * swap. We can't silently discard it without setting
+			 * swap entry in the page table.
+			 */
+			if (!PageDirty(page) && !PageSwapCache(page)) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
-				rp->lazyfreed++;
 				goto discard;
+			} else if (!PageSwapBacked(page)) {
+				/* dirty MADV_FREE page */
+				set_pte_at(mm, address, pvmw.pte, pteval);
+				ret = SWAP_DIRTY;
+				page_vma_mapped_walk_done(&pvmw);
+				break;
 			}
 
 			if (swap_duplicate(entry) < 0) {
@@ -1491,18 +1495,15 @@ static int page_mapcount_is_zero(struct page *page)
  * SWAP_AGAIN	- we missed a mapping, try again later
  * SWAP_FAIL	- the page is unswappable
  * SWAP_MLOCK	- page is mlocked.
+ * SWAP_DIRTY	- page is dirty MADV_FREE page
  */
 int try_to_unmap(struct page *page, enum ttu_flags flags)
 {
 	int ret;
-	struct rmap_private rp = {
-		.flags = flags,
-		.lazyfreed = 0,
-	};
 
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
-		.arg = &rp,
+		.arg = (void *)flags,
 		.done = page_mapcount_is_zero,
 		.anon_lock = page_lock_anon_vma_read,
 	};
@@ -1523,11 +1524,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
 	else
 		ret = rmap_walk(page, &rwc);
 
-	if (ret != SWAP_MLOCK && !page_mapcount(page)) {
+	if (ret != SWAP_MLOCK && !page_mapcount(page))
 		ret = SWAP_SUCCESS;
-		if (rp.lazyfreed && !PageDirty(page))
-			ret = SWAP_LZFREE;
-	}
 	return ret;
 }
 
@@ -1554,14 +1552,10 @@ static int page_not_mapped(struct page *page)
 int try_to_munlock(struct page *page)
 {
 	int ret;
-	struct rmap_private rp = {
-		.flags = TTU_MUNLOCK,
-		.lazyfreed = 0,
-	};
 
 	struct rmap_walk_control rwc = {
 		.rmap_one = try_to_unmap_one,
-		.arg = &rp,
+		.arg = (void *)TTU_MUNLOCK,
 		.done = page_not_mapped,
 		.anon_lock = page_lock_anon_vma_read,
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68ea50d..16ad821 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -911,7 +911,8 @@ static void page_check_dirty_writeback(struct page *page,
 	 * Anonymous pages are not handled by flushers and must be written
 	 * from reclaim context. Do not stall reclaim based on them
 	 */
-	if (!page_is_file_cache(page)) {
+	if (!page_is_file_cache(page) ||
+	    (PageAnon(page) && !PageSwapBacked(page))) {
 		*dirty = false;
 		*writeback = false;
 		return;
@@ -992,7 +993,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
-		if (page_mapped(page) || PageSwapCache(page))
+		if ((page_mapped(page) || PageSwapCache(page)) &&
+		    !(PageAnon(page) && !PageSwapBacked(page)))
 			sc->nr_scanned++;
 
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
@@ -1118,8 +1120,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
+		 * Lazyfree page could be freed directly
 		 */
-		if (PageAnon(page) && !PageSwapCache(page)) {
+		if (PageAnon(page) && PageSwapBacked(page) &&
+		    !PageSwapCache(page)) {
 			if (!(sc->gfp_mask & __GFP_IO))
 				goto keep_locked;
 			if (!add_to_swap(page, page_list))
@@ -1140,9 +1144,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * The page is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
 		 */
-		if (page_mapped(page) && mapping) {
+		if (page_mapped(page)) {
 			switch (ret = try_to_unmap(page,
 				ttu_flags | TTU_BATCH_FLUSH)) {
+			case SWAP_DIRTY:
+				SetPageSwapBacked(page);
+				/* fall through */
 			case SWAP_FAIL:
 				nr_unmap_fail++;
 				goto activate_locked;
@@ -1150,8 +1157,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 			case SWAP_MLOCK:
 				goto cull_mlocked;
-			case SWAP_LZFREE:
-				goto lazyfree;
 			case SWAP_SUCCESS:
 				; /* try to free the page below */
 			}
@@ -1263,10 +1268,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-lazyfree:
-		if (!mapping || !__remove_mapping(mapping, page, true))
-			goto keep_locked;
+		if (PageAnon(page) && !PageSwapBacked(page)) {
+			/* follow __remove_mapping for reference */
+			if (!page_ref_freeze(page, 1))
+				goto keep_locked;
+			if (PageDirty(page)) {
+				page_ref_unfreeze(page, 1);
+				goto keep_locked;
+			}
 
+			count_vm_event(PGLAZYFREED);
+		} else if (!mapping || !__remove_mapping(mapping, page, true))
+			goto keep_locked;
 		/*
 		 * At this point, we have no other references and there is
 		 * no way to pick any more up (removed from LRU, removed
@@ -1276,9 +1289,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 */
 		__ClearPageLocked(page);
 free_it:
-		if (ret == SWAP_LZFREE)
-			count_vm_event(PGLAZYFREED);
-
 		nr_reclaimed++;
 
 		/*
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V5 5/6] mm: enable MADV_FREE for swapless system
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
                   ` (3 preceding siblings ...)
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
                     ` (2 more replies)
  2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
  5 siblings, 3 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

Now MADV_FREE pages can be easily reclaimed even for swapless system. We
can safely enable MADV_FREE for all systems.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 mm/madvise.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 225af7d..5ab4b7b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -612,13 +612,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	case MADV_WILLNEED:
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_FREE:
-		/*
-		 * XXX: In this implementation, MADV_FREE works like
-		 * MADV_DONTNEED on swapless system or full swap.
-		 */
-		if (get_nr_swap_pages() > 0)
-			return madvise_free(vma, prev, start, end);
-		/* passthrough */
+		return madvise_free(vma, prev, start, end);
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
 	default:
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
                   ` (4 preceding siblings ...)
  2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
@ 2017-02-24 21:31 ` Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
                     ` (2 more replies)
  5 siblings, 3 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-24 21:31 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

show MADV_FREE pages info of each vma in smaps. The interface is for
diganose or monitoring purpose, userspace could use it to understand
what happens in the application. Since userspace could dirty MADV_FREE
pages without notice from kernel, this interface is the only place we
can get accurate accounting info about MADV_FREE pages.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 Documentation/filesystems/proc.txt | 4 ++++
 fs/proc/task_mmu.c                 | 8 +++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index c94b467..45853e1 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -412,6 +412,7 @@ Private_Clean:         0 kB
 Private_Dirty:         0 kB
 Referenced:          892 kB
 Anonymous:             0 kB
+LazyFree:              0 kB
 AnonHugePages:         0 kB
 ShmemPmdMapped:        0 kB
 Shared_Hugetlb:        0 kB
@@ -441,6 +442,9 @@ accessed.
 "Anonymous" shows the amount of memory that does not belong to any file.  Even
 a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
+"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
+The memory isn't freed immediately with madvise(). It's freed in memory
+pressure if the memory is clean.
 "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
 "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
 huge pages.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb2..8a5ec00 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -440,6 +440,7 @@ struct mem_size_stats {
 	unsigned long private_dirty;
 	unsigned long referenced;
 	unsigned long anonymous;
+	unsigned long lazyfree;
 	unsigned long anonymous_thp;
 	unsigned long shmem_thp;
 	unsigned long swap;
@@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
 	int i, nr = compound ? 1 << compound_order(page) : 1;
 	unsigned long size = nr * PAGE_SIZE;
 
-	if (PageAnon(page))
+	if (PageAnon(page)) {
 		mss->anonymous += size;
+		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
+			mss->lazyfree += size;
+	}
 
 	mss->resident += size;
 	/* Accumulate the size in pages that have been accessed. */
@@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   "Private_Dirty:  %8lu kB\n"
 		   "Referenced:     %8lu kB\n"
 		   "Anonymous:      %8lu kB\n"
+		   "LazyFree:       %8lu kB\n"
 		   "AnonHugePages:  %8lu kB\n"
 		   "ShmemPmdMapped: %8lu kB\n"
 		   "Shared_Hugetlb: %8lu kB\n"
@@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
 		   mss.private_dirty >> 10,
 		   mss.referenced >> 10,
 		   mss.anonymous >> 10,
+		   mss.lazyfree >> 10,
 		   mss.anonymous_thp >> 10,
 		   mss.shmem_thp >> 10,
 		   mss.shared_hugetlb >> 10,
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
@ 2017-02-27  6:28   ` Minchan Kim
  2017-02-27 16:13     ` Shaohua Li
  2017-02-27 14:53   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Minchan Kim @ 2017-02-27  6:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

Hello Shaohua,

On Fri, Feb 24, 2017 at 01:31:46PM -0800, Shaohua Li wrote:
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.
> 
> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

This patch doesn't address I pointed out in v4.

https://marc.info/?i=20170224233752.GB4635%40bbox

Let's discuss it if you still are against.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
@ 2017-02-27  6:33   ` Minchan Kim
  2017-02-27 16:19     ` Shaohua Li
  2017-02-27 15:05   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 44+ messages in thread
From: Minchan Kim @ 2017-02-27  6:33 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

Hi Shaohua,

On Fri, Feb 24, 2017 at 01:31:47PM -0800, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  include/linux/rmap.h |  2 +-
>  mm/huge_memory.c     |  2 ++
>  mm/madvise.c         |  1 +
>  mm/rmap.c            | 40 +++++++++++++++++-----------------------
>  mm/vmscan.c          | 34 ++++++++++++++++++++++------------
>  5 files changed, 43 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7a39414..fee10d7 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
>  #define SWAP_AGAIN	1
>  #define SWAP_FAIL	2
>  #define SWAP_MLOCK	3
> -#define SWAP_LZFREE	4
> +#define SWAP_DIRTY	4

I still don't convinced why we should introduce SWAP_DIRTY in try_to_unmap.
https://marc.info/?l=linux-mm&m=148797879123238&w=2

We have been SetPageMlocked in there but why cannot we SetPageSwapBacked
in there? It's not a thing to change LRU type but it's just indication
we found the page's status changed in late.

>  
>  #endif	/* _LINUX_RMAP_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3b7ee0c..4c7454b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1571,6 +1571,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 61e10b1..225af7d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -413,6 +413,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			set_pte_at(mm, addr, pte, ptent);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c621088..bb45712 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1281,11 +1281,6 @@ void page_remove_rmap(struct page *page, bool compound)
>  	 */
>  }
>  
> -struct rmap_private {
> -	enum ttu_flags flags;
> -	int lazyfreed;
> -};
> -
>  /*
>   * @arg: enum ttu_flags will be passed to this argument
>   */
> @@ -1301,8 +1296,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	struct page *subpage;
>  	int ret = SWAP_AGAIN;
> -	struct rmap_private *rp = arg;
> -	enum ttu_flags flags = rp->flags;
> +	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> @@ -1419,11 +1413,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
>  				page);
>  
> -			if (!PageDirty(page)) {
> +			/*
> +			 * swapin page could be clean, it has data stored in
> +			 * swap. We can't silently discard it without setting
> +			 * swap entry in the page table.
> +			 */
> +			if (!PageDirty(page) && !PageSwapCache(page)) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
> -				rp->lazyfreed++;
>  				goto discard;
> +			} else if (!PageSwapBacked(page)) {
> +				/* dirty MADV_FREE page */
> +				set_pte_at(mm, address, pvmw.pte, pteval);
> +				ret = SWAP_DIRTY;
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
>  			}
>  
>  			if (swap_duplicate(entry) < 0) {
> @@ -1491,18 +1495,15 @@ static int page_mapcount_is_zero(struct page *page)
>   * SWAP_AGAIN	- we missed a mapping, try again later
>   * SWAP_FAIL	- the page is unswappable
>   * SWAP_MLOCK	- page is mlocked.
> + * SWAP_DIRTY	- page is dirty MADV_FREE page
>   */
>  int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
>  	int ret;
> -	struct rmap_private rp = {
> -		.flags = flags,
> -		.lazyfreed = 0,
> -	};
>  
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = try_to_unmap_one,
> -		.arg = &rp,
> +		.arg = (void *)flags,
>  		.done = page_mapcount_is_zero,
>  		.anon_lock = page_lock_anon_vma_read,
>  	};
> @@ -1523,11 +1524,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>  	else
>  		ret = rmap_walk(page, &rwc);
>  
> -	if (ret != SWAP_MLOCK && !page_mapcount(page)) {
> +	if (ret != SWAP_MLOCK && !page_mapcount(page))
>  		ret = SWAP_SUCCESS;
> -		if (rp.lazyfreed && !PageDirty(page))
> -			ret = SWAP_LZFREE;
> -	}
>  	return ret;
>  }
>  
> @@ -1554,14 +1552,10 @@ static int page_not_mapped(struct page *page)
>  int try_to_munlock(struct page *page)
>  {
>  	int ret;
> -	struct rmap_private rp = {
> -		.flags = TTU_MUNLOCK,
> -		.lazyfreed = 0,
> -	};
>  
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = try_to_unmap_one,
> -		.arg = &rp,
> +		.arg = (void *)TTU_MUNLOCK,
>  		.done = page_not_mapped,
>  		.anon_lock = page_lock_anon_vma_read,
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 68ea50d..16ad821 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -911,7 +911,8 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) ||
> +	    (PageAnon(page) && !PageSwapBacked(page))) {
>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -992,7 +993,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) &&
> +		    !(PageAnon(page) && !PageSwapBacked(page)))
>  			sc->nr_scanned++;
>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1118,8 +1120,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> +		 * Lazyfree page could be freed directly
>  		 */
> -		if (PageAnon(page) && !PageSwapCache(page)) {
> +		if (PageAnon(page) && PageSwapBacked(page) &&
> +		    !PageSwapCache(page)) {
>  			if (!(sc->gfp_mask & __GFP_IO))
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
> @@ -1140,9 +1144,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page)) {
>  			switch (ret = try_to_unmap(page,
>  				ttu_flags | TTU_BATCH_FLUSH)) {
> +			case SWAP_DIRTY:
> +				SetPageSwapBacked(page);
> +				/* fall through */
>  			case SWAP_FAIL:
>  				nr_unmap_fail++;
>  				goto activate_locked;
> @@ -1150,8 +1157,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
> -			case SWAP_LZFREE:
> -				goto lazyfree;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1263,10 +1268,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
> -		if (!mapping || !__remove_mapping(mapping, page, true))
> -			goto keep_locked;
> +		if (PageAnon(page) && !PageSwapBacked(page)) {
> +			/* follow __remove_mapping for reference */
> +			if (!page_ref_freeze(page, 1))
> +				goto keep_locked;
> +			if (PageDirty(page)) {
> +				page_ref_unfreeze(page, 1);
> +				goto keep_locked;
> +			}
>  
> +			count_vm_event(PGLAZYFREED);
> +		} else if (!mapping || !__remove_mapping(mapping, page, true))
> +			goto keep_locked;
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1276,9 +1289,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 */
>  		__ClearPageLocked(page);
>  free_it:
> -		if (ret == SWAP_LZFREE)
> -			count_vm_event(PGLAZYFREED);
> -
>  		nr_reclaimed++;
>  
>  		/*
> -- 
> 2.9.3
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-24 21:31 ` [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
@ 2017-02-27  6:48   ` Hillf Danton
  2017-02-27 14:35   ` Michal Hocko
  1 sibling, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2017-02-27  6:48 UTC (permalink / raw)
  To: 'Shaohua Li', linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm


On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> There are a few places the code assumes anonymous pages should have
> SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> for them. The assumption doesn't hold any more, so fix them.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 1/6] mm: delete unnecessary TTU_* flags
  2017-02-24 21:31 ` [PATCH V5 1/6] mm: delete unnecessary TTU_* flags Shaohua Li
@ 2017-02-27 13:48   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 13:48 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:44, Shaohua Li wrote:
> Johannes pointed out TTU_LZFREE is unnecessary. It's true because we
> always have the flag set if we want to do an unmap. For cases we don't
> do an unmap, the TTU_LZFREE part of code should never run.
> 
> Also the TTU_UNMAP is unnecessary. If no other flags set (for
> example, TTU_MIGRATION), an unmap is implied.
> 
> The patch includes Johannes's cleanup and dead TTU_ACTION macro removal
> code
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/rmap.h | 22 +++++++++-------------
>  mm/memory-failure.c  |  2 +-
>  mm/rmap.c            |  2 +-
>  mm/vmscan.c          | 11 ++++-------
>  4 files changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8c89e90..7a39414 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -83,19 +83,17 @@ struct anon_vma_chain {
>  };
>  
>  enum ttu_flags {
> -	TTU_UNMAP = 1,			/* unmap mode */
> -	TTU_MIGRATION = 2,		/* migration mode */
> -	TTU_MUNLOCK = 4,		/* munlock mode */
> -	TTU_LZFREE = 8,			/* lazy free mode */
> -	TTU_SPLIT_HUGE_PMD = 16,	/* split huge PMD if any */
> -
> -	TTU_IGNORE_MLOCK = (1 << 8),	/* ignore mlock */
> -	TTU_IGNORE_ACCESS = (1 << 9),	/* don't age */
> -	TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
> -	TTU_BATCH_FLUSH = (1 << 11),	/* Batch TLB flushes where possible
> +	TTU_MIGRATION		= 0x1,	/* migration mode */
> +	TTU_MUNLOCK		= 0x2,	/* munlock mode */
> +
> +	TTU_SPLIT_HUGE_PMD	= 0x4,	/* split huge PMD if any */
> +	TTU_IGNORE_MLOCK	= 0x8,	/* ignore mlock */
> +	TTU_IGNORE_ACCESS	= 0x10,	/* don't age */
> +	TTU_IGNORE_HWPOISON	= 0x20,	/* corrupted page is recoverable */
> +	TTU_BATCH_FLUSH		= 0x40,	/* Batch TLB flushes where possible
>  					 * and caller guarantees they will
>  					 * do a final flush if necessary */
> -	TTU_RMAP_LOCKED = (1 << 12)	/* do not grab rmap lock:
> +	TTU_RMAP_LOCKED		= 0x80	/* do not grab rmap lock:
>  					 * caller holds it */
>  };
>  
> @@ -193,8 +191,6 @@ static inline void page_dup_rmap(struct page *page, bool compound)
>  int page_referenced(struct page *, int is_locked,
>  			struct mem_cgroup *memcg, unsigned long *vm_flags);
>  
> -#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
> -
>  int try_to_unmap(struct page *, enum ttu_flags flags);
>  
>  /* Avoid racy checks */
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3d0f2fd..b78d080 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -906,7 +906,7 @@ EXPORT_SYMBOL_GPL(get_hwpoison_page);
>  static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
>  				  int trapno, int flags, struct page **hpagep)
>  {
> -	enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
> +	enum ttu_flags ttu = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
>  	struct address_space *mapping;
>  	LIST_HEAD(tokill);
>  	int ret;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8774791..96eb85c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1418,7 +1418,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			 */
>  			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
>  
> -			if (!PageDirty(page) && (flags & TTU_LZFREE)) {
> +			if (!PageDirty(page)) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
>  				rp->lazyfreed++;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b40..68ea50d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -971,7 +971,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		int may_enter_fs;
>  		enum page_references references = PAGEREF_RECLAIM_CLEAN;
>  		bool dirty, writeback;
> -		bool lazyfree = false;
>  		int ret = SWAP_SUCCESS;
>  
>  		cond_resched();
> @@ -1125,7 +1124,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
>  				goto activate_locked;
> -			lazyfree = true;
>  			may_enter_fs = 1;
>  
>  			/* Adding to swap updated mapping */
> @@ -1143,9 +1141,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * processes. Try to unmap it here.
>  		 */
>  		if (page_mapped(page) && mapping) {
> -			switch (ret = try_to_unmap(page, lazyfree ?
> -				(ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) :
> -				(ttu_flags | TTU_BATCH_FLUSH))) {
> +			switch (ret = try_to_unmap(page,
> +				ttu_flags | TTU_BATCH_FLUSH)) {
>  			case SWAP_FAIL:
>  				nr_unmap_fail++;
>  				goto activate_locked;
> @@ -1353,7 +1350,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>  	}
>  
>  	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
> -			TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
> +			TTU_IGNORE_ACCESS, NULL, true);
>  	list_splice(&clean_pages, page_list);
>  	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
>  	return ret;
> @@ -1760,7 +1757,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	if (nr_taken == 0)
>  		return 0;
>  
> -	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, TTU_UNMAP,
> +	nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
>  				&stat, false);
>  
>  	spin_lock_irq(&pgdat->lru_lock);
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-24 21:31 ` [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
  2017-02-27  6:48   ` Hillf Danton
@ 2017-02-27 14:35   ` Michal Hocko
  2017-02-27 16:10     ` Shaohua Li
  1 sibling, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 14:35 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:45, Shaohua Li wrote:
> There are a few places the code assumes anonymous pages should have
> SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> for them. The assumption doesn't hold any more, so fix them.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Looks good to me.
[...]
> index 96eb85c..c621088 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1416,7 +1416,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			 * Store the swap location in the pte.
>  			 * See handle_pte_fault() ...
>  			 */
> -			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> +			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> +				page);

just this part makes me scratch my head. I really do not understand what
kind of problem it tries to prevent from, maybe I am missing something
obvoious...

>  
>  			if (!PageDirty(page)) {
>  				/* It's a freeable page by MADV_FREE */
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
  2017-02-27  6:28   ` Minchan Kim
@ 2017-02-27 14:53   ` Michal Hocko
  2017-02-27 17:15   ` Johannes Weiner
  2017-02-28  3:19   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 14:53 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:46, Shaohua Li wrote:
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.

This patch also changes behavior of madv_freed pages on the active list
because they are not moved to the inactive list but considering how anon
pages are reclaimed these days I do not really think this will be
noticeable.

> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/swap.h          |  2 +-
>  include/linux/vm_event_item.h |  2 +-
>  mm/huge_memory.c              |  3 ---
>  mm/madvise.c                  |  2 --
>  mm/swap.c                     | 50 ++++++++++++++++++++++++-------------------
>  mm/vmstat.c                   |  1 +
>  6 files changed, 31 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 45e91dd..486494e 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -279,7 +279,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> -extern void deactivate_page(struct page *page);
> +extern void mark_page_lazyfree(struct page *page);
>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 6aa1b6c..94e58da 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -25,7 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		FOR_ALL_ZONES(PGALLOC),
>  		FOR_ALL_ZONES(ALLOCSTALL),
>  		FOR_ALL_ZONES(PGSCAN_SKIP),
> -		PGFREE, PGACTIVATE, PGDEACTIVATE,
> +		PGFREE, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE,
>  		PGFAULT, PGMAJFAULT,
>  		PGLAZYFREED,
>  		PGREFILL,
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index cf9fb46..3b7ee0c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1562,9 +1562,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		ClearPageDirty(page);
>  	unlock_page(page);
>  
> -	if (PageActive(page))
> -		deactivate_page(page);
> -
>  	if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
>  		orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
>  			tlb->fullmm);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc5927c..61e10b1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -411,8 +411,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			ptent = pte_mkold(ptent);
>  			ptent = pte_mkclean(ptent);
>  			set_pte_at(mm, addr, pte, ptent);
> -			if (PageActive(page))
> -				deactivate_page(page);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
>  	}
> diff --git a/mm/swap.c b/mm/swap.c
> index c4910f1..c4fb4b9 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -46,7 +46,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> -static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
>  #endif
> @@ -561,20 +561,26 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  }
>  
>  
> -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
>  			    void *arg)
>  {
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		int file = page_is_file_cache(page);
> -		int lru = page_lru_base_type(page);
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		bool active = PageActive(page);
>  
> -		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		del_page_from_lru_list(page, lruvec, LRU_INACTIVE_ANON + active);
>  		ClearPageActive(page);
>  		ClearPageReferenced(page);
> -		add_page_to_lru_list(page, lruvec, lru);
> +		/*
> +		 * lazyfree pages are clean anonymous pages. They have
> +		 * SwapBacked flag cleared to distinguish normal anonymous
> +		 * pages
> +		 */
> +		ClearPageSwapBacked(page);
> +		add_page_to_lru_list(page, lruvec, LRU_INACTIVE_FILE);
>  
> -		__count_vm_event(PGDEACTIVATE);
> -		update_page_reclaim_stat(lruvec, file, 0);
> +		__count_vm_events(PGLAZYFREE, hpage_nr_pages(page));
> +		update_page_reclaim_stat(lruvec, 1, 0);
>  	}
>  }
>  
> @@ -604,9 +610,9 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> -	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
>  	if (pagevec_count(pvec))
> -		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
>  
>  	activate_page_drain(cpu);
>  }
> @@ -638,22 +644,22 @@ void deactivate_file_page(struct page *page)
>  }
>  
>  /**
> - * deactivate_page - deactivate a page
> + * mark_page_lazyfree - make an anon page lazyfree
>   * @page: page to deactivate
>   *
> - * deactivate_page() moves @page to the inactive list if @page was on the active
> - * list and was not an unevictable page.  This is done to accelerate the reclaim
> - * of @page.
> + * mark_page_lazyfree() moves @page to the inactive file list.
> + * This is done to accelerate the reclaim of @page.
>   */
> -void deactivate_page(struct page *page)
> -{
> -	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> -		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +void mark_page_lazyfree(struct page *page)
> + {
> +	if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
> +	    !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
>  
>  		get_page(page);
>  		if (!pagevec_add(pvec, page) || PageCompound(page))
> -			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> -		put_cpu_var(lru_deactivate_pvecs);
> +			pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
> +		put_cpu_var(lru_lazyfree_pvecs);
>  	}
>  }
>  
> @@ -704,7 +710,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> -		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			queue_work_on(cpu, lru_add_drain_wq, work);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff..7774196 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -992,6 +992,7 @@ const char * const vmstat_text[] = {
>  	"pgfree",
>  	"pgactivate",
>  	"pgdeactivate",
> +	"pglazyfree",
>  
>  	"pgfault",
>  	"pgmajfault",
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
  2017-02-27  6:33   ` Minchan Kim
@ 2017-02-27 15:05   ` Michal Hocko
  2017-02-27 17:21   ` Johannes Weiner
  2017-02-28  3:21   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 15:05 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:47, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch

OK, this looks much more cleaner and easier to follow than the original
version I have seen.
 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/rmap.h |  2 +-
>  mm/huge_memory.c     |  2 ++
>  mm/madvise.c         |  1 +
>  mm/rmap.c            | 40 +++++++++++++++++-----------------------
>  mm/vmscan.c          | 34 ++++++++++++++++++++++------------
>  5 files changed, 43 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 7a39414..fee10d7 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
>  #define SWAP_AGAIN	1
>  #define SWAP_FAIL	2
>  #define SWAP_MLOCK	3
> -#define SWAP_LZFREE	4
> +#define SWAP_DIRTY	4
>  
>  #endif	/* _LINUX_RMAP_H */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3b7ee0c..4c7454b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1571,6 +1571,8 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		set_pmd_at(mm, addr, pmd, orig_pmd);
>  		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
>  	}
> +
> +	mark_page_lazyfree(page);
>  	ret = true;
>  out:
>  	spin_unlock(ptl);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 61e10b1..225af7d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -413,6 +413,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			set_pte_at(mm, addr, pte, ptent);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  		}
> +		mark_page_lazyfree(page);
>  	}
>  out:
>  	if (nr_swap) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c621088..bb45712 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1281,11 +1281,6 @@ void page_remove_rmap(struct page *page, bool compound)
>  	 */
>  }
>  
> -struct rmap_private {
> -	enum ttu_flags flags;
> -	int lazyfreed;
> -};
> -
>  /*
>   * @arg: enum ttu_flags will be passed to this argument
>   */
> @@ -1301,8 +1296,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	struct page *subpage;
>  	int ret = SWAP_AGAIN;
> -	struct rmap_private *rp = arg;
> -	enum ttu_flags flags = rp->flags;
> +	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
>  	if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> @@ -1419,11 +1413,21 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
>  				page);
>  
> -			if (!PageDirty(page)) {
> +			/*
> +			 * swapin page could be clean, it has data stored in
> +			 * swap. We can't silently discard it without setting
> +			 * swap entry in the page table.
> +			 */
> +			if (!PageDirty(page) && !PageSwapCache(page)) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
> -				rp->lazyfreed++;
>  				goto discard;
> +			} else if (!PageSwapBacked(page)) {
> +				/* dirty MADV_FREE page */
> +				set_pte_at(mm, address, pvmw.pte, pteval);
> +				ret = SWAP_DIRTY;
> +				page_vma_mapped_walk_done(&pvmw);
> +				break;
>  			}
>  
>  			if (swap_duplicate(entry) < 0) {
> @@ -1491,18 +1495,15 @@ static int page_mapcount_is_zero(struct page *page)
>   * SWAP_AGAIN	- we missed a mapping, try again later
>   * SWAP_FAIL	- the page is unswappable
>   * SWAP_MLOCK	- page is mlocked.
> + * SWAP_DIRTY	- page is dirty MADV_FREE page
>   */
>  int try_to_unmap(struct page *page, enum ttu_flags flags)
>  {
>  	int ret;
> -	struct rmap_private rp = {
> -		.flags = flags,
> -		.lazyfreed = 0,
> -	};
>  
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = try_to_unmap_one,
> -		.arg = &rp,
> +		.arg = (void *)flags,
>  		.done = page_mapcount_is_zero,
>  		.anon_lock = page_lock_anon_vma_read,
>  	};
> @@ -1523,11 +1524,8 @@ int try_to_unmap(struct page *page, enum ttu_flags flags)
>  	else
>  		ret = rmap_walk(page, &rwc);
>  
> -	if (ret != SWAP_MLOCK && !page_mapcount(page)) {
> +	if (ret != SWAP_MLOCK && !page_mapcount(page))
>  		ret = SWAP_SUCCESS;
> -		if (rp.lazyfreed && !PageDirty(page))
> -			ret = SWAP_LZFREE;
> -	}
>  	return ret;
>  }
>  
> @@ -1554,14 +1552,10 @@ static int page_not_mapped(struct page *page)
>  int try_to_munlock(struct page *page)
>  {
>  	int ret;
> -	struct rmap_private rp = {
> -		.flags = TTU_MUNLOCK,
> -		.lazyfreed = 0,
> -	};
>  
>  	struct rmap_walk_control rwc = {
>  		.rmap_one = try_to_unmap_one,
> -		.arg = &rp,
> +		.arg = (void *)TTU_MUNLOCK,
>  		.done = page_not_mapped,
>  		.anon_lock = page_lock_anon_vma_read,
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 68ea50d..16ad821 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -911,7 +911,8 @@ static void page_check_dirty_writeback(struct page *page,
>  	 * Anonymous pages are not handled by flushers and must be written
>  	 * from reclaim context. Do not stall reclaim based on them
>  	 */
> -	if (!page_is_file_cache(page)) {
> +	if (!page_is_file_cache(page) ||
> +	    (PageAnon(page) && !PageSwapBacked(page))) {
>  		*dirty = false;
>  		*writeback = false;
>  		return;
> @@ -992,7 +993,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep_locked;
>  
>  		/* Double the slab pressure for mapped and swapcache pages */
> -		if (page_mapped(page) || PageSwapCache(page))
> +		if ((page_mapped(page) || PageSwapCache(page)) &&
> +		    !(PageAnon(page) && !PageSwapBacked(page)))
>  			sc->nr_scanned++;
>  
>  		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
> @@ -1118,8 +1120,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		/*
>  		 * Anonymous process memory has backing store?
>  		 * Try to allocate it some swap space here.
> +		 * Lazyfree page could be freed directly
>  		 */
> -		if (PageAnon(page) && !PageSwapCache(page)) {
> +		if (PageAnon(page) && PageSwapBacked(page) &&
> +		    !PageSwapCache(page)) {
>  			if (!(sc->gfp_mask & __GFP_IO))
>  				goto keep_locked;
>  			if (!add_to_swap(page, page_list))
> @@ -1140,9 +1144,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * The page is mapped into the page tables of one or more
>  		 * processes. Try to unmap it here.
>  		 */
> -		if (page_mapped(page) && mapping) {
> +		if (page_mapped(page)) {
>  			switch (ret = try_to_unmap(page,
>  				ttu_flags | TTU_BATCH_FLUSH)) {
> +			case SWAP_DIRTY:
> +				SetPageSwapBacked(page);
> +				/* fall through */
>  			case SWAP_FAIL:
>  				nr_unmap_fail++;
>  				goto activate_locked;
> @@ -1150,8 +1157,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				goto keep_locked;
>  			case SWAP_MLOCK:
>  				goto cull_mlocked;
> -			case SWAP_LZFREE:
> -				goto lazyfree;
>  			case SWAP_SUCCESS:
>  				; /* try to free the page below */
>  			}
> @@ -1263,10 +1268,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			}
>  		}
>  
> -lazyfree:
> -		if (!mapping || !__remove_mapping(mapping, page, true))
> -			goto keep_locked;
> +		if (PageAnon(page) && !PageSwapBacked(page)) {
> +			/* follow __remove_mapping for reference */
> +			if (!page_ref_freeze(page, 1))
> +				goto keep_locked;
> +			if (PageDirty(page)) {
> +				page_ref_unfreeze(page, 1);
> +				goto keep_locked;
> +			}
>  
> +			count_vm_event(PGLAZYFREED);
> +		} else if (!mapping || !__remove_mapping(mapping, page, true))
> +			goto keep_locked;
>  		/*
>  		 * At this point, we have no other references and there is
>  		 * no way to pick any more up (removed from LRU, removed
> @@ -1276,9 +1289,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 */
>  		__ClearPageLocked(page);
>  free_it:
> -		if (ret == SWAP_LZFREE)
> -			count_vm_event(PGLAZYFREED);
> -
>  		nr_reclaimed++;
>  
>  		/*
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 5/6] mm: enable MADV_FREE for swapless system
  2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
@ 2017-02-27 15:06   ` Michal Hocko
  2017-02-28  3:22   ` Hillf Danton
  2017-02-28  5:02   ` Minchan Kim
  2 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 15:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:48, Shaohua Li wrote:
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/madvise.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 225af7d..5ab4b7b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -612,13 +612,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  	case MADV_WILLNEED:
>  		return madvise_willneed(vma, prev, start, end);
>  	case MADV_FREE:
> -		/*
> -		 * XXX: In this implementation, MADV_FREE works like
> -		 * MADV_DONTNEED on swapless system or full swap.
> -		 */
> -		if (get_nr_swap_pages() > 0)
> -			return madvise_free(vma, prev, start, end);
> -		/* passthrough */
> +		return madvise_free(vma, prev, start, end);
>  	case MADV_DONTNEED:
>  		return madvise_dontneed(vma, prev, start, end);
>  	default:
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
@ 2017-02-27 15:06   ` Michal Hocko
  2017-02-28  3:23   ` Hillf Danton
  2017-03-01 13:36   ` Michal Hocko
  2 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 15:06 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/filesystems/proc.txt | 4 ++++
>  fs/proc/task_mmu.c                 | 8 +++++++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> index c94b467..45853e1 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -412,6 +412,7 @@ Private_Clean:         0 kB
>  Private_Dirty:         0 kB
>  Referenced:          892 kB
>  Anonymous:             0 kB
> +LazyFree:              0 kB
>  AnonHugePages:         0 kB
>  ShmemPmdMapped:        0 kB
>  Shared_Hugetlb:        0 kB
> @@ -441,6 +442,9 @@ accessed.
>  "Anonymous" shows the amount of memory that does not belong to any file.  Even
>  a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
>  and a page is modified, the file page is replaced by a private anonymous copy.
> +"LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
> +The memory isn't freed immediately with madvise(). It's freed in memory
> +pressure if the memory is clean.
>  "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
>  "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
>  huge pages.
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index ee3efb2..8a5ec00 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -440,6 +440,7 @@ struct mem_size_stats {
>  	unsigned long private_dirty;
>  	unsigned long referenced;
>  	unsigned long anonymous;
> +	unsigned long lazyfree;
>  	unsigned long anonymous_thp;
>  	unsigned long shmem_thp;
>  	unsigned long swap;
> @@ -456,8 +457,11 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page,
>  	int i, nr = compound ? 1 << compound_order(page) : 1;
>  	unsigned long size = nr * PAGE_SIZE;
>  
> -	if (PageAnon(page))
> +	if (PageAnon(page)) {
>  		mss->anonymous += size;
> +		if (!PageSwapBacked(page) && !dirty && !PageDirty(page))
> +			mss->lazyfree += size;
> +	}
>  
>  	mss->resident += size;
>  	/* Accumulate the size in pages that have been accessed. */
> @@ -770,6 +774,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   "Private_Dirty:  %8lu kB\n"
>  		   "Referenced:     %8lu kB\n"
>  		   "Anonymous:      %8lu kB\n"
> +		   "LazyFree:       %8lu kB\n"
>  		   "AnonHugePages:  %8lu kB\n"
>  		   "ShmemPmdMapped: %8lu kB\n"
>  		   "Shared_Hugetlb: %8lu kB\n"
> @@ -788,6 +793,7 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
>  		   mss.private_dirty >> 10,
>  		   mss.referenced >> 10,
>  		   mss.anonymous >> 10,
> +		   mss.lazyfree >> 10,
>  		   mss.anonymous_thp >> 10,
>  		   mss.shmem_thp >> 10,
>  		   mss.shared_hugetlb >> 10,
> -- 
> 2.9.3
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-27 14:35   ` Michal Hocko
@ 2017-02-27 16:10     ` Shaohua Li
  2017-02-27 16:28       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Shaohua Li @ 2017-02-27 16:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Mon, Feb 27, 2017 at 03:35:34PM +0100, Michal Hocko wrote:
> On Fri 24-02-17 13:31:45, Shaohua Li wrote:
> > There are a few places the code assumes anonymous pages should have
> > SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> > going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> > for them. The assumption doesn't hold any more, so fix them.
> > 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> Looks good to me.
> [...]
> > index 96eb85c..c621088 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1416,7 +1416,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >  			 * Store the swap location in the pte.
> >  			 * See handle_pte_fault() ...
> >  			 */
> > -			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> > +			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> > +				page);
> 
> just this part makes me scratch my head. I really do not understand what
> kind of problem it tries to prevent from, maybe I am missing something
> obvoious...

Just check a page which isn't lazyfree but wrongly enters here without swap
entry. Or maybe you suggest we delete this statement?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-27  6:28   ` Minchan Kim
@ 2017-02-27 16:13     ` Shaohua Li
  2017-02-27 16:30       ` Michal Hocko
  2017-02-28  2:53       ` Minchan Kim
  0 siblings, 2 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-27 16:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

On Mon, Feb 27, 2017 at 03:28:01PM +0900, Minchan Kim wrote:
> Hello Shaohua,
> 
> On Fri, Feb 24, 2017 at 01:31:46PM -0800, Shaohua Li wrote:
> > madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> > pages, but they can be freed without pageout. To destinguish them
> > against normal anonymous pages, we clear their SwapBacked flag.
> > 
> > MADV_FREE pages could be freed without pageout, so they pretty much like
> > used once file pages. For such pages, we'd like to reclaim them once
> > there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> > pages always before used once file pages and we definitively want to
> > reclaim the pages before other anonymous and file pages.
> > 
> > To speed up MADV_FREE pages reclaim, we put the pages into
> > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > nowadays and should be full of used once file pages. Reclaiming
> > MADV_FREE pages will not have much interfere of anonymous and active
> > file pages. And the inactive file pages and MADV_FREE pages will be
> > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > means we can reclaim the pages without swap support. This idea is
> > suggested by Johannes.
> > 
> > This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> > avoid bisect failure, next patch will do it.
> > 
> > The patch is based on Minchan's original patch.
> > 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> 
> This patch doesn't address I pointed out in v4.
> 
> https://marc.info/?i=20170224233752.GB4635%40bbox
> 
> Let's discuss it if you still are against.

I really think a spearate patch makes the code clearer. There are a lot of
places we introduce a function but don't use it immediately, if the way makes
the code clearer. But anyway, I'll let Andrew decide if the two patches should
be merged.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-27  6:33   ` Minchan Kim
@ 2017-02-27 16:19     ` Shaohua Li
  2017-02-27 16:32       ` Michal Hocko
  2017-02-28  5:02       ` Minchan Kim
  0 siblings, 2 replies; 44+ messages in thread
From: Shaohua Li @ 2017-02-27 16:19 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

On Mon, Feb 27, 2017 at 03:33:15PM +0900, Minchan Kim wrote:
> Hi Shaohua,
> 
> On Fri, Feb 24, 2017 at 01:31:47PM -0800, Shaohua Li wrote:
> > When memory pressure is high, we free MADV_FREE pages. If the pages are
> > not dirty in pte, the pages could be freed immediately. Otherwise we
> > can't reclaim them. We put the pages back to anonumous LRU list (by
> > setting SwapBacked flag) and the pages will be reclaimed in normal
> > swapout way.
> > 
> > We use normal page reclaim policy. Since MADV_FREE pages are put into
> > inactive file list, such pages and inactive file pages are reclaimed
> > according to their age. This is expected, because we don't want to
> > reclaim too many MADV_FREE pages before used once pages.
> > 
> > Based on Minchan's original patch
> > 
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Minchan Kim <minchan@kernel.org>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Rik van Riel <riel@redhat.com>
> > Cc: Mel Gorman <mgorman@techsingularity.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  include/linux/rmap.h |  2 +-
> >  mm/huge_memory.c     |  2 ++
> >  mm/madvise.c         |  1 +
> >  mm/rmap.c            | 40 +++++++++++++++++-----------------------
> >  mm/vmscan.c          | 34 ++++++++++++++++++++++------------
> >  5 files changed, 43 insertions(+), 36 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 7a39414..fee10d7 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
> >  #define SWAP_AGAIN	1
> >  #define SWAP_FAIL	2
> >  #define SWAP_MLOCK	3
> > -#define SWAP_LZFREE	4
> > +#define SWAP_DIRTY	4
> 
> I still don't convinced why we should introduce SWAP_DIRTY in try_to_unmap.
> https://marc.info/?l=linux-mm&m=148797879123238&w=2
> 
> We have been SetPageMlocked in there but why cannot we SetPageSwapBacked
> in there? It's not a thing to change LRU type but it's just indication
> we found the page's status changed in late.

This one I don't have strong preference. Personally I agree with Johannes,
handling failure in vmscan sounds better. But since the failure handling is
just one statement, this probably doesn't make too much difference. If Johannes
and you made an agreement, I'll follow.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag
  2017-02-27 16:10     ` Shaohua Li
@ 2017-02-27 16:28       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 16:28 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Mon 27-02-17 08:10:24, Shaohua Li wrote:
> On Mon, Feb 27, 2017 at 03:35:34PM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 13:31:45, Shaohua Li wrote:
> > > There are a few places the code assumes anonymous pages should have
> > > SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
> > > going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
> > > for them. The assumption doesn't hold any more, so fix them.
> > > 
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Shaohua Li <shli@fb.com>

Anyway, feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> > 
> > Looks good to me.
> > [...]
> > > index 96eb85c..c621088 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1416,7 +1416,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >  			 * Store the swap location in the pte.
> > >  			 * See handle_pte_fault() ...
> > >  			 */
> > > -			VM_BUG_ON_PAGE(!PageSwapCache(page), page);
> > > +			VM_BUG_ON_PAGE(!PageSwapCache(page) && PageSwapBacked(page),
> > > +				page);
> > 
> > just this part makes me scratch my head. I really do not understand what
> > kind of problem it tries to prevent from, maybe I am missing something
> > obvoious...
> 
> Just check a page which isn't lazyfree but wrongly enters here without swap
> entry. Or maybe you suggest we delete this statement?

Ohh, I figured out when seeing later patch in the series, I then wanted
to get back to this one but forgot... This on its own didn't really tell
me much. Maybe a comment would be helpful or even drop the VM_BUG_ON
altogether.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-27 16:13     ` Shaohua Li
@ 2017-02-27 16:30       ` Michal Hocko
  2017-02-28  2:53       ` Minchan Kim
  1 sibling, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 16:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Minchan Kim, linux-mm, linux-kernel, Kernel-team, hughd, hannes,
	riel, mgorman, akpm

On Mon 27-02-17 08:13:10, Shaohua Li wrote:
> On Mon, Feb 27, 2017 at 03:28:01PM +0900, Minchan Kim wrote:
[...]
> > This patch doesn't address I pointed out in v4.
> > 
> > https://marc.info/?i=20170224233752.GB4635%40bbox
> > 
> > Let's discuss it if you still are against.
> 
> I really think a spearate patch makes the code clearer. There are a lot of
> places we introduce a function but don't use it immediately, if the way makes
> the code clearer. But anyway, I'll let Andrew decide if the two patches should
> be merged.

I agree that it is almost always _preferable_ to add new functions along
with their callers. In this particular case I would lean towards keeping
the separation the way Shaohua did it because it makes the code really
cleaner IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-27 16:19     ` Shaohua Li
@ 2017-02-27 16:32       ` Michal Hocko
  2017-02-28  5:02       ` Minchan Kim
  1 sibling, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2017-02-27 16:32 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Minchan Kim, linux-mm, linux-kernel, Kernel-team, hughd, hannes,
	riel, mgorman, akpm

On Mon 27-02-17 08:19:08, Shaohua Li wrote:
> On Mon, Feb 27, 2017 at 03:33:15PM +0900, Minchan Kim wrote:
[...]
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
> > >  #define SWAP_AGAIN	1
> > >  #define SWAP_FAIL	2
> > >  #define SWAP_MLOCK	3
> > > -#define SWAP_LZFREE	4
> > > +#define SWAP_DIRTY	4
> > 
> > I still don't convinced why we should introduce SWAP_DIRTY in try_to_unmap.
> > https://marc.info/?l=linux-mm&m=148797879123238&w=2
> > 
> > We have been SetPageMlocked in there but why cannot we SetPageSwapBacked
> > in there? It's not a thing to change LRU type but it's just indication
> > we found the page's status changed in late.
> 
> This one I don't have strong preference. Personally I agree with Johannes,
> handling failure in vmscan sounds better. But since the failure handling is
> just one statement, this probably doesn't make too much difference. If Johannes
> and you made an agreement, I'll follow.

FWIW I like your current SWAP_DIRTY and the later handling at the vmscan
level more.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
  2017-02-27  6:28   ` Minchan Kim
  2017-02-27 14:53   ` Michal Hocko
@ 2017-02-27 17:15   ` Johannes Weiner
  2017-02-28  3:19   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2017-02-27 17:15 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Fri, Feb 24, 2017 at 01:31:46PM -0800, Shaohua Li wrote:
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.
> 
> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
  2017-02-27  6:33   ` Minchan Kim
  2017-02-27 15:05   ` Michal Hocko
@ 2017-02-27 17:21   ` Johannes Weiner
  2017-02-28  3:21   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2017-02-27 17:21 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, minchan, hughd,
	riel, mgorman, akpm

On Fri, Feb 24, 2017 at 01:31:47PM -0800, Shaohua Li wrote:
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

FWIW, I agree with Minchan that this could be folded into the previous
patch and would be a little neater. But I don't feel strongly in this
case since I didn't have any trouble reviewing the patches like this -
void mark_page_lazyfree(struct page *) is an easy API to remember.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-27 16:13     ` Shaohua Li
  2017-02-27 16:30       ` Michal Hocko
@ 2017-02-28  2:53       ` Minchan Kim
  1 sibling, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2017-02-28  2:53 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

Hi,

On Mon, Feb 27, 2017 at 08:13:10AM -0800, Shaohua Li wrote:
> On Mon, Feb 27, 2017 at 03:28:01PM +0900, Minchan Kim wrote:
> > Hello Shaohua,
> > 
> > On Fri, Feb 24, 2017 at 01:31:46PM -0800, Shaohua Li wrote:
> > > madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> > > pages, but they can be freed without pageout. To destinguish them
> > > against normal anonymous pages, we clear their SwapBacked flag.
> > > 
> > > MADV_FREE pages could be freed without pageout, so they pretty much like
> > > used once file pages. For such pages, we'd like to reclaim them once
> > > there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> > > pages always before used once file pages and we definitively want to
> > > reclaim the pages before other anonymous and file pages.
> > > 
> > > To speed up MADV_FREE pages reclaim, we put the pages into
> > > LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> > > nowadays and should be full of used once file pages. Reclaiming
> > > MADV_FREE pages will not have much interfere of anonymous and active
> > > file pages. And the inactive file pages and MADV_FREE pages will be
> > > reclaimed according to their age, so we don't reclaim too many MADV_FREE
> > > pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> > > means we can reclaim the pages without swap support. This idea is
> > > suggested by Johannes.
> > > 
> > > This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> > > avoid bisect failure, next patch will do it.
> > > 
> > > The patch is based on Minchan's original patch.
> > > 
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Shaohua Li <shli@fb.com>
> > 
> > This patch doesn't address I pointed out in v4.
> > 
> > https://marc.info/?i=20170224233752.GB4635%40bbox
> > 
> > Let's discuss it if you still are against.
> 
> I really think a spearate patch makes the code clearer. There are a lot of
> places we introduce a function but don't use it immediately, if the way makes
> the code clearer. But anyway, I'll let Andrew decide if the two patches should
> be merged.

Acked-by: Minchan Kim <minchan@kernel.org>

Okay, I don't insist it any more if others are happy but please keep it in mind
that it's not a good habit, IMHO. Because

First of all, it makes review hard.

You introduce PGLAZYFREE in the patch but reviewer cannot find where it is used
so cannot review the accouting is right.

You introduce mark_page_lazyfree in the patch but there is no callsite to use
it. How can reviewer review it rightly? We cannot see what checks are missing
in there and what checks are redundant, and what kinds of lock we need.
It's hot path or slow path? Depending on it, we need to think approach.

There are many questions in there. It means we cannot review it without relying
upon upcoming patches, which is really not helpful for the review.

As well, it adds unncessary bisect point which is not a good, either.

I really want to merge two patches(introduce part and use-it part) unless
it makes review really hard or need per-subsystem apply.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list
  2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
                     ` (2 preceding siblings ...)
  2017-02-27 17:15   ` Johannes Weiner
@ 2017-02-28  3:19   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2017-02-28  3:19 UTC (permalink / raw)
  To: 'Shaohua Li', linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm

On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> madv MADV_FREE indicate pages are 'lazyfree'. They are still anonymous
> pages, but they can be freed without pageout. To destinguish them
> against normal anonymous pages, we clear their SwapBacked flag.
> 
> MADV_FREE pages could be freed without pageout, so they pretty much like
> used once file pages. For such pages, we'd like to reclaim them once
> there is memory pressure. Also it might be unfair reclaiming MADV_FREE
> pages always before used once file pages and we definitively want to
> reclaim the pages before other anonymous and file pages.
> 
> To speed up MADV_FREE pages reclaim, we put the pages into
> LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
> nowadays and should be full of used once file pages. Reclaiming
> MADV_FREE pages will not have much interfere of anonymous and active
> file pages. And the inactive file pages and MADV_FREE pages will be
> reclaimed according to their age, so we don't reclaim too many MADV_FREE
> pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
> means we can reclaim the pages without swap support. This idea is
> suggested by Johannes.
> 
> This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
> avoid bisect failure, next patch will do it.
> 
> The patch is based on Minchan's original patch.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
                     ` (2 preceding siblings ...)
  2017-02-27 17:21   ` Johannes Weiner
@ 2017-02-28  3:21   ` Hillf Danton
  3 siblings, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2017-02-28  3:21 UTC (permalink / raw)
  To: 'Shaohua Li', linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm


On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> When memory pressure is high, we free MADV_FREE pages. If the pages are
> not dirty in pte, the pages could be freed immediately. Otherwise we
> can't reclaim them. We put the pages back to anonumous LRU list (by
> setting SwapBacked flag) and the pages will be reclaimed in normal
> swapout way.
> 
> We use normal page reclaim policy. Since MADV_FREE pages are put into
> inactive file list, such pages and inactive file pages are reclaimed
> according to their age. This is expected, because we don't want to
> reclaim too many MADV_FREE pages before used once pages.
> 
> Based on Minchan's original patch
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 5/6] mm: enable MADV_FREE for swapless system
  2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
@ 2017-02-28  3:22   ` Hillf Danton
  2017-02-28  5:02   ` Minchan Kim
  2 siblings, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2017-02-28  3:22 UTC (permalink / raw)
  To: 'Shaohua Li', linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm


On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
@ 2017-02-28  3:23   ` Hillf Danton
  2017-03-01 13:36   ` Michal Hocko
  2 siblings, 0 replies; 44+ messages in thread
From: Hillf Danton @ 2017-02-28  3:23 UTC (permalink / raw)
  To: 'Shaohua Li', linux-mm, linux-kernel
  Cc: Kernel-team, mhocko, minchan, hughd, hannes, riel, mgorman, akpm


On February 25, 2017 5:32 AM Shaohua Li wrote: 
> 
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 4/6] mm: reclaim MADV_FREE pages
  2017-02-27 16:19     ` Shaohua Li
  2017-02-27 16:32       ` Michal Hocko
@ 2017-02-28  5:02       ` Minchan Kim
  1 sibling, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2017-02-28  5:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

On Mon, Feb 27, 2017 at 08:19:08AM -0800, Shaohua Li wrote:
> On Mon, Feb 27, 2017 at 03:33:15PM +0900, Minchan Kim wrote:
> > Hi Shaohua,
> > 
> > On Fri, Feb 24, 2017 at 01:31:47PM -0800, Shaohua Li wrote:
> > > When memory pressure is high, we free MADV_FREE pages. If the pages are
> > > not dirty in pte, the pages could be freed immediately. Otherwise we
> > > can't reclaim them. We put the pages back to anonumous LRU list (by
> > > setting SwapBacked flag) and the pages will be reclaimed in normal
> > > swapout way.
> > > 
> > > We use normal page reclaim policy. Since MADV_FREE pages are put into
> > > inactive file list, such pages and inactive file pages are reclaimed
> > > according to their age. This is expected, because we don't want to
> > > reclaim too many MADV_FREE pages before used once pages.
> > > 
> > > Based on Minchan's original patch
> > > 
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Minchan Kim <minchan@kernel.org>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Rik van Riel <riel@redhat.com>
> > > Cc: Mel Gorman <mgorman@techsingularity.net>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Signed-off-by: Shaohua Li <shli@fb.com>
> > > ---
> > >  include/linux/rmap.h |  2 +-
> > >  mm/huge_memory.c     |  2 ++
> > >  mm/madvise.c         |  1 +
> > >  mm/rmap.c            | 40 +++++++++++++++++-----------------------
> > >  mm/vmscan.c          | 34 ++++++++++++++++++++++------------
> > >  5 files changed, 43 insertions(+), 36 deletions(-)
> > > 
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index 7a39414..fee10d7 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -298,6 +298,6 @@ static inline int page_mkclean(struct page *page)
> > >  #define SWAP_AGAIN	1
> > >  #define SWAP_FAIL	2
> > >  #define SWAP_MLOCK	3
> > > -#define SWAP_LZFREE	4
> > > +#define SWAP_DIRTY	4
> > 
> > I still don't convinced why we should introduce SWAP_DIRTY in try_to_unmap.
> > https://marc.info/?l=linux-mm&m=148797879123238&w=2
> > 
> > We have been SetPageMlocked in there but why cannot we SetPageSwapBacked
> > in there? It's not a thing to change LRU type but it's just indication
> > we found the page's status changed in late.
> 
> This one I don't have strong preference. Personally I agree with Johannes,
> handling failure in vmscan sounds better. But since the failure handling is
> just one statement, this probably doesn't make too much difference. If Johannes
> and you made an agreement, I'll follow.

I don't want to add unnecessary new return value(i.e., SWAP_DIRTY).
If VM found lazyfree page dirty in try_to_unmap_one, it means "non-swappable page"
so it's natural to set SetPageSwapBacked in there and return just SWAP_FAIL to
activate it in vmscan.c. SWAP_FAIL means the page is non-swappable so it should be
activated. I don't see any problem in there like software engineering pov.

However, it seems everyone are happy with introdcuing SWAP_DIRTY so I don't
insist on it which is not critical for this patchset.
I looked over try_to_unmap and callers. Now, I think we could remove SWAP_MLOCK
and maybe SWAP_AGAIN as well as SWAP_DIRTY that is to make try_to_unmap *bool*.
So, it could be done by separate patchset. I will look into that in more.

Acked-by: Minchan Kim <minchan@kernel.org>

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 5/6] mm: enable MADV_FREE for swapless system
  2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
  2017-02-28  3:22   ` Hillf Danton
@ 2017-02-28  5:02   ` Minchan Kim
  2 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2017-02-28  5:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, mhocko, hughd, hannes, riel,
	mgorman, akpm

On Fri, Feb 24, 2017 at 01:31:48PM -0800, Shaohua Li wrote:
> Now MADV_FREE pages can be easily reclaimed even for swapless system. We
> can safely enable MADV_FREE for all systems.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Shaohua Li <shli@fb.com>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
  2017-02-27 15:06   ` Michal Hocko
  2017-02-28  3:23   ` Hillf Danton
@ 2017-03-01 13:36   ` Michal Hocko
  2017-03-01 17:37     ` Shaohua Li
  2017-03-01 18:31     ` Johannes Weiner
  2 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2017-03-01 13:36 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> show MADV_FREE pages info of each vma in smaps. The interface is for
> diganose or monitoring purpose, userspace could use it to understand
> what happens in the application. Since userspace could dirty MADV_FREE
> pages without notice from kernel, this interface is the only place we
> can get accurate accounting info about MADV_FREE pages.

I have just got to test this patchset and noticed something that was a
bit surprising

madvise(mmap(len), len, MADV_FREE)
Size:             102400 kB
Rss:              102400 kB
Pss:              102400 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:    102400 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:        102400 kB
LazyFree:         102368 kB

It took me a some time to realize that LazyFree is not accurate because
there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
is an implementation detail which shouldn't be visible to the userspace.
Should we simply drain the pagevec? A crude way would be to simply
lru_add_drain_all after we are done with the given range. We can also
make this lru_lazyfree_pvecs specific but I am not sure this is worth
the additional code.
---
diff --git a/mm/madvise.c b/mm/madvise.c
index dc5927c812d3..d2c318db16c9 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	madvise_free_page_range(&tlb, vma, start, end);
 	mmu_notifier_invalidate_range_end(mm, start, end);
 	tlb_finish_mmu(&tlb, start, end);
-
+	lru_add_drain_all();
 	return 0;
 }
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 13:36   ` Michal Hocko
@ 2017-03-01 17:37     ` Shaohua Li
  2017-03-01 17:49       ` Michal Hocko
  2017-03-01 18:31     ` Johannes Weiner
  1 sibling, 1 reply; 44+ messages in thread
From: Shaohua Li @ 2017-03-01 17:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > show MADV_FREE pages info of each vma in smaps. The interface is for
> > diganose or monitoring purpose, userspace could use it to understand
> > what happens in the application. Since userspace could dirty MADV_FREE
> > pages without notice from kernel, this interface is the only place we
> > can get accurate accounting info about MADV_FREE pages.
> 
> I have just got to test this patchset and noticed something that was a
> bit surprising
> 
> madvise(mmap(len), len, MADV_FREE)
> Size:             102400 kB
> Rss:              102400 kB
> Pss:              102400 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:    102400 kB
> Private_Dirty:         0 kB
> Referenced:            0 kB
> Anonymous:        102400 kB
> LazyFree:         102368 kB
> 
> It took me a some time to realize that LazyFree is not accurate because
> there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> is an implementation detail which shouldn't be visible to the userspace.
> Should we simply drain the pagevec? A crude way would be to simply
> lru_add_drain_all after we are done with the given range. We can also
> make this lru_lazyfree_pvecs specific but I am not sure this is worth
> the additional code.

Minchan's original patch includes a drain of pvec. I discard it because I think
it's not worth the effort. There aren't too many memory in the per-cpu vecs.
Like what you said, I doubt this is noticeable to userspace.

Thanks,
Shaohua


> ---
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc5927c812d3..d2c318db16c9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	madvise_free_page_range(&tlb, vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  	tlb_finish_mmu(&tlb, start, end);
> -
> +	lru_add_drain_all();
>  	return 0;
>  }
>  
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 17:37     ` Shaohua Li
@ 2017-03-01 17:49       ` Michal Hocko
  2017-03-01 18:18         ` Shaohua Li
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-03-01 17:49 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Wed 01-03-17 09:37:10, Shaohua Li wrote:
> On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > diganose or monitoring purpose, userspace could use it to understand
> > > what happens in the application. Since userspace could dirty MADV_FREE
> > > pages without notice from kernel, this interface is the only place we
> > > can get accurate accounting info about MADV_FREE pages.
> > 
> > I have just got to test this patchset and noticed something that was a
> > bit surprising
> > 
> > madvise(mmap(len), len, MADV_FREE)
> > Size:             102400 kB
> > Rss:              102400 kB
> > Pss:              102400 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:    102400 kB
> > Private_Dirty:         0 kB
> > Referenced:            0 kB
> > Anonymous:        102400 kB
> > LazyFree:         102368 kB
> > 
> > It took me a some time to realize that LazyFree is not accurate because
> > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > is an implementation detail which shouldn't be visible to the userspace.
> > Should we simply drain the pagevec? A crude way would be to simply
> > lru_add_drain_all after we are done with the given range. We can also
> > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > the additional code.
> 
> Minchan's original patch includes a drain of pvec. I discard it because I think
> it's not worth the effort. There aren't too many memory in the per-cpu vecs.

but multiply that by the number of CPUs.

> Like what you said, I doubt this is noticeable to userspace.

maybe I wasn't clear enough. I've noticed and I expect others would as
well. We really shouldn't leak implementation details like that. So I
_believe_ this should be fixed. Draining all pagevecs is rather coarse
but it is the simplest thing to do. If you do not want to fold this
into the original patch I can send a standalone one. Or do you have any
concerns about draining?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 17:49       ` Michal Hocko
@ 2017-03-01 18:18         ` Shaohua Li
  0 siblings, 0 replies; 44+ messages in thread
From: Shaohua Li @ 2017-03-01 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Kernel-team, minchan, hughd, hannes,
	riel, mgorman, akpm

On Wed, Mar 01, 2017 at 06:49:56PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 09:37:10, Shaohua Li wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > > diganose or monitoring purpose, userspace could use it to understand
> > > > what happens in the application. Since userspace could dirty MADV_FREE
> > > > pages without notice from kernel, this interface is the only place we
> > > > can get accurate accounting info about MADV_FREE pages.
> > > 
> > > I have just got to test this patchset and noticed something that was a
> > > bit surprising
> > > 
> > > madvise(mmap(len), len, MADV_FREE)
> > > Size:             102400 kB
> > > Rss:              102400 kB
> > > Pss:              102400 kB
> > > Shared_Clean:          0 kB
> > > Shared_Dirty:          0 kB
> > > Private_Clean:    102400 kB
> > > Private_Dirty:         0 kB
> > > Referenced:            0 kB
> > > Anonymous:        102400 kB
> > > LazyFree:         102368 kB
> > > 
> > > It took me a some time to realize that LazyFree is not accurate because
> > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > > is an implementation detail which shouldn't be visible to the userspace.
> > > Should we simply drain the pagevec? A crude way would be to simply
> > > lru_add_drain_all after we are done with the given range. We can also
> > > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > > the additional code.
> > 
> > Minchan's original patch includes a drain of pvec. I discard it because I think
> > it's not worth the effort. There aren't too many memory in the per-cpu vecs.
> 
> but multiply that by the number of CPUs.
> 
> > Like what you said, I doubt this is noticeable to userspace.
> 
> maybe I wasn't clear enough. I've noticed and I expect others would as
> well. We really shouldn't leak implementation details like that. So I
> _believe_ this should be fixed. Draining all pagevecs is rather coarse
> but it is the simplest thing to do. If you do not want to fold this
> into the original patch I can send a standalone one. Or do you have any
> concerns about draining?

No, no objection at all. Just doubt it's worthy. Looks nobody complains similar
issue, For exmaple, deactivate_file_page does the similar thing, then the smaps
'Referenced' could be inaccurate.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 13:36   ` Michal Hocko
  2017-03-01 17:37     ` Shaohua Li
@ 2017-03-01 18:31     ` Johannes Weiner
  2017-03-01 18:57       ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2017-03-01 18:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, minchan, hughd,
	riel, mgorman, akpm

On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > show MADV_FREE pages info of each vma in smaps. The interface is for
> > diganose or monitoring purpose, userspace could use it to understand
> > what happens in the application. Since userspace could dirty MADV_FREE
> > pages without notice from kernel, this interface is the only place we
> > can get accurate accounting info about MADV_FREE pages.
> 
> I have just got to test this patchset and noticed something that was a
> bit surprising
> 
> madvise(mmap(len), len, MADV_FREE)
> Size:             102400 kB
> Rss:              102400 kB
> Pss:              102400 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:    102400 kB
> Private_Dirty:         0 kB
> Referenced:            0 kB
> Anonymous:        102400 kB
> LazyFree:         102368 kB
> 
> It took me a some time to realize that LazyFree is not accurate because
> there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> is an implementation detail which shouldn't be visible to the userspace.
> Should we simply drain the pagevec? A crude way would be to simply
> lru_add_drain_all after we are done with the given range. We can also
> make this lru_lazyfree_pvecs specific but I am not sure this is worth
> the additional code.
> ---
> diff --git a/mm/madvise.c b/mm/madvise.c
> index dc5927c812d3..d2c318db16c9 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	madvise_free_page_range(&tlb, vma, start, end);
>  	mmu_notifier_invalidate_range_end(mm, start, end);
>  	tlb_finish_mmu(&tlb, start, end);
> -
> +	lru_add_drain_all();

A full drain on all CPUs is very expensive and IMO not justified for
some per-cpu fuzz factor in the stats. I'd take hampering the stats
over hampering the syscall any day; only a subset of MADV_FREE users
will look at the stats.

And while the aggregate error can be large on machines with many CPUs
(notably the machines on which you absolutely don't want to send IPIs
to all cores each time a thread madvises some pages!), the pages of a
single process are not likely to be spread out across more than a few
CPUs. The error when reading a specific smaps should be completely ok.

In numbers: even if your process is madvising from 16 different CPUs,
the error in its smaps file will peak at 896K in the worst case. That
level of concurrency tends to come with much bigger memory quantities
for that amount of error to matter.

IMO this is a non-issue.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 18:31     ` Johannes Weiner
@ 2017-03-01 18:57       ` Michal Hocko
  2017-03-02  7:39         ` Minchan Kim
  2017-03-02 14:01         ` Johannes Weiner
  0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2017-03-01 18:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, minchan, hughd,
	riel, mgorman, akpm

On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > diganose or monitoring purpose, userspace could use it to understand
> > > what happens in the application. Since userspace could dirty MADV_FREE
> > > pages without notice from kernel, this interface is the only place we
> > > can get accurate accounting info about MADV_FREE pages.
> > 
> > I have just got to test this patchset and noticed something that was a
> > bit surprising
> > 
> > madvise(mmap(len), len, MADV_FREE)
> > Size:             102400 kB
> > Rss:              102400 kB
> > Pss:              102400 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:    102400 kB
> > Private_Dirty:         0 kB
> > Referenced:            0 kB
> > Anonymous:        102400 kB
> > LazyFree:         102368 kB
> > 
> > It took me a some time to realize that LazyFree is not accurate because
> > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > is an implementation detail which shouldn't be visible to the userspace.
> > Should we simply drain the pagevec? A crude way would be to simply
> > lru_add_drain_all after we are done with the given range. We can also
> > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > the additional code.
> > ---
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index dc5927c812d3..d2c318db16c9 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> >  	madvise_free_page_range(&tlb, vma, start, end);
> >  	mmu_notifier_invalidate_range_end(mm, start, end);
> >  	tlb_finish_mmu(&tlb, start, end);
> > -
> > +	lru_add_drain_all();
> 
> A full drain on all CPUs is very expensive and IMO not justified for
> some per-cpu fuzz factor in the stats. I'd take hampering the stats
> over hampering the syscall any day; only a subset of MADV_FREE users
> will look at the stats.
> 
> And while the aggregate error can be large on machines with many CPUs
> (notably the machines on which you absolutely don't want to send IPIs
> to all cores each time a thread madvises some pages!),

I am not sure I understand. Where would we trigger IPIs?
lru_add_drain_all relies on workqueus.

> the pages of a
> single process are not likely to be spread out across more than a few
> CPUs.

Then we can simply only flushe lru_lazyfree_pvecs which should reduce
the unrelated noise from other pagevecs.

> The error when reading a specific smaps should be completely ok.
> 
> In numbers: even if your process is madvising from 16 different CPUs,
> the error in its smaps file will peak at 896K in the worst case. That
> level of concurrency tends to come with much bigger memory quantities
> for that amount of error to matter.

It is still an unexpected behavior IMHO and an implementation detail
which leaks to the userspace.
 
> IMO this is a non-issue.

I will not insist if there is a general consensus on this and it is a
documented behavior, though. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 18:57       ` Michal Hocko
@ 2017-03-02  7:39         ` Minchan Kim
  2017-03-02 14:01         ` Johannes Weiner
  1 sibling, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2017-03-02  7:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shaohua Li, linux-mm, linux-kernel, Kernel-team,
	hughd, riel, mgorman, akpm

On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > On Fri 24-02-17 13:31:49, Shaohua Li wrote:
> > > > show MADV_FREE pages info of each vma in smaps. The interface is for
> > > > diganose or monitoring purpose, userspace could use it to understand
> > > > what happens in the application. Since userspace could dirty MADV_FREE
> > > > pages without notice from kernel, this interface is the only place we
> > > > can get accurate accounting info about MADV_FREE pages.
> > > 
> > > I have just got to test this patchset and noticed something that was a
> > > bit surprising
> > > 
> > > madvise(mmap(len), len, MADV_FREE)
> > > Size:             102400 kB
> > > Rss:              102400 kB
> > > Pss:              102400 kB
> > > Shared_Clean:          0 kB
> > > Shared_Dirty:          0 kB
> > > Private_Clean:    102400 kB
> > > Private_Dirty:         0 kB
> > > Referenced:            0 kB
> > > Anonymous:        102400 kB
> > > LazyFree:         102368 kB
> > > 
> > > It took me a some time to realize that LazyFree is not accurate because
> > > there are still pages on the per-cpu lru_lazyfree_pvecs. I believe this
> > > is an implementation detail which shouldn't be visible to the userspace.
> > > Should we simply drain the pagevec? A crude way would be to simply
> > > lru_add_drain_all after we are done with the given range. We can also
> > > make this lru_lazyfree_pvecs specific but I am not sure this is worth
> > > the additional code.
> > > ---
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index dc5927c812d3..d2c318db16c9 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > >  	madvise_free_page_range(&tlb, vma, start, end);
> > >  	mmu_notifier_invalidate_range_end(mm, start, end);
> > >  	tlb_finish_mmu(&tlb, start, end);
> > > -
> > > +	lru_add_drain_all();
> > 
> > A full drain on all CPUs is very expensive and IMO not justified for
> > some per-cpu fuzz factor in the stats. I'd take hampering the stats
> > over hampering the syscall any day; only a subset of MADV_FREE users
> > will look at the stats.
> > 
> > And while the aggregate error can be large on machines with many CPUs
> > (notably the machines on which you absolutely don't want to send IPIs
> > to all cores each time a thread madvises some pages!),
> 
> I am not sure I understand. Where would we trigger IPIs?
> lru_add_drain_all relies on workqueus.
> 
> > the pages of a
> > single process are not likely to be spread out across more than a few
> > CPUs.
> 
> Then we can simply only flushe lru_lazyfree_pvecs which should reduce
> the unrelated noise from other pagevecs.
> 
> > The error when reading a specific smaps should be completely ok.
> > 
> > In numbers: even if your process is madvising from 16 different CPUs,
> > the error in its smaps file will peak at 896K in the worst case. That
> > level of concurrency tends to come with much bigger memory quantities
> > for that amount of error to matter.
> 
> It is still an unexpected behavior IMHO and an implementation detail
> which leaks to the userspace.
>  
> > IMO this is a non-issue.
> 
> I will not insist if there is a general consensus on this and it is a
> documented behavior, though. 

We cannot gurantee with that even draining because madvise_free can
miss some of pages easily with several conditions.
First of all, userspace can never know how many of pages are mapped
in there at the moment. As well, one of page in the range can be
swapped out or is going migrating, fail to try_lockpage and so on.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-01 18:57       ` Michal Hocko
  2017-03-02  7:39         ` Minchan Kim
@ 2017-03-02 14:01         ` Johannes Weiner
  2017-03-02 16:30           ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Johannes Weiner @ 2017-03-02 14:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, minchan, hughd,
	riel, mgorman, akpm

On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
> > On Wed, Mar 01, 2017 at 02:36:24PM +0100, Michal Hocko wrote:
> > > @@ -474,7 +474,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
> > >  	madvise_free_page_range(&tlb, vma, start, end);
> > >  	mmu_notifier_invalidate_range_end(mm, start, end);
> > >  	tlb_finish_mmu(&tlb, start, end);
> > > -
> > > +	lru_add_drain_all();
> > 
> > A full drain on all CPUs is very expensive and IMO not justified for
> > some per-cpu fuzz factor in the stats. I'd take hampering the stats
> > over hampering the syscall any day; only a subset of MADV_FREE users
> > will look at the stats.
> > 
> > And while the aggregate error can be large on machines with many CPUs
> > (notably the machines on which you absolutely don't want to send IPIs
> > to all cores each time a thread madvises some pages!),
> 
> I am not sure I understand. Where would we trigger IPIs?
> lru_add_drain_all relies on workqueus.

Brainfart on my end, s,IPIs,sync work items,.

That doesn't change my point, though. These things are expensive, and
we had scalability issues with them in the past. See for example
4dd72b4a47a5 ("mm: fadvise: avoid expensive remote LRU cache draining
after FADV_DONTNEED").

> > the pages of a
> > single process are not likely to be spread out across more than a few
> > CPUs.
> 
> Then we can simply only flushe lru_lazyfree_pvecs which should reduce
> the unrelated noise from other pagevecs.

The problem isn't flushing other pagevecs once we're already scheduled
on a CPU, the problem is scheduling work on all cpus and then waiting
for completion.

> > The error when reading a specific smaps should be completely ok.
> > 
> > In numbers: even if your process is madvising from 16 different CPUs,
> > the error in its smaps file will peak at 896K in the worst case. That
> > level of concurrency tends to come with much bigger memory quantities
> > for that amount of error to matter.
> 
> It is still an unexpected behavior IMHO and an implementation detail
> which leaks to the userspace.

We have per-cpu fuzz in every single vmstat counter. Look at
calculate_normal_threshold() in vmstat.c and the sample thresholds for
when per-cpu deltas are flushed. In the vast majority of machines, the
per-cpu error in these counters is much higher than what we get with
pagevecs holding back a few pages.

It's not that I think you're wrong: it *is* an implementation detail.
But we take a bit of incoherency from batching all over the place, so
it's a little odd to take a stand over this particular instance of it
- whether demanding that it'd be fixed, or be documented, which would
only suggest to users that this is special when it really isn't etc.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-02 14:01         ` Johannes Weiner
@ 2017-03-02 16:30           ` Michal Hocko
  2017-03-04  0:10             ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-03-02 16:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shaohua Li, linux-mm, linux-kernel, Kernel-team, minchan, hughd,
	riel, mgorman, akpm

On Thu 02-03-17 09:01:01, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 07:57:35PM +0100, Michal Hocko wrote:
> > On Wed 01-03-17 13:31:49, Johannes Weiner wrote:
[...]
> > > The error when reading a specific smaps should be completely ok.
> > > 
> > > In numbers: even if your process is madvising from 16 different CPUs,
> > > the error in its smaps file will peak at 896K in the worst case. That
> > > level of concurrency tends to come with much bigger memory quantities
> > > for that amount of error to matter.
> > 
> > It is still an unexpected behavior IMHO and an implementation detail
> > which leaks to the userspace.
> 
> We have per-cpu fuzz in every single vmstat counter. Look at
> calculate_normal_threshold() in vmstat.c and the sample thresholds for
> when per-cpu deltas are flushed. In the vast majority of machines, the
> per-cpu error in these counters is much higher than what we get with
> pagevecs holding back a few pages.

Yes but vmstat counters have a different usecase AFAIK. You mostly look
at those when debugging or watching the system. /proc/<pid>/smaps is
quite often used to do per task metrics which are then used for some
decision making so it should be less fuzzy if that is possible.

> It's not that I think you're wrong: it *is* an implementation detail.
> But we take a bit of incoherency from batching all over the place, so
> it's a little odd to take a stand over this particular instance of it
> - whether demanding that it'd be fixed, or be documented, which would
> only suggest to users that this is special when it really isn't etc.

I am not aware of other counter printed in smaps that would suffer from
the same problem, but I haven't checked too deeply so I might be wrong. 

Anyway it seems that I am alone in my position so I will not insist.
If we have any bug report then we can still fix it.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-02 16:30           ` Michal Hocko
@ 2017-03-04  0:10             ` Andrew Morton
  2017-03-07 10:05               ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2017-03-04  0:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shaohua Li, linux-mm, linux-kernel, Kernel-team,
	minchan, hughd, riel, mgorman

On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> > It's not that I think you're wrong: it *is* an implementation detail.
> > But we take a bit of incoherency from batching all over the place, so
> > it's a little odd to take a stand over this particular instance of it
> > - whether demanding that it'd be fixed, or be documented, which would
> > only suggest to users that this is special when it really isn't etc.
> 
> I am not aware of other counter printed in smaps that would suffer from
> the same problem, but I haven't checked too deeply so I might be wrong. 
> 
> Anyway it seems that I am alone in my position so I will not insist.
> If we have any bug report then we can still fix it.

A single lru_add_drain_all() right at the top level (in smaps_show()?)
won't kill us and should significantly improve this issue.  And it
might accidentally make some of the other smaps statistics more
accurate as well.

If not, can we please have a nice comment somewhere appropriate which
explains why LazyFree is inaccurate and why we chose to leave it that
way?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-04  0:10             ` Andrew Morton
@ 2017-03-07 10:05               ` Michal Hocko
  2017-03-07 22:43                 ` Andrew Morton
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2017-03-07 10:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Shaohua Li, linux-mm, linux-kernel, Kernel-team,
	minchan, hughd, riel, mgorman

On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > It's not that I think you're wrong: it *is* an implementation detail.
> > > But we take a bit of incoherency from batching all over the place, so
> > > it's a little odd to take a stand over this particular instance of it
> > > - whether demanding that it'd be fixed, or be documented, which would
> > > only suggest to users that this is special when it really isn't etc.
> > 
> > I am not aware of other counter printed in smaps that would suffer from
> > the same problem, but I haven't checked too deeply so I might be wrong. 
> > 
> > Anyway it seems that I am alone in my position so I will not insist.
> > If we have any bug report then we can still fix it.
> 
> A single lru_add_drain_all() right at the top level (in smaps_show()?)
> won't kill us

I do not think we want to put lru_add_drain_all cost to a random
process reading /proc/<pid>/smaps. If anything the one which does the
madvise should be doing this.

> and should significantly improve this issue.  And it
> might accidentally make some of the other smaps statistics more
> accurate as well.
> 
> If not, can we please have a nice comment somewhere appropriate which
> explains why LazyFree is inaccurate and why we chose to leave it that
> way?

Yeah, I would appreciate the comment more. What about the following
---
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 45853e116eef..0b58b317dc76 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -444,7 +444,9 @@ a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
 and a page is modified, the file page is replaced by a private anonymous copy.
 "LazyFree" shows the amount of memory which is marked by madvise(MADV_FREE).
 The memory isn't freed immediately with madvise(). It's freed in memory
-pressure if the memory is clean.
+pressure if the memory is clean. Please note that the printed value might
+be lower than the real value due to optimizations used in the current
+implementation. If this is not desirable please file a bug report.
 "AnonHugePages" shows the ammount of memory backed by transparent hugepage.
 "ShmemPmdMapped" shows the ammount of shared (shmem/tmpfs) memory backed by
 huge pages.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-07 10:05               ` Michal Hocko
@ 2017-03-07 22:43                 ` Andrew Morton
  2017-03-08  5:36                   ` Minchan Kim
  0 siblings, 1 reply; 44+ messages in thread
From: Andrew Morton @ 2017-03-07 22:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Shaohua Li, linux-mm, linux-kernel, Kernel-team,
	minchan, hughd, riel, mgorman

On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > > It's not that I think you're wrong: it *is* an implementation detail.
> > > > But we take a bit of incoherency from batching all over the place, so
> > > > it's a little odd to take a stand over this particular instance of it
> > > > - whether demanding that it'd be fixed, or be documented, which would
> > > > only suggest to users that this is special when it really isn't etc.
> > > 
> > > I am not aware of other counter printed in smaps that would suffer from
> > > the same problem, but I haven't checked too deeply so I might be wrong. 
> > > 
> > > Anyway it seems that I am alone in my position so I will not insist.
> > > If we have any bug report then we can still fix it.
> > 
> > A single lru_add_drain_all() right at the top level (in smaps_show()?)
> > won't kill us
> 
> I do not think we want to put lru_add_drain_all cost to a random
> process reading /proc/<pid>/smaps.

Why not?  It's that process which is calling for the work to be done.

> If anything the one which does the
> madvise should be doing this.

But it would be silly to do extra work in madvise() if nobody will be
reading smaps for the next two months.

How much work is it anyway?  What would be the relative impact upon a
smaps read?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps
  2017-03-07 22:43                 ` Andrew Morton
@ 2017-03-08  5:36                   ` Minchan Kim
  0 siblings, 0 replies; 44+ messages in thread
From: Minchan Kim @ 2017-03-08  5:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Johannes Weiner, Shaohua Li, linux-mm,
	linux-kernel, Kernel-team, hughd, riel, mgorman

Hi Andrew,

On Tue, Mar 07, 2017 at 02:43:38PM -0800, Andrew Morton wrote:
> On Tue, 7 Mar 2017 11:05:45 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 03-03-17 16:10:27, Andrew Morton wrote:
> > > On Thu, 2 Mar 2017 17:30:54 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > > It's not that I think you're wrong: it *is* an implementation detail.
> > > > > But we take a bit of incoherency from batching all over the place, so
> > > > > it's a little odd to take a stand over this particular instance of it
> > > > > - whether demanding that it'd be fixed, or be documented, which would
> > > > > only suggest to users that this is special when it really isn't etc.
> > > > 
> > > > I am not aware of other counter printed in smaps that would suffer from
> > > > the same problem, but I haven't checked too deeply so I might be wrong. 
> > > > 
> > > > Anyway it seems that I am alone in my position so I will not insist.
> > > > If we have any bug report then we can still fix it.
> > > 
> > > A single lru_add_drain_all() right at the top level (in smaps_show()?)
> > > won't kill us
> > 
> > I do not think we want to put lru_add_drain_all cost to a random
> > process reading /proc/<pid>/smaps.
> 
> Why not?  It's that process which is calling for the work to be done.
> 
> > If anything the one which does the
> > madvise should be doing this.
> 
> But it would be silly to do extra work in madvise() if nobody will be
> reading smaps for the next two months.
> 
> How much work is it anyway?  What would be the relative impact upon a
> smaps read?

I agree only if the draining guarantees all of mapped pages in the range
could be marked to lazyfree. However, it's not true because there are a
few of logics to skip the page marking in madvise_free_pte_range.

So, my conclusion is drainning helps a bit but not gaurantees.
In such case, IMHO, let's not do the effort to make better.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-03-08  5:51 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24 21:31 [PATCH V5 0/6] mm: fix some MADV_FREE issues Shaohua Li
2017-02-24 21:31 ` [PATCH V5 1/6] mm: delete unnecessary TTU_* flags Shaohua Li
2017-02-27 13:48   ` Michal Hocko
2017-02-24 21:31 ` [PATCH V5 2/6] mm: don't assume anonymous pages have SwapBacked flag Shaohua Li
2017-02-27  6:48   ` Hillf Danton
2017-02-27 14:35   ` Michal Hocko
2017-02-27 16:10     ` Shaohua Li
2017-02-27 16:28       ` Michal Hocko
2017-02-24 21:31 ` [PATCH V5 3/6] mm: move MADV_FREE pages into LRU_INACTIVE_FILE list Shaohua Li
2017-02-27  6:28   ` Minchan Kim
2017-02-27 16:13     ` Shaohua Li
2017-02-27 16:30       ` Michal Hocko
2017-02-28  2:53       ` Minchan Kim
2017-02-27 14:53   ` Michal Hocko
2017-02-27 17:15   ` Johannes Weiner
2017-02-28  3:19   ` Hillf Danton
2017-02-24 21:31 ` [PATCH V5 4/6] mm: reclaim MADV_FREE pages Shaohua Li
2017-02-27  6:33   ` Minchan Kim
2017-02-27 16:19     ` Shaohua Li
2017-02-27 16:32       ` Michal Hocko
2017-02-28  5:02       ` Minchan Kim
2017-02-27 15:05   ` Michal Hocko
2017-02-27 17:21   ` Johannes Weiner
2017-02-28  3:21   ` Hillf Danton
2017-02-24 21:31 ` [PATCH V5 5/6] mm: enable MADV_FREE for swapless system Shaohua Li
2017-02-27 15:06   ` Michal Hocko
2017-02-28  3:22   ` Hillf Danton
2017-02-28  5:02   ` Minchan Kim
2017-02-24 21:31 ` [PATCH V5 6/6] proc: show MADV_FREE pages info in smaps Shaohua Li
2017-02-27 15:06   ` Michal Hocko
2017-02-28  3:23   ` Hillf Danton
2017-03-01 13:36   ` Michal Hocko
2017-03-01 17:37     ` Shaohua Li
2017-03-01 17:49       ` Michal Hocko
2017-03-01 18:18         ` Shaohua Li
2017-03-01 18:31     ` Johannes Weiner
2017-03-01 18:57       ` Michal Hocko
2017-03-02  7:39         ` Minchan Kim
2017-03-02 14:01         ` Johannes Weiner
2017-03-02 16:30           ` Michal Hocko
2017-03-04  0:10             ` Andrew Morton
2017-03-07 10:05               ` Michal Hocko
2017-03-07 22:43                 ` Andrew Morton
2017-03-08  5:36                   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).