linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 1/4] mm: throttle MADV_FREE
@ 2015-02-24  8:18 Minchan Kim
  2015-02-24  8:18 ` [PATCH RFC 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 42+ messages in thread
From: Minchan Kim @ 2015-02-24  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Rik van Riel, Michal Hocko,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang, Minchan Kim

Recently, Shaohua reported that MADV_FREE is much slower than
MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
applications went to stall with direct reclaim since kswapd's
reclaim speed isn't fast than applications's allocation speed
so that it causes lots of stall and lock contention.

This patch throttles MADV_FREEing so it works only if there
are enough pages in the system which will not trigger backgroud/
direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
because there is no point to delay freeing if we know system
is under memory pressure.

When I test the patch on my 3G machine + 12 CPU + 8G swap,
test: 12 processes

loop = 5;
mmap(512M);
while (loop--) {
	memset(512M);
	madvise(MADV_FREE or MADV_DONTNEED);
}

1) dontneed: 6.78user 234.09system 0:48.89elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed

It's clearly win.

Reported-by: Shaohua Li <shli@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..81bb26ecf064 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		 * XXX: In this implementation, MADV_FREE works like
 		 * MADV_DONTNEED on swapless system or full swap.
 		 */
-		if (get_nr_swap_pages() > 0)
-			return madvise_free(vma, prev, start, end);
+		if (get_nr_swap_pages() > 0) {
+			unsigned long threshold;
+			/*
+			 * If we have trobule with memory pressure(ie,
+			 * under high watermark), free pages instantly.
+			 */
+			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
+			threshold = threshold + (threshold >> 1);
+			if (nr_free_pages() > threshold)
+				return madvise_free(vma, prev, start, end);
+		}
 		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RFC 2/4] mm: change deactivate_page with deactivate_file_page
  2015-02-24  8:18 [PATCH RFC 1/4] mm: throttle MADV_FREE Minchan Kim
@ 2015-02-24  8:18 ` Minchan Kim
  2015-02-24  8:18 ` [PATCH RFC 3/4] mm: move lazy free pages to inactive list Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2015-02-24  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Rik van Riel, Michal Hocko,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang, Minchan Kim

Deactivate_page was born for file invalidation so it has too
specific logic for file-backed pages.
Make the name of function to file-specific.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  2 +-
 mm/swap.c            | 20 ++++++++++----------
 mm/truncate.c        |  2 +-
 3 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7067eca501e2..cee108cbe2d5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void deactivate_file_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index cd3a5e64cea9..5b2a60578f9c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -42,7 +42,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -743,7 +743,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
  * be write it out by flusher threads as this is much more effective
  * than the single-page writeout from reclaim.
  */
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 			      void *arg)
 {
 	int lru, file;
@@ -811,22 +811,22 @@ void lru_add_drain_cpu(int cpu)
 		local_irq_restore(flags);
 	}
 
-	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
 	if (pagevec_count(pvec))
-		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
 	activate_page_drain(cpu);
 }
 
 /**
- * deactivate_page - forcefully deactivate a page
+ * deactivate_file_page - forcefully deactivate a file page
  * @page: page to deactivate
  *
  * This function hints the VM that @page is a good reclaim candidate,
  * for example if its invalidation fails due to the page being dirty
  * or under writeback.
  */
-void deactivate_page(struct page *page)
+void deactivate_file_page(struct page *page)
 {
 	/*
 	 * In a workload with many unevictable page such as mprotect, unevictable
@@ -836,11 +836,11 @@ void deactivate_page(struct page *page)
 		return;
 
 	if (likely(get_page_unless_zero(page))) {
-		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);
 
 		if (!pagevec_add(pvec, page))
-			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
-		put_cpu_var(lru_deactivate_pvecs);
+			pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+		put_cpu_var(lru_deactivate_file_pvecs);
 	}
 }
 
@@ -872,7 +872,7 @@ void lru_add_drain_all(void)
 
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
-		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
diff --git a/mm/truncate.c b/mm/truncate.c
index ddec5a5966d7..a0619c492f17 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -513,7 +513,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			 * of interest and try to speed up its reclaim.
 			 */
 			if (!ret)
-				deactivate_page(page);
+				deactivate_file_page(page);
 			count += ret;
 		}
 		pagevec_remove_exceptionals(&pvec);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RFC 3/4] mm: move lazy free pages to inactive list
  2015-02-24  8:18 [PATCH RFC 1/4] mm: throttle MADV_FREE Minchan Kim
  2015-02-24  8:18 ` [PATCH RFC 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
@ 2015-02-24  8:18 ` Minchan Kim
  2015-02-24 16:14   ` Michal Hocko
  2015-02-24  8:18 ` [PATCH RFC 4/4] mm: support MADV_FREE in swapless system Minchan Kim
  2015-02-24 15:43 ` [PATCH RFC 1/4] mm: throttle MADV_FREE Michal Hocko
  3 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-24  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Rik van Riel, Michal Hocko,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang, Minchan Kim

MADV_FREE is hint that it's okay to discard pages if memory is
pressure and we uses reclaimers(ie, kswapd and direct reclaim)
to free them so there is no worth to remain them in active
anonymous LRU list so this patch moves them to inactive LRU list.

A arguable issue for the approach is whether we should put it
head or tail in inactive list and selected it as head because
kernel cannot make sure it's really cold or warm for every usecase
but at least we know it's not hot so landing of inactive head
would be comprimise if it stayed in active LRU.
As well, if we put recent hinted pages to inactive's tail,
VM could discard cache hot pages, which would be bad.

As a bonus, we don't need to move them back and forth in inactive
list whenever MADV_SYSCALL syscall is called.

As drawback, VM should scan more pages in inactive anonymous LRU
to discard but it has happened all the time if recent reference
happens on those pages in inactive LRU list so I don't think
it's not a main drawback.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/swap.h |  1 +
 mm/madvise.c         |  2 ++
 mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cee108cbe2d5..0428e4c84e1d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 81bb26ecf064..6176039c69e4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -324,6 +324,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = pte_mkold(ptent);
 		ptent = pte_mkclean(ptent);
 		set_pte_at(mm, addr, pte, ptent);
+		if (PageActive(page))
+			deactivate_page(page);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
 	arch_leave_lazy_mmu_mode();
diff --git a/mm/swap.c b/mm/swap.c
index 5b2a60578f9c..393968c33667 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -43,6 +43,7 @@ int page_cluster;
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
 	update_page_reclaim_stat(lruvec, file, 0);
 }
 
+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+			    void *arg)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		int file = page_is_file_cache(page);
+		int lru = page_lru_base_type(page);
+
+		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+		ClearPageActive(page);
+		add_page_to_lru_list(page, lruvec, lru);
+
+		__count_vm_event(PGDEACTIVATE);
+		update_page_reclaim_stat(lruvec, file, 0);
+	}
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -815,6 +833,10 @@ void lru_add_drain_cpu(int cpu)
 	if (pagevec_count(pvec))
 		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
 
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
 	activate_page_drain(cpu);
 }
 
@@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
 	}
 }
 
+void deactivate_page(struct page *page)
+{
+	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		page_cache_get(page);
+		if (!pagevec_add(pvec, page))
+			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
+}
+
 void lru_add_drain(void)
 {
 	lru_add_drain_cpu(get_cpu());
@@ -873,6 +907,7 @@ void lru_add_drain_all(void)
 		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
 		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
 		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
 		    need_activate_page_drain(cpu)) {
 			INIT_WORK(work, lru_add_drain_per_cpu);
 			schedule_work_on(cpu, work);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RFC 4/4] mm: support MADV_FREE in swapless system
  2015-02-24  8:18 [PATCH RFC 1/4] mm: throttle MADV_FREE Minchan Kim
  2015-02-24  8:18 ` [PATCH RFC 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
  2015-02-24  8:18 ` [PATCH RFC 3/4] mm: move lazy free pages to inactive list Minchan Kim
@ 2015-02-24  8:18 ` Minchan Kim
  2015-02-24 16:51   ` Michal Hocko
  2015-02-24 15:43 ` [PATCH RFC 1/4] mm: throttle MADV_FREE Michal Hocko
  3 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-24  8:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Rik van Riel, Michal Hocko,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang, Minchan Kim

Historically, we have disabled reclaiming completely with swapoff
or non-swap configurable system. It did make sense but problem
for lazy free pages is that we couldn't get a chance to discard
hinted pages in reclaim path in those systems.

That's why current MADV_FREE implmentation drop pages instantly
like MADV_DONTNNED in swapless system so that users on those
systems couldn't get the benefit of MADV_FREE.

This patch makes VM try to reclaim anonymous pages on swapless
system. Basic strategy is to try reclaim anonymous pages
if there are pages in *inactive anon*.

In non-swap config system, VM doesn't do aging/reclaiming
anonymous LRU list so inactive anon LRU list should be always
empty. So, if there are some pages in inactive anon LRU,
it means they are MADV_FREE hinted pages so VM try to reclaim
them and discard or promote them onto active list.

In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
anonymous LRU list so inactive anon LRU list would be empty but
it might be not always true because some pages could remain
inactive anon list if the admin had swapon before. So, if there
are some pages in inactive anon LRU, it means they are MADV_FREE
hinted pages or non-hinted pages which have stayed before.
VM try to reclaim them and discard or promote them onto active list
so we could have only hinted pages on inactive anon LRU list
after a while.

In swap-enabled-and-full, VM does aging but not try to reclaim
in current implementation. However, we try to reclaim them by
this patch so reclaim efficiency would be worse than old.
I don't know how many such workloads(ie, doing a job with
full-swap for a long time)  we should take care of are.

Hope the comment.

On swapoff system with 3G ram, there are 10 processes with below

loop = 12;
mmap(256M);
while (loop--) {
	memset(256M);
	madvise(MADV_FREE or MADV_DONTNEED);
	sleep(1);
}

1) dontneed: 5.40user 24.75system 0:15.36elapsed
2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 21 ++++++++-------------
 mm/vmscan.c  | 32 +++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 24 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6176039c69e4..b3937e8012e6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -515,6 +515,8 @@ static long
 madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		unsigned long start, unsigned long end, int behavior)
 {
+	unsigned long free_threshold;
+
 	switch (behavior) {
 	case MADV_REMOVE:
 		return madvise_remove(vma, prev, start, end);
@@ -522,20 +524,13 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		return madvise_willneed(vma, prev, start, end);
 	case MADV_FREE:
 		/*
-		 * XXX: In this implementation, MADV_FREE works like
-		 * MADV_DONTNEED on swapless system or full swap.
+		 * If we have trobule with memory pressure(ie,
+		 * under high watermark), free pages instantly.
 		 */
-		if (get_nr_swap_pages() > 0) {
-			unsigned long threshold;
-			/*
-			 * If we have trobule with memory pressure(ie,
-			 * under high watermark), free pages instantly.
-			 */
-			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
-			threshold = threshold + (threshold >> 1);
-			if (nr_free_pages() > threshold)
-				return madvise_free(vma, prev, start, end);
-		}
+		free_threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
+		free_threshold = free_threshold + (free_threshold >> 1);
+		if (nr_free_pages() > free_threshold)
+			return madvise_free(vma, prev, start, end);
 		/* passthrough */
 	case MADV_DONTNEED:
 		return madvise_dontneed(vma, prev, start, end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 671e47edb584..1574cd783ab9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
 
 static unsigned long zone_reclaimable_pages(struct zone *zone)
 {
-	int nr;
+	unsigned long file, anon;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
+	file = zone_page_state(zone, NR_ACTIVE_FILE) +
+		zone_page_state(zone, NR_INACTIVE_FILE);
 
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
+	/*
+	 * Although there is no swap space, we should consider
+	 * lazy free pages in inactive anon LRU list.
+	 */
+	if (total_swap_pages > 0) {
+		anon = zone_page_state(zone, NR_ACTIVE_ANON) +
+			zone_page_state(zone, NR_INACTIVE_ANON);
+	} else {
+		anon = zone_page_state(zone, NR_INACTIVE_ANON);
+	}
 
-	return nr;
+	return file + anon;
 }
 
 bool zone_reclaimable(struct zone *zone)
@@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
 	if (!global_reclaim(sc))
 		force_scan = true;
 
-	/* If we have no swap space, do not bother scanning anon pages. */
-	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
+	/*
+	 * If we have no inactive anon page, do not bother scanning
+	 * anon pages.
+	 */
+	if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
 		scan_balance = SCAN_FILE;
 		goto out;
 	}
@@ -2344,8 +2354,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 */
 	pages_for_compaction = (2UL << sc->order);
 	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
+
 	if (sc->nr_reclaimed < pages_for_compaction &&
 			inactive_lru_pages > pages_for_compaction)
 		return true;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-24  8:18 [PATCH RFC 1/4] mm: throttle MADV_FREE Minchan Kim
                   ` (2 preceding siblings ...)
  2015-02-24  8:18 ` [PATCH RFC 4/4] mm: support MADV_FREE in swapless system Minchan Kim
@ 2015-02-24 15:43 ` Michal Hocko
  2015-02-24 22:54   ` Shaohua Li
  2015-02-25  0:08   ` Minchan Kim
  3 siblings, 2 replies; 42+ messages in thread
From: Michal Hocko @ 2015-02-24 15:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> Recently, Shaohua reported that MADV_FREE is much slower than
> MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> applications went to stall with direct reclaim since kswapd's
> reclaim speed isn't fast than applications's allocation speed
> so that it causes lots of stall and lock contention.

I am not sure I understand this correctly. So the issue is that there is
huge number of MADV_FREE on the LRU and they are not close to the tail
of the list so the reclaim has to do a lot of work before it starts
dropping them?

> This patch throttles MADV_FREEing so it works only if there
> are enough pages in the system which will not trigger backgroud/
> direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> because there is no point to delay freeing if we know system
> is under memory pressure.

Hmm, this is still conforming to the documentation because the kernel is
free to free pages at its convenience. I am not sure this is a good
idea, though. Why some MADV_FREE calls should be treated differently?
Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
would work without long stalls most of the time - except when there is a
memory pressure.

Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
two calls is different.

> When I test the patch on my 3G machine + 12 CPU + 8G swap,
> test: 12 processes
> 
> loop = 5;
> mmap(512M);

Who is eating the rest of the memory?

> while (loop--) {
> 	memset(512M);
> 	madvise(MADV_FREE or MADV_DONTNEED);
> }
> 
> 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> 
> It's clearly win.
> 
> Reported-by: Shaohua Li <shli@kernel.org>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

I don't know. This looks like a hack with hard to predict consequences
which might trigger pathological corner cases.

> ---
>  mm/madvise.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..81bb26ecf064 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
>  		 * XXX: In this implementation, MADV_FREE works like
>  		 * MADV_DONTNEED on swapless system or full swap.
>  		 */
> -		if (get_nr_swap_pages() > 0)
> -			return madvise_free(vma, prev, start, end);
> +		if (get_nr_swap_pages() > 0) {
> +			unsigned long threshold;
> +			/*
> +			 * If we have trobule with memory pressure(ie,
> +			 * under high watermark), free pages instantly.
> +			 */
> +			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> +			threshold = threshold + (threshold >> 1);

Why threshold += threshold >> 1 ?

> +			if (nr_free_pages() > threshold)
> +				return madvise_free(vma, prev, start, end);
> +		}
>  		/* passthrough */
>  	case MADV_DONTNEED:
>  		return madvise_dontneed(vma, prev, start, end);
> -- 
> 1.9.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 3/4] mm: move lazy free pages to inactive list
  2015-02-24  8:18 ` [PATCH RFC 3/4] mm: move lazy free pages to inactive list Minchan Kim
@ 2015-02-24 16:14   ` Michal Hocko
  2015-02-25  0:27     ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2015-02-24 16:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> MADV_FREE is hint that it's okay to discard pages if memory is
> pressure and we uses reclaimers(ie, kswapd and direct reclaim)

s@if memory is pressure@if there is memory pressure@

> to free them so there is no worth to remain them in active
> anonymous LRU list so this patch moves them to inactive LRU list.

Makes sense to me.

> A arguable issue for the approach is whether we should put it
> head or tail in inactive list

Is it really arguable? Why should active MADV_FREE pages appear before
those which were living on the inactive list. This doesn't make any
sense to me.

> and selected it as head because
> kernel cannot make sure it's really cold or warm for every usecase
> but at least we know it's not hot so landing of inactive head
> would be comprimise if it stayed in active LRU.

This is really hard to read. What do you think about the following
wording?
"
The active status of those pages is cleared and they are moved to the
head of the inactive LRU. This means that MADV_FREE-ed pages which
were living on the inactive list are reclaimed first because they
are more likely to be cold rather than recently active pages.
"

> As well, if we put recent hinted pages to inactive's tail,
> VM could discard cache hot pages, which would be bad.
> 
> As a bonus, we don't need to move them back and forth in inactive
> list whenever MADV_SYSCALL syscall is called.
> 
> As drawback, VM should scan more pages in inactive anonymous LRU
> to discard but it has happened all the time if recent reference
> happens on those pages in inactive LRU list so I don't think
> it's not a main drawback.

Rather than the above paragraphs I would like to see a description why
this is needed. Something like the following?
"
This is fixing a suboptimal behavior of MADV_FREE when pages living on
the active list will sit there for a long time even under memory
pressure while the inactive list is reclaimed heavily. This basically
breaks the whole purpose of using MADV_FREE to help the system to free
memory which is might not be used.
"

> Signed-off-by: Minchan Kim <minchan@kernel.org>

Other than that the patch looks good to me.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/swap.h |  1 +
>  mm/madvise.c         |  2 ++
>  mm/swap.c            | 35 +++++++++++++++++++++++++++++++++++
>  3 files changed, 38 insertions(+)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index cee108cbe2d5..0428e4c84e1d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
>  extern void lru_add_drain_all(void);
>  extern void rotate_reclaimable_page(struct page *page);
>  extern void deactivate_file_page(struct page *page);
> +extern void deactivate_page(struct page *page);
>  extern void swap_setup(void);
>  
>  extern void add_page_to_unevictable_list(struct page *page);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 81bb26ecf064..6176039c69e4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -324,6 +324,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		ptent = pte_mkold(ptent);
>  		ptent = pte_mkclean(ptent);
>  		set_pte_at(mm, addr, pte, ptent);
> +		if (PageActive(page))
> +			deactivate_page(page);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
>  	arch_leave_lazy_mmu_mode();
> diff --git a/mm/swap.c b/mm/swap.c
> index 5b2a60578f9c..393968c33667 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -43,6 +43,7 @@ int page_cluster;
>  static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
>  static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
>  static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
> +static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
>  
>  /*
>   * This path almost never happens for VM activity - pages are normally
> @@ -789,6 +790,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
>  	update_page_reclaim_stat(lruvec, file, 0);
>  }
>  
> +
> +static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
> +			    void *arg)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		int file = page_is_file_cache(page);
> +		int lru = page_lru_base_type(page);
> +
> +		del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
> +		ClearPageActive(page);
> +		add_page_to_lru_list(page, lruvec, lru);
> +
> +		__count_vm_event(PGDEACTIVATE);
> +		update_page_reclaim_stat(lruvec, file, 0);
> +	}
> +}
> +
>  /*
>   * Drain pages out of the cpu's pagevecs.
>   * Either "cpu" is the current CPU, and preemption has already been
> @@ -815,6 +833,10 @@ void lru_add_drain_cpu(int cpu)
>  	if (pagevec_count(pvec))
>  		pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
>  
> +	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
> +	if (pagevec_count(pvec))
> +		pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +
>  	activate_page_drain(cpu);
>  }
>  
> @@ -844,6 +866,18 @@ void deactivate_file_page(struct page *page)
>  	}
>  }
>  
> +void deactivate_page(struct page *page)
> +{
> +	if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> +		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> +		page_cache_get(page);
> +		if (!pagevec_add(pvec, page))
> +			pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> +		put_cpu_var(lru_deactivate_pvecs);
> +	}
> +}
> +
>  void lru_add_drain(void)
>  {
>  	lru_add_drain_cpu(get_cpu());
> @@ -873,6 +907,7 @@ void lru_add_drain_all(void)
>  		if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
>  		    pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
> +		    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
>  		    need_activate_page_drain(cpu)) {
>  			INIT_WORK(work, lru_add_drain_per_cpu);
>  			schedule_work_on(cpu, work);
> -- 
> 1.9.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 4/4] mm: support MADV_FREE in swapless system
  2015-02-24  8:18 ` [PATCH RFC 4/4] mm: support MADV_FREE in swapless system Minchan Kim
@ 2015-02-24 16:51   ` Michal Hocko
  2015-02-25  1:41     ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2015-02-24 16:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Tue 24-02-15 17:18:17, Minchan Kim wrote:
> Historically, we have disabled reclaiming completely with swapoff
> or non-swap configurable system. It did make sense but problem
> for lazy free pages is that we couldn't get a chance to discard
> hinted pages in reclaim path in those systems.
> 
> That's why current MADV_FREE implmentation drop pages instantly
> like MADV_DONTNNED in swapless system so that users on those
> systems couldn't get the benefit of MADV_FREE.
> 
> This patch makes VM try to reclaim anonymous pages on swapless
> system. Basic strategy is to try reclaim anonymous pages
> if there are pages in *inactive anon*.

Nice idea.

> In non-swap config system, VM doesn't do aging/reclaiming
> anonymous LRU list so inactive anon LRU list should be always
> empty. So, if there are some pages in inactive anon LRU,
> it means they are MADV_FREE hinted pages so VM try to reclaim
> them and discard or promote them onto active list.
> 
> In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
> anonymous LRU list so inactive anon LRU list would be empty but
> it might be not always true because some pages could remain
> inactive anon list if the admin had swapon before. So, if there
> are some pages in inactive anon LRU, it means they are MADV_FREE
> hinted pages or non-hinted pages which have stayed before.
> VM try to reclaim them and discard or promote them onto active list
> so we could have only hinted pages on inactive anon LRU list
> after a while.
> 
> In swap-enabled-and-full, VM does aging but not try to reclaim
> in current implementation. However, we try to reclaim them by
> this patch so reclaim efficiency would be worse than old.
> I don't know how many such workloads(ie, doing a job with
> full-swap for a long time)  we should take care of are.

Talking about performance when the swap is full is a bit funny. The
system is crawling at the time most probably.

> Hope the comment.
> 
> On swapoff system with 3G ram, there are 10 processes with below
> 
> loop = 12;
> mmap(256M);

again what is the memory pressure which fills up the rest of the memory.

> while (loop--) {
> 	memset(256M);
> 	madvise(MADV_FREE or MADV_DONTNEED);
> 	sleep(1);
> }
> 
> 1) dontneed: 5.40user 24.75system 0:15.36elapsed
> 2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  mm/madvise.c | 21 ++++++++-------------
>  mm/vmscan.c  | 32 +++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 24 deletions(-)
> 
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 671e47edb584..1574cd783ab9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
>  
>  static unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
> -	int nr;
> +	unsigned long file, anon;
>  
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> +	file = zone_page_state(zone, NR_ACTIVE_FILE) +
> +		zone_page_state(zone, NR_INACTIVE_FILE);
>  
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> +	/*
> +	 * Although there is no swap space, we should consider
> +	 * lazy free pages in inactive anon LRU list.
> +	 */
> +	if (total_swap_pages > 0) {
> +		anon = zone_page_state(zone, NR_ACTIVE_ANON) +
> +			zone_page_state(zone, NR_INACTIVE_ANON);
> +	} else {
> +		anon = zone_page_state(zone, NR_INACTIVE_ANON);
> +	}

Hmm, this doesn't look right to me. The zone would be considered
reclaimable even though it is not in fact. Most of the systems are
configured with swap and this would break those AFAICS.

> -	return nr;
> +	return file + anon;
>  }
>  
>  bool zone_reclaimable(struct zone *zone)
> @@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
>  	if (!global_reclaim(sc))
>  		force_scan = true;
>  
> -	/* If we have no swap space, do not bother scanning anon pages. */
> -	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> +	/*
> +	 * If we have no inactive anon page, do not bother scanning
> +	 * anon pages.
> +	 */
> +	if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
>  		scan_balance = SCAN_FILE;
>  		goto out;
>  	}

but later in the function we are considering active anon pages as well.
Does this work at all?

> @@ -2344,8 +2354,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
>  	 */
>  	pages_for_compaction = (2UL << sc->order);
>  	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +	inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
> +
>  	if (sc->nr_reclaimed < pages_for_compaction &&
>  			inactive_lru_pages > pages_for_compaction)
>  		return true;
> -- 
> 1.9.1
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-24 15:43 ` [PATCH RFC 1/4] mm: throttle MADV_FREE Michal Hocko
@ 2015-02-24 22:54   ` Shaohua Li
  2015-02-25 14:13     ` Michal Hocko
  2015-02-25  0:08   ` Minchan Kim
  1 sibling, 1 reply; 42+ messages in thread
From: Shaohua Li @ 2015-02-24 22:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Yalin.Wang

On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > Recently, Shaohua reported that MADV_FREE is much slower than
> > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > applications went to stall with direct reclaim since kswapd's
> > reclaim speed isn't fast than applications's allocation speed
> > so that it causes lots of stall and lock contention.
> 
> I am not sure I understand this correctly. So the issue is that there is
> huge number of MADV_FREE on the LRU and they are not close to the tail
> of the list so the reclaim has to do a lot of work before it starts
> dropping them?

I thought the main reason is current reclaim stragety. Anonymous pages are
considered to be hard to be reclaimed with current policy, VM bias to reclaim
file pages (anon pages are in active list first, referenced pte will reactivate
anon pages and increase rotate count)

> > This patch throttles MADV_FREEing so it works only if there
> > are enough pages in the system which will not trigger backgroud/
> > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > because there is no point to delay freeing if we know system
> > is under memory pressure.
> 
> Hmm, this is still conforming to the documentation because the kernel is
> free to free pages at its convenience. I am not sure this is a good
> idea, though. Why some MADV_FREE calls should be treated differently?
> Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> would work without long stalls most of the time - except when there is a
> memory pressure.
> 
> Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> two calls is different.
> 
> > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > test: 12 processes
> > 
> > loop = 5;
> > mmap(512M);
> 
> Who is eating the rest of the memory?
> 
> > while (loop--) {
> > 	memset(512M);
> > 	madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> > 
> > It's clearly win.
> > 
> > Reported-by: Shaohua Li <shli@kernel.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> I don't know. This looks like a hack with hard to predict consequences
> which might trigger pathological corner cases.

This has big improvement in practise, but as Michael said, this will introduce
unpredictable behavior. madvfree pages before memory pressure will be free
later. Plus, this doesn't change the situation madvfree pages are hard to be
free (even with the 3rd patch). Of course it's not introduced by the the
madfree patch, VM bias free file pages.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-24 15:43 ` [PATCH RFC 1/4] mm: throttle MADV_FREE Michal Hocko
  2015-02-24 22:54   ` Shaohua Li
@ 2015-02-25  0:08   ` Minchan Kim
  2015-02-25  7:11     ` Minchan Kim
  2015-02-27  3:37     ` [RFC] mm: change mm_advise_free to clear page dirty Wang, Yalin
  1 sibling, 2 replies; 42+ messages in thread
From: Minchan Kim @ 2015-02-25  0:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

Hi Michal,

On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > Recently, Shaohua reported that MADV_FREE is much slower than
> > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > applications went to stall with direct reclaim since kswapd's
> > reclaim speed isn't fast than applications's allocation speed
> > so that it causes lots of stall and lock contention.
> 
> I am not sure I understand this correctly. So the issue is that there is
> huge number of MADV_FREE on the LRU and they are not close to the tail
> of the list so the reclaim has to do a lot of work before it starts
> dropping them?

No, Shaohua already tested deactivating of hinted pages to head/tail
of inactive anon LRU and he said it didn't solve his problem.
I thought main culprit was scanning/rotating/throttling in
direct reclaim path.

> 
> > This patch throttles MADV_FREEing so it works only if there
> > are enough pages in the system which will not trigger backgroud/
> > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > because there is no point to delay freeing if we know system
> > is under memory pressure.
> 
> Hmm, this is still conforming to the documentation because the kernel is
> free to free pages at its convenience. I am not sure this is a good
> idea, though. Why some MADV_FREE calls should be treated differently?

It's hint for VM to free pages so I think it's okay to free them instantly
sometime if it can save more important thing like system stall.
IOW, madvise is just hint, not a strict rule.

> Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> would work without long stalls most of the time - except when there is a
> memory pressure.

True.

> 
> Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> two calls is different.

I agree it's not a apple to apple comparison.

Acutally, MADV_FREE moves the cost from hot path(ie, system call path)
to slow path(ie, reclaim context) so it would be slower if there are
much memory pressure continuously due to a lot overhead of freeing pages
in reclaim context. So, it would be good if kernel detects it nicely
and prevent the situation. This patch aims for that.

> 
> > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > test: 12 processes
> > 
> > loop = 5;
> > mmap(512M);
> 
> Who is eating the rest of the memory?

As I wrote down,  there are 12 processes with below test.
IOW, 512M * 12 = 6G but system RAM is just 3G.

> 
> > while (loop--) {
> > 	memset(512M);
> > 	madvise(MADV_FREE or MADV_DONTNEED);
> > }
> > 
> > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> > 
> > It's clearly win.
> > 
> > Reported-by: Shaohua Li <shli@kernel.org>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> I don't know. This looks like a hack with hard to predict consequences
> which might trigger pathological corner cases.

Yeb, it might be. That's why I tagged RFC so hope other guys suggest
better idea.

> 
> > ---
> >  mm/madvise.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8921c2..81bb26ecf064 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> >  		 * XXX: In this implementation, MADV_FREE works like
> >  		 * MADV_DONTNEED on swapless system or full swap.
> >  		 */
> > -		if (get_nr_swap_pages() > 0)
> > -			return madvise_free(vma, prev, start, end);
> > +		if (get_nr_swap_pages() > 0) {
> > +			unsigned long threshold;
> > +			/*
> > +			 * If we have trobule with memory pressure(ie,
> > +			 * under high watermark), free pages instantly.
> > +			 */
> > +			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> > +			threshold = threshold + (threshold >> 1);
> 
> Why threshold += threshold >> 1 ?

I wanted to trigger this logic if we have free pages under high watermark.

> 
> > +			if (nr_free_pages() > threshold)
> > +				return madvise_free(vma, prev, start, end);
> > +		}
> >  		/* passthrough */
> >  	case MADV_DONTNEED:
> >  		return madvise_dontneed(vma, prev, start, end);
> > -- 
> > 1.9.1
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 3/4] mm: move lazy free pages to inactive list
  2015-02-24 16:14   ` Michal Hocko
@ 2015-02-25  0:27     ` Minchan Kim
  2015-02-25 15:17       ` Michal Hocko
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-25  0:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Tue, Feb 24, 2015 at 05:14:08PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> > MADV_FREE is hint that it's okay to discard pages if memory is
> > pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> 
> s@if memory is pressure@if there is memory pressure@
> 
> > to free them so there is no worth to remain them in active
> > anonymous LRU list so this patch moves them to inactive LRU list.
> 
> Makes sense to me.
> 
> > A arguable issue for the approach is whether we should put it
> > head or tail in inactive list
> 
> Is it really arguable? Why should active MADV_FREE pages appear before
> those which were living on the inactive list. This doesn't make any
> sense to me.

It would be better to drop garbage pages(ie, freed from allocator)
rather than swap out and now anon LRU aging is seq model so
inacitve list can include a lot working set so putting hinted pages
into tail of LRU could enhance reclaim efficiency.
That's why I said it might be arguble.

> 
> > and selected it as head because
> > kernel cannot make sure it's really cold or warm for every usecase
> > but at least we know it's not hot so landing of inactive head
> > would be comprimise if it stayed in active LRU.
> 
> This is really hard to read. What do you think about the following
> wording?
> "
> The active status of those pages is cleared and they are moved to the
> head of the inactive LRU. This means that MADV_FREE-ed pages which
> were living on the inactive list are reclaimed first because they
> are more likely to be cold rather than recently active pages.
> "

My phrase is to focus why we should put them into head of inactive
so it's orthogonal with your phrase and maybe my phrase could be
complement.

> 
> > As well, if we put recent hinted pages to inactive's tail,
> > VM could discard cache hot pages, which would be bad.
> > 
> > As a bonus, we don't need to move them back and forth in inactive
> > list whenever MADV_SYSCALL syscall is called.
> > 
> > As drawback, VM should scan more pages in inactive anonymous LRU
> > to discard but it has happened all the time if recent reference
> > happens on those pages in inactive LRU list so I don't think
> > it's not a main drawback.
> 
> Rather than the above paragraphs I would like to see a description why
> this is needed. Something like the following?
> "
> This is fixing a suboptimal behavior of MADV_FREE when pages living on
> the active list will sit there for a long time even under memory
> pressure while the inactive list is reclaimed heavily. This basically
> breaks the whole purpose of using MADV_FREE to help the system to free
> memory which is might not be used.
> "

Good to me. I will add this. Thanks!

> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> Other than that the patch looks good to me.
> Acked-by: Michal Hocko <mhocko@suse.cz>

Thanks for the review, Michal!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 4/4] mm: support MADV_FREE in swapless system
  2015-02-24 16:51   ` Michal Hocko
@ 2015-02-25  1:41     ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2015-02-25  1:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Tue, Feb 24, 2015 at 05:51:46PM +0100, Michal Hocko wrote:
> On Tue 24-02-15 17:18:17, Minchan Kim wrote:
> > Historically, we have disabled reclaiming completely with swapoff
> > or non-swap configurable system. It did make sense but problem
> > for lazy free pages is that we couldn't get a chance to discard
> > hinted pages in reclaim path in those systems.
> > 
> > That's why current MADV_FREE implmentation drop pages instantly
> > like MADV_DONTNNED in swapless system so that users on those
> > systems couldn't get the benefit of MADV_FREE.
> > 
> > This patch makes VM try to reclaim anonymous pages on swapless
> > system. Basic strategy is to try reclaim anonymous pages
> > if there are pages in *inactive anon*.
> 
> Nice idea.
> 
> > In non-swap config system, VM doesn't do aging/reclaiming
> > anonymous LRU list so inactive anon LRU list should be always
> > empty. So, if there are some pages in inactive anon LRU,
> > it means they are MADV_FREE hinted pages so VM try to reclaim
> > them and discard or promote them onto active list.
> > 
> > In swap-config-but-not-yet-swapon, VM doesn't do aging/reclaiming
> > anonymous LRU list so inactive anon LRU list would be empty but
> > it might be not always true because some pages could remain
> > inactive anon list if the admin had swapon before. So, if there
> > are some pages in inactive anon LRU, it means they are MADV_FREE
> > hinted pages or non-hinted pages which have stayed before.
> > VM try to reclaim them and discard or promote them onto active list
> > so we could have only hinted pages on inactive anon LRU list
> > after a while.
> > 
> > In swap-enabled-and-full, VM does aging but not try to reclaim
> > in current implementation. However, we try to reclaim them by
> > this patch so reclaim efficiency would be worse than old.
> > I don't know how many such workloads(ie, doing a job with
> > full-swap for a long time)  we should take care of are.
> 
> Talking about performance when the swap is full is a bit funny. The
> system is crawling at the time most probably.
> 
> > Hope the comment.
> > 
> > On swapoff system with 3G ram, there are 10 processes with below
> > 
> > loop = 12;
> > mmap(256M);
> 
> again what is the memory pressure which fills up the rest of the memory.

256M * 10 processes = 2.5G on 3G ram with swapoff

> 
> > while (loop--) {
> > 	memset(256M);
> > 	madvise(MADV_FREE or MADV_DONTNEED);
> > 	sleep(1);
> > }
> > 
> > 1) dontneed: 5.40user 24.75system 0:15.36elapsed
> > 2) madvfree + this patch: 5.18user 6.90system 0:13.39elapsed
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  mm/madvise.c | 21 ++++++++-------------
> >  mm/vmscan.c  | 32 +++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 24 deletions(-)
> > 
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 671e47edb584..1574cd783ab9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -163,16 +163,23 @@ static bool global_reclaim(struct scan_control *sc)
> >  
> >  static unsigned long zone_reclaimable_pages(struct zone *zone)
> >  {
> > -	int nr;
> > +	unsigned long file, anon;
> >  
> > -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> > -	     zone_page_state(zone, NR_INACTIVE_FILE);
> > +	file = zone_page_state(zone, NR_ACTIVE_FILE) +
> > +		zone_page_state(zone, NR_INACTIVE_FILE);
> >  
> > -	if (get_nr_swap_pages() > 0)
> > -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> > -		      zone_page_state(zone, NR_INACTIVE_ANON);
> > +	/*
> > +	 * Although there is no swap space, we should consider
> > +	 * lazy free pages in inactive anon LRU list.
> > +	 */
> > +	if (total_swap_pages > 0) {
> > +		anon = zone_page_state(zone, NR_ACTIVE_ANON) +
> > +			zone_page_state(zone, NR_INACTIVE_ANON);
> > +	} else {
> > +		anon = zone_page_state(zone, NR_INACTIVE_ANON);
> > +	}
> 
> Hmm, this doesn't look right to me. The zone would be considered
> reclaimable even though it is not in fact. Most of the systems are
> configured with swap and this would break those AFAICS.

Old behavior:

a: active i: inactive f: file a: anon

1) swapoff : af + if
2) swapon + free space on swap : af + if + aa + ia
3) swapon + no free space on swap : af + if

New behavior

1) swapoff : af + if + ia
In here, ia would be zero most of time unless there are
hinted pages in inactive anon LRU so it would be not different with old.
If there are hinted pages in inactive anon LRU, it's reclaimable pages
potentially so we should include those pages for calc.

2) swapon + free space on swap : af + if + aa + ia

Same with old.

3) swapon + no free space on swap : af + if + ia

I agree this case would be different with old so it would have
side effects(ex, more reclaim trying, OOM delaying). But I don't know
how we take care of fullly swap used system.


> 
> > -	return nr;
> > +	return file + anon;
> >  }
> >  
> >  bool zone_reclaimable(struct zone *zone)
> > @@ -2002,8 +2009,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> >  	if (!global_reclaim(sc))
> >  		force_scan = true;
> >  
> > -	/* If we have no swap space, do not bother scanning anon pages. */
> > -	if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
> > +	/*
> > +	 * If we have no inactive anon page, do not bother scanning
> > +	 * anon pages.
> > +	 */
> > +	if (!sc->may_swap || !zone_page_state(zone, NR_INACTIVE_ANON)) {
> >  		scan_balance = SCAN_FILE;
> >  		goto out;
> >  	}
> 
> but later in the function we are considering active anon pages as well.
> Does this work at all?

I don't understand what your concern is. Could you elaborate it more?
What I want is just to try reclaiming of anonymous pages in situation
where no swap but pages in inactive anon.

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-25  0:08   ` Minchan Kim
@ 2015-02-25  7:11     ` Minchan Kim
  2015-02-25 15:07       ` Michal Hocko
  2015-02-25 18:37       ` Shaohua Li
  2015-02-27  3:37     ` [RFC] mm: change mm_advise_free to clear page dirty Wang, Yalin
  1 sibling, 2 replies; 42+ messages in thread
From: Minchan Kim @ 2015-02-25  7:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> Hi Michal,
> 
> On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > applications went to stall with direct reclaim since kswapd's
> > > reclaim speed isn't fast than applications's allocation speed
> > > so that it causes lots of stall and lock contention.
> > 
> > I am not sure I understand this correctly. So the issue is that there is
> > huge number of MADV_FREE on the LRU and they are not close to the tail
> > of the list so the reclaim has to do a lot of work before it starts
> > dropping them?
> 
> No, Shaohua already tested deactivating of hinted pages to head/tail
> of inactive anon LRU and he said it didn't solve his problem.
> I thought main culprit was scanning/rotating/throttling in
> direct reclaim path.

I investigated my workload and found most of slowness came from swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the pages in pte. Instead, free the cold page
because swapin is more expensive than (alloc page + zeroing).

I tested below quick fix and reduced swapin from 879,585 to 1,878.
Elapsed time was

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Although it was not good as throttling, it's better than old and
it's orthogoral with throttling so I hope to merge this first
than arguable throttling. Any comments?

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8921c2..d41ae76d3e54 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *pte, ptent;
 	struct page *page;
+	swp_entry_t entry;
 	unsigned long next;
+	int rss = 0;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd)) {
@@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (!pte_present(ptent))
+		if (pte_none(ptent))
 			continue;
 
+		if (!pte_present(ptent)) {
+			entry = pte_to_swp_entry(ptent);
+			if (non_swap_entry(entry))
+				continue;
+			rss--;
+			free_swap_and_cache(entry);
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			continue;
+		}
+
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
 			continue;
@@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		set_pte_at(mm, addr, pte, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
+
+	if (rss) {
+		if (current->mm == mm)
+			sync_mm_rss(mm);
+
+		add_mm_counter(mm, MM_SWAPENTS, rss);
+	}
+
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(pte - 1, ptl);
 next:
-- 
1.9.1


> 
> > 
> > > This patch throttles MADV_FREEing so it works only if there
> > > are enough pages in the system which will not trigger backgroud/
> > > direct reclaim. Otherwise, MADV_FREE falls back to MADV_DONTNEED
> > > because there is no point to delay freeing if we know system
> > > is under memory pressure.
> > 
> > Hmm, this is still conforming to the documentation because the kernel is
> > free to free pages at its convenience. I am not sure this is a good
> > idea, though. Why some MADV_FREE calls should be treated differently?
> 
> It's hint for VM to free pages so I think it's okay to free them instantly
> sometime if it can save more important thing like system stall.
> IOW, madvise is just hint, not a strict rule.
> 
> > Wouldn't that lead to hard to predict behavior? E.g. LIFO reused blocks
> > would work without long stalls most of the time - except when there is a
> > memory pressure.
> 
> True.
> 
> > 
> > Comparison to MADV_DONTNEED is not very fair IMHO because the scope of the
> > two calls is different.
> 
> I agree it's not a apple to apple comparison.
> 
> Acutally, MADV_FREE moves the cost from hot path(ie, system call path)
> to slow path(ie, reclaim context) so it would be slower if there are
> much memory pressure continuously due to a lot overhead of freeing pages
> in reclaim context. So, it would be good if kernel detects it nicely
> and prevent the situation. This patch aims for that.
> 
> > 
> > > When I test the patch on my 3G machine + 12 CPU + 8G swap,
> > > test: 12 processes
> > > 
> > > loop = 5;
> > > mmap(512M);
> > 
> > Who is eating the rest of the memory?
> 
> As I wrote down,  there are 12 processes with below test.
> IOW, 512M * 12 = 6G but system RAM is just 3G.
> 
> > 
> > > while (loop--) {
> > > 	memset(512M);
> > > 	madvise(MADV_FREE or MADV_DONTNEED);
> > > }
> > > 
> > > 1) dontneed: 6.78user 234.09system 0:48.89elapsed
> > > 2) madvfree: 6.03user 401.17system 1:30.67elapsed
> > > 3) madvfree + this ptach: 5.68user 113.42system 0:36.52elapsed
> > > 
> > > It's clearly win.
> > > 
> > > Reported-by: Shaohua Li <shli@kernel.org>
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > I don't know. This looks like a hack with hard to predict consequences
> > which might trigger pathological corner cases.
> 
> Yeb, it might be. That's why I tagged RFC so hope other guys suggest
> better idea.
> 
> > 
> > > ---
> > >  mm/madvise.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8921c2..81bb26ecf064 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -523,8 +523,17 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > >  		 * XXX: In this implementation, MADV_FREE works like
> > >  		 * MADV_DONTNEED on swapless system or full swap.
> > >  		 */
> > > -		if (get_nr_swap_pages() > 0)
> > > -			return madvise_free(vma, prev, start, end);
> > > +		if (get_nr_swap_pages() > 0) {
> > > +			unsigned long threshold;
> > > +			/*
> > > +			 * If we have trobule with memory pressure(ie,
> > > +			 * under high watermark), free pages instantly.
> > > +			 */
> > > +			threshold = min_free_kbytes >> (PAGE_SHIFT - 10);
> > > +			threshold = threshold + (threshold >> 1);
> > 
> > Why threshold += threshold >> 1 ?
> 
> I wanted to trigger this logic if we have free pages under high watermark.
> 
> > 
> > > +			if (nr_free_pages() > threshold)
> > > +				return madvise_free(vma, prev, start, end);
> > > +		}
> > >  		/* passthrough */
> > >  	case MADV_DONTNEED:
> > >  		return madvise_dontneed(vma, prev, start, end);
> > > -- 
> > > 1.9.1
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-24 22:54   ` Shaohua Li
@ 2015-02-25 14:13     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2015-02-25 14:13 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Yalin.Wang

On Tue 24-02-15 14:54:01, Shaohua Li wrote:
> On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > applications went to stall with direct reclaim since kswapd's
> > > reclaim speed isn't fast than applications's allocation speed
> > > so that it causes lots of stall and lock contention.
> > 
> > I am not sure I understand this correctly. So the issue is that there is
> > huge number of MADV_FREE on the LRU and they are not close to the tail
> > of the list so the reclaim has to do a lot of work before it starts
> > dropping them?
> 
> I thought the main reason is current reclaim stragety. Anonymous pages are
> considered to be hard to be reclaimed with current policy, VM bias to reclaim
> file pages (anon pages are in active list first, referenced pte will reactivate
> anon pages and increase rotate count)

Makes sense. We are really biasing to page cache reclaim most of the
time.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-25  7:11     ` Minchan Kim
@ 2015-02-25 15:07       ` Michal Hocko
  2015-02-25 18:37       ` Shaohua Li
  1 sibling, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2015-02-25 15:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Wed 25-02-15 16:11:18, Minchan Kim wrote:
[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..d41ae76d3e54 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	pte_t *pte, ptent;
>  	struct page *page;
> +	swp_entry_t entry;
>  	unsigned long next;
> +	int rss = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (!pte_present(ptent))
> +		if (pte_none(ptent))
>  			continue;
>  
> +		if (!pte_present(ptent)) {
> +			entry = pte_to_swp_entry(ptent);
> +			if (non_swap_entry(entry))
> +				continue;
> +			rss--;
> +			free_swap_and_cache(entry);
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			continue;
> +		}

Yes this makes sense. rss is a bit confusing because those pages are not
resident.

> +
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
>  			continue;
> @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> +
> +	if (rss) {
> +		if (current->mm == mm)
> +			sync_mm_rss(mm);
> +
> +		add_mm_counter(mm, MM_SWAPENTS, rss);
> +	}
> +
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(pte - 1, ptl);
>  next:
> -- 
> 1.9.1
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 3/4] mm: move lazy free pages to inactive list
  2015-02-25  0:27     ` Minchan Kim
@ 2015-02-25 15:17       ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2015-02-25 15:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li, Yalin.Wang

On Wed 25-02-15 09:27:28, Minchan Kim wrote:
> On Tue, Feb 24, 2015 at 05:14:08PM +0100, Michal Hocko wrote:
> > On Tue 24-02-15 17:18:16, Minchan Kim wrote:
> > > MADV_FREE is hint that it's okay to discard pages if memory is
> > > pressure and we uses reclaimers(ie, kswapd and direct reclaim)
> > 
> > s@if memory is pressure@if there is memory pressure@
> > 
> > > to free them so there is no worth to remain them in active
> > > anonymous LRU list so this patch moves them to inactive LRU list.
> > 
> > Makes sense to me.
> > 
> > > A arguable issue for the approach is whether we should put it
> > > head or tail in inactive list
> > 
> > Is it really arguable? Why should active MADV_FREE pages appear before
> > those which were living on the inactive list. This doesn't make any
> > sense to me.
> 
> It would be better to drop garbage pages(ie, freed from allocator)
> rather than swap out and now anon LRU aging is seq model so
> inacitve list can include a lot working set so putting hinted pages
> into tail of LRU could enhance reclaim efficiency.
> That's why I said it might be arguble.

OK, maybe I misunderstood what you tried to tell. Sure we can discuss
whether to put all MADV_FREE pages to the tail of inactive list. But
the above wording suggested that _active_ MADV_FREE pages were
discussed and treating them differently from the inactive pages simply
didn't make sense to me.
 
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-25  7:11     ` Minchan Kim
  2015-02-25 15:07       ` Michal Hocko
@ 2015-02-25 18:37       ` Shaohua Li
  2015-02-26  0:42         ` Minchan Kim
  1 sibling, 1 reply; 42+ messages in thread
From: Shaohua Li @ 2015-02-25 18:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Yalin.Wang

On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > Hi Michal,
> > 
> > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > applications went to stall with direct reclaim since kswapd's
> > > > reclaim speed isn't fast than applications's allocation speed
> > > > so that it causes lots of stall and lock contention.
> > > 
> > > I am not sure I understand this correctly. So the issue is that there is
> > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > of the list so the reclaim has to do a lot of work before it starts
> > > dropping them?
> > 
> > No, Shaohua already tested deactivating of hinted pages to head/tail
> > of inactive anon LRU and he said it didn't solve his problem.
> > I thought main culprit was scanning/rotating/throttling in
> > direct reclaim path.
> 
> I investigated my workload and found most of slowness came from swapin.
> 
> 1) dontneed: 1,612 swapin
> 2) madvfree: 879,585 swapin
> 
> If we find hinted pages were already swapped out when syscall is called,
> it's pointless to keep the pages in pte. Instead, free the cold page
> because swapin is more expensive than (alloc page + zeroing).
> 
> I tested below quick fix and reduced swapin from 879,585 to 1,878.
> Elapsed time was
> 
> 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> 
> Although it was not good as throttling, it's better than old and
> it's orthogoral with throttling so I hope to merge this first
> than arguable throttling. Any comments?
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8921c2..d41ae76d3e54 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	spinlock_t *ptl;
>  	pte_t *pte, ptent;
>  	struct page *page;
> +	swp_entry_t entry;
>  	unsigned long next;
> +	int rss = 0;
>  
>  	next = pmd_addr_end(addr, end);
>  	if (pmd_trans_huge(*pmd)) {
> @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  	for (; addr != end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (!pte_present(ptent))
> +		if (pte_none(ptent))
>  			continue;
>  
> +		if (!pte_present(ptent)) {
> +			entry = pte_to_swp_entry(ptent);
> +			if (non_swap_entry(entry))
> +				continue;
> +			rss--;
> +			free_swap_and_cache(entry);
> +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			continue;
> +		}
> +
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
>  			continue;
> @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> +
> +	if (rss) {
> +		if (current->mm == mm)
> +			sync_mm_rss(mm);
> +
> +		add_mm_counter(mm, MM_SWAPENTS, rss);
> +	}
> +

This looks make sense, but I'm wondering why it can help and if this can help
real workload.  Let me have an example. Say there is 1G memory, workload uses
800M memory with DONTNEED, there should be no swap. With FREE, workload might
use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
trigger swap) is more suitable to evaluate the performance of the patch.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-25 18:37       ` Shaohua Li
@ 2015-02-26  0:42         ` Minchan Kim
  2015-02-26 19:04           ` Shaohua Li
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-26  0:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Yalin.Wang

Hello,

On Wed, Feb 25, 2015 at 10:37:48AM -0800, Shaohua Li wrote:
> On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> > On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > > Hi Michal,
> > > 
> > > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > > applications went to stall with direct reclaim since kswapd's
> > > > > reclaim speed isn't fast than applications's allocation speed
> > > > > so that it causes lots of stall and lock contention.
> > > > 
> > > > I am not sure I understand this correctly. So the issue is that there is
> > > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > > of the list so the reclaim has to do a lot of work before it starts
> > > > dropping them?
> > > 
> > > No, Shaohua already tested deactivating of hinted pages to head/tail
> > > of inactive anon LRU and he said it didn't solve his problem.
> > > I thought main culprit was scanning/rotating/throttling in
> > > direct reclaim path.
> > 
> > I investigated my workload and found most of slowness came from swapin.
> > 
> > 1) dontneed: 1,612 swapin
> > 2) madvfree: 879,585 swapin
> > 
> > If we find hinted pages were already swapped out when syscall is called,
> > it's pointless to keep the pages in pte. Instead, free the cold page
> > because swapin is more expensive than (alloc page + zeroing).
> > 
> > I tested below quick fix and reduced swapin from 879,585 to 1,878.
> > Elapsed time was
> > 
> > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > 
> > Although it was not good as throttling, it's better than old and
> > it's orthogoral with throttling so I hope to merge this first
> > than arguable throttling. Any comments?
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8921c2..d41ae76d3e54 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	spinlock_t *ptl;
> >  	pte_t *pte, ptent;
> >  	struct page *page;
> > +	swp_entry_t entry;
> >  	unsigned long next;
> > +	int rss = 0;
> >  
> >  	next = pmd_addr_end(addr, end);
> >  	if (pmd_trans_huge(*pmd)) {
> > @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  	for (; addr != end; pte++, addr += PAGE_SIZE) {
> >  		ptent = *pte;
> >  
> > -		if (!pte_present(ptent))
> > +		if (pte_none(ptent))
> >  			continue;
> >  
> > +		if (!pte_present(ptent)) {
> > +			entry = pte_to_swp_entry(ptent);
> > +			if (non_swap_entry(entry))
> > +				continue;
> > +			rss--;
> > +			free_swap_and_cache(entry);
> > +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > +			continue;
> > +		}
> > +
> >  		page = vm_normal_page(vma, addr, ptent);
> >  		if (!page)
> >  			continue;
> > @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >  		set_pte_at(mm, addr, pte, ptent);
> >  		tlb_remove_tlb_entry(tlb, pte, addr);
> >  	}
> > +
> > +	if (rss) {
> > +		if (current->mm == mm)
> > +			sync_mm_rss(mm);
> > +
> > +		add_mm_counter(mm, MM_SWAPENTS, rss);
> > +	}
> > +
> 
> This looks make sense, but I'm wondering why it can help and if this can help
> real workload.  Let me have an example. Say there is 1G memory, workload uses

void *ptr1 = malloc(len); /* allocator mmap new chunk */
touch_iow_dirty(ptr1, len);
..
..
..
..                      /* swapout happens */
free(ptr1);             /* allocator calls MADV_FREE on the chunk */

void *ptr2 = malloc(len) /* allocator reuses previous chunk */
touch_iow_dirty(ptr2, len); /* swapin happens to read garbage and application overwrite the garbage */

It's really unnecessary cost.


> 800M memory with DONTNEED, there should be no swap. With FREE, workload might
> use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
> trigger swap) is more suitable to evaluate the performance of the patch.

I think above example is really clear and possible scenario.
Could you give me more concrete example to test if you want?

Thanks.

> 
> Thanks,
> Shaohua
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RFC 1/4] mm: throttle MADV_FREE
  2015-02-26  0:42         ` Minchan Kim
@ 2015-02-26 19:04           ` Shaohua Li
  0 siblings, 0 replies; 42+ messages in thread
From: Shaohua Li @ 2015-02-26 19:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Yalin.Wang

On Thu, Feb 26, 2015 at 09:42:06AM +0900, Minchan Kim wrote:
> Hello,
> 
> On Wed, Feb 25, 2015 at 10:37:48AM -0800, Shaohua Li wrote:
> > On Wed, Feb 25, 2015 at 04:11:18PM +0900, Minchan Kim wrote:
> > > On Wed, Feb 25, 2015 at 09:08:09AM +0900, Minchan Kim wrote:
> > > > Hi Michal,
> > > > 
> > > > On Tue, Feb 24, 2015 at 04:43:18PM +0100, Michal Hocko wrote:
> > > > > On Tue 24-02-15 17:18:14, Minchan Kim wrote:
> > > > > > Recently, Shaohua reported that MADV_FREE is much slower than
> > > > > > MADV_DONTNEED in his MADV_FREE bomb test. The reason is many of
> > > > > > applications went to stall with direct reclaim since kswapd's
> > > > > > reclaim speed isn't fast than applications's allocation speed
> > > > > > so that it causes lots of stall and lock contention.
> > > > > 
> > > > > I am not sure I understand this correctly. So the issue is that there is
> > > > > huge number of MADV_FREE on the LRU and they are not close to the tail
> > > > > of the list so the reclaim has to do a lot of work before it starts
> > > > > dropping them?
> > > > 
> > > > No, Shaohua already tested deactivating of hinted pages to head/tail
> > > > of inactive anon LRU and he said it didn't solve his problem.
> > > > I thought main culprit was scanning/rotating/throttling in
> > > > direct reclaim path.
> > > 
> > > I investigated my workload and found most of slowness came from swapin.
> > > 
> > > 1) dontneed: 1,612 swapin
> > > 2) madvfree: 879,585 swapin
> > > 
> > > If we find hinted pages were already swapped out when syscall is called,
> > > it's pointless to keep the pages in pte. Instead, free the cold page
> > > because swapin is more expensive than (alloc page + zeroing).
> > > 
> > > I tested below quick fix and reduced swapin from 879,585 to 1,878.
> > > Elapsed time was
> > > 
> > > 1) dontneed: 6.10user 233.50system 0:50.44elapsed
> > > 2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
> > > 
> > > Although it was not good as throttling, it's better than old and
> > > it's orthogoral with throttling so I hope to merge this first
> > > than arguable throttling. Any comments?
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8921c2..d41ae76d3e54 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -274,7 +274,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > >  	spinlock_t *ptl;
> > >  	pte_t *pte, ptent;
> > >  	struct page *page;
> > > +	swp_entry_t entry;
> > >  	unsigned long next;
> > > +	int rss = 0;
> > >  
> > >  	next = pmd_addr_end(addr, end);
> > >  	if (pmd_trans_huge(*pmd)) {
> > > @@ -293,9 +295,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > >  	for (; addr != end; pte++, addr += PAGE_SIZE) {
> > >  		ptent = *pte;
> > >  
> > > -		if (!pte_present(ptent))
> > > +		if (pte_none(ptent))
> > >  			continue;
> > >  
> > > +		if (!pte_present(ptent)) {
> > > +			entry = pte_to_swp_entry(ptent);
> > > +			if (non_swap_entry(entry))
> > > +				continue;
> > > +			rss--;
> > > +			free_swap_and_cache(entry);
> > > +			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> > > +			continue;
> > > +		}
> > > +
> > >  		page = vm_normal_page(vma, addr, ptent);
> > >  		if (!page)
> > >  			continue;
> > > @@ -326,6 +338,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > >  		set_pte_at(mm, addr, pte, ptent);
> > >  		tlb_remove_tlb_entry(tlb, pte, addr);
> > >  	}
> > > +
> > > +	if (rss) {
> > > +		if (current->mm == mm)
> > > +			sync_mm_rss(mm);
> > > +
> > > +		add_mm_counter(mm, MM_SWAPENTS, rss);
> > > +	}
> > > +
> > 
> > This looks make sense, but I'm wondering why it can help and if this can help
> > real workload.  Let me have an example. Say there is 1G memory, workload uses
> 
> void *ptr1 = malloc(len); /* allocator mmap new chunk */
> touch_iow_dirty(ptr1, len);
> ..
> ..
> ..
> ..                      /* swapout happens */
> free(ptr1);             /* allocator calls MADV_FREE on the chunk */
> 
> void *ptr2 = malloc(len) /* allocator reuses previous chunk */
> touch_iow_dirty(ptr2, len); /* swapin happens to read garbage and application overwrite the garbage */
> 
> It's really unnecessary cost.
> 
> 
> > 800M memory with DONTNEED, there should be no swap. With FREE, workload might
> > use more than 1G memory and trigger swap. I thought the case (DONTNEED doesn't
> > trigger swap) is more suitable to evaluate the performance of the patch.
> 
> I think above example is really clear and possible scenario.
> Could you give me more concrete example to test if you want?

Sorry, no, I don't have concrete example either. My magor concern is workload
which has no swap with DONTNEED and has possible swap with FREE.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-25  0:08   ` Minchan Kim
  2015-02-25  7:11     ` Minchan Kim
@ 2015-02-27  3:37     ` Wang, Yalin
  2015-02-27  5:28       ` Minchan Kim
  2015-02-27 21:02       ` Michal Hocko
  1 sibling, 2 replies; 42+ messages in thread
From: Wang, Yalin @ 2015-02-27  3:37 UTC (permalink / raw)
  To: 'Minchan Kim',
	Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

This patch add ClearPageDirty() to clear AnonPage dirty flag,
the Anonpage mapcount must be 1, so that this page is only used by
the current process, not shared by other process like fork().
if not clear page dirty for this anon page, the page will never be
treated as freeable.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 mm/madvise.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..257925a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 
 		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
+		if (!page || !PageAnon(page) || !trylock_page(page))
 			continue;
 
 		if (PageSwapCache(page)) {
-			if (!trylock_page(page))
+			if (!try_to_free_swap(page))
 				continue;
-
-			if (!try_to_free_swap(page)) {
-				unlock_page(page);
-				continue;
-			}
-
-			ClearPageDirty(page);
-			unlock_page(page);
 		}
 
+		if (page_mapcount(page) == 1)
+			ClearPageDirty(page);
+		unlock_page(page);
 		/*
 		 * Some of architecture(ex, PPC) don't update TLB
 		 * with set_pte_at and tlb_remove_tlb_entry so for
-- 
2.2.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  3:37     ` [RFC] mm: change mm_advise_free to clear page dirty Wang, Yalin
@ 2015-02-27  5:28       ` Minchan Kim
  2015-02-27  5:48         ` Wang, Yalin
  2015-02-27 21:02       ` Michal Hocko
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-27  5:28 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

Hello,

On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> the Anonpage mapcount must be 1, so that this page is only used by
> the current process, not shared by other process like fork().
> if not clear page dirty for this anon page, the page will never be
> treated as freeable.

In case of anonymous page, it has PG_dirty when VM adds it to
swap cache and clear it in clear_page_dirty_for_io. That's why
I added ClearPageDirty if we found it in swapcache.
What case am I missing? It would be better to understand if you
describe specific scenario.

Thanks.

> 
> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> ---
>  mm/madvise.c | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..257925a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			continue;
>  
>  		page = vm_normal_page(vma, addr, ptent);
> -		if (!page)
> +		if (!page || !PageAnon(page) || !trylock_page(page))
>  			continue;
>  
>  		if (PageSwapCache(page)) {
> -			if (!trylock_page(page))
> +			if (!try_to_free_swap(page))
>  				continue;
> -
> -			if (!try_to_free_swap(page)) {
> -				unlock_page(page);
> -				continue;
> -			}
> -
> -			ClearPageDirty(page);
> -			unlock_page(page);
>  		}
>  
> +		if (page_mapcount(page) == 1)
> +			ClearPageDirty(page);
> +		unlock_page(page);
>  		/*
>  		 * Some of architecture(ex, PPC) don't update TLB
>  		 * with set_pte_at and tlb_remove_tlb_entry so for
> -- 
> 2.2.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  5:28       ` Minchan Kim
@ 2015-02-27  5:48         ` Wang, Yalin
  2015-02-27  6:44           ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-02-27  5:48 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Friday, February 27, 2015 1:28 PM
> To: Wang, Yalin
> Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> 
> Hello,
> 
> On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > the Anonpage mapcount must be 1, so that this page is only used by
> > the current process, not shared by other process like fork().
> > if not clear page dirty for this anon page, the page will never be
> > treated as freeable.
> 
> In case of anonymous page, it has PG_dirty when VM adds it to
> swap cache and clear it in clear_page_dirty_for_io. That's why
> I added ClearPageDirty if we found it in swapcache.
> What case am I missing? It would be better to understand if you
> describe specific scenario.
> 
> Thanks.
> 
> >
> > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > ---
> >  mm/madvise.c | 15 +++++----------
> >  1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..257925a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> unsigned long addr,
> >  			continue;
> >
> >  		page = vm_normal_page(vma, addr, ptent);
> > -		if (!page)
> > +		if (!page || !PageAnon(page) || !trylock_page(page))
> >  			continue;
> >
> >  		if (PageSwapCache(page)) {
> > -			if (!trylock_page(page))
> > +			if (!try_to_free_swap(page))
> >  				continue;
> > -
> > -			if (!try_to_free_swap(page)) {
> > -				unlock_page(page);
> > -				continue;
> > -			}
> > -
> > -			ClearPageDirty(page);
> > -			unlock_page(page);
> >  		}
> >
> > +		if (page_mapcount(page) == 1)
> > +			ClearPageDirty(page);
> > +		unlock_page(page);
> >  		/*
> >  		 * Some of architecture(ex, PPC) don't update TLB
> >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > --
Yes, for page which is in SwapCache, it is correct,
But for anon page which is not in SwapCache, it is always
PageDirty(), so we should also clear dirty bit to make it freeable,

Another problem  is that if an anon page is shared by more than one process,
This happened when fork(), the anon page will be copy on write,
In this case, we should not clear page dirty,
Otherwise, other process will get zero page if the page is freed by reclaim path,
This is not correct for other process which don't call MADV_FREE syscall.

Thanks



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  5:48         ` Wang, Yalin
@ 2015-02-27  6:44           ` Minchan Kim
  2015-02-27  7:50             ` Wang, Yalin
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-27  6:44 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > Sent: Friday, February 27, 2015 1:28 PM
> > To: Wang, Yalin
> > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > 
> > Hello,
> > 
> > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > the Anonpage mapcount must be 1, so that this page is only used by
> > > the current process, not shared by other process like fork().
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable.
> > 
> > In case of anonymous page, it has PG_dirty when VM adds it to
> > swap cache and clear it in clear_page_dirty_for_io. That's why
> > I added ClearPageDirty if we found it in swapcache.
> > What case am I missing? It would be better to understand if you
> > describe specific scenario.
> > 
> > Thanks.
> > 
> > >
> > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > ---
> > >  mm/madvise.c | 15 +++++----------
> > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..257925a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned long addr,
> > >  			continue;
> > >
> > >  		page = vm_normal_page(vma, addr, ptent);
> > > -		if (!page)
> > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > >  			continue;
> > >
> > >  		if (PageSwapCache(page)) {
> > > -			if (!trylock_page(page))
> > > +			if (!try_to_free_swap(page))
> > >  				continue;
> > > -
> > > -			if (!try_to_free_swap(page)) {
> > > -				unlock_page(page);
> > > -				continue;
> > > -			}
> > > -
> > > -			ClearPageDirty(page);
> > > -			unlock_page(page);
> > >  		}
> > >
> > > +		if (page_mapcount(page) == 1)
> > > +			ClearPageDirty(page);
> > > +		unlock_page(page);
> > >  		/*
> > >  		 * Some of architecture(ex, PPC) don't update TLB
> > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > --
> Yes, for page which is in SwapCache, it is correct,
> But for anon page which is not in SwapCache, it is always
> PageDirty(), so we should also clear dirty bit to make it freeable,

No. Every anon page starts from !PageDirty and it has PG_dirty
only when it's addeded into swap cache. If vm_swap_full turns on,
a page in swap cache could have PG_dirty via try_to_free_swap again.
So, Do you have concern about swapped-out pages when MADV_FREE is
called? If so, please look at my patch.

https://lkml.org/lkml/2015/2/25/43

It will zap the swapped out page. So, this is not a issue any more?

> 
> Another problem  is that if an anon page is shared by more than one process,
> This happened when fork(), the anon page will be copy on write,
> In this case, we should not clear page dirty,
> This is not correct for other process which don't call MADV_FREE syscall.

You mean we shouldn't inherit MADV_FREE attribute?
Why?

parent:
ptr1 = malloc(len);
        -> allocator calls mmap(len);
memset(ptr1, 'a', len);
free(ptr1);
        -> allocator calss madvise_free(ptr1, len);
..
..
        -> VM discard hinted pages

fork();

child:

ptr2 = malloc(len)
        -> allocator reuses the chunk allocated from parent.
so, child will see zero pages from ptr2 but he doesn't write
anything so garbage|zero page anything is okay to him.


> 
> Thanks
> 
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  6:44           ` Minchan Kim
@ 2015-02-27  7:50             ` Wang, Yalin
  2015-02-27 13:37               ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-02-27  7:50 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Friday, February 27, 2015 2:44 PM
> To: Wang, Yalin
> Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> 
> On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> Kim
> > > Sent: Friday, February 27, 2015 1:28 PM
> > > To: Wang, Yalin
> > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > >
> > > Hello,
> > >
> > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > the current process, not shared by other process like fork().
> > > > if not clear page dirty for this anon page, the page will never be
> > > > treated as freeable.
> > >
> > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > I added ClearPageDirty if we found it in swapcache.
> > > What case am I missing? It would be better to understand if you
> > > describe specific scenario.
> > >
> > > Thanks.
> > >
> > > >
> > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > ---
> > > >  mm/madvise.c | 15 +++++----------
> > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 6d0fcb8..257925a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > unsigned long addr,
> > > >  			continue;
> > > >
> > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > -		if (!page)
> > > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > > >  			continue;
> > > >
> > > >  		if (PageSwapCache(page)) {
> > > > -			if (!trylock_page(page))
> > > > +			if (!try_to_free_swap(page))
> > > >  				continue;
> > > > -
> > > > -			if (!try_to_free_swap(page)) {
> > > > -				unlock_page(page);
> > > > -				continue;
> > > > -			}
> > > > -
> > > > -			ClearPageDirty(page);
> > > > -			unlock_page(page);
> > > >  		}
> > > >
> > > > +		if (page_mapcount(page) == 1)
> > > > +			ClearPageDirty(page);
> > > > +		unlock_page(page);
> > > >  		/*
> > > >  		 * Some of architecture(ex, PPC) don't update TLB
> > > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > > --
> > Yes, for page which is in SwapCache, it is correct,
> > But for anon page which is not in SwapCache, it is always
> > PageDirty(), so we should also clear dirty bit to make it freeable,
> 
> No. Every anon page starts from !PageDirty and it has PG_dirty
> only when it's addeded into swap cache. If vm_swap_full turns on,
> a page in swap cache could have PG_dirty via try_to_free_swap again.

mmm..
sometimes you can see an anon page PageDirty(), but it is not in swapcache,
for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
at this time, the page is deleted from swapcache and is marked PageDirty(),


> So, Do you have concern about swapped-out pages when MADV_FREE is
> called? If so, please look at my patch.
> 
> https://lkml.org/lkml/2015/2/25/43
> 
> It will zap the swapped out page. So, this is not a issue any more?
> 
> >
> > Another problem  is that if an anon page is shared by more than one
> process,
> > This happened when fork(), the anon page will be copy on write,
> > In this case, we should not clear page dirty,
> > This is not correct for other process which don't call MADV_FREE syscall.
> 
> You mean we shouldn't inherit MADV_FREE attribute?
> Why?

Is it correct behavior if code like this:

Parent:
ptr1 = malloc(len);
memset(ptr1, 'a', len);
fork();
if (I am parent)
	madvise_free(ptr1, len);

child:
sleep(10);
parse_data(ptr1, len);  // child may see zero, not 'a',
			// is it the right behavior that the programer want?

Because child don't call madvise_free(), so it should see 'a', not zero page.
Isn't it ?
Thanks







^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  7:50             ` Wang, Yalin
@ 2015-02-27 13:37               ` Minchan Kim
  2015-02-28 13:50                 ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-27 13:37 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > Sent: Friday, February 27, 2015 2:44 PM
> > To: Wang, Yalin
> > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > 
> > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > -----Original Message-----
> > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> > Kim
> > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > To: Wang, Yalin
> > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > >
> > > > Hello,
> > > >
> > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > > the current process, not shared by other process like fork().
> > > > > if not clear page dirty for this anon page, the page will never be
> > > > > treated as freeable.
> > > >
> > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > I added ClearPageDirty if we found it in swapcache.
> > > > What case am I missing? It would be better to understand if you
> > > > describe specific scenario.
> > > >
> > > > Thanks.
> > > >
> > > > >
> > > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > > ---
> > > > >  mm/madvise.c | 15 +++++----------
> > > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > index 6d0fcb8..257925a 100644
> > > > > --- a/mm/madvise.c
> > > > > +++ b/mm/madvise.c
> > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > > unsigned long addr,
> > > > >  			continue;
> > > > >
> > > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > > -		if (!page)
> > > > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > > > >  			continue;
> > > > >
> > > > >  		if (PageSwapCache(page)) {
> > > > > -			if (!trylock_page(page))
> > > > > +			if (!try_to_free_swap(page))
> > > > >  				continue;
> > > > > -
> > > > > -			if (!try_to_free_swap(page)) {
> > > > > -				unlock_page(page);
> > > > > -				continue;
> > > > > -			}
> > > > > -
> > > > > -			ClearPageDirty(page);
> > > > > -			unlock_page(page);
> > > > >  		}
> > > > >
> > > > > +		if (page_mapcount(page) == 1)
> > > > > +			ClearPageDirty(page);
> > > > > +		unlock_page(page);
> > > > >  		/*
> > > > >  		 * Some of architecture(ex, PPC) don't update TLB
> > > > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > --
> > > Yes, for page which is in SwapCache, it is correct,
> > > But for anon page which is not in SwapCache, it is always
> > > PageDirty(), so we should also clear dirty bit to make it freeable,
> > 
> > No. Every anon page starts from !PageDirty and it has PG_dirty
> > only when it's addeded into swap cache. If vm_swap_full turns on,
> > a page in swap cache could have PG_dirty via try_to_free_swap again.
> 
> mmm..
> sometimes you can see an anon page PageDirty(), but it is not in swapcache,
> for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> at this time, the page is deleted from swapcache and is marked PageDirty(),

That's what I missed. It's clear and would be simple patch so
could you send a patch to fix this issue with detailed description
like above?

> 
> 
> > So, Do you have concern about swapped-out pages when MADV_FREE is
> > called? If so, please look at my patch.
> > 
> > https://lkml.org/lkml/2015/2/25/43
> > 
> > It will zap the swapped out page. So, this is not a issue any more?
> > 
> > >
> > > Another problem  is that if an anon page is shared by more than one
> > process,
> > > This happened when fork(), the anon page will be copy on write,
> > > In this case, we should not clear page dirty,
> > > This is not correct for other process which don't call MADV_FREE syscall.
> > 
> > You mean we shouldn't inherit MADV_FREE attribute?
> > Why?
> 
> Is it correct behavior if code like this:
> 
> Parent:
> ptr1 = malloc(len);
> memset(ptr1, 'a', len);
> fork();
> if (I am parent)
> 	madvise_free(ptr1, len);
> 
> child:
> sleep(10);
> parse_data(ptr1, len);  // child may see zero, not 'a',
> 			// is it the right behavior that the programer want?
> 
> Because child don't call madvise_free(), so it should see 'a', not zero page.
> Isn't it ?

You're absolutely right. Thanks.
But I doubt your fix is best. Most of fork will do exec soonish so
it's not a good idea to make MADV_FREE void even though hinted pages
are shared when the syscall was called.
How about checking the page is shared or not in reclaim path?
If it is still shared, we shouldn't discard it.

Thanks.

> Thanks
> 
> 
> 
> 
> 
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27  3:37     ` [RFC] mm: change mm_advise_free to clear page dirty Wang, Yalin
  2015-02-27  5:28       ` Minchan Kim
@ 2015-02-27 21:02       ` Michal Hocko
  2015-02-28  2:11         ` Wang, Yalin
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2015-02-27 21:02 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Minchan Kim',
	Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li

On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> the Anonpage mapcount must be 1, so that this page is only used by
> the current process, not shared by other process like fork().
> if not clear page dirty for this anon page, the page will never be
> treated as freeable.

Very well spotted! I haven't noticed that during the review.

> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> ---
>  mm/madvise.c | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..257925a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			continue;
>  
>  		page = vm_normal_page(vma, addr, ptent);
> -		if (!page)
> +		if (!page || !PageAnon(page) || !trylock_page(page))
>  			continue;

PageAnon check seems to be redundant because we are not allowing
MADV_FREE on any !anon private mappings AFAIR.

>  
>  		if (PageSwapCache(page)) {
> -			if (!trylock_page(page))
> +			if (!try_to_free_swap(page))
>  				continue;

You need to unlock the page here.

> -
> -			if (!try_to_free_swap(page)) {
> -				unlock_page(page);
> -				continue;
> -			}
> -
> -			ClearPageDirty(page);
> -			unlock_page(page);
>  		}
>  
> +		if (page_mapcount(page) == 1)
> +			ClearPageDirty(page);

Please add a comment about why we need to ClearPageDirty even
!PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
reason seem to be racing try_to_free_swap which sets the page that way
(although I do not seem to remember why are we doing that in the first
place...)

> +		unlock_page(page);
>  		/*
>  		 * Some of architecture(ex, PPC) don't update TLB
>  		 * with set_pte_at and tlb_remove_tlb_entry so for
> -- 
> 2.2.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27 21:02       ` Michal Hocko
@ 2015-02-28  2:11         ` Wang, Yalin
  2015-02-28  6:01           ` [RFC V2] " Wang, Yalin
                             ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Wang, Yalin @ 2015-02-28  2:11 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Minchan Kim',
	Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li

> -----Original Message-----
> From: Michal Hocko [mailto:mstsxfx@gmail.com] On Behalf Of Michal Hocko
> Sent: Saturday, February 28, 2015 5:03 AM
> To: Wang, Yalin
> Cc: 'Minchan Kim'; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> 
> On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > the Anonpage mapcount must be 1, so that this page is only used by
> > the current process, not shared by other process like fork().
> > if not clear page dirty for this anon page, the page will never be
> > treated as freeable.
> 
> Very well spotted! I haven't noticed that during the review.
> 
> > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > ---
> >  mm/madvise.c | 15 +++++----------
> >  1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..257925a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> unsigned long addr,
> >  			continue;
> >
> >  		page = vm_normal_page(vma, addr, ptent);
> > -		if (!page)
> > +		if (!page || !PageAnon(page) || !trylock_page(page))
> >  			continue;
> 
> PageAnon check seems to be redundant because we are not allowing
> MADV_FREE on any !anon private mappings AFAIR.
I only see this check:
/* MADV_FREE works for only anon vma at the moment */
	if (vma->vm_file)
		return -EINVAL;

but for file private map, there are also AnonPage sometimes, do we need change
to like this:
	if (vma->vm_flags & VM_SHARED)
		return -EINVAL;
> >
> >  		if (PageSwapCache(page)) {
> > -			if (!trylock_page(page))
> > +			if (!try_to_free_swap(page))
> >  				continue;
> 
> You need to unlock the page here.
Good spot.

> > -
> > -			if (!try_to_free_swap(page)) {
> > -				unlock_page(page);
> > -				continue;
> > -			}
> > -
> > -			ClearPageDirty(page);
> > -			unlock_page(page);
> >  		}
> >
> > +		if (page_mapcount(page) == 1)
> > +			ClearPageDirty(page);
> 
> Please add a comment about why we need to ClearPageDirty even
> !PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
> reason seem to be racing try_to_free_swap which sets the page that way
> (although I do not seem to remember why are we doing that in the first
> place...)
> 
Use page_mapcount to judge if a page can be clear dirty flag seems
Not a very good solution, that is because we don't know how many
ptes are share this page, I am thinking if there is some good solution
For shared AnonPage.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC V2] mm: change mm_advise_free to clear page dirty
  2015-02-28  2:11         ` Wang, Yalin
@ 2015-02-28  6:01           ` Wang, Yalin
  2015-03-02 12:38             ` Michal Hocko
  2015-02-28 13:55           ` [RFC] " Minchan Kim
  2015-03-02 12:33           ` Michal Hocko
  2 siblings, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-02-28  6:01 UTC (permalink / raw)
  To: 'Michal Hocko'
  Cc: 'Minchan Kim', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li'

This patch add ClearPageDirty() to clear AnonPage dirty flag,
if not clear page dirty for this anon page, the page will never be
treated as freeable. we also make sure the shared AnonPage is not
freeable, we implement it by dirty all copyed AnonPage pte,
so that make sure the Anonpage will not become freeable, unless
all process which shared this page call madvise_free syscall.

Another change is that we also handle file map page,
we just clear pte young bit for file map, this is useful,
it can make reclaim patch move file pages into inactive
lru list aggressively.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 mm/madvise.c | 26 +++++++++++++++-----------
 mm/memory.c  | 12 ++++++++++--
 2 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..712756b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -299,30 +299,38 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		page = vm_normal_page(vma, addr, ptent);
 		if (!page)
 			continue;
+		if (!PageAnon(page))
+			goto set_pte;
+		if (!trylock_page(page))
+			continue;
 
 		if (PageSwapCache(page)) {
-			if (!trylock_page(page))
-				continue;
-
 			if (!try_to_free_swap(page)) {
 				unlock_page(page);
 				continue;
 			}
-
-			ClearPageDirty(page);
-			unlock_page(page);
 		}
 
 		/*
+		 * we clear page dirty flag for AnonPage, no matter if this
+		 * page is in swapcahce or not, AnonPage not in swapcache also set
+		 * dirty flag sometimes, this happened when an AnonPage is removed
+		 * from swapcahce by try_to_free_swap()
+		 */
+		ClearPageDirty(page);
+		unlock_page(page);
+		/*
 		 * Some of architecture(ex, PPC) don't update TLB
 		 * with set_pte_at and tlb_remove_tlb_entry so for
 		 * the portability, remap the pte with old|clean
 		 * after pte clearing.
 		 */
+set_pte:
 		ptent = ptep_get_and_clear_full(mm, addr, pte,
 						tlb->fullmm);
 		ptent = pte_mkold(ptent);
-		ptent = pte_mkclean(ptent);
+		if (PageAnon(page))
+			ptent = pte_mkclean(ptent);
 		set_pte_at(mm, addr, pte, ptent);
 		tlb_remove_tlb_entry(tlb, pte, addr);
 	}
@@ -364,10 +372,6 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
 		return -EINVAL;
 
-	/* MADV_FREE works for only anon vma at the moment */
-	if (vma->vm_file)
-		return -EINVAL;
-
 	start = max(vma->vm_start, start_addr);
 	if (start >= vma->vm_end)
 		return -EINVAL;
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..3d949b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,10 +874,18 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
-		if (PageAnon(page))
+		if (PageAnon(page)) {
+			/*
+			 * we dirty the copyed pte for anon page,
+			 * this is useful for madvise_free_pte_range(),
+			 * this can prevent shared anon page freed by madvise_free
+			 * syscall
+			 */
+			pte = pte_mkdirty(pte);
 			rss[MM_ANONPAGES]++;
-		else
+		} else {
 			rss[MM_FILEPAGES]++;
+		}
 	}
 
 out_set_pte:
-- 
2.2.2

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-27 13:37               ` Minchan Kim
@ 2015-02-28 13:50                 ` Minchan Kim
  2015-03-02  1:59                   ` Wang, Yalin
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-28 13:50 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

On Fri, Feb 27, 2015 at 10:37:14PM +0900, Minchan Kim wrote:
> On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > > Sent: Friday, February 27, 2015 2:44 PM
> > > To: Wang, Yalin
> > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > 
> > > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > > -----Original Message-----
> > > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> > > Kim
> > > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > > To: Wang, Yalin
> > > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > > > the current process, not shared by other process like fork().
> > > > > > if not clear page dirty for this anon page, the page will never be
> > > > > > treated as freeable.
> > > > >
> > > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > > I added ClearPageDirty if we found it in swapcache.
> > > > > What case am I missing? It would be better to understand if you
> > > > > describe specific scenario.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > >
> > > > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > > > ---
> > > > > >  mm/madvise.c | 15 +++++----------
> > > > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > index 6d0fcb8..257925a 100644
> > > > > > --- a/mm/madvise.c
> > > > > > +++ b/mm/madvise.c
> > > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > > > unsigned long addr,
> > > > > >  			continue;
> > > > > >
> > > > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > > > -		if (!page)
> > > > > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > > > > >  			continue;
> > > > > >
> > > > > >  		if (PageSwapCache(page)) {
> > > > > > -			if (!trylock_page(page))
> > > > > > +			if (!try_to_free_swap(page))
> > > > > >  				continue;
> > > > > > -
> > > > > > -			if (!try_to_free_swap(page)) {
> > > > > > -				unlock_page(page);
> > > > > > -				continue;
> > > > > > -			}
> > > > > > -
> > > > > > -			ClearPageDirty(page);
> > > > > > -			unlock_page(page);
> > > > > >  		}
> > > > > >
> > > > > > +		if (page_mapcount(page) == 1)
> > > > > > +			ClearPageDirty(page);
> > > > > > +		unlock_page(page);
> > > > > >  		/*
> > > > > >  		 * Some of architecture(ex, PPC) don't update TLB
> > > > > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > > --
> > > > Yes, for page which is in SwapCache, it is correct,
> > > > But for anon page which is not in SwapCache, it is always
> > > > PageDirty(), so we should also clear dirty bit to make it freeable,
> > > 
> > > No. Every anon page starts from !PageDirty and it has PG_dirty
> > > only when it's addeded into swap cache. If vm_swap_full turns on,
> > > a page in swap cache could have PG_dirty via try_to_free_swap again.
> > 
> > mmm..
> > sometimes you can see an anon page PageDirty(), but it is not in swapcache,
> > for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> > at this time, the page is deleted from swapcache and is marked PageDirty(),
> 
> That's what I missed. It's clear and would be simple patch so
> could you send a patch to fix this issue with detailed description
> like above?
> 
> > 
> > 
> > > So, Do you have concern about swapped-out pages when MADV_FREE is
> > > called? If so, please look at my patch.
> > > 
> > > https://lkml.org/lkml/2015/2/25/43
> > > 
> > > It will zap the swapped out page. So, this is not a issue any more?
> > > 
> > > >
> > > > Another problem  is that if an anon page is shared by more than one
> > > process,
> > > > This happened when fork(), the anon page will be copy on write,
> > > > In this case, we should not clear page dirty,
> > > > This is not correct for other process which don't call MADV_FREE syscall.
> > > 
> > > You mean we shouldn't inherit MADV_FREE attribute?
> > > Why?
> > 
> > Is it correct behavior if code like this:
> > 
> > Parent:
> > ptr1 = malloc(len);
> > memset(ptr1, 'a', len);
> > fork();
> > if (I am parent)
> > 	madvise_free(ptr1, len);
> > 
> > child:
> > sleep(10);
> > parse_data(ptr1, len);  // child may see zero, not 'a',
> > 			// is it the right behavior that the programer want?
> > 
> > Because child don't call madvise_free(), so it should see 'a', not zero page.
> > Isn't it ?
> 
> You're absolutely right. Thanks.
> But I doubt your fix is best. Most of fork will do exec soonish so
> it's not a good idea to make MADV_FREE void even though hinted pages
> are shared when the syscall was called.
> How about checking the page is shared or not in reclaim path?
> If it is still shared, we shouldn't discard it.

I got confused. With looking at copy_one_pte, it copys from src_pte
and not clear dirty bit if it's not a shared mapping.
If so, in your example, child pte has pte dirty bit on while parent
has clean bit by madvise_free so that VM shouldn't discard the page.
No?

> 
> Thanks.
> 
> > Thanks
> > 
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Kind regards,
> Minchan Kim

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-28  2:11         ` Wang, Yalin
  2015-02-28  6:01           ` [RFC V2] " Wang, Yalin
@ 2015-02-28 13:55           ` Minchan Kim
  2015-03-02  1:53             ` Wang, Yalin
  2015-03-02 12:33           ` Michal Hocko
  2 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-02-28 13:55 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Michal Hocko',
	Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li

On Sat, Feb 28, 2015 at 10:11:13AM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Michal Hocko [mailto:mstsxfx@gmail.com] On Behalf Of Michal Hocko
> > Sent: Saturday, February 28, 2015 5:03 AM
> > To: Wang, Yalin
> > Cc: 'Minchan Kim'; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > 
> > On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > the Anonpage mapcount must be 1, so that this page is only used by
> > > the current process, not shared by other process like fork().
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable.
> > 
> > Very well spotted! I haven't noticed that during the review.
> > 
> > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > ---
> > >  mm/madvise.c | 15 +++++----------
> > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..257925a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned long addr,
> > >  			continue;
> > >
> > >  		page = vm_normal_page(vma, addr, ptent);
> > > -		if (!page)
> > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > >  			continue;
> > 
> > PageAnon check seems to be redundant because we are not allowing
> > MADV_FREE on any !anon private mappings AFAIR.
> I only see this check:
> /* MADV_FREE works for only anon vma at the moment */
> 	if (vma->vm_file)
> 		return -EINVAL;
> 
> but for file private map, there are also AnonPage sometimes, do we need change
> to like this:
> 	if (vma->vm_flags & VM_SHARED)
> 		return -EINVAL;

I couldn't understand your point. In this stage, we intentionally
disabled madvise_free on file mapped area(AFAIRC, some guys tried
it long time ago but it had many issues so dropped).
So, how can file-private mmaped can reach this code?
Could you elaborate it more about that why we need PageAnon check
in here?


> > >
> > >  		if (PageSwapCache(page)) {
> > > -			if (!trylock_page(page))
> > > +			if (!try_to_free_swap(page))
> > >  				continue;
> > 
> > You need to unlock the page here.
> Good spot.
> 
> > > -
> > > -			if (!try_to_free_swap(page)) {
> > > -				unlock_page(page);
> > > -				continue;
> > > -			}
> > > -
> > > -			ClearPageDirty(page);
> > > -			unlock_page(page);
> > >  		}
> > >
> > > +		if (page_mapcount(page) == 1)
> > > +			ClearPageDirty(page);
> > 
> > Please add a comment about why we need to ClearPageDirty even
> > !PageSwapCache. Anon pages are usually not marked dirty AFAIR. The
> > reason seem to be racing try_to_free_swap which sets the page that way
> > (although I do not seem to remember why are we doing that in the first
> > place...)
> > 
> Use page_mapcount to judge if a page can be clear dirty flag seems
> Not a very good solution, that is because we don't know how many
> ptes are share this page, I am thinking if there is some good solution
> For shared AnonPage.
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-28 13:55           ` [RFC] " Minchan Kim
@ 2015-03-02  1:53             ` Wang, Yalin
  0 siblings, 0 replies; 42+ messages in thread
From: Wang, Yalin @ 2015-03-02  1:53 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: 'Michal Hocko',
	Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Saturday, February 28, 2015 9:56 PM
> To: Wang, Yalin
> Cc: 'Michal Hocko'; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> 
> On Sat, Feb 28, 2015 at 10:11:13AM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Michal Hocko [mailto:mstsxfx@gmail.com] On Behalf Of Michal Hocko
> > > Sent: Saturday, February 28, 2015 5:03 AM
> > > To: Wang, Yalin
> > > Cc: 'Minchan Kim'; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > >
> > > On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > the Anonpage mapcount must be 1, so that this page is only used by
> > > > the current process, not shared by other process like fork().
> > > > if not clear page dirty for this anon page, the page will never be
> > > > treated as freeable.
> > >
> > > Very well spotted! I haven't noticed that during the review.
> > >
> > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > ---
> > > >  mm/madvise.c | 15 +++++----------
> > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 6d0fcb8..257925a 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > > unsigned long addr,
> > > >  			continue;
> > > >
> > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > -		if (!page)
> > > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > > >  			continue;
> > >
> > > PageAnon check seems to be redundant because we are not allowing
> > > MADV_FREE on any !anon private mappings AFAIR.
> > I only see this check:
> > /* MADV_FREE works for only anon vma at the moment */
> > 	if (vma->vm_file)
> > 		return -EINVAL;
> >
> > but for file private map, there are also AnonPage sometimes, do we need
> change
> > to like this:
> > 	if (vma->vm_flags & VM_SHARED)
> > 		return -EINVAL;
> 
> I couldn't understand your point. In this stage, we intentionally
> disabled madvise_free on file mapped area(AFAIRC, some guys tried
> it long time ago but it had many issues so dropped).
> So, how can file-private mmaped can reach this code?
> Could you elaborate it more about that why we need PageAnon check
> in here?
> 
I send a new patch:
[RFC V2] mm: change mm_advise_free to clear page dirty
Please have a look at it.
Thanks for your comments!



^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-28 13:50                 ` Minchan Kim
@ 2015-03-02  1:59                   ` Wang, Yalin
  2015-03-03  0:42                     ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-03-02  1:59 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Saturday, February 28, 2015 9:50 PM
> To: Wang, Yalin
> Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> 
> On Fri, Feb 27, 2015 at 10:37:14PM +0900, Minchan Kim wrote:
> > On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > > > -----Original Message-----
> > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> Kim
> > > > Sent: Friday, February 27, 2015 2:44 PM
> > > > To: Wang, Yalin
> > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > >
> > > > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > > > -----Original Message-----
> > > > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of
> Minchan
> > > > Kim
> > > > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > > > To: Wang, Yalin
> > > > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org;
> linux-
> > > > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua
> Li
> > > > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > > > the Anonpage mapcount must be 1, so that this page is only used
> by
> > > > > > > the current process, not shared by other process like fork().
> > > > > > > if not clear page dirty for this anon page, the page will never
> be
> > > > > > > treated as freeable.
> > > > > >
> > > > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > > > I added ClearPageDirty if we found it in swapcache.
> > > > > > What case am I missing? It would be better to understand if you
> > > > > > describe specific scenario.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > > > > ---
> > > > > > >  mm/madvise.c | 15 +++++----------
> > > > > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > > index 6d0fcb8..257925a 100644
> > > > > > > --- a/mm/madvise.c
> > > > > > > +++ b/mm/madvise.c
> > > > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t
> *pmd,
> > > > > > unsigned long addr,
> > > > > > >  			continue;
> > > > > > >
> > > > > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > > > > -		if (!page)
> > > > > > > +		if (!page || !PageAnon(page)
> || !trylock_page(page))
> > > > > > >  			continue;
> > > > > > >
> > > > > > >  		if (PageSwapCache(page)) {
> > > > > > > -			if (!trylock_page(page))
> > > > > > > +			if (!try_to_free_swap(page))
> > > > > > >  				continue;
> > > > > > > -
> > > > > > > -			if (!try_to_free_swap(page)) {
> > > > > > > -				unlock_page(page);
> > > > > > > -				continue;
> > > > > > > -			}
> > > > > > > -
> > > > > > > -			ClearPageDirty(page);
> > > > > > > -			unlock_page(page);
> > > > > > >  		}
> > > > > > >
> > > > > > > +		if (page_mapcount(page) == 1)
> > > > > > > +			ClearPageDirty(page);
> > > > > > > +		unlock_page(page);
> > > > > > >  		/*
> > > > > > >  		 * Some of architecture(ex, PPC) don't update TLB
> > > > > > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > > > --
> > > > > Yes, for page which is in SwapCache, it is correct,
> > > > > But for anon page which is not in SwapCache, it is always
> > > > > PageDirty(), so we should also clear dirty bit to make it freeable,
> > > >
> > > > No. Every anon page starts from !PageDirty and it has PG_dirty
> > > > only when it's addeded into swap cache. If vm_swap_full turns on,
> > > > a page in swap cache could have PG_dirty via try_to_free_swap again.
> > >
> > > mmm..
> > > sometimes you can see an anon page PageDirty(), but it is not in
> swapcache,
> > > for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> > > at this time, the page is deleted from swapcache and is marked
> PageDirty(),
> >
> > That's what I missed. It's clear and would be simple patch so
> > could you send a patch to fix this issue with detailed description
> > like above?
> >
> > >
> > >
> > > > So, Do you have concern about swapped-out pages when MADV_FREE is
> > > > called? If so, please look at my patch.
> > > >
> > > > https://lkml.org/lkml/2015/2/25/43
> > > >
> > > > It will zap the swapped out page. So, this is not a issue any more?
> > > >
> > > > >
> > > > > Another problem  is that if an anon page is shared by more than one
> > > > process,
> > > > > This happened when fork(), the anon page will be copy on write,
> > > > > In this case, we should not clear page dirty,
> > > > > This is not correct for other process which don't call MADV_FREE
> syscall.
> > > >
> > > > You mean we shouldn't inherit MADV_FREE attribute?
> > > > Why?
> > >
> > > Is it correct behavior if code like this:
> > >
> > > Parent:
> > > ptr1 = malloc(len);
> > > memset(ptr1, 'a', len);
> > > fork();
> > > if (I am parent)
> > > 	madvise_free(ptr1, len);
> > >
> > > child:
> > > sleep(10);
> > > parse_data(ptr1, len);  // child may see zero, not 'a',
> > > 			// is it the right behavior that the programer want?
> > >
> > > Because child don't call madvise_free(), so it should see 'a', not zero
> page.
> > > Isn't it ?
> >
> > You're absolutely right. Thanks.
> > But I doubt your fix is best. Most of fork will do exec soonish so
> > it's not a good idea to make MADV_FREE void even though hinted pages
> > are shared when the syscall was called.
> > How about checking the page is shared or not in reclaim path?
> > If it is still shared, we shouldn't discard it.
> 
> I got confused. With looking at copy_one_pte, it copys from src_pte
> and not clear dirty bit if it's not a shared mapping.
> If so, in your example, child pte has pte dirty bit on while parent
> has clean bit by madvise_free so that VM shouldn't discard the page.
> No?
> 
It is not always true that src_pte has dirty bit.
For example ,
If a page have been swap into swap partition,
The pte have become a swap entry,
Then a read fault happened,
The pte will be a page pte without dirty bit,
Am I ritht?

Thanks



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-02-28  2:11         ` Wang, Yalin
  2015-02-28  6:01           ` [RFC V2] " Wang, Yalin
  2015-02-28 13:55           ` [RFC] " Minchan Kim
@ 2015-03-02 12:33           ` Michal Hocko
  2 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2015-03-02 12:33 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Minchan Kim',
	Andrew Morton, linux-kernel, linux-mm, Rik van Riel,
	Johannes Weiner, Mel Gorman, Shaohua Li

On Sat 28-02-15 10:11:13, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Michal Hocko [mailto:mstsxfx@gmail.com] On Behalf Of Michal Hocko
> > Sent: Saturday, February 28, 2015 5:03 AM
> > To: Wang, Yalin
> > Cc: 'Minchan Kim'; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > 
> > On Fri 27-02-15 11:37:18, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > the Anonpage mapcount must be 1, so that this page is only used by
> > > the current process, not shared by other process like fork().
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable.
> > 
> > Very well spotted! I haven't noticed that during the review.
> > 
> > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > ---
> > >  mm/madvise.c | 15 +++++----------
> > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..257925a 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned long addr,
> > >  			continue;
> > >
> > >  		page = vm_normal_page(vma, addr, ptent);
> > > -		if (!page)
> > > +		if (!page || !PageAnon(page) || !trylock_page(page))
> > >  			continue;
> > 
> > PageAnon check seems to be redundant because we are not allowing
> > MADV_FREE on any !anon private mappings AFAIR.
> I only see this check:
> /* MADV_FREE works for only anon vma at the moment */
> 	if (vma->vm_file)
> 		return -EINVAL;
> 
> but for file private map, there are also AnonPage sometimes,

AFAIR MADV_FREE was intended only for private anon mappings. What would
be the use case for MADV_FREE on file backed private mappings?

[...]

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC V2] mm: change mm_advise_free to clear page dirty
  2015-02-28  6:01           ` [RFC V2] " Wang, Yalin
@ 2015-03-02 12:38             ` Michal Hocko
  2015-03-03  2:06               ` [RFC V3] " Wang, Yalin
  0 siblings, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2015-03-02 12:38 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Minchan Kim', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li'

On Sat 28-02-15 14:01:46, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> if not clear page dirty for this anon page, the page will never be
> treated as freeable. we also make sure the shared AnonPage is not
> freeable, we implement it by dirty all copyed AnonPage pte,
> so that make sure the Anonpage will not become freeable, unless
> all process which shared this page call madvise_free syscall.

I am not able to parse this text.

> Another change is that we also handle file map page,
> we just clear pte young bit for file map, this is useful,
> it can make reclaim patch move file pages into inactive
> lru list aggressively.

This doesn't belong to this patch. If file private mappings should allow
MADV_FREE is a separate topic and should be discussed independently.

> 
> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> ---
>  mm/madvise.c | 26 +++++++++++++++-----------
>  mm/memory.c  | 12 ++++++++++--
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..712756b 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -299,30 +299,38 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		page = vm_normal_page(vma, addr, ptent);
>  		if (!page)
>  			continue;
> +		if (!PageAnon(page))
> +			goto set_pte;
> +		if (!trylock_page(page))
> +			continue;
>  
>  		if (PageSwapCache(page)) {
> -			if (!trylock_page(page))
> -				continue;
> -
>  			if (!try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -
> -			ClearPageDirty(page);
> -			unlock_page(page);
>  		}
>  
>  		/*
> +		 * we clear page dirty flag for AnonPage, no matter if this
> +		 * page is in swapcahce or not, AnonPage not in swapcache also set
> +		 * dirty flag sometimes, this happened when an AnonPage is removed
> +		 * from swapcahce by try_to_free_swap()
> +		 */
> +		ClearPageDirty(page);
> +		unlock_page(page);
> +		/*
>  		 * Some of architecture(ex, PPC) don't update TLB
>  		 * with set_pte_at and tlb_remove_tlb_entry so for
>  		 * the portability, remap the pte with old|clean
>  		 * after pte clearing.
>  		 */
> +set_pte:
>  		ptent = ptep_get_and_clear_full(mm, addr, pte,
>  						tlb->fullmm);
>  		ptent = pte_mkold(ptent);
> -		ptent = pte_mkclean(ptent);
> +		if (PageAnon(page))
> +			ptent = pte_mkclean(ptent);
>  		set_pte_at(mm, addr, pte, ptent);
>  		tlb_remove_tlb_entry(tlb, pte, addr);
>  	}
> @@ -364,10 +372,6 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  	if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
>  		return -EINVAL;
>  
> -	/* MADV_FREE works for only anon vma at the moment */
> -	if (vma->vm_file)
> -		return -EINVAL;
> -
>  	start = max(vma->vm_start, start_addr);
>  	if (start >= vma->vm_end)
>  		return -EINVAL;
> diff --git a/mm/memory.c b/mm/memory.c
> index 8068893..3d949b3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -874,10 +874,18 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page);
> -		if (PageAnon(page))
> +		if (PageAnon(page)) {
> +			/*
> +			 * we dirty the copyed pte for anon page,
> +			 * this is useful for madvise_free_pte_range(),
> +			 * this can prevent shared anon page freed by madvise_free
> +			 * syscall
> +			 */
> +			pte = pte_mkdirty(pte);
>  			rss[MM_ANONPAGES]++;
> -		else
> +		} else {
>  			rss[MM_FILEPAGES]++;
> +		}
>  	}
>  
>  out_set_pte:
> -- 
> 2.2.2
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC] mm: change mm_advise_free to clear page dirty
  2015-03-02  1:59                   ` Wang, Yalin
@ 2015-03-03  0:42                     ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2015-03-03  0:42 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: Michal Hocko, Andrew Morton, linux-kernel, linux-mm,
	Rik van Riel, Johannes Weiner, Mel Gorman, Shaohua Li

On Mon, Mar 02, 2015 at 09:59:20AM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > Sent: Saturday, February 28, 2015 9:50 PM
> > To: Wang, Yalin
> > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > 
> > On Fri, Feb 27, 2015 at 10:37:14PM +0900, Minchan Kim wrote:
> > > On Fri, Feb 27, 2015 at 03:50:29PM +0800, Wang, Yalin wrote:
> > > > > -----Original Message-----
> > > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> > Kim
> > > > > Sent: Friday, February 27, 2015 2:44 PM
> > > > > To: Wang, Yalin
> > > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org; linux-
> > > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua Li
> > > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > > >
> > > > > On Fri, Feb 27, 2015 at 01:48:48PM +0800, Wang, Yalin wrote:
> > > > > > > -----Original Message-----
> > > > > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of
> > Minchan
> > > > > Kim
> > > > > > > Sent: Friday, February 27, 2015 1:28 PM
> > > > > > > To: Wang, Yalin
> > > > > > > Cc: Michal Hocko; Andrew Morton; linux-kernel@vger.kernel.org;
> > linux-
> > > > > > > mm@kvack.org; Rik van Riel; Johannes Weiner; Mel Gorman; Shaohua
> > Li
> > > > > > > Subject: Re: [RFC] mm: change mm_advise_free to clear page dirty
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > On Fri, Feb 27, 2015 at 11:37:18AM +0800, Wang, Yalin wrote:
> > > > > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > > > > the Anonpage mapcount must be 1, so that this page is only used
> > by
> > > > > > > > the current process, not shared by other process like fork().
> > > > > > > > if not clear page dirty for this anon page, the page will never
> > be
> > > > > > > > treated as freeable.
> > > > > > >
> > > > > > > In case of anonymous page, it has PG_dirty when VM adds it to
> > > > > > > swap cache and clear it in clear_page_dirty_for_io. That's why
> > > > > > > I added ClearPageDirty if we found it in swapcache.
> > > > > > > What case am I missing? It would be better to understand if you
> > > > > > > describe specific scenario.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > >
> > > > > > > > Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> > > > > > > > ---
> > > > > > > >  mm/madvise.c | 15 +++++----------
> > > > > > > >  1 file changed, 5 insertions(+), 10 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > > > > > index 6d0fcb8..257925a 100644
> > > > > > > > --- a/mm/madvise.c
> > > > > > > > +++ b/mm/madvise.c
> > > > > > > > @@ -297,22 +297,17 @@ static int madvise_free_pte_range(pmd_t
> > *pmd,
> > > > > > > unsigned long addr,
> > > > > > > >  			continue;
> > > > > > > >
> > > > > > > >  		page = vm_normal_page(vma, addr, ptent);
> > > > > > > > -		if (!page)
> > > > > > > > +		if (!page || !PageAnon(page)
> > || !trylock_page(page))
> > > > > > > >  			continue;
> > > > > > > >
> > > > > > > >  		if (PageSwapCache(page)) {
> > > > > > > > -			if (!trylock_page(page))
> > > > > > > > +			if (!try_to_free_swap(page))
> > > > > > > >  				continue;
> > > > > > > > -
> > > > > > > > -			if (!try_to_free_swap(page)) {
> > > > > > > > -				unlock_page(page);
> > > > > > > > -				continue;
> > > > > > > > -			}
> > > > > > > > -
> > > > > > > > -			ClearPageDirty(page);
> > > > > > > > -			unlock_page(page);
> > > > > > > >  		}
> > > > > > > >
> > > > > > > > +		if (page_mapcount(page) == 1)
> > > > > > > > +			ClearPageDirty(page);
> > > > > > > > +		unlock_page(page);
> > > > > > > >  		/*
> > > > > > > >  		 * Some of architecture(ex, PPC) don't update TLB
> > > > > > > >  		 * with set_pte_at and tlb_remove_tlb_entry so for
> > > > > > > > --
> > > > > > Yes, for page which is in SwapCache, it is correct,
> > > > > > But for anon page which is not in SwapCache, it is always
> > > > > > PageDirty(), so we should also clear dirty bit to make it freeable,
> > > > >
> > > > > No. Every anon page starts from !PageDirty and it has PG_dirty
> > > > > only when it's addeded into swap cache. If vm_swap_full turns on,
> > > > > a page in swap cache could have PG_dirty via try_to_free_swap again.
> > > >
> > > > mmm..
> > > > sometimes you can see an anon page PageDirty(), but it is not in
> > swapcache,
> > > > for example, handle_pte_fault()-->do_swap_page()-->try_to_free_swap(),
> > > > at this time, the page is deleted from swapcache and is marked
> > PageDirty(),
> > >
> > > That's what I missed. It's clear and would be simple patch so
> > > could you send a patch to fix this issue with detailed description
> > > like above?
> > >
> > > >
> > > >
> > > > > So, Do you have concern about swapped-out pages when MADV_FREE is
> > > > > called? If so, please look at my patch.
> > > > >
> > > > > https://lkml.org/lkml/2015/2/25/43
> > > > >
> > > > > It will zap the swapped out page. So, this is not a issue any more?
> > > > >
> > > > > >
> > > > > > Another problem  is that if an anon page is shared by more than one
> > > > > process,
> > > > > > This happened when fork(), the anon page will be copy on write,
> > > > > > In this case, we should not clear page dirty,
> > > > > > This is not correct for other process which don't call MADV_FREE
> > syscall.
> > > > >
> > > > > You mean we shouldn't inherit MADV_FREE attribute?
> > > > > Why?
> > > >
> > > > Is it correct behavior if code like this:
> > > >
> > > > Parent:
> > > > ptr1 = malloc(len);
> > > > memset(ptr1, 'a', len);
> > > > fork();
> > > > if (I am parent)
> > > > 	madvise_free(ptr1, len);
> > > >
> > > > child:
> > > > sleep(10);
> > > > parse_data(ptr1, len);  // child may see zero, not 'a',
> > > > 			// is it the right behavior that the programer want?
> > > >
> > > > Because child don't call madvise_free(), so it should see 'a', not zero
> > page.
> > > > Isn't it ?
> > >
> > > You're absolutely right. Thanks.
> > > But I doubt your fix is best. Most of fork will do exec soonish so
> > > it's not a good idea to make MADV_FREE void even though hinted pages
> > > are shared when the syscall was called.
> > > How about checking the page is shared or not in reclaim path?
> > > If it is still shared, we shouldn't discard it.
> > 
> > I got confused. With looking at copy_one_pte, it copys from src_pte
> > and not clear dirty bit if it's not a shared mapping.
> > If so, in your example, child pte has pte dirty bit on while parent
> > has clean bit by madvise_free so that VM shouldn't discard the page.
> > No?
> > 
> It is not always true that src_pte has dirty bit.
> For example ,
> If a page have been swap into swap partition,
> The pte have become a swap entry,
> Then a read fault happened,
> The pte will be a page pte without dirty bit,
> Am I ritht?

Parent A, Child B

1. Page : swapped-out, A: !pte_present B: !pte_present, A or B does madvise_free

It's not pte_present so madvise_free will be void

2. Page : swapped-in, PageSwapCache A: pte_clean by swapin-read fault
   B: !pte_present, A does madvise_free

madvise_free will do try_to_free_swap but other process(ie, B) is still
reference the swap slot so it will fail so madvise_free by A will be void.

3. Page: swappined-in, !PageSwapCache A: pte_clean by swapin-read fault
   B: !pte_clean by swapin-read fault, A does madvise_free

Since the page is removed from swap cache, it has PG_dirty so we couldn't
discard the page(ie, we should swap out, not discard).

Is there any usecase I am missing?

Anyway, relying on PageDirty for madvise_free rather than pte_dirty
makes lots of confuse. Originally, PageDirty flag for anonymous page
is for avoidng unnecessary swapout so I will cook a patch and send RFC.

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-02 12:38             ` Michal Hocko
@ 2015-03-03  2:06               ` Wang, Yalin
  0 siblings, 0 replies; 42+ messages in thread
From: Wang, Yalin @ 2015-03-03  2:06 UTC (permalink / raw)
  To: 'Michal Hocko', 'Minchan Kim',
	'Andrew Morton', 'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li'

This patch add ClearPageDirty() to clear AnonPage dirty flag,
if not clear page dirty for this anon page, the page will never be
treated as freeable. We also make sure the shared AnonPage is not
freeable, we implement it by dirty all copyed AnonPage pte,
so that make sure the Anonpage will not become freeable, unless
all process which shared this page call madvise_free syscall.

Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
---
 mm/madvise.c | 16 +++++++++-------
 mm/memory.c  | 12 ++++++++++--
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..b61070d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -297,23 +297,25 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 
 		page = vm_normal_page(vma, addr, ptent);
-		if (!page)
+		if (!page || !trylock_page(page))
 			continue;
 
 		if (PageSwapCache(page)) {
-			if (!trylock_page(page))
-				continue;
-
 			if (!try_to_free_swap(page)) {
 				unlock_page(page);
 				continue;
 			}
-
-			ClearPageDirty(page);
-			unlock_page(page);
 		}
 
 		/*
+		 * we clear page dirty flag for AnonPage, no matter if this
+		 * page is in swapcahce or not, AnonPage not in swapcache also set
+		 * dirty flag sometimes, this happened when a AnonPage is removed
+		 * from swapcahce by try_to_free_swap()
+		 */
+		ClearPageDirty(page);
+		unlock_page(page);
+		/*
 		 * Some of architecture(ex, PPC) don't update TLB
 		 * with set_pte_at and tlb_remove_tlb_entry so for
 		 * the portability, remap the pte with old|clean
diff --git a/mm/memory.c b/mm/memory.c
index 8068893..3d949b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -874,10 +874,18 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (page) {
 		get_page(page);
 		page_dup_rmap(page);
-		if (PageAnon(page))
+		if (PageAnon(page)) {
+			/*
+			 * we dirty the copyed pte for anon page,
+			 * this is useful for madvise_free_pte_range(),
+			 * this can prevent shared anon page freed by madvise_free
+			 * syscall
+			 */
+			pte = pte_mkdirty(pte);
 			rss[MM_ANONPAGES]++;
-		else
+		} else {
 			rss[MM_FILEPAGES]++;
+		}
 	}
 
 out_set_pte:
-- 
2.2.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-05 15:35 ` Michal Hocko
@ 2015-03-09  0:57   ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2015-03-09  0:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wang, Yalin, 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

Hello Michal,

On Thu, Mar 05, 2015 at 04:35:05PM +0100, Michal Hocko wrote:
> On Tue 03-03-15 12:25:51, Minchan Kim wrote:
> [...]
> > From 30c6d5b35a3dc7e451041183ce5efd6a6c42bf88 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan@kernel.org>
> > Date: Tue, 3 Mar 2015 10:06:59 +0900
> > Subject: [RFC] mm: make every pte dirty on do_swap_page
> 
> Hi Minchan, could you resend this patch separately. I am afraid that
> this one got so convoluted with originally unrelated issues that
> people might miss it.
> 
> Thanks!

No problem. Thanks for the review.
I will resend it this week but I'm afraid everybody will be in LSF/MM
so they will be busy with hardwork in there. :)


> -- 
> Michal Hocko
> SUSE Labs

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-03  3:25 [RFC V3] " Minchan Kim
  2015-03-03  3:59 ` Wang, Yalin
@ 2015-03-05 15:35 ` Michal Hocko
  2015-03-09  0:57   ` Minchan Kim
  1 sibling, 1 reply; 42+ messages in thread
From: Michal Hocko @ 2015-03-05 15:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Wang, Yalin, 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

On Tue 03-03-15 12:25:51, Minchan Kim wrote:
[...]
> From 30c6d5b35a3dc7e451041183ce5efd6a6c42bf88 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Tue, 3 Mar 2015 10:06:59 +0900
> Subject: [RFC] mm: make every pte dirty on do_swap_page

Hi Minchan, could you resend this patch separately. I am afraid that
this one got so convoluted with originally unrelated issues that
people might miss it.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-03  6:46     ` Wang, Yalin
@ 2015-03-03 13:40       ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2015-03-03 13:40 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Michal Hocko', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

On Tue, Mar 03, 2015 at 02:46:40PM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > Sent: Tuesday, March 03, 2015 12:15 PM
> > To: Wang, Yalin
> > Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> > 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> > 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> > Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> > 
> > On Tue, Mar 03, 2015 at 11:59:17AM +0800, Wang, Yalin wrote:
> > > > -----Original Message-----
> > > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> > Kim
> > > > Sent: Tuesday, March 03, 2015 11:26 AM
> > > > To: Wang, Yalin
> > > > Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> > > > 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> > > > 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> > > > Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> > > >
> > > > Could you separte this patch in this patchset thread?
> > > > It's tackling differnt problem.
> > > >
> > > > As well, I had a question to previous thread about why shared page
> > > > has a problem now but you didn't answer and send a new patchset.
> > > > It makes reviewers/maintainer time waste/confuse. Please, don't
> > > > hurry to send a code. Before that, resolve reviewers's comments.
> > > >
> > > > On Tue, Mar 03, 2015 at 10:06:40AM +0800, Wang, Yalin wrote:
> > > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > > if not clear page dirty for this anon page, the page will never be
> > > > > treated as freeable. We also make sure the shared AnonPage is not
> > > > > freeable, we implement it by dirty all copyed AnonPage pte,
> > > > > so that make sure the Anonpage will not become freeable, unless
> > > > > all process which shared this page call madvise_free syscall.
> > > >
> > > > Please, spend more time to make description clear. I really doubt
> > > > who understand this description without code inspection. :(
> > > > Of course, I'm not a person to write description clear like native
> > > > , either but just I'm sure I spend a more time to write description
> > > > rather than coding, at least. :)
> > > >
> > > I see, I will send another mail for file private map pages.
> > > Sorry for my English expressions.
> > > I think your solution is ok,
> > > Your patch will make sure the anonpage pte will always be dirty.
> > > I add some comments for your patch:
> > >
> > > > ---
> > > >  mm/madvise.c | 1 -
> > > >  mm/memory.c  | 9 +++++++--
> > > >  mm/rmap.c    | 2 +-
> > > >  mm/vmscan.c  | 3 +--
> > > >  4 files changed, 9 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 6d0fcb8..d64200e 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -309,7 +309,6 @@ static int madvise_free_pte_range(pmd_t *pmd,
> > unsigned
> > > > long addr,
> > > >  				continue;
> > > >  			}
> > > >
> > > > -			ClearPageDirty(page);
> > > >  			unlock_page(page);
> > > >  		}
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 8ae52c9..2f45e77 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -2460,9 +2460,14 @@ static int do_swap_page(struct mm_struct *mm,
> > struct
> > > > vm_area_struct *vma,
> > > >
> > > >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> > > >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > > > -	pte = mk_pte(page, vma->vm_page_prot);
> > > > +
> > > > +	/*
> > > > +	 * Every page swapped-out was pte_dirty so we makes pte dirty again.
> > > > +	 * MADV_FREE relys on it.
> > > > +	 */
> > > > +	pte = mk_pte(pte_mkdirty(page), vma->vm_page_prot);
> > > pte_mkdirty() usage seems wrong here.
> > 
> > Argh, it reveals I didn't test even build. My shame.
> > But RFC tag might mitigate my shame. :)
> > I will fix it if I send a formal version.
> > Thanks for the review.
> > 
> > >
> > > >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > > > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > > +		pte = maybe_mkwrite(pte, vma);
> > > >  		flags &= ~FAULT_FLAG_WRITE;
> > > >  		ret |= VM_FAULT_WRITE;
> > > >  		exclusive = 1;
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 47b3ba8..34c1d66 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page,
> > struct
> > > > vm_area_struct *vma,
> > > >
> > > >  		if (flags & TTU_FREE) {
> > > >  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> > > > -			if (!dirty && !PageDirty(page)) {
> > > > +			if (!dirty) {
> > > >  				/* It's a freeable page by MADV_FREE */
> > > >  				dec_mm_counter(mm, MM_ANONPAGES);
> > > >  				goto discard;
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 671e47e..7f520c9 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -805,8 +805,7 @@ static enum page_references
> > > > page_check_references(struct page *page,
> > > >  		return PAGEREF_KEEP;
> > > >  	}
> > > >
> > > > -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> > > > -			!PageDirty(page))
> > > > +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
> > > >  		*freeable = true;
> > > >
> > > >  	/* Reclaim if clean, defer dirty pages to writeback */
> > > > --
> > > > 1.9.3
> > > Could we remove SetPageDirty(page); in try_to_free_swap() function based
> > on this patch?
> > > Because your patch will make sure the pte is always dirty,
> > > We don't need setpagedirty(),
> > > The try_to_unmap() path will re-dirty the page during reclaim path,
> > > Isn't it?
> > 
> > I dont't know what side-effect we will have if we removes SetPageDirty.
> > It might regress on tmpfs which would page without pte.
> > I don't want to have such risk in this patch.
> > If you want it, you could suggest it separately if this patch lands.
> > 
> Ok, Could you send out your change as a normal patch for more related maintainers to review /comment it?

NP but let's wait a few days to see if we have a luck which they grab a time
slot to review. :)

Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-03  4:14   ` Minchan Kim
@ 2015-03-03  6:46     ` Wang, Yalin
  2015-03-03 13:40       ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-03-03  6:46 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Tuesday, March 03, 2015 12:15 PM
> To: Wang, Yalin
> Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> 
> On Tue, Mar 03, 2015 at 11:59:17AM +0800, Wang, Yalin wrote:
> > > -----Original Message-----
> > > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan
> Kim
> > > Sent: Tuesday, March 03, 2015 11:26 AM
> > > To: Wang, Yalin
> > > Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> > > 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> > > 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> > > Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> > >
> > > Could you separte this patch in this patchset thread?
> > > It's tackling differnt problem.
> > >
> > > As well, I had a question to previous thread about why shared page
> > > has a problem now but you didn't answer and send a new patchset.
> > > It makes reviewers/maintainer time waste/confuse. Please, don't
> > > hurry to send a code. Before that, resolve reviewers's comments.
> > >
> > > On Tue, Mar 03, 2015 at 10:06:40AM +0800, Wang, Yalin wrote:
> > > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > > if not clear page dirty for this anon page, the page will never be
> > > > treated as freeable. We also make sure the shared AnonPage is not
> > > > freeable, we implement it by dirty all copyed AnonPage pte,
> > > > so that make sure the Anonpage will not become freeable, unless
> > > > all process which shared this page call madvise_free syscall.
> > >
> > > Please, spend more time to make description clear. I really doubt
> > > who understand this description without code inspection. :(
> > > Of course, I'm not a person to write description clear like native
> > > , either but just I'm sure I spend a more time to write description
> > > rather than coding, at least. :)
> > >
> > I see, I will send another mail for file private map pages.
> > Sorry for my English expressions.
> > I think your solution is ok,
> > Your patch will make sure the anonpage pte will always be dirty.
> > I add some comments for your patch:
> >
> > > ---
> > >  mm/madvise.c | 1 -
> > >  mm/memory.c  | 9 +++++++--
> > >  mm/rmap.c    | 2 +-
> > >  mm/vmscan.c  | 3 +--
> > >  4 files changed, 9 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 6d0fcb8..d64200e 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -309,7 +309,6 @@ static int madvise_free_pte_range(pmd_t *pmd,
> unsigned
> > > long addr,
> > >  				continue;
> > >  			}
> > >
> > > -			ClearPageDirty(page);
> > >  			unlock_page(page);
> > >  		}
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8ae52c9..2f45e77 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -2460,9 +2460,14 @@ static int do_swap_page(struct mm_struct *mm,
> struct
> > > vm_area_struct *vma,
> > >
> > >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> > >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > > -	pte = mk_pte(page, vma->vm_page_prot);
> > > +
> > > +	/*
> > > +	 * Every page swapped-out was pte_dirty so we makes pte dirty again.
> > > +	 * MADV_FREE relys on it.
> > > +	 */
> > > +	pte = mk_pte(pte_mkdirty(page), vma->vm_page_prot);
> > pte_mkdirty() usage seems wrong here.
> 
> Argh, it reveals I didn't test even build. My shame.
> But RFC tag might mitigate my shame. :)
> I will fix it if I send a formal version.
> Thanks for the review.
> 
> >
> > >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > > +		pte = maybe_mkwrite(pte, vma);
> > >  		flags &= ~FAULT_FLAG_WRITE;
> > >  		ret |= VM_FAULT_WRITE;
> > >  		exclusive = 1;
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 47b3ba8..34c1d66 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page,
> struct
> > > vm_area_struct *vma,
> > >
> > >  		if (flags & TTU_FREE) {
> > >  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> > > -			if (!dirty && !PageDirty(page)) {
> > > +			if (!dirty) {
> > >  				/* It's a freeable page by MADV_FREE */
> > >  				dec_mm_counter(mm, MM_ANONPAGES);
> > >  				goto discard;
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 671e47e..7f520c9 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -805,8 +805,7 @@ static enum page_references
> > > page_check_references(struct page *page,
> > >  		return PAGEREF_KEEP;
> > >  	}
> > >
> > > -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> > > -			!PageDirty(page))
> > > +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
> > >  		*freeable = true;
> > >
> > >  	/* Reclaim if clean, defer dirty pages to writeback */
> > > --
> > > 1.9.3
> > Could we remove SetPageDirty(page); in try_to_free_swap() function based
> on this patch?
> > Because your patch will make sure the pte is always dirty,
> > We don't need setpagedirty(),
> > The try_to_unmap() path will re-dirty the page during reclaim path,
> > Isn't it?
> 
> I dont't know what side-effect we will have if we removes SetPageDirty.
> It might regress on tmpfs which would page without pte.
> I don't want to have such risk in this patch.
> If you want it, you could suggest it separately if this patch lands.
> 
Ok, Could you send out your change as a normal patch for more related maintainers to review /comment it?

Thanks


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-03  3:59 ` Wang, Yalin
@ 2015-03-03  4:14   ` Minchan Kim
  2015-03-03  6:46     ` Wang, Yalin
  0 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2015-03-03  4:14 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Michal Hocko', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

On Tue, Mar 03, 2015 at 11:59:17AM +0800, Wang, Yalin wrote:
> > -----Original Message-----
> > From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> > Sent: Tuesday, March 03, 2015 11:26 AM
> > To: Wang, Yalin
> > Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> > 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> > 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> > Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> > 
> > Could you separte this patch in this patchset thread?
> > It's tackling differnt problem.
> > 
> > As well, I had a question to previous thread about why shared page
> > has a problem now but you didn't answer and send a new patchset.
> > It makes reviewers/maintainer time waste/confuse. Please, don't
> > hurry to send a code. Before that, resolve reviewers's comments.
> > 
> > On Tue, Mar 03, 2015 at 10:06:40AM +0800, Wang, Yalin wrote:
> > > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > > if not clear page dirty for this anon page, the page will never be
> > > treated as freeable. We also make sure the shared AnonPage is not
> > > freeable, we implement it by dirty all copyed AnonPage pte,
> > > so that make sure the Anonpage will not become freeable, unless
> > > all process which shared this page call madvise_free syscall.
> > 
> > Please, spend more time to make description clear. I really doubt
> > who understand this description without code inspection. :(
> > Of course, I'm not a person to write description clear like native
> > , either but just I'm sure I spend a more time to write description
> > rather than coding, at least. :)
> > 
> I see, I will send another mail for file private map pages.
> Sorry for my English expressions.
> I think your solution is ok,
> Your patch will make sure the anonpage pte will always be dirty.
> I add some comments for your patch:
> 
> > ---
> >  mm/madvise.c | 1 -
> >  mm/memory.c  | 9 +++++++--
> >  mm/rmap.c    | 2 +-
> >  mm/vmscan.c  | 3 +--
> >  4 files changed, 9 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..d64200e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -309,7 +309,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
> > long addr,
> >  				continue;
> >  			}
> > 
> > -			ClearPageDirty(page);
> >  			unlock_page(page);
> >  		}
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8ae52c9..2f45e77 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2460,9 +2460,14 @@ static int do_swap_page(struct mm_struct *mm, struct
> > vm_area_struct *vma,
> > 
> >  	inc_mm_counter_fast(mm, MM_ANONPAGES);
> >  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> > -	pte = mk_pte(page, vma->vm_page_prot);
> > +
> > +	/*
> > +	 * Every page swapped-out was pte_dirty so we makes pte dirty again.
> > +	 * MADV_FREE relys on it.
> > +	 */
> > +	pte = mk_pte(pte_mkdirty(page), vma->vm_page_prot);
> pte_mkdirty() usage seems wrong here.

Argh, it reveals I didn't test even build. My shame.
But RFC tag might mitigate my shame. :)
I will fix it if I send a formal version.
Thanks for the review.

> 
> >  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> > -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> > +		pte = maybe_mkwrite(pte, vma);
> >  		flags &= ~FAULT_FLAG_WRITE;
> >  		ret |= VM_FAULT_WRITE;
> >  		exclusive = 1;
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 47b3ba8..34c1d66 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct
> > vm_area_struct *vma,
> > 
> >  		if (flags & TTU_FREE) {
> >  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> > -			if (!dirty && !PageDirty(page)) {
> > +			if (!dirty) {
> >  				/* It's a freeable page by MADV_FREE */
> >  				dec_mm_counter(mm, MM_ANONPAGES);
> >  				goto discard;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 671e47e..7f520c9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -805,8 +805,7 @@ static enum page_references
> > page_check_references(struct page *page,
> >  		return PAGEREF_KEEP;
> >  	}
> > 
> > -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> > -			!PageDirty(page))
> > +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
> >  		*freeable = true;
> > 
> >  	/* Reclaim if clean, defer dirty pages to writeback */
> > --
> > 1.9.3
> Could we remove SetPageDirty(page); in try_to_free_swap() function based on this patch?
> Because your patch will make sure the pte is always dirty,
> We don't need setpagedirty(),
> The try_to_unmap() path will re-dirty the page during reclaim path,
> Isn't it?

I dont't know what side-effect we will have if we removes SetPageDirty.
It might regress on tmpfs which would page without pte.
I don't want to have such risk in this patch.
If you want it, you could suggest it separately if this patch lands.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [RFC V3] mm: change mm_advise_free to clear page dirty
  2015-03-03  3:25 [RFC V3] " Minchan Kim
@ 2015-03-03  3:59 ` Wang, Yalin
  2015-03-03  4:14   ` Minchan Kim
  2015-03-05 15:35 ` Michal Hocko
  1 sibling, 1 reply; 42+ messages in thread
From: Wang, Yalin @ 2015-03-03  3:59 UTC (permalink / raw)
  To: 'Minchan Kim'
  Cc: 'Michal Hocko', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

> -----Original Message-----
> From: Minchan Kim [mailto:minchan.kim@gmail.com] On Behalf Of Minchan Kim
> Sent: Tuesday, March 03, 2015 11:26 AM
> To: Wang, Yalin
> Cc: 'Michal Hocko'; 'Andrew Morton'; 'linux-kernel@vger.kernel.org';
> 'linux-mm@kvack.org'; 'Rik van Riel'; 'Johannes Weiner'; 'Mel Gorman';
> 'Shaohua Li'; Hugh Dickins; Cyrill Gorcunov
> Subject: Re: [RFC V3] mm: change mm_advise_free to clear page dirty
> 
> Could you separte this patch in this patchset thread?
> It's tackling differnt problem.
> 
> As well, I had a question to previous thread about why shared page
> has a problem now but you didn't answer and send a new patchset.
> It makes reviewers/maintainer time waste/confuse. Please, don't
> hurry to send a code. Before that, resolve reviewers's comments.
> 
> On Tue, Mar 03, 2015 at 10:06:40AM +0800, Wang, Yalin wrote:
> > This patch add ClearPageDirty() to clear AnonPage dirty flag,
> > if not clear page dirty for this anon page, the page will never be
> > treated as freeable. We also make sure the shared AnonPage is not
> > freeable, we implement it by dirty all copyed AnonPage pte,
> > so that make sure the Anonpage will not become freeable, unless
> > all process which shared this page call madvise_free syscall.
> 
> Please, spend more time to make description clear. I really doubt
> who understand this description without code inspection. :(
> Of course, I'm not a person to write description clear like native
> , either but just I'm sure I spend a more time to write description
> rather than coding, at least. :)
> 
I see, I will send another mail for file private map pages.
Sorry for my English expressions.
I think your solution is ok,
Your patch will make sure the anonpage pte will always be dirty.
I add some comments for your patch:

> ---
>  mm/madvise.c | 1 -
>  mm/memory.c  | 9 +++++++--
>  mm/rmap.c    | 2 +-
>  mm/vmscan.c  | 3 +--
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..d64200e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -309,7 +309,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned
> long addr,
>  				continue;
>  			}
> 
> -			ClearPageDirty(page);
>  			unlock_page(page);
>  		}
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 8ae52c9..2f45e77 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2460,9 +2460,14 @@ static int do_swap_page(struct mm_struct *mm, struct
> vm_area_struct *vma,
> 
>  	inc_mm_counter_fast(mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(mm, MM_SWAPENTS);
> -	pte = mk_pte(page, vma->vm_page_prot);
> +
> +	/*
> +	 * Every page swapped-out was pte_dirty so we makes pte dirty again.
> +	 * MADV_FREE relys on it.
> +	 */
> +	pte = mk_pte(pte_mkdirty(page), vma->vm_page_prot);
pte_mkdirty() usage seems wrong here.

>  	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +		pte = maybe_mkwrite(pte, vma);
>  		flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
>  		exclusive = 1;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 47b3ba8..34c1d66 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct
> vm_area_struct *vma,
> 
>  		if (flags & TTU_FREE) {
>  			VM_BUG_ON_PAGE(PageSwapCache(page), page);
> -			if (!dirty && !PageDirty(page)) {
> +			if (!dirty) {
>  				/* It's a freeable page by MADV_FREE */
>  				dec_mm_counter(mm, MM_ANONPAGES);
>  				goto discard;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 671e47e..7f520c9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -805,8 +805,7 @@ static enum page_references
> page_check_references(struct page *page,
>  		return PAGEREF_KEEP;
>  	}
> 
> -	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
> -			!PageDirty(page))
> +	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
>  		*freeable = true;
> 
>  	/* Reclaim if clean, defer dirty pages to writeback */
> --
> 1.9.3
Could we remove SetPageDirty(page); in try_to_free_swap() function based on this patch?
Because your patch will make sure the pte is always dirty,
We don't need setpagedirty(),
The try_to_unmap() path will re-dirty the page during reclaim path,
Isn't it?

Thanks








^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC V3] mm: change mm_advise_free to clear page dirty
@ 2015-03-03  3:25 Minchan Kim
  2015-03-03  3:59 ` Wang, Yalin
  2015-03-05 15:35 ` Michal Hocko
  0 siblings, 2 replies; 42+ messages in thread
From: Minchan Kim @ 2015-03-03  3:25 UTC (permalink / raw)
  To: Wang, Yalin
  Cc: 'Michal Hocko', 'Andrew Morton',
	'linux-kernel@vger.kernel.org',
	'linux-mm@kvack.org', 'Rik van Riel',
	'Johannes Weiner', 'Mel Gorman',
	'Shaohua Li',
	Hugh Dickins, Cyrill Gorcunov

Could you separte this patch in this patchset thread?
It's tackling differnt problem.

As well, I had a question to previous thread about why shared page
has a problem now but you didn't answer and send a new patchset.
It makes reviewers/maintainer time waste/confuse. Please, don't
hurry to send a code. Before that, resolve reviewers's comments.

On Tue, Mar 03, 2015 at 10:06:40AM +0800, Wang, Yalin wrote:
> This patch add ClearPageDirty() to clear AnonPage dirty flag,
> if not clear page dirty for this anon page, the page will never be
> treated as freeable. We also make sure the shared AnonPage is not
> freeable, we implement it by dirty all copyed AnonPage pte,
> so that make sure the Anonpage will not become freeable, unless
> all process which shared this page call madvise_free syscall.

Please, spend more time to make description clear. I really doubt
who understand this description without code inspection. :(
Of course, I'm not a person to write description clear like native
, either but just I'm sure I spend a more time to write description
rather than coding, at least. :)

> 
> Signed-off-by: Yalin Wang <yalin.wang@sonymobile.com>
> ---
>  mm/madvise.c | 16 +++++++++-------
>  mm/memory.c  | 12 ++++++++++--
>  2 files changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..b61070d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -297,23 +297,25 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			continue;
>  
>  		page = vm_normal_page(vma, addr, ptent);
> -		if (!page)
> +		if (!page || !trylock_page(page))
>  			continue;
>  
>  		if (PageSwapCache(page)) {
> -			if (!trylock_page(page))
> -				continue;
> -
>  			if (!try_to_free_swap(page)) {
>  				unlock_page(page);
>  				continue;
>  			}
> -
> -			ClearPageDirty(page);
> -			unlock_page(page);
>  		}
>  
>  		/*
> +		 * we clear page dirty flag for AnonPage, no matter if this
> +		 * page is in swapcahce or not, AnonPage not in swapcache also set
> +		 * dirty flag sometimes, this happened when a AnonPage is removed
> +		 * from swapcahce by try_to_free_swap()
> +		 */
> +		ClearPageDirty(page);
> +		unlock_page(page);
> +		/*

Parent:

ptrP = malloc();
*ptrP = 'a';
fork(); -> child process pte has dirty by your patch
..
memory pressure -> So, swapped out the page.
..
..
Child: var = *ptrP; assert(var =='a') -> So, swapin happens and child has pte_clean
parent: var = *ptrP; aasert(var == 'a') -> So, swapin happens and parent has pte_clean
..
..
Parent:
madvise_free -> remove PageDirty
So, both parent and child has pte_clean and !PageDirty, which
is target for VM to discard a page.
..
VM discard the page by memory pressure.
..
Child: var = *ptrP: assert(var == 'a'); <---- oops.

And blindly ClearPageDirty makes duplicates swap out.

>  		 * Some of architecture(ex, PPC) don't update TLB
>  		 * with set_pte_at and tlb_remove_tlb_entry so for
>  		 * the portability, remap the pte with old|clean
> diff --git a/mm/memory.c b/mm/memory.c
> index 8068893..3d949b3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -874,10 +874,18 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (page) {
>  		get_page(page);
>  		page_dup_rmap(page);
> -		if (PageAnon(page))
> +		if (PageAnon(page)) {
> +			/*
> +			 * we dirty the copyed pte for anon page,
> +			 * this is useful for madvise_free_pte_range(),
> +			 * this can prevent shared anon page freed by madvise_free
> +			 * syscall
> +			 */
> +			pte = pte_mkdirty(pte);

It made every MADV_FREE hinted page void. IOW, if a process called MADV_FREE
calls fork, VM cannot discard pages if child doesn't free pages or calls madvise_free.
Then, if parent calls madvise_free before fork, we couldn't free those pages.
IOW, you are ignoring below example.

parent:
ptr1 = malloc(len);
        -> allocator calls mmap(len);
memset(ptr1, 'a', len);
free(ptr1);
        -> allocator calls madvise_free(ptr1, len);
fork();
..
..
        -> VM discard hinted pages
child:

ptr2 = malloc(len)
        -> allocator reuses the chunk allocated from parent.
so, child will see zero pages from ptr2 but he doesn't write
anything so garbage|zero page anything is okay to him.

As well, you are adding new instructions in fork which is very frequent syscall
so I'd like to find another way to avoid adding instructions in such hot path.

I will send different patch. Please review it.

So, my suggestion is below. It always makes pte dirty so let's Cc
Cyrill to take care of softdirty and Hugh who is Mr.Swap.

>From 30c6d5b35a3dc7e451041183ce5efd6a6c42bf88 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Tue, 3 Mar 2015 10:06:59 +0900
Subject: [RFC] mm: make every pte dirty on do_swap_page

Bascially, MADV_FREE relys on the pte dirty to decide whether
VM should discard or not. However, swapped-in page doesn't have
pte_dirty. Instead, it checks PageDirty and PageSwapCache for
such page because swapped-in page could live on swap cache or
set PageDirty when it is removed from swapcache so MADV_FREE
checks it and doesn't discard.

The problem in here is any anonymous page can have PageDirty if
it is removed from swapcache so that VM cannot parse those pages
as freeable even if we did madvise_free. Look at below example.

ptr = malloc();
memset(ptr);
..
heavy memory pressure -> swap-out all of pages
..
out of memory pressure
..
var = *ptr; -> swap-in page/remove the page from swapcache. so pte_clean
               but SetPageDirty

madvise_free(ptr);
..
..
heavy memory pressure -> VM cannot discard the page by PageDirty.

PageDirty for anonymous page aims for avoiding duplicating
swapping out. In other words, if a page have swapped-in but
live swapcache(ie, !PageDirty), we could save swapout if the page
is selected as victim by VM in future because swap device have
kept previous swapped-out contents of the page.

So, rather than relying on the PG_dirty for working madvise_free,
pte_dirty is more straightforward.
Inherently, swapped-out page was pte_dirty so this patch restores
the dirtiness when swap-in fault happens and madvise_free doesn't
rely on the PageDirty.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 1 -
 mm/memory.c  | 9 +++++++--
 mm/rmap.c    | 2 +-
 mm/vmscan.c  | 3 +--
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..d64200e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -309,7 +309,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				continue;
 			}
 
-			ClearPageDirty(page);
 			unlock_page(page);
 		}
 
diff --git a/mm/memory.c b/mm/memory.c
index 8ae52c9..2f45e77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2460,9 +2460,14 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	inc_mm_counter_fast(mm, MM_ANONPAGES);
 	dec_mm_counter_fast(mm, MM_SWAPENTS);
-	pte = mk_pte(page, vma->vm_page_prot);
+
+	/*
+	 * Every page swapped-out was pte_dirty so we makes pte dirty again.
+	 * MADV_FREE relys on it.
+	 */
+	pte = mk_pte(pte_mkdirty(page), vma->vm_page_prot);
 	if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+		pte = maybe_mkwrite(pte, vma);
 		flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
 		exclusive = 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index 47b3ba8..34c1d66 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1268,7 +1268,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 		if (flags & TTU_FREE) {
 			VM_BUG_ON_PAGE(PageSwapCache(page), page);
-			if (!dirty && !PageDirty(page)) {
+			if (!dirty) {
 				/* It's a freeable page by MADV_FREE */
 				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 671e47e..7f520c9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -805,8 +805,7 @@ static enum page_references page_check_references(struct page *page,
 		return PAGEREF_KEEP;
 	}
 
-	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
-			!PageDirty(page))
+	if (PageAnon(page) && !pte_dirty && !PageSwapCache(page))
 		*freeable = true;
 
 	/* Reclaim if clean, defer dirty pages to writeback */
-- 
1.9.3

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-03-09  0:57 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24  8:18 [PATCH RFC 1/4] mm: throttle MADV_FREE Minchan Kim
2015-02-24  8:18 ` [PATCH RFC 2/4] mm: change deactivate_page with deactivate_file_page Minchan Kim
2015-02-24  8:18 ` [PATCH RFC 3/4] mm: move lazy free pages to inactive list Minchan Kim
2015-02-24 16:14   ` Michal Hocko
2015-02-25  0:27     ` Minchan Kim
2015-02-25 15:17       ` Michal Hocko
2015-02-24  8:18 ` [PATCH RFC 4/4] mm: support MADV_FREE in swapless system Minchan Kim
2015-02-24 16:51   ` Michal Hocko
2015-02-25  1:41     ` Minchan Kim
2015-02-24 15:43 ` [PATCH RFC 1/4] mm: throttle MADV_FREE Michal Hocko
2015-02-24 22:54   ` Shaohua Li
2015-02-25 14:13     ` Michal Hocko
2015-02-25  0:08   ` Minchan Kim
2015-02-25  7:11     ` Minchan Kim
2015-02-25 15:07       ` Michal Hocko
2015-02-25 18:37       ` Shaohua Li
2015-02-26  0:42         ` Minchan Kim
2015-02-26 19:04           ` Shaohua Li
2015-02-27  3:37     ` [RFC] mm: change mm_advise_free to clear page dirty Wang, Yalin
2015-02-27  5:28       ` Minchan Kim
2015-02-27  5:48         ` Wang, Yalin
2015-02-27  6:44           ` Minchan Kim
2015-02-27  7:50             ` Wang, Yalin
2015-02-27 13:37               ` Minchan Kim
2015-02-28 13:50                 ` Minchan Kim
2015-03-02  1:59                   ` Wang, Yalin
2015-03-03  0:42                     ` Minchan Kim
2015-02-27 21:02       ` Michal Hocko
2015-02-28  2:11         ` Wang, Yalin
2015-02-28  6:01           ` [RFC V2] " Wang, Yalin
2015-03-02 12:38             ` Michal Hocko
2015-03-03  2:06               ` [RFC V3] " Wang, Yalin
2015-02-28 13:55           ` [RFC] " Minchan Kim
2015-03-02  1:53             ` Wang, Yalin
2015-03-02 12:33           ` Michal Hocko
2015-03-03  3:25 [RFC V3] " Minchan Kim
2015-03-03  3:59 ` Wang, Yalin
2015-03-03  4:14   ` Minchan Kim
2015-03-03  6:46     ` Wang, Yalin
2015-03-03 13:40       ` Minchan Kim
2015-03-05 15:35 ` Michal Hocko
2015-03-09  0:57   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).