All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-20 14:43 ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

Recently, there was a reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.
So app developer want to support POSIX_FADV_NOREUSE but other OSes include linux 
don't support it. (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is going on writing
during invalidate_mapping_pages, it can't work.
It makes application programmer to use it hard since they always 
consider sync data before calling fadivse(..POSIX_FADV_DONTNEED) to 
make sure the pages couldn't be discardable. At last, they can't use
deferred write of kernel so see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So the idea in this series is that
let's move invalidated pages but not-freed page until into inactive list.
It can help relcaim efficiency very much so that it can prevent
eviction working set.

My exeperiment is folowing as.

Test Environment :
DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
Rsync backup directory size : 16G

rsync version is 3.0.7.
rsync patch is Ben's fadivse.
The stress scenario do following jobs with parallel.

1. git clone linux-2.6
1. make all -j4 linux-mmotm
3. rsync src dst

nrns : no-patched rsync + no stress
prns : patched rsync + no stress
nrs  : no-patched rsync + stress
prs  : patched rsync + stress

For profiling, I added some vmstat.
pginvalidate : the total number of pages which are moved by invalidate_mapping_pages
pgreclaim : the number of pages which are moved at inactive's tail by PG_reclaim of pginvalidate

                        NRNS    PRNS    NRS     PRS 
Elapsed time            36:01.49        37:13.58        01:23:24        01:21:45
nr_vmscan_write         184     1       296     509 
pgactivate              76559   84714   445214  463143
pgdeactivate            19360   40184   74302   91423
pginvalidate            0       2240333 0       1769147
pgreclaim               0       1849651 0       1650796
pgfault                 406208  421860  72485217        70334416
pgmajfault              212     334     5149    3688
pgsteal_dma             0       0       0       0   
pgsteal_normal          2645174 1545116 2521098 1578651
pgsteal_high            5162080 2562269 6074720 3137294
pgsteal_movable         0       0       0       0   
pgscan_kswapd_dma       0       0       0       0   
pgscan_kswapd_normal    2641732 1545374 2499894 1557882
pgscan_kswapd_high      5143794 2567981 5999585 3051150
pgscan_kswapd_movable   0       0       0       0   
pgscan_direct_dma       0       0       0       0   
pgscan_direct_normal    3643    0       21613   21238
pgscan_direct_high      20174   1783    76980   87848
pgscan_direct_movable   0       0       0       0   
pginodesteal            130     1029    3510    24100
slabs_scanned           1421824 1648128 1870720 1880320
kswapd_steal            7785153 4105620 8498332 4608372
kswapd_inodesteal       189432  474052  342835  472503
pageoutrun              100687  52282   145712  70946
allocstall              22      1       149     163 
pgrotated               0       2231408 2932    1765393
unevictable_pgs_scanned 0       0       0       0   

In stress test(NRS vs PRS), pgsteal_[normal|high] are reduced by 37% and 48%.
pgscan_kswapd_[normal|high] are reduced by 37% and 49%.
It means although the VM scan small window, it can reclaim enough pages to work well and
prevent eviction unnecessary page.
rsync program's elapsed time is reduced by 1.5 minutes but I think rsync's fadvise 
isn't good because [NRNS vs NRS] it takes one minutes longer time. 
I think it's because calling unnecessary fadivse system calls so that 
rsync's fadvise should be smart then effect would be much better than now.
The pgmajor fault is reduced by 28%. It's good.
What I can't understand is that why inode steal is increased.
If anyone know it, please explain to me.
Anyway, this patch improves reclaim efficiency very much.

Recently, Steven Barrentt already applied this series to his project kernel 
"Liquorix kernel" and said followin as with one problem.
(The problem is solved by [3/4]. See the description)

" I've been having really good results with your new patch set that
mitigates the problem where a backup utility or something like that
reads each file once and eventually evicting the original working set
out of the page cache.
...
...
 These patches solved some problems on a friend's desktop.
 He said that his wife wanted to send me kisses and hugs because their
computer was so responsive after the patches were applied.
"
So I think this patch series solves real problem.

 - [1/3] is to move invalidated page which is dirty/writeback on active list
   into inactive list's head.
 - [2/3] is to move memcg reclaimable page on inactive's tail.
 - [3/3] is for moving invalidated page into inactive list's tail when the
   page's writeout is completed for reclaim asap.

This patches are based on mmotm-02-04

Changelog since v5:
 - Remove vmstat patch as profiling for final merge

Changelog since v4:
 - Remove patches related to madvise and clean up patch of swap.c
   (I will separate madvise issue from this series and repost after merging this series)

Minchan Kim (3):
  deactivate invalidated pages
  memcg: move memcg reclaimable page into tail of inactive list
  Reclaim invalidated page ASAP

 include/linux/memcontrol.h |    6 ++
 include/linux/swap.h       |    1 +
 mm/memcontrol.c            |   27 ++++++++++
 mm/page-writeback.c        |   12 ++++-
 mm/swap.c                  |  116 +++++++++++++++++++++++++++++++++++++++++++-
 mm/truncate.c              |   17 +++++--
 6 files changed, 172 insertions(+), 7 deletions(-)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-20 14:43 ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

Recently, there was a reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.
So app developer want to support POSIX_FADV_NOREUSE but other OSes include linux 
don't support it. (http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is going on writing
during invalidate_mapping_pages, it can't work.
It makes application programmer to use it hard since they always 
consider sync data before calling fadivse(..POSIX_FADV_DONTNEED) to 
make sure the pages couldn't be discardable. At last, they can't use
deferred write of kernel so see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So the idea in this series is that
let's move invalidated pages but not-freed page until into inactive list.
It can help relcaim efficiency very much so that it can prevent
eviction working set.

My exeperiment is folowing as.

Test Environment :
DRAM : 2G, CPU : Intel(R) Core(TM)2 CPU
Rsync backup directory size : 16G

rsync version is 3.0.7.
rsync patch is Ben's fadivse.
The stress scenario do following jobs with parallel.

1. git clone linux-2.6
1. make all -j4 linux-mmotm
3. rsync src dst

nrns : no-patched rsync + no stress
prns : patched rsync + no stress
nrs  : no-patched rsync + stress
prs  : patched rsync + stress

For profiling, I added some vmstat.
pginvalidate : the total number of pages which are moved by invalidate_mapping_pages
pgreclaim : the number of pages which are moved at inactive's tail by PG_reclaim of pginvalidate

                        NRNS    PRNS    NRS     PRS 
Elapsed time            36:01.49        37:13.58        01:23:24        01:21:45
nr_vmscan_write         184     1       296     509 
pgactivate              76559   84714   445214  463143
pgdeactivate            19360   40184   74302   91423
pginvalidate            0       2240333 0       1769147
pgreclaim               0       1849651 0       1650796
pgfault                 406208  421860  72485217        70334416
pgmajfault              212     334     5149    3688
pgsteal_dma             0       0       0       0   
pgsteal_normal          2645174 1545116 2521098 1578651
pgsteal_high            5162080 2562269 6074720 3137294
pgsteal_movable         0       0       0       0   
pgscan_kswapd_dma       0       0       0       0   
pgscan_kswapd_normal    2641732 1545374 2499894 1557882
pgscan_kswapd_high      5143794 2567981 5999585 3051150
pgscan_kswapd_movable   0       0       0       0   
pgscan_direct_dma       0       0       0       0   
pgscan_direct_normal    3643    0       21613   21238
pgscan_direct_high      20174   1783    76980   87848
pgscan_direct_movable   0       0       0       0   
pginodesteal            130     1029    3510    24100
slabs_scanned           1421824 1648128 1870720 1880320
kswapd_steal            7785153 4105620 8498332 4608372
kswapd_inodesteal       189432  474052  342835  472503
pageoutrun              100687  52282   145712  70946
allocstall              22      1       149     163 
pgrotated               0       2231408 2932    1765393
unevictable_pgs_scanned 0       0       0       0   

In stress test(NRS vs PRS), pgsteal_[normal|high] are reduced by 37% and 48%.
pgscan_kswapd_[normal|high] are reduced by 37% and 49%.
It means although the VM scan small window, it can reclaim enough pages to work well and
prevent eviction unnecessary page.
rsync program's elapsed time is reduced by 1.5 minutes but I think rsync's fadvise 
isn't good because [NRNS vs NRS] it takes one minutes longer time. 
I think it's because calling unnecessary fadivse system calls so that 
rsync's fadvise should be smart then effect would be much better than now.
The pgmajor fault is reduced by 28%. It's good.
What I can't understand is that why inode steal is increased.
If anyone know it, please explain to me.
Anyway, this patch improves reclaim efficiency very much.

Recently, Steven Barrentt already applied this series to his project kernel 
"Liquorix kernel" and said followin as with one problem.
(The problem is solved by [3/4]. See the description)

" I've been having really good results with your new patch set that
mitigates the problem where a backup utility or something like that
reads each file once and eventually evicting the original working set
out of the page cache.
...
...
 These patches solved some problems on a friend's desktop.
 He said that his wife wanted to send me kisses and hugs because their
computer was so responsive after the patches were applied.
"
So I think this patch series solves real problem.

 - [1/3] is to move invalidated page which is dirty/writeback on active list
   into inactive list's head.
 - [2/3] is to move memcg reclaimable page on inactive's tail.
 - [3/3] is for moving invalidated page into inactive list's tail when the
   page's writeout is completed for reclaim asap.

This patches are based on mmotm-02-04

Changelog since v5:
 - Remove vmstat patch as profiling for final merge

Changelog since v4:
 - Remove patches related to madvise and clean up patch of swap.c
   (I will separate madvise issue from this series and repost after merging this series)

Minchan Kim (3):
  deactivate invalidated pages
  memcg: move memcg reclaimable page into tail of inactive list
  Reclaim invalidated page ASAP

 include/linux/memcontrol.h |    6 ++
 include/linux/swap.h       |    1 +
 mm/memcontrol.c            |   27 ++++++++++
 mm/page-writeback.c        |   12 ++++-
 mm/swap.c                  |  116 +++++++++++++++++++++++++++++++++++++++++++-
 mm/truncate.c              |   17 +++++--
 6 files changed, 172 insertions(+), 7 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v6 1/3] deactivate invalidated pages
  2011-02-20 14:43 ` Minchan Kim
@ 2011-02-20 14:43   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It makes for application programmer to use it since they always
have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head if we can't truncate
it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page
would be flushed sooner or later. It can prevent writeout of pageout
which is less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
Changelog since v5:
 - call deactivate_page out of lock_page - suggested by Balbir

Changelog since v4:
 - Change function comments - suggested by Johannes
 - Change function name - suggested by Johannes
 - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
 - Add acked-by

Changelog since v3:
 - Change function comments - suggested by Johannes
 - Change function name - suggested by Johannes
 - add only dirty/writeback pages to deactive pagevec

Changelog since v2:
 - mapped page leaves alone - suggested by Mel
 - pass part related PG_reclaim in next patch.

Changelog since v1:
 - modify description
 - correct typo
 - add some comment


 include/linux/swap.h |    1 +
 mm/swap.c            |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c        |   17 ++++++++---
 3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..c335055 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index c02f936..4aea806 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
 }
 
 /*
+ * If the page can not be invalidated, it is moved to the
+ * inactive list to speed up its reclaim.  It is moved to the
+ * head of the list, rather than the tail, to give the flusher
+ * threads some time to write it out, as this is much more
+ * effective than the single-page writeout from reclaim.
+ */
+static void lru_deactivate(struct page *page, struct zone *zone)
+{
+	int lru, file;
+
+	if (!PageLRU(page) || !PageActive(page))
+		return;
+
+	/* Some processes are using the page */
+	if (page_mapped(page))
+		return;
+
+	file = page_is_file_cache(page);
+	lru = page_lru_base_type(page);
+	del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+	ClearPageActive(page);
+	ClearPageReferenced(page);
+	add_page_to_lru_list(zone, page, lru);
+	__count_vm_event(PGDEACTIVATE);
+
+	update_page_reclaim_stat(zone, page, file, 0);
+}
+
+static void ____pagevec_lru_deactivate(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		lru_deactivate(page, zone);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+
+/*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
  * disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		____pagevec_lru_deactivate(pvec);
+}
+
+/**
+ * deactivate_page - forcefully deactivate a page
+ * @page: page to deactivate
+ *
+ * This function hints the VM that @page is a good reclaim candidate,
+ * for example if its invalidation fails due to the page being dirty
+ * or under writeback.
+ */
+void deactivate_page(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		if (!pagevec_add(pvec, page))
+			____pagevec_lru_deactivate(pvec);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
 }
 
 void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index 4d415b3..35a379b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
  * pagetables.
  */
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
-				       pgoff_t start, pgoff_t end)
+		pgoff_t start, pgoff_t end)
 {
 	struct pagevec pvec;
 	pgoff_t next = start;
-	unsigned long ret = 0;
+	unsigned long ret;
+	unsigned long count = 0;
 	int i;
 
 	pagevec_init(&pvec, 0);
@@ -359,9 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (lock_failed)
 				continue;
 
-			ret += invalidate_inode_page(page);
-
+			ret = invalidate_inode_page(page);
 			unlock_page(page);
+			/*
+			 * Invalidation is a hint that the page is no longer
+			 * of interest and try to speed up its reclaim.
+			 */
+			if (!ret)
+				deactivate_page(page);
+			count += ret;
 			if (next > end)
 				break;
 		}
@@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
-	return ret;
+	return count;
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v6 1/3] deactivate invalidated pages
@ 2011-02-20 14:43   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By other approach, app developers use POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It makes for application programmer to use it since they always
have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
make sure the pages could be discardable. At last, they can't use
deferred write of kernel so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidation is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head if we can't truncate
it right now.

Why I move page to head of lru on this patch, Dirty/Writeback page
would be flushed sooner or later. It can prevent writeout of pageout
which is less effective than flusher's writeout.

Originally, I reused lru_demote of Peter with some change so added
his Signed-off-by.

Reported-by: Ben Gamari <bgamari.foss@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
Changelog since v5:
 - call deactivate_page out of lock_page - suggested by Balbir

Changelog since v4:
 - Change function comments - suggested by Johannes
 - Change function name - suggested by Johannes
 - Drop only dirty/writeback pages to deactive pagevec - suggested by Johannes
 - Add acked-by

Changelog since v3:
 - Change function comments - suggested by Johannes
 - Change function name - suggested by Johannes
 - add only dirty/writeback pages to deactive pagevec

Changelog since v2:
 - mapped page leaves alone - suggested by Mel
 - pass part related PG_reclaim in next patch.

Changelog since v1:
 - modify description
 - correct typo
 - add some comment


 include/linux/swap.h |    1 +
 mm/swap.c            |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c        |   17 ++++++++---
 3 files changed, 91 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4d55932..c335055 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -215,6 +215,7 @@ extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index c02f936..4aea806 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,7 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -347,6 +348,60 @@ void add_page_to_unevictable_list(struct page *page)
 }
 
 /*
+ * If the page can not be invalidated, it is moved to the
+ * inactive list to speed up its reclaim.  It is moved to the
+ * head of the list, rather than the tail, to give the flusher
+ * threads some time to write it out, as this is much more
+ * effective than the single-page writeout from reclaim.
+ */
+static void lru_deactivate(struct page *page, struct zone *zone)
+{
+	int lru, file;
+
+	if (!PageLRU(page) || !PageActive(page))
+		return;
+
+	/* Some processes are using the page */
+	if (page_mapped(page))
+		return;
+
+	file = page_is_file_cache(page);
+	lru = page_lru_base_type(page);
+	del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+	ClearPageActive(page);
+	ClearPageReferenced(page);
+	add_page_to_lru_list(zone, page, lru);
+	__count_vm_event(PGDEACTIVATE);
+
+	update_page_reclaim_stat(zone, page, file, 0);
+}
+
+static void ____pagevec_lru_deactivate(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		lru_deactivate(page, zone);
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
+
+/*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
  * disabled; or "cpu" is being hot-unplugged, and is already dead.
@@ -372,6 +427,29 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+	if (pagevec_count(pvec))
+		____pagevec_lru_deactivate(pvec);
+}
+
+/**
+ * deactivate_page - forcefully deactivate a page
+ * @page: page to deactivate
+ *
+ * This function hints the VM that @page is a good reclaim candidate,
+ * for example if its invalidation fails due to the page being dirty
+ * or under writeback.
+ */
+void deactivate_page(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+		if (!pagevec_add(pvec, page))
+			____pagevec_lru_deactivate(pvec);
+		put_cpu_var(lru_deactivate_pvecs);
+	}
 }
 
 void lru_add_drain(void)
diff --git a/mm/truncate.c b/mm/truncate.c
index 4d415b3..35a379b 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -328,11 +328,12 @@ EXPORT_SYMBOL(truncate_inode_pages);
  * pagetables.
  */
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
-				       pgoff_t start, pgoff_t end)
+		pgoff_t start, pgoff_t end)
 {
 	struct pagevec pvec;
 	pgoff_t next = start;
-	unsigned long ret = 0;
+	unsigned long ret;
+	unsigned long count = 0;
 	int i;
 
 	pagevec_init(&pvec, 0);
@@ -359,9 +360,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (lock_failed)
 				continue;
 
-			ret += invalidate_inode_page(page);
-
+			ret = invalidate_inode_page(page);
 			unlock_page(page);
+			/*
+			 * Invalidation is a hint that the page is no longer
+			 * of interest and try to speed up its reclaim.
+			 */
+			if (!ret)
+				deactivate_page(page);
+			count += ret;
 			if (next > end)
 				break;
 		}
@@ -369,7 +376,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
-	return ret;
+	return count;
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
  2011-02-20 14:43 ` Minchan Kim
@ 2011-02-20 14:43   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list.  That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
Changelog since v5:
 - change page_lru_base_type to page_lru - suggested by Kame

Changelog since v4:
 - add acked-by and reviewed-by
 - change description - suggested by Rik

 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   27 +++++++++++++++++++++++++++
 mm/swap.c                  |    3 ++-
 3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
 extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
 	return ;
 }
 
+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	return ;
+}
+
 static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
 {
 	return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..8a97571 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
 	mem_cgroup_del_lru_list(page, page_lru(page));
 }
 
+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim.  If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+	enum lru_list lru = page_lru(page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Used bit is set without atomic ops but after smp_wmb().
+	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
+	 */
+	smp_rmb();
+	/* unused or root page is not rotated. */
+	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+		return;
+	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+	list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
 void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-			int lru = page_lru_base_type(page);
+			enum lru_list lru = page_lru_base_type(page);
 			list_move_tail(&page->lru, &zone->lru[lru].list);
+			mem_cgroup_rotate_reclaimable_page(page);
 			pgmoved++;
 		}
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
@ 2011-02-20 14:43   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list.  That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
Changelog since v5:
 - change page_lru_base_type to page_lru - suggested by Kame

Changelog since v4:
 - add acked-by and reviewed-by
 - change description - suggested by Rik

 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   27 +++++++++++++++++++++++++++
 mm/swap.c                  |    3 ++-
 3 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
 extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
 	return ;
 }
 
+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	return ;
+}
+
 static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
 {
 	return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..8a97571 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
 	mem_cgroup_del_lru_list(page, page_lru(page));
 }
 
+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim.  If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+	enum lru_list lru = page_lru(page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	/*
+	 * Used bit is set without atomic ops but after smp_wmb().
+	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
+	 */
+	smp_rmb();
+	/* unused or root page is not rotated. */
+	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+		return;
+	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+	list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
 void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-			int lru = page_lru_base_type(page);
+			enum lru_list lru = page_lru_base_type(page);
 			list_move_tail(&page->lru, &zone->lru[lru].list);
+			mem_cgroup_rotate_reclaimable_page(page);
 			pgmoved++;
 		}
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v6 3/3] Reclaim invalidated page ASAP
  2011-02-20 14:43 ` Minchan Kim
@ 2011-02-20 14:43   ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
In v4, ClearPageReadahead in set_page_dirty has a problem which is reported
by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks
PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling
but Cced Andrea for double check.

Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/page-writeback.c |   12 +++++++++++-
 mm/swap.c           |   41 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..b437fe6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1211,6 +1211,17 @@ int set_page_dirty(struct page *page)
 
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+		/*
+		 * readahead/lru_deactivate_page could remain
+		 * PG_readahead/PG_reclaim due to race with end_page_writeback
+		 * About readahead, if the page is written, the flags would be
+		 * reset. So no problem.
+		 * About lru_deactivate_page, if the page is redirty, the flag
+		 * will be reset. So no problem. but if the page is used by readahead
+		 * it will confuse readahead and make it restart the size rampup
+		 * process. But it's a trivial problem.
+		 */
+		ClearPageReclaim(page);
 #ifdef CONFIG_BLOCK
 		if (!spd)
 			spd = __set_page_dirty_buffers;
@@ -1266,7 +1277,6 @@ int clear_page_dirty_for_io(struct page *page)
 
 	BUG_ON(!PageLocked(page));
 
-	ClearPageReclaim(page);
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		/*
 		 * Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 1b9e4eb..0a33714 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -354,26 +354,61 @@ void add_page_to_unevictable_list(struct page *page)
  * head of the list, rather than the tail, to give the flusher
  * threads some time to write it out, as this is much more
  * effective than the single-page writeout from reclaim.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. inactive, clean -> inactive, tail
+ * 6. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be write it out by flusher threads as this is much more effective
+ * than the single-page writeout from reclaim.
  */
 static void lru_deactivate(struct page *page, struct zone *zone)
 {
 	int lru, file;
+	bool active;
 
-	if (!PageLRU(page) || !PageActive(page))
+	if (!PageLRU(page))
 		return;
 
 	/* Some processes are using the page */
 	if (page_mapped(page))
 		return;
 
+	active = PageActive(page);
+
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+	del_page_from_lru_list(zone, page, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
 	add_page_to_lru_list(zone, page, lru);
-	__count_vm_event(PGDEACTIVATE);
 
+	if (PageWriteback(page) || PageDirty(page)) {
+		/*
+		 * PG_reclaim could be raced with end_page_writeback
+		 * It can make readahead confusing.  But race window
+		 * is _really_ small and  it's non-critical problem.
+		 */
+		SetPageReclaim(page);
+	} else {
+		/*
+		 * The page's writeback ends up during pagevec
+		 * We moves tha page into tail of inactive.
+		 */
+		list_move_tail(&page->lru, &zone->lru[lru].list);
+		mem_cgroup_rotate_reclaimable_page(page);
+		__count_vm_event(PGROTATED);
+	}
+
+	if (active)
+		__count_vm_event(PGDEACTIVATE);
 	update_page_reclaim_stat(zone, page, file, 0);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v6 3/3] Reclaim invalidated page ASAP
@ 2011-02-20 14:43   ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-20 14:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, LKML, Steven Barrett, Ben Gamari, Peter Zijlstra,
	Rik van Riel, Mel Gorman, KOSAKI Motohiro, Wu Fengguang,
	Johannes Weiner, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki, Minchan Kim

invalidate_mapping_pages is very big hint to reclaimer.
It means user doesn't want to use the page any more.
So in order to prevent working set page eviction, this patch
move the page into tail of inactive list by PG_reclaim.

Please, remember that pages in inactive list are working set
as well as active list. If we don't move pages into inactive list's
tail, pages near by tail of inactive list can be evicted although
we have a big clue about useless pages. It's totally bad.

Now PG_readahead/PG_reclaim is shared.
fe3cba17 added ClearPageReclaim into clear_page_dirty_for_io for
preventing fast reclaiming readahead marker page.

In this series, PG_reclaim is used by invalidated page, too.
If VM find the page is invalidated and it's dirty, it sets PG_reclaim
to reclaim asap. Then, when the dirty page will be writeback,
clear_page_dirty_for_io will clear PG_reclaim unconditionally.
It disturbs this serie's goal.

I think it's okay to clear PG_readahead when the page is dirty, not
writeback time. So this patch moves ClearPageReadahead.
In v4, ClearPageReadahead in set_page_dirty has a problem which is reported
by Steven Barrett. It's due to compound page. Some driver(ex, audio) calls
set_page_dirty with compound page which isn't on LRU. but my patch does
ClearPageRelcaim on compound page. In non-CONFIG_PAGEFLAGS_EXTENDED, it breaks
PageTail flag.

I think it doesn't affect THP and pass my test with THP enabling
but Cced Andrea for double check.

Reported-by: Steven Barrett <damentz@liquorix.net>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/page-writeback.c |   12 +++++++++++-
 mm/swap.c           |   41 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..b437fe6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1211,6 +1211,17 @@ int set_page_dirty(struct page *page)
 
 	if (likely(mapping)) {
 		int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
+		/*
+		 * readahead/lru_deactivate_page could remain
+		 * PG_readahead/PG_reclaim due to race with end_page_writeback
+		 * About readahead, if the page is written, the flags would be
+		 * reset. So no problem.
+		 * About lru_deactivate_page, if the page is redirty, the flag
+		 * will be reset. So no problem. but if the page is used by readahead
+		 * it will confuse readahead and make it restart the size rampup
+		 * process. But it's a trivial problem.
+		 */
+		ClearPageReclaim(page);
 #ifdef CONFIG_BLOCK
 		if (!spd)
 			spd = __set_page_dirty_buffers;
@@ -1266,7 +1277,6 @@ int clear_page_dirty_for_io(struct page *page)
 
 	BUG_ON(!PageLocked(page));
 
-	ClearPageReclaim(page);
 	if (mapping && mapping_cap_account_dirty(mapping)) {
 		/*
 		 * Yes, Virginia, this is indeed insane.
diff --git a/mm/swap.c b/mm/swap.c
index 1b9e4eb..0a33714 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -354,26 +354,61 @@ void add_page_to_unevictable_list(struct page *page)
  * head of the list, rather than the tail, to give the flusher
  * threads some time to write it out, as this is much more
  * effective than the single-page writeout from reclaim.
+ *
+ * If the page isn't page_mapped and dirty/writeback, the page
+ * could reclaim asap using PG_reclaim.
+ *
+ * 1. active, mapped page -> none
+ * 2. active, dirty/writeback page -> inactive, head, PG_reclaim
+ * 3. inactive, mapped page -> none
+ * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim
+ * 5. inactive, clean -> inactive, tail
+ * 6. Others -> none
+ *
+ * In 4, why it moves inactive's head, the VM expects the page would
+ * be write it out by flusher threads as this is much more effective
+ * than the single-page writeout from reclaim.
  */
 static void lru_deactivate(struct page *page, struct zone *zone)
 {
 	int lru, file;
+	bool active;
 
-	if (!PageLRU(page) || !PageActive(page))
+	if (!PageLRU(page))
 		return;
 
 	/* Some processes are using the page */
 	if (page_mapped(page))
 		return;
 
+	active = PageActive(page);
+
 	file = page_is_file_cache(page);
 	lru = page_lru_base_type(page);
-	del_page_from_lru_list(zone, page, lru + LRU_ACTIVE);
+	del_page_from_lru_list(zone, page, lru + active);
 	ClearPageActive(page);
 	ClearPageReferenced(page);
 	add_page_to_lru_list(zone, page, lru);
-	__count_vm_event(PGDEACTIVATE);
 
+	if (PageWriteback(page) || PageDirty(page)) {
+		/*
+		 * PG_reclaim could be raced with end_page_writeback
+		 * It can make readahead confusing.  But race window
+		 * is _really_ small and  it's non-critical problem.
+		 */
+		SetPageReclaim(page);
+	} else {
+		/*
+		 * The page's writeback ends up during pagevec
+		 * We moves tha page into tail of inactive.
+		 */
+		list_move_tail(&page->lru, &zone->lru[lru].list);
+		mem_cgroup_rotate_reclaimable_page(page);
+		__count_vm_event(PGROTATED);
+	}
+
+	if (active)
+		__count_vm_event(PGDEACTIVATE);
 	update_page_reclaim_stat(zone, page, file, 0);
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 1/3] deactivate invalidated pages
  2011-02-20 14:43   ` Minchan Kim
@ 2011-02-21  8:29     ` Johannes Weiner
  -1 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21  8:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Sun, Feb 20, 2011 at 11:43:36PM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
> 
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
> 
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It makes for application programmer to use it since they always
> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages could be discardable. At last, they can't use
> deferred write of kernel so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
> 
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head if we can't truncate
> it right now.
> 
> Why I move page to head of lru on this patch, Dirty/Writeback page
> would be flushed sooner or later. It can prevent writeout of pageout
> which is less effective than flusher's writeout.
> 
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
> 
> Reported-by: Ben Gamari <bgamari.foss@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 1/3] deactivate invalidated pages
@ 2011-02-21  8:29     ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21  8:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Sun, Feb 20, 2011 at 11:43:36PM +0900, Minchan Kim wrote:
> Recently, there are reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into
> active list so that it results in working set page eviction.
> 
> Some app developer want to support POSIX_FADV_NOREUSE.
> But other OSes don't support it, either.
> (http://marc.info/?l=linux-mm&m=128928979512086&w=2)
> 
> By other approach, app developers use POSIX_FADV_DONTNEED.
> But it has a problem. If kernel meets page is writing
> during invalidate_mapping_pages, it can't work.
> It makes for application programmer to use it since they always
> have to sync data before calling fadivse(..POSIX_FADV_DONTNEED) to
> make sure the pages could be discardable. At last, they can't use
> deferred write of kernel so that they could see performance loss.
> (http://insights.oetiker.ch/linux/fadvise.html)
> 
> In fact, invalidation is very big hint to reclaimer.
> It means we don't use the page any more. So let's move
> the writing page into inactive list's head if we can't truncate
> it right now.
> 
> Why I move page to head of lru on this patch, Dirty/Writeback page
> would be flushed sooner or later. It can prevent writeout of pageout
> which is less effective than flusher's writeout.
> 
> Originally, I reused lru_demote of Peter with some change so added
> his Signed-off-by.
> 
> Reported-by: Ben Gamari <bgamari.foss@gmail.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Nick Piggin <npiggin@kernel.dk>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
  2011-02-20 14:43   ` Minchan Kim
@ 2011-02-21  8:40     ` Johannes Weiner
  -1 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21  8:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>  	mem_cgroup_del_lru_list(page, page_lru(page));
>  }
>  
> +/*
> + * Writeback is about to end against a page which has been marked for immediate
> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
> + * inactive list.
> + */
> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	struct page_cgroup *pc;
> +	enum lru_list lru = page_lru(page);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +	/*
> +	 * Used bit is set without atomic ops but after smp_wmb().
> +	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> +	 */
> +	smp_rmb();
> +	/* unused or root page is not rotated. */
> +	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> +		return;

The placement of this barrier is confused and has been fixed up in the
meantime in other places.  It has to be between PageCgroupUsed() and
accessing pc->mem_cgroup.  You can look at the other memcg lru
functions for reference.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
@ 2011-02-21  8:40     ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21  8:40 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>  	mem_cgroup_del_lru_list(page, page_lru(page));
>  }
>  
> +/*
> + * Writeback is about to end against a page which has been marked for immediate
> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
> + * inactive list.
> + */
> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	struct page_cgroup *pc;
> +	enum lru_list lru = page_lru(page);
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	pc = lookup_page_cgroup(page);
> +	/*
> +	 * Used bit is set without atomic ops but after smp_wmb().
> +	 * For making pc->mem_cgroup visible, insert smp_rmb() here.
> +	 */
> +	smp_rmb();
> +	/* unused or root page is not rotated. */
> +	if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
> +		return;

The placement of this barrier is confused and has been fixed up in the
meantime in other places.  It has to be between PageCgroupUsed() and
accessing pc->mem_cgroup.  You can look at the other memcg lru
functions for reference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
  2011-02-21  8:40     ` Johannes Weiner
@ 2011-02-21 14:07       ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 14:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Feb 21, 2011 at 5:40 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>       mem_cgroup_del_lru_list(page, page_lru(page));
>>  }
>>
>> +/*
>> + * Writeback is about to end against a page which has been marked for immediate
>> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
>> + * inactive list.
>> + */
>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> +     struct mem_cgroup_per_zone *mz;
>> +     struct page_cgroup *pc;
>> +     enum lru_list lru = page_lru(page);
>> +
>> +     if (mem_cgroup_disabled())
>> +             return;
>> +
>> +     pc = lookup_page_cgroup(page);
>> +     /*
>> +      * Used bit is set without atomic ops but after smp_wmb().
>> +      * For making pc->mem_cgroup visible, insert smp_rmb() here.
>> +      */
>> +     smp_rmb();
>> +     /* unused or root page is not rotated. */
>> +     if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> +             return;
>
> The placement of this barrier is confused and has been fixed up in the
> meantime in other places.  It has to be between PageCgroupUsed() and
> accessing pc->mem_cgroup.  You can look at the other memcg lru
> functions for reference.

Yes. I saw your patch at that time but forgot it.
I will resend fixed version.
Thanks.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
@ 2011-02-21 14:07       ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 14:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Feb 21, 2011 at 5:40 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Sun, Feb 20, 2011 at 11:43:37PM +0900, Minchan Kim wrote:
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -813,6 +813,33 @@ void mem_cgroup_del_lru(struct page *page)
>>       mem_cgroup_del_lru_list(page, page_lru(page));
>>  }
>>
>> +/*
>> + * Writeback is about to end against a page which has been marked for immediate
>> + * reclaim.  If it still appears to be reclaimable, move it to the tail of the
>> + * inactive list.
>> + */
>> +void mem_cgroup_rotate_reclaimable_page(struct page *page)
>> +{
>> +     struct mem_cgroup_per_zone *mz;
>> +     struct page_cgroup *pc;
>> +     enum lru_list lru = page_lru(page);
>> +
>> +     if (mem_cgroup_disabled())
>> +             return;
>> +
>> +     pc = lookup_page_cgroup(page);
>> +     /*
>> +      * Used bit is set without atomic ops but after smp_wmb().
>> +      * For making pc->mem_cgroup visible, insert smp_rmb() here.
>> +      */
>> +     smp_rmb();
>> +     /* unused or root page is not rotated. */
>> +     if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
>> +             return;
>
> The placement of this barrier is confused and has been fixed up in the
> meantime in other places.  It has to be between PageCgroupUsed() and
> accessing pc->mem_cgroup.  You can look at the other memcg lru
> functions for reference.

Yes. I saw your patch at that time but forgot it.
I will resend fixed version.
Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
  2011-02-21  8:40     ` Johannes Weiner
@ 2011-02-21 15:59       ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 15:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

Fixed version.

>From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@gmail.com>
Date: Tue, 22 Feb 2011 00:53:05 +0900
Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list

The rotate_reclaimable_page function moves just written out
pages, which the VM wanted to reclaim, to the end of the
inactive list.  That way the VM will find those pages first
next time it needs to free memory.
This patch apply the rule in memcg.
It can help to prevent unnecessary working page eviction of memcg.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---

Changelog since v6:
 - adjust smp_rmb - suggested by Johannes

 include/linux/memcontrol.h |    6 ++++++
 mm/memcontrol.c            |   26 ++++++++++++++++++++++++++
 mm/swap.c                  |    3 ++-
 3 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..5a5ce70 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,7 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_add_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru_list(struct page *page, enum lru_list lru);
+extern void mem_cgroup_rotate_reclaimable_page(struct page *page);
 extern void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru);
 extern void mem_cgroup_del_lru(struct page *page);
 extern void mem_cgroup_move_lists(struct page *page,
@@ -215,6 +216,11 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
 	return ;
 }
 
+static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	return ;
+}
+
 static inline void mem_cgroup_rotate_lru_list(struct page *page, int lru)
 {
 	return ;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 686f1ce..2fc97fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -813,6 +813,32 @@ void mem_cgroup_del_lru(struct page *page)
 	mem_cgroup_del_lru_list(page, page_lru(page));
 }
 
+/*
+ * Writeback is about to end against a page which has been marked for immediate
+ * reclaim.  If it still appears to be reclaimable, move it to the tail of the
+ * inactive list.
+ */
+void mem_cgroup_rotate_reclaimable_page(struct page *page)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct page_cgroup *pc;
+	enum lru_list lru = page_lru(page);
+
+	if (mem_cgroup_disabled())
+		return;
+
+	pc = lookup_page_cgroup(page);
+	/* unused or root page is not rotated. */
+	if (!PageCgroupUsed(pc))
+		return;
+	/* Ensure pc->mem_cgroup is visible after reading PCG_USED. */
+	smp_rmb();
+	if (mem_cgroup_is_root(pc->mem_cgroup))
+		return;
+	mz = page_cgroup_zoneinfo(pc->mem_cgroup, page);
+	list_move_tail(&pc->lru, &mz->lists[lru]);
+}
+
 void mem_cgroup_rotate_lru_list(struct page *page, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz;
diff --git a/mm/swap.c b/mm/swap.c
index 4aea806..1b9e4eb 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,9 @@ static void pagevec_move_tail(struct pagevec *pvec)
 			spin_lock(&zone->lru_lock);
 		}
 		if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-			int lru = page_lru_base_type(page);
+			enum lru_list lru = page_lru_base_type(page);
 			list_move_tail(&page->lru, &zone->lru[lru].list);
+			mem_cgroup_rotate_reclaimable_page(page);
 			pgmoved++;
 		}
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
@ 2011-02-21 15:59       ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 15:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

Fixed version.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
  2011-02-21 15:59       ` Minchan Kim
@ 2011-02-21 16:06         ` Johannes Weiner
  -1 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21 16:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 12:59:25AM +0900, Minchan Kim wrote:
> Fixed version.
> 
> >From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> 
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list.  That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list
@ 2011-02-21 16:06         ` Johannes Weiner
  0 siblings, 0 replies; 36+ messages in thread
From: Johannes Weiner @ 2011-02-21 16:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Nick Piggin, Andrea Arcangeli, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 12:59:25AM +0900, Minchan Kim wrote:
> Fixed version.
> 
> >From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> 
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list.  That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-20 14:43 ` Minchan Kim
@ 2011-02-21 19:07   ` Andrea Arcangeli
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-21 19:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hello,

On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
> Recently, there was a reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into

"recently" and "thrashing horribly" seem to signal a regression. Ok
that trying to have backup not messing up the VM working set, but by
any means running rsync in a loop shouldn't lead a server into
"trashing horribly" (other than for the additional disk I/O, just like
if rsync would be using O_DIRECT).

This effort in teaching rsync to tell the VM it's likely an used-once
type of access to the cache is good (tar will need it too), but if
this is a regression like it appears from the words above ("recently"
and "trashing horribly"), I suspect it's much higher priority to fix a
VM regression than to add fadvise support in rsync/tar. Likely if the
system didn't start "trashing horribly", they wouldn't need rsync.

Then fadvise becomes an improvement on top of that.

It'd be nice if at least it was tested if older kernel wouldn't trash
horribly after being left inactive overnight. If it still trashes
horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I'm quite comfortable that older kernels would do perfectly ok with a
loop of rsync overnight while the system was idle. I also got people
asking me privately what to do to avoid the backup to swapout, that
further make me believe something regressed recently as older VM code
would never swapout on such a workload, even if you do used twice or 3
times in a row. If it swapout that's the real bug.

I had questions about limiting the pagecache size to a certain amount,
that works too, but that's again a band aid like fadvise, and it's
real minor issue compared to fixing the VM so that at least you can
tell the kernel "nuke all clean cache first", being able to tell the
kernel just that (even if some VM clever algorithm thinks swapping is
better and we want to swap by default) will fix it. We still need a
way to make the kernel behave perfect with zero swapping without
fadvise and without limiting the cache. Maybe setting swappiness to 0
just does that, I suggested that and I heard nothing back.

If you can reproduce I suggest making sure that at least it doesn't
swap anything during the overnight workload as that would signal a
definitive problem.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-21 19:07   ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-21 19:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hello,

On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
> Recently, there was a reported problem about thrashing.
> (http://marc.info/?l=rsync&m=128885034930933&w=2)
> It happens by backup workloads(ex, nightly rsync).
> That's because the workload makes just use-once pages
> and touches pages twice. It promotes the page into

"recently" and "thrashing horribly" seem to signal a regression. Ok
that trying to have backup not messing up the VM working set, but by
any means running rsync in a loop shouldn't lead a server into
"trashing horribly" (other than for the additional disk I/O, just like
if rsync would be using O_DIRECT).

This effort in teaching rsync to tell the VM it's likely an used-once
type of access to the cache is good (tar will need it too), but if
this is a regression like it appears from the words above ("recently"
and "trashing horribly"), I suspect it's much higher priority to fix a
VM regression than to add fadvise support in rsync/tar. Likely if the
system didn't start "trashing horribly", they wouldn't need rsync.

Then fadvise becomes an improvement on top of that.

It'd be nice if at least it was tested if older kernel wouldn't trash
horribly after being left inactive overnight. If it still trashes
horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I'm quite comfortable that older kernels would do perfectly ok with a
loop of rsync overnight while the system was idle. I also got people
asking me privately what to do to avoid the backup to swapout, that
further make me believe something regressed recently as older VM code
would never swapout on such a workload, even if you do used twice or 3
times in a row. If it swapout that's the real bug.

I had questions about limiting the pagecache size to a certain amount,
that works too, but that's again a band aid like fadvise, and it's
real minor issue compared to fixing the VM so that at least you can
tell the kernel "nuke all clean cache first", being able to tell the
kernel just that (even if some VM clever algorithm thinks swapping is
better and we want to swap by default) will fix it. We still need a
way to make the kernel behave perfect with zero swapping without
fadvise and without limiting the cache. Maybe setting swappiness to 0
just does that, I suggested that and I heard nothing back.

If you can reproduce I suggest making sure that at least it doesn't
swap anything during the overnight workload as that would signal a
definitive problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-21 19:07   ` Andrea Arcangeli
@ 2011-02-21 22:59     ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 22:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi Andrea,

On Tue, Feb 22, 2011 at 4:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello,
>
> On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
>> Recently, there was a reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>
> "recently" and "thrashing horribly" seem to signal a regression. Ok
> that trying to have backup not messing up the VM working set, but by
> any means running rsync in a loop shouldn't lead a server into
> "trashing horribly" (other than for the additional disk I/O, just like
> if rsync would be using O_DIRECT).
>
> This effort in teaching rsync to tell the VM it's likely an used-once
> type of access to the cache is good (tar will need it too), but if
> this is a regression like it appears from the words above ("recently"
> and "trashing horribly"), I suspect it's much higher priority to fix a
> VM regression than to add fadvise support in rsync/tar. Likely if the
> system didn't start "trashing horribly", they wouldn't need rsync.
>
> Then fadvise becomes an improvement on top of that.
>
> It'd be nice if at least it was tested if older kernel wouldn't trash
> horribly after being left inactive overnight. If it still trashes
> horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I don't have a reproducible experiment.
I started the series with Ben's complain.
http://marc.info/?l=rsync&m=128885034930933&w=2
I don't know Ben could test it with older kernel since recently he got silent.

I am not sure older kernel worked well such as workload.
That's because Rik had been pointed out rsync's two touch
problem(http://linux-mm.org/PageReplacementRequirements) and solved
part of the problem with remaining half of the active file pages(look
at inactive_file_is_low) on 2.6.28's big change reclaim.
So I think the problem about such workload have been in there.

>
> I'm quite comfortable that older kernels would do perfectly ok with a
> loop of rsync overnight while the system was idle. I also got people
> asking me privately what to do to avoid the backup to swapout, that
> further make me believe something regressed recently as older VM code
> would never swapout on such a workload, even if you do used twice or 3
> times in a row. If it swapout that's the real bug.

Hmm,,,

>
> I had questions about limiting the pagecache size to a certain amount,
> that works too, but that's again a band aid like fadvise, and it's
> real minor issue compared to fixing the VM so that at least you can
> tell the kernel "nuke all clean cache first", being able to tell the
> kernel just that (even if some VM clever algorithm thinks swapping is
> better and we want to swap by default) will fix it. We still need a
> way to make the kernel behave perfect with zero swapping without
> fadvise and without limiting the cache. Maybe setting swappiness to 0
> just does that, I suggested that and I heard nothing back.

I don't think it's desirable to set swappiness to 0 since it changes
system global reclaim policy for preventing just a backing program's
wrong behavior.
And this patch's benefit is not only for preventing working set.
Before this patch, application should sync the data before fadvise to
take a effect but it's very inefficient by slow sync operation. If the
invalidation meet dirty page, it can skip the page. But after this
patch, application could fdavise without sync because we could move
the page of head/or tail of inactive list.

So I think the patch is valuable enough to merge?

>
> If you can reproduce I suggest making sure that at least it doesn't
> swap anything during the overnight workload as that would signal a
> definitive problem.
>
with swappiness = 0?
As I said, I don't like the solution.

And as I said, I need Ben's help but I am not sure he can.
Thanks for the review, Andrea.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-21 22:59     ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-21 22:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi Andrea,

On Tue, Feb 22, 2011 at 4:07 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello,
>
> On Sun, Feb 20, 2011 at 11:43:35PM +0900, Minchan Kim wrote:
>> Recently, there was a reported problem about thrashing.
>> (http://marc.info/?l=rsync&m=128885034930933&w=2)
>> It happens by backup workloads(ex, nightly rsync).
>> That's because the workload makes just use-once pages
>> and touches pages twice. It promotes the page into
>
> "recently" and "thrashing horribly" seem to signal a regression. Ok
> that trying to have backup not messing up the VM working set, but by
> any means running rsync in a loop shouldn't lead a server into
> "trashing horribly" (other than for the additional disk I/O, just like
> if rsync would be using O_DIRECT).
>
> This effort in teaching rsync to tell the VM it's likely an used-once
> type of access to the cache is good (tar will need it too), but if
> this is a regression like it appears from the words above ("recently"
> and "trashing horribly"), I suspect it's much higher priority to fix a
> VM regression than to add fadvise support in rsync/tar. Likely if the
> system didn't start "trashing horribly", they wouldn't need rsync.
>
> Then fadvise becomes an improvement on top of that.
>
> It'd be nice if at least it was tested if older kernel wouldn't trash
> horribly after being left inactive overnight. If it still trashes
> horribly with 2.6.18 ok... ignore this, otherwise we need a real fix.

I don't have a reproducible experiment.
I started the series with Ben's complain.
http://marc.info/?l=rsync&m=128885034930933&w=2
I don't know Ben could test it with older kernel since recently he got silent.

I am not sure older kernel worked well such as workload.
That's because Rik had been pointed out rsync's two touch
problem(http://linux-mm.org/PageReplacementRequirements) and solved
part of the problem with remaining half of the active file pages(look
at inactive_file_is_low) on 2.6.28's big change reclaim.
So I think the problem about such workload have been in there.

>
> I'm quite comfortable that older kernels would do perfectly ok with a
> loop of rsync overnight while the system was idle. I also got people
> asking me privately what to do to avoid the backup to swapout, that
> further make me believe something regressed recently as older VM code
> would never swapout on such a workload, even if you do used twice or 3
> times in a row. If it swapout that's the real bug.

Hmm,,,

>
> I had questions about limiting the pagecache size to a certain amount,
> that works too, but that's again a band aid like fadvise, and it's
> real minor issue compared to fixing the VM so that at least you can
> tell the kernel "nuke all clean cache first", being able to tell the
> kernel just that (even if some VM clever algorithm thinks swapping is
> better and we want to swap by default) will fix it. We still need a
> way to make the kernel behave perfect with zero swapping without
> fadvise and without limiting the cache. Maybe setting swappiness to 0
> just does that, I suggested that and I heard nothing back.

I don't think it's desirable to set swappiness to 0 since it changes
system global reclaim policy for preventing just a backing program's
wrong behavior.
And this patch's benefit is not only for preventing working set.
Before this patch, application should sync the data before fadvise to
take a effect but it's very inefficient by slow sync operation. If the
invalidation meet dirty page, it can skip the page. But after this
patch, application could fdavise without sync because we could move
the page of head/or tail of inactive list.

So I think the patch is valuable enough to merge?

>
> If you can reproduce I suggest making sure that at least it doesn't
> swap anything during the overnight workload as that would signal a
> definitive problem.
>
with swappiness = 0?
As I said, I don't like the solution.

And as I said, I need Ben's help but I am not sure he can.
Thanks for the review, Andrea.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-21 22:59     ` Minchan Kim
@ 2011-02-22 13:28       ` Andrea Arcangeli
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 13:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi Minchan,

On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> I don't have a reproducible experiment.
> I started the series with Ben's complain.
> http://marc.info/?l=rsync&m=128885034930933&w=2

Yes I've noticed.

> I don't know Ben could test it with older kernel since recently he got silent.

That's my point, we should check if the "thrashing horribly" is really
a "recently" or if it has always happened before with 2.6.18 and previous.

> I am not sure older kernel worked well such as workload.
> That's because Rik had been pointed out rsync's two touch
> problem(http://linux-mm.org/PageReplacementRequirements) and solved
> part of the problem with remaining half of the active file pages(look
> at inactive_file_is_low) on 2.6.28's big change reclaim.
> So I think the problem about such workload have been in there.

It's possible it's an old problem, but frankly I doubt that any
swapping would have ever happened before, no matter how much rsync you
would run in a loop. As said I also got PM from users asking what they
can do to limit the pagecache because their systems are swapping
overnight because of backup loads, that's definitely not ok. Or at
least there must be a tweak to tell the VM "stop doing the swapping
thing with backup". I didn't yet try to reproduce or debug this as I'm
busy with other compaction/THP related bits.

> And this patch's benefit is not only for preventing working set.
> Before this patch, application should sync the data before fadvise to
> take a effect but it's very inefficient by slow sync operation. If the
> invalidation meet dirty page, it can skip the page. But after this
> patch, application could fdavise without sync because we could move
> the page of head/or tail of inactive list.

I agree, the objective of the patch definitely looks good. Before the
fadvise would be ignored without a fdatasync before it as the pages
were dirty and couldn't be discarded.

> So I think the patch is valuable enough to merge?

The objective looks good, I didn't have time to review all details of
the patches but I'll try to review it later.

> And as I said, I need Ben's help but I am not sure he can.
> Thanks for the review, Andrea.

Thanks for the effort in improving fadvise.

I'd like to try to find the time to check if we've a regression
without fadvise too. It's perfectly ok to use fadvise in rsync but my
point is that the kernel should not lead to trashing of running apps
regardless of fadvise or not, and I think that is higher priority to
fix than the fadvise improvement. But still the fadvise improvement is
very valuable to increase overall performance and hopefully to avoid
wasting most of filesystem cache because of backups (but wasting cache
shouldn't lead to trashing horribly, just more I/O from apps after the
backup run, like after starting the app the first time, trashing
usually means we swapped out or discarded mapped pages too and that's
wrong).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-22 13:28       ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 13:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi Minchan,

On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> I don't have a reproducible experiment.
> I started the series with Ben's complain.
> http://marc.info/?l=rsync&m=128885034930933&w=2

Yes I've noticed.

> I don't know Ben could test it with older kernel since recently he got silent.

That's my point, we should check if the "thrashing horribly" is really
a "recently" or if it has always happened before with 2.6.18 and previous.

> I am not sure older kernel worked well such as workload.
> That's because Rik had been pointed out rsync's two touch
> problem(http://linux-mm.org/PageReplacementRequirements) and solved
> part of the problem with remaining half of the active file pages(look
> at inactive_file_is_low) on 2.6.28's big change reclaim.
> So I think the problem about such workload have been in there.

It's possible it's an old problem, but frankly I doubt that any
swapping would have ever happened before, no matter how much rsync you
would run in a loop. As said I also got PM from users asking what they
can do to limit the pagecache because their systems are swapping
overnight because of backup loads, that's definitely not ok. Or at
least there must be a tweak to tell the VM "stop doing the swapping
thing with backup". I didn't yet try to reproduce or debug this as I'm
busy with other compaction/THP related bits.

> And this patch's benefit is not only for preventing working set.
> Before this patch, application should sync the data before fadvise to
> take a effect but it's very inefficient by slow sync operation. If the
> invalidation meet dirty page, it can skip the page. But after this
> patch, application could fdavise without sync because we could move
> the page of head/or tail of inactive list.

I agree, the objective of the patch definitely looks good. Before the
fadvise would be ignored without a fdatasync before it as the pages
were dirty and couldn't be discarded.

> So I think the patch is valuable enough to merge?

The objective looks good, I didn't have time to review all details of
the patches but I'll try to review it later.

> And as I said, I need Ben's help but I am not sure he can.
> Thanks for the review, Andrea.

Thanks for the effort in improving fadvise.

I'd like to try to find the time to check if we've a regression
without fadvise too. It's perfectly ok to use fadvise in rsync but my
point is that the kernel should not lead to trashing of running apps
regardless of fadvise or not, and I think that is higher priority to
fix than the fadvise improvement. But still the fadvise improvement is
very valuable to increase overall performance and hopefully to avoid
wasting most of filesystem cache because of backups (but wasting cache
shouldn't lead to trashing horribly, just more I/O from apps after the
backup run, like after starting the app the first time, trashing
usually means we swapped out or discarded mapped pages too and that's
wrong).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-22 13:28       ` Andrea Arcangeli
@ 2011-02-22 14:26         ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-22 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 02:28:04PM +0100, Andrea Arcangeli wrote:
> Hi Minchan,
> 
> On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> > I don't have a reproducible experiment.
> > I started the series with Ben's complain.
> > http://marc.info/?l=rsync&m=128885034930933&w=2
> 
> Yes I've noticed.
> 
> > I don't know Ben could test it with older kernel since recently he got silent.
> 
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.
> 
> > I am not sure older kernel worked well such as workload.
> > That's because Rik had been pointed out rsync's two touch
> > problem(http://linux-mm.org/PageReplacementRequirements) and solved
> > part of the problem with remaining half of the active file pages(look
> > at inactive_file_is_low) on 2.6.28's big change reclaim.
> > So I think the problem about such workload have been in there.
> 
> It's possible it's an old problem, but frankly I doubt that any
> swapping would have ever happened before, no matter how much rsync you
> would run in a loop. As said I also got PM from users asking what they
> can do to limit the pagecache because their systems are swapping
> overnight because of backup loads, that's definitely not ok. Or at
> least there must be a tweak to tell the VM "stop doing the swapping
> thing with backup". I didn't yet try to reproduce or debug this as I'm
> busy with other compaction/THP related bits.
> 
> > And this patch's benefit is not only for preventing working set.
> > Before this patch, application should sync the data before fadvise to
> > take a effect but it's very inefficient by slow sync operation. If the
> > invalidation meet dirty page, it can skip the page. But after this
> > patch, application could fdavise without sync because we could move
> > the page of head/or tail of inactive list.
> 
> I agree, the objective of the patch definitely looks good. Before the
> fadvise would be ignored without a fdatasync before it as the pages
> were dirty and couldn't be discarded.
> 
> > So I think the patch is valuable enough to merge?
> 
> The objective looks good, I didn't have time to review all details of
> the patches but I'll try to review it later.
> 
> > And as I said, I need Ben's help but I am not sure he can.
> > Thanks for the review, Andrea.
> 
> Thanks for the effort in improving fadvise.
> 
> I'd like to try to find the time to check if we've a regression
> without fadvise too. It's perfectly ok to use fadvise in rsync but my
> point is that the kernel should not lead to trashing of running apps
> regardless of fadvise or not, and I think that is higher priority to
> fix than the fadvise improvement. But still the fadvise improvement is
> very valuable to increase overall performance and hopefully to avoid
> wasting most of filesystem cache because of backups (but wasting cache
> shouldn't lead to trashing horribly, just more I/O from apps after the
> backup run, like after starting the app the first time, trashing
> usually means we swapped out or discarded mapped pages too and that's
> wrong).

I agree your opinion but I hope the patch is going on 2.6.38 or 39.
That's because if we find regression's root cause, how could this series be changed?
I think it's no difference before and after.
Of course, if rsync like applicatoin start to use fadvise agressively, the problem
could be buried on toe but we still have a older kernel and older rsync so we can
reproduce it then we can find the root cause.
What's the problem if the series is merged?
If it is reasonable, it's no problem to pend the series.

I _totally_ agree your opinion and I want to find root cause of the regression, too.
But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
I hope clever people like you would have a time to find it and report it to linux-mm
in future.

Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
It could help us very much.
Thanks.

> 
> Thanks,
> Andrea

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-22 14:26         ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-02-22 14:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 02:28:04PM +0100, Andrea Arcangeli wrote:
> Hi Minchan,
> 
> On Tue, Feb 22, 2011 at 07:59:51AM +0900, Minchan Kim wrote:
> > I don't have a reproducible experiment.
> > I started the series with Ben's complain.
> > http://marc.info/?l=rsync&m=128885034930933&w=2
> 
> Yes I've noticed.
> 
> > I don't know Ben could test it with older kernel since recently he got silent.
> 
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.
> 
> > I am not sure older kernel worked well such as workload.
> > That's because Rik had been pointed out rsync's two touch
> > problem(http://linux-mm.org/PageReplacementRequirements) and solved
> > part of the problem with remaining half of the active file pages(look
> > at inactive_file_is_low) on 2.6.28's big change reclaim.
> > So I think the problem about such workload have been in there.
> 
> It's possible it's an old problem, but frankly I doubt that any
> swapping would have ever happened before, no matter how much rsync you
> would run in a loop. As said I also got PM from users asking what they
> can do to limit the pagecache because their systems are swapping
> overnight because of backup loads, that's definitely not ok. Or at
> least there must be a tweak to tell the VM "stop doing the swapping
> thing with backup". I didn't yet try to reproduce or debug this as I'm
> busy with other compaction/THP related bits.
> 
> > And this patch's benefit is not only for preventing working set.
> > Before this patch, application should sync the data before fadvise to
> > take a effect but it's very inefficient by slow sync operation. If the
> > invalidation meet dirty page, it can skip the page. But after this
> > patch, application could fdavise without sync because we could move
> > the page of head/or tail of inactive list.
> 
> I agree, the objective of the patch definitely looks good. Before the
> fadvise would be ignored without a fdatasync before it as the pages
> were dirty and couldn't be discarded.
> 
> > So I think the patch is valuable enough to merge?
> 
> The objective looks good, I didn't have time to review all details of
> the patches but I'll try to review it later.
> 
> > And as I said, I need Ben's help but I am not sure he can.
> > Thanks for the review, Andrea.
> 
> Thanks for the effort in improving fadvise.
> 
> I'd like to try to find the time to check if we've a regression
> without fadvise too. It's perfectly ok to use fadvise in rsync but my
> point is that the kernel should not lead to trashing of running apps
> regardless of fadvise or not, and I think that is higher priority to
> fix than the fadvise improvement. But still the fadvise improvement is
> very valuable to increase overall performance and hopefully to avoid
> wasting most of filesystem cache because of backups (but wasting cache
> shouldn't lead to trashing horribly, just more I/O from apps after the
> backup run, like after starting the app the first time, trashing
> usually means we swapped out or discarded mapped pages too and that's
> wrong).

I agree your opinion but I hope the patch is going on 2.6.38 or 39.
That's because if we find regression's root cause, how could this series be changed?
I think it's no difference before and after.
Of course, if rsync like applicatoin start to use fadvise agressively, the problem
could be buried on toe but we still have a older kernel and older rsync so we can
reproduce it then we can find the root cause.
What's the problem if the series is merged?
If it is reasonable, it's no problem to pend the series.

I _totally_ agree your opinion and I want to find root cause of the regression, too.
But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
I hope clever people like you would have a time to find it and report it to linux-mm
in future.

Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
It could help us very much.
Thanks.

> 
> Thanks,
> Andrea

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-22 14:26         ` Minchan Kim
@ 2011-02-22 14:46           ` Andrea Arcangeli
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 11:26:10PM +0900, Minchan Kim wrote:
> I agree your opinion but I hope the patch is going on 2.6.38 or 39.
> That's because if we find regression's root cause, how could this series be changed?
> I think it's no difference before and after.
> Of course, if rsync like applicatoin start to use fadvise agressively, the problem
> could be buried on toe but we still have a older kernel and older rsync so we can
> reproduce it then we can find the root cause.

No risk to hide it (I do backups with tar ;). I'm also assuming your
modification to rsync isn't going to be on by default (for small
working set, it's ok if rsync holds the cache and doesn't discard it).

> What's the problem if the series is merged?
> If it is reasonable, it's no problem to pend the series.

I've absolutely no problem with the objective of this series. The
objective looks very good and it can increase performance (like showed
by your benchmark saving 2min from your workload under stress and only
running a few seconds slower without stress).

> I _totally_ agree your opinion and I want to find root cause of the regression, too.
> But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
> I hope clever people like you would have a time to find it and report it to linux-mm
> in future.
> 
> Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
> It could help us very much.

Exactly, I only wanted to suggest to check what happens with 2.6.18 to
at least know if it's a regression or not. Because if we have a
regression, we need more work on this.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-22 14:46           ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, LKML, Steven Barrett, Ben Gamari,
	Peter Zijlstra, Rik van Riel, Mel Gorman, KOSAKI Motohiro,
	Wu Fengguang, Johannes Weiner, Nick Piggin, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 11:26:10PM +0900, Minchan Kim wrote:
> I agree your opinion but I hope the patch is going on 2.6.38 or 39.
> That's because if we find regression's root cause, how could this series be changed?
> I think it's no difference before and after.
> Of course, if rsync like applicatoin start to use fadvise agressively, the problem
> could be buried on toe but we still have a older kernel and older rsync so we can
> reproduce it then we can find the root cause.

No risk to hide it (I do backups with tar ;). I'm also assuming your
modification to rsync isn't going to be on by default (for small
working set, it's ok if rsync holds the cache and doesn't discard it).

> What's the problem if the series is merged?
> If it is reasonable, it's no problem to pend the series.

I've absolutely no problem with the objective of this series. The
objective looks very good and it can increase performance (like showed
by your benchmark saving 2min from your workload under stress and only
running a few seconds slower without stress).

> I _totally_ agree your opinion and I want to find root cause of the regression, too.
> But unfortunatly, I don't have any time and enviroment to reproduce it. ;(
> I hope clever people like you would have a time to find it and report it to linux-mm
> in future.
> 
> Ben. Could you test your workload on older 2.6.18 kernel if you see the thread?
> It could help us very much.

Exactly, I only wanted to suggest to check what happens with 2.6.18 to
at least know if it's a regression or not. Because if we have a
regression, we need more work on this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-22 13:28       ` Andrea Arcangeli
@ 2011-02-22 17:03         ` Jeffrey Hundstad
  -1 siblings, 0 replies; 36+ messages in thread
From: Jeffrey Hundstad @ 2011-02-22 17:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Johannes Weiner, Nick Piggin,
	Balbir Singh, KAMEZAWA Hiroyuki

On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> Hi Minchan,
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.

I would bet that "thrashing" isn't what he meant.  He almost certainly 
meant "needless I/O" and nothing to do with swapping.  I've seen this 
similar report before.  This report is almost always gets answered with 
a "set your swappiness to an appropriate value" type answer.  Or a "your 
userspace app needs to get smarter" type answer.  I think they are 
trying to do the latter, but need some help from the kernel, and that's 
what they're trying to do here.

-- 
Jeffrey Hundstad


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-22 17:03         ` Jeffrey Hundstad
  0 siblings, 0 replies; 36+ messages in thread
From: Jeffrey Hundstad @ 2011-02-22 17:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Johannes Weiner, Nick Piggin,
	Balbir Singh, KAMEZAWA Hiroyuki

On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> Hi Minchan,
> That's my point, we should check if the "thrashing horribly" is really
> a "recently" or if it has always happened before with 2.6.18 and previous.

I would bet that "thrashing" isn't what he meant.  He almost certainly 
meant "needless I/O" and nothing to do with swapping.  I've seen this 
similar report before.  This report is almost always gets answered with 
a "set your swappiness to an appropriate value" type answer.  Or a "your 
userspace app needs to get smarter" type answer.  I think they are 
trying to do the latter, but need some help from the kernel, and that's 
what they're trying to do here.

-- 
Jeffrey Hundstad

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
  2011-02-22 17:03         ` Jeffrey Hundstad
@ 2011-02-22 17:11           ` Andrea Arcangeli
  -1 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 17:11 UTC (permalink / raw)
  To: Jeffrey Hundstad
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Johannes Weiner, Nick Piggin,
	Balbir Singh, KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 11:03:51AM -0600, Jeffrey Hundstad wrote:
> On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> > Hi Minchan,
> > That's my point, we should check if the "thrashing horribly" is really
> > a "recently" or if it has always happened before with 2.6.18 and previous.
> 
> I would bet that "thrashing" isn't what he meant.  He almost certainly 
> meant "needless I/O" and nothing to do with swapping.  I've seen this 
> similar report before.  This report is almost always gets answered with 
> a "set your swappiness to an appropriate value" type answer.  Or a "your 

Hmm but swappiness can't help if "needless I/O in the morning after
backup" is the problem. So it wouldn't be the right suggestion if
"needless I/O is the problem".

We need a "vmstat 1" during the "trashing horribly"/"needless I/O" in
the morning to be sure.

> userspace app needs to get smarter" type answer.  I think they are 
> trying to do the latter, but need some help from the kernel, and that's 
> what they're trying to do here.

That's fine thing. But I got reports indipendent of this thread from
friends, asking me how to prevent swapping overnight because of
backup, so I connected that report to the above "trashing
horribly"/recently. I had no problems personally but may backup runs
once (not twice to activate/reference pages). I had no time to
investigate this further yet to see if things works ok for me even
with a rsync loop activating and referencing everything.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v6 0/4] fadvise(DONTNEED) support
@ 2011-02-22 17:11           ` Andrea Arcangeli
  0 siblings, 0 replies; 36+ messages in thread
From: Andrea Arcangeli @ 2011-02-22 17:11 UTC (permalink / raw)
  To: Jeffrey Hundstad
  Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Johannes Weiner, Nick Piggin,
	Balbir Singh, KAMEZAWA Hiroyuki

On Tue, Feb 22, 2011 at 11:03:51AM -0600, Jeffrey Hundstad wrote:
> On 02/22/2011 07:28 AM, Andrea Arcangeli wrote:
> > Hi Minchan,
> > That's my point, we should check if the "thrashing horribly" is really
> > a "recently" or if it has always happened before with 2.6.18 and previous.
> 
> I would bet that "thrashing" isn't what he meant.  He almost certainly 
> meant "needless I/O" and nothing to do with swapping.  I've seen this 
> similar report before.  This report is almost always gets answered with 
> a "set your swappiness to an appropriate value" type answer.  Or a "your 

Hmm but swappiness can't help if "needless I/O in the morning after
backup" is the problem. So it wouldn't be the right suggestion if
"needless I/O is the problem".

We need a "vmstat 1" during the "trashing horribly"/"needless I/O" in
the morning to be sure.

> userspace app needs to get smarter" type answer.  I think they are 
> trying to do the latter, but need some help from the kernel, and that's 
> what they're trying to do here.

That's fine thing. But I got reports indipendent of this thread from
friends, asking me how to prevent swapping overnight because of
backup, so I connected that report to the above "trashing
horribly"/recently. I had no problems personally but may backup runs
once (not twice to activate/reference pages). I had no time to
investigate this further yet to see if things works ok for me even
with a rsync loop activating and referencing everything.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page
  2011-02-21 15:59       ` Minchan Kim
@ 2011-03-28 12:51         ` Eric Dumazet
  -1 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2011-03-28 12:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, KAMEZAWA Hiroyuki

Le mardi 22 février 2011 à 00:59 +0900, Minchan Kim a écrit :
> Fixed version.
> 
> From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> 
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list.  That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---

Hmm... "inline inline" is an error on some gcc versions

  CC      arch/x86/kernel/asm-offsets.s
In file included from include/linux/swap.h:8,
                 from include/linux/suspend.h:4,
                 from arch/x86/kernel/asm-offsets.c:12:
include/linux/memcontrol.h:220: error: duplicate `inline'
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1


> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> +	return ;
> +}
> +

[PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto

commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
list) added inline keyword twice in its prototype.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..5e9840f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -216,7 +216,7 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
 	return ;
 }
 
-static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
 {
 	return ;
 }



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page
@ 2011-03-28 12:51         ` Eric Dumazet
  0 siblings, 0 replies; 36+ messages in thread
From: Eric Dumazet @ 2011-03-28 12:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Johannes Weiner, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, KAMEZAWA Hiroyuki

Le mardi 22 fA(C)vrier 2011 A  00:59 +0900, Minchan Kim a A(C)crit :
> Fixed version.
> 
> From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Tue, 22 Feb 2011 00:53:05 +0900
> Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> 
> The rotate_reclaimable_page function moves just written out
> pages, which the VM wanted to reclaim, to the end of the
> inactive list.  That way the VM will find those pages first
> next time it needs to free memory.
> This patch apply the rule in memcg.
> It can help to prevent unnecessary working page eviction of memcg.
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---

Hmm... "inline inline" is an error on some gcc versions

  CC      arch/x86/kernel/asm-offsets.s
In file included from include/linux/swap.h:8,
                 from include/linux/suspend.h:4,
                 from arch/x86/kernel/asm-offsets.c:12:
include/linux/memcontrol.h:220: error: duplicate `inline'
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1


> +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> +{
> +	return ;
> +}
> +

[PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto

commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
list) added inline keyword twice in its prototype.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..5e9840f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -216,7 +216,7 @@ static inline void mem_cgroup_del_lru_list(struct page *page, int lru)
 	return ;
 }
 
-static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
+static inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
 {
 	return ;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page
  2011-03-28 12:51         ` Eric Dumazet
@ 2011-03-28 13:45           ` Minchan Kim
  -1 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-03-28 13:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Johannes Weiner, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Mar 28, 2011 at 02:51:46PM +0200, Eric Dumazet wrote:
> Le mardi 22 février 2011 à 00:59 +0900, Minchan Kim a écrit :
> > Fixed version.
> > 
> > From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Tue, 22 Feb 2011 00:53:05 +0900
> > Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> > 
> > The rotate_reclaimable_page function moves just written out
> > pages, which the VM wanted to reclaim, to the end of the
> > inactive list.  That way the VM will find those pages first
> > next time it needs to free memory.
> > This patch apply the rule in memcg.
> > It can help to prevent unnecessary working page eviction of memcg.
> > 
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> 
> Hmm... "inline inline" is an error on some gcc versions
> 
>   CC      arch/x86/kernel/asm-offsets.s
> In file included from include/linux/swap.h:8,
>                  from include/linux/suspend.h:4,
>                  from arch/x86/kernel/asm-offsets.c:12:
> include/linux/memcontrol.h:220: error: duplicate `inline'
> make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
> 
> 
> > +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> > +{
> > +	return ;
> > +}
> > +
> 
> [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto
> 
> commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
> list) added inline keyword twice in its prototype.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It a was totally my fault.
Thanks very much. 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page
@ 2011-03-28 13:45           ` Minchan Kim
  0 siblings, 0 replies; 36+ messages in thread
From: Minchan Kim @ 2011-03-28 13:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Johannes Weiner, linux-mm, LKML, Steven Barrett,
	Ben Gamari, Peter Zijlstra, Rik van Riel, Mel Gorman,
	KOSAKI Motohiro, Wu Fengguang, Nick Piggin, Andrea Arcangeli,
	Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Mar 28, 2011 at 02:51:46PM +0200, Eric Dumazet wrote:
> Le mardi 22 fevrier 2011 a 00:59 +0900, Minchan Kim a ecrit :
> > Fixed version.
> > 
> > From be7d31f6e539bbad1ebedf52c6a51a4a80f7976a Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Tue, 22 Feb 2011 00:53:05 +0900
> > Subject: [PATCH v7 2/3] memcg: move memcg reclaimable page into tail of inactive list
> > 
> > The rotate_reclaimable_page function moves just written out
> > pages, which the VM wanted to reclaim, to the end of the
> > inactive list.  That way the VM will find those pages first
> > next time it needs to free memory.
> > This patch apply the rule in memcg.
> > It can help to prevent unnecessary working page eviction of memcg.
> > 
> > Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> 
> Hmm... "inline inline" is an error on some gcc versions
> 
>   CC      arch/x86/kernel/asm-offsets.s
> In file included from include/linux/swap.h:8,
>                  from include/linux/suspend.h:4,
>                  from arch/x86/kernel/asm-offsets.c:12:
> include/linux/memcontrol.h:220: error: duplicate `inline'
> make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
> 
> 
> > +static inline inline void mem_cgroup_rotate_reclaimable_page(struct page *page)
> > +{
> > +	return ;
> > +}
> > +
> 
> [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page proto
> 
> commit 3f58a8294333 (move memcg reclaimable page into tail of inactive
> list) added inline keyword twice in its prototype.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It a was totally my fault.
Thanks very much. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-03-28 13:46 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-20 14:43 [PATCH v6 0/4] fadvise(DONTNEED) support Minchan Kim
2011-02-20 14:43 ` Minchan Kim
2011-02-20 14:43 ` [PATCH v6 1/3] deactivate invalidated pages Minchan Kim
2011-02-20 14:43   ` Minchan Kim
2011-02-21  8:29   ` Johannes Weiner
2011-02-21  8:29     ` Johannes Weiner
2011-02-20 14:43 ` [PATCH v6 2/3] memcg: move memcg reclaimable page into tail of inactive list Minchan Kim
2011-02-20 14:43   ` Minchan Kim
2011-02-21  8:40   ` Johannes Weiner
2011-02-21  8:40     ` Johannes Weiner
2011-02-21 14:07     ` Minchan Kim
2011-02-21 14:07       ` Minchan Kim
2011-02-21 15:59     ` Minchan Kim
2011-02-21 15:59       ` Minchan Kim
2011-02-21 16:06       ` Johannes Weiner
2011-02-21 16:06         ` Johannes Weiner
2011-03-28 12:51       ` [PATCH] memcg: fix mem_cgroup_rotate_reclaimable_page Eric Dumazet
2011-03-28 12:51         ` Eric Dumazet
2011-03-28 13:45         ` Minchan Kim
2011-03-28 13:45           ` Minchan Kim
2011-02-20 14:43 ` [PATCH v6 3/3] Reclaim invalidated page ASAP Minchan Kim
2011-02-20 14:43   ` Minchan Kim
2011-02-21 19:07 ` [PATCH v6 0/4] fadvise(DONTNEED) support Andrea Arcangeli
2011-02-21 19:07   ` Andrea Arcangeli
2011-02-21 22:59   ` Minchan Kim
2011-02-21 22:59     ` Minchan Kim
2011-02-22 13:28     ` Andrea Arcangeli
2011-02-22 13:28       ` Andrea Arcangeli
2011-02-22 14:26       ` Minchan Kim
2011-02-22 14:26         ` Minchan Kim
2011-02-22 14:46         ` Andrea Arcangeli
2011-02-22 14:46           ` Andrea Arcangeli
2011-02-22 17:03       ` Jeffrey Hundstad
2011-02-22 17:03         ` Jeffrey Hundstad
2011-02-22 17:11         ` Andrea Arcangeli
2011-02-22 17:11           ` Andrea Arcangeli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.