linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2
@ 2017-02-02 19:19 Johannes Weiner
  2017-02-02 19:19 ` [PATCH 1/7] mm: vmscan: scan dirty pages even in laptop mode Johannes Weiner
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Hi Andrew,

here are some minor updates to the series. It's nothing functional,
just code comments and updates to the changelogs from the mailing list
discussions. Since we don't have a good delta system for changelogs
I'm resending the entire thing as a drop-in replacement for -mm.

These are the changes:

1. mm: vmscan: scan dirty pages even in laptop mode

   Mel tested the entire series, not just one patch. Move his test
   conclusions from 'mm: vmscan: remove old flusher wakeup from direct
   reclaim' into the series header in patch 1. Also, reflect the fact
   that these test results are indeed Mel's, not mine.

2. mm: vmscan: kick flushers when we encounter dirty pages on the LRU

   Mention the trade-off between flush-the-world/flush-the-scanwindow
   type wakeups in the changelog, as per the mailing list discussion.

3. mm: vmscan: move dirty pages out of the way until they're flushed

   Correct the last paragraph in the changelog. We're not activating
   dirty/writeback pages after they have rotated twice; they are being
   activated straight away to get them out of the reclaimer's face.
   This was a vestige from an earlier version of the patch.

4. mm: vmscan: move dirty pages out of the way until they're flushed fix

   Code comment fixlet to explain why we activate dirty/writeback pages.

Thanks!

 include/linux/mm_inline.h        |  7 ++++
 include/linux/mmzone.h           |  2 -
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/swap.c                        |  9 +++--
 mm/vmscan.c                      | 77 ++++++++++++++++++--------------------
 6 files changed, 50 insertions(+), 49 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/7] mm: vmscan: scan dirty pages even in laptop mode
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 19:19 ` [PATCH 2/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU Johannes Weiner
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Patch series "mm: vmscan: fix kswapd writeback regression".

We noticed a regression on multiple hadoop workloads when moving from 3.10
to 4.0 and 4.6, which involves kswapd getting tangled up in page writeout,
causing direct reclaim herds that also don't make progress.

I tracked it down to the thrash avoidance efforts after 3.10 that make the
kernel better at keeping use-once cache and use-many cache sorted on the
inactive and active list, with more aggressive protection of the active
list as long as there is inactive cache.  Unfortunately, our workload's
use-once cache is mostly from streaming writes.  Waiting for writes to
avoid potential reloads in the future is not a good tradeoff.

These patches do the following:

1. Wake the flushers when kswapd sees a lump of dirty pages. It's
   possible to be below the dirty background limit and still have
   cache velocity push them through the LRU. So start a-flushin'.

2. Let kswapd only write pages that have been rotated twice. This
   makes sure we really tried to get all the clean pages on the
   inactive list before resorting to horrible LRU-order writeback.

3. Move rotating dirty pages off the inactive list. Instead of
   churning or waiting on page writeback, we'll go after clean active
   cache. This might lead to thrashing, but in this state memory
   demand outstrips IO speed anyway, and reads are faster than writes.

Mel backported the series to 4.10-rc5 with one minor conflict and ran a
couple of tests on it.  Mix of read/write random workload didn't show
anything interesting.  Write-only database didn't show much difference
in performance but there were slight reductions in IO -- probably in
the noise.

simoop did show big differences although not as big as Mel expected.
This is Chris Mason's workload that similate the VM activity of hadoop.
Mel won't go through the full details but over the samples measured
during an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4
(and 4.10-rc5 is worse than 4.9 according to historical data for
reasons Mel hasn't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down.  Direct reclaim activity is one fifth
of what it was according to vmstats.  Kswapd activity is higher but
this is not necessarily surprising.  Kswapd efficiency is unchanged at
99% (99% of pages scanned were reclaimed) but direct reclaim efficiency
went from 77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context.  With the series, no data was written back.  With or without
the patch, pages are being immediately reclaimed after writeback
completes.  However, with the patch, only 1/8th of the pages are
reclaimed like this.

This patch (of 5):

We have an elaborate dirty/writeback throttling mechanism inside the
reclaim scanner, but for that to work the pages have to go through
shrink_page_list() and get counted for what they are.  Otherwise, we mess
up the LRU order and don't match reclaim speed to writeback.

Especially during deactivation, there is never a reason to skip dirty
pages; nothing is even trying to write them out from there.  Don't mess up
the LRU order for nothing, shuffle these pages along.

Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mmzone.h |  2 --
 mm/vmscan.c            | 14 ++------------
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index df992831fde7..338a786a993f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -236,8 +236,6 @@ struct lruvec {
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
 #define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
 
-/* Isolate clean file */
-#define ISOLATE_CLEAN		((__force isolate_mode_t)0x1)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x2)
 /* Isolate for asynchronous migration */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bb23ff229b6..0d05f7f3b532 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,7 @@ struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	enum zone_type reclaim_idx;
 
+	/* Writepage batching in laptop mode; RECLAIM_WRITE */
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1373,13 +1374,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	 * wants to isolate pages it will be able to operate on without
 	 * blocking - clean pages for the most part.
 	 *
-	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
-	 * is used by reclaim when it is cannot write to backing storage
-	 *
 	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
 	 * that it is possible to migrate without blocking
 	 */
-	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+	if (mode & ISOLATE_ASYNC_MIGRATE) {
 		/* All the caller can do on PageWriteback is block */
 		if (PageWriteback(page))
 			return ret;
@@ -1387,10 +1385,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		if (PageDirty(page)) {
 			struct address_space *mapping;
 
-			/* ISOLATE_CLEAN means only clean pages */
-			if (mode & ISOLATE_CLEAN)
-				return ret;
-
 			/*
 			 * Only pages without mappings or that have a
 			 * ->migratepage callback are possible to migrate
@@ -1731,8 +1725,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
@@ -1929,8 +1921,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
  2017-02-02 19:19 ` [PATCH 1/7] mm: vmscan: scan dirty pages even in laptop mode Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 19:19 ` [PATCH 3/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU fix Johannes Weiner
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Memory pressure can put dirty pages at the end of the LRU without anybody
running into dirty limits.  Don't start writing individual pages from
kswapd while the flushers might be asleep.

Unlike the old direct reclaim flusher wakeup (removed in the next
patch) that flushes the number of pages just scanned, this patch wakes
the flushers for all outstanding dirty pages. That seemed to perform
better in a synthetic test that pushes dirty pages to the end of the
LRU and into reclaim, because we know LRU aging outstrips writeback
already, and this way we give younger dirty pages a headstart rather
than wait until reclaim runs into them as well. It also means less
plugging and risk of exhausting the struct request pool from reclaim.

There is a concern that this will cause temporary files that used to
get dirtied and truncated before writeback to now get written to disk
under memory pressure. If this turns out to be a real problem, we'll
have to revisit this and tame the reclaim flusher wakeups.

Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/vmscan.c                      | 18 +++++++++++++-----
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5527d910ba3d..a3c0cbd7c888 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -46,7 +46,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_VMSCAN,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 2ccd9ccbf9ef..7bd8783a590f 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -31,7 +31,7 @@
 
 #define WB_WORK_REASON							\
 	EM( WB_REASON_BACKGROUND,		"background")		\
-	EM( WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages")	\
+	EM( WB_REASON_VMSCAN,			"vmscan")		\
 	EM( WB_REASON_SYNC,			"sync")			\
 	EM( WB_REASON_PERIODIC,			"periodic")		\
 	EM( WB_REASON_LAPTOP_TIMER,		"laptop_timer")		\
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d05f7f3b532..56ea8d24041f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1798,12 +1798,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
-		 * implies that flushers are not keeping up. In this case, flag
-		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
-		 * reclaim context.
+		 * implies that flushers are not doing their job. This can
+		 * happen when memory pressure pushes dirty pages to the end
+		 * of the LRU without the dirty limits being breached. It can
+		 * also happen when the proportion of dirty pages grows not
+		 * through writes but through memory pressure reclaiming all
+		 * the clean cache. And in some cases, the flushers simply
+		 * cannot keep up with the allocation rate. Nudge the flusher
+		 * threads in case they are asleep, but also allow kswapd to
+		 * start writing pages during reclaim.
 		 */
-		if (stat.nr_unqueued_dirty == nr_taken)
+		if (stat.nr_unqueued_dirty == nr_taken) {
+			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
+		}
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -2787,7 +2795,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
+						WB_REASON_VMSCAN);
 			sc->may_writepage = 1;
 		}
 	} while (--sc->priority >= 0);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU fix
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
  2017-02-02 19:19 ` [PATCH 1/7] mm: vmscan: scan dirty pages even in laptop mode Johannes Weiner
  2017-02-02 19:19 ` [PATCH 2/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 19:19 ` [PATCH 4/7] mm: vmscan: remove old flusher wakeup from direct reclaim path Johannes Weiner
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Mention dirty expiration as a condition: we need dirty data that is too
recent for periodic flushing and not large enough for waking up limit
flushing.  As per Mel.

Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/vmscan.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56ea8d24041f..83c92b866afe 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1799,14 +1799,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not doing their job. This can
-		 * happen when memory pressure pushes dirty pages to the end
-		 * of the LRU without the dirty limits being breached. It can
-		 * also happen when the proportion of dirty pages grows not
-		 * through writes but through memory pressure reclaiming all
-		 * the clean cache. And in some cases, the flushers simply
-		 * cannot keep up with the allocation rate. Nudge the flusher
-		 * threads in case they are asleep, but also allow kswapd to
-		 * start writing pages during reclaim.
+		 * happen when memory pressure pushes dirty pages to the end of
+		 * the LRU before the dirty limits are breached and the dirty
+		 * data has expired. It can also happen when the proportion of
+		 * dirty pages grows not through writes but through memory
+		 * pressure reclaiming all the clean cache. And in some cases,
+		 * the flushers simply cannot keep up with the allocation
+		 * rate. Nudge the flusher threads in case they are asleep, but
+		 * also allow kswapd to start writing pages during reclaim.
 		 */
 		if (stat.nr_unqueued_dirty == nr_taken) {
 			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/7] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
                   ` (2 preceding siblings ...)
  2017-02-02 19:19 ` [PATCH 3/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU fix Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 19:19 ` [PATCH 5/7] mm: vmscan: only write dirty pages that the scanner has seen twice Johannes Weiner
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Direct reclaim has been replaced by kswapd reclaim in pretty much all
common memory pressure situations, so this code most likely doesn't
accomplish the described effect anymore.  The previous patch wakes up
flushers for all reclaimers when we encounter dirty pages at the tail end
of the LRU.  Remove the crufty old direct reclaim invocation.

Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/vmscan.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 83c92b866afe..ce2ee8331414 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
-	unsigned long total_scanned = 0;
-	unsigned long writeback_threshold;
 retry:
 	delayacct_freepages_start();
 
@@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		shrink_zones(zonelist, sc);
 
-		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			break;
 
@@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (sc->priority < DEF_PRIORITY - 2)
 			sc->may_writepage = 1;
-
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
-		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_VMSCAN);
-			sc->may_writepage = 1;
-		}
 	} while (--sc->priority >= 0);
 
 	delayacct_freepages_end();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/7] mm: vmscan: only write dirty pages that the scanner has seen twice
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
                   ` (3 preceding siblings ...)
  2017-02-02 19:19 ` [PATCH 4/7] mm: vmscan: remove old flusher wakeup from direct reclaim path Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 19:19 ` [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed Johannes Weiner
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Dirty pages can easily reach the end of the LRU while there are still
clean pages to reclaim around.  Don't let kswapd write them back just
because there are a lot of them.  It costs more CPU to find the clean
pages, but that's almost certainly better than to disrupt writeback from
the flushers with LRU-order single-page writes from reclaim.  And the
flushers have been woken up by that point, so we spend IO capacity on
flushing and CPU capacity on finding the clean cache.

Only start writing dirty pages if they have cycled around the LRU twice
now and STILL haven't been queued on the IO device.  It's possible that
the dirty pages are so sparsely distributed across different bdis, inodes,
memory cgroups, that the flushers take forever to get to the ones we want
reclaimed.  Once we see them twice on the LRU, we know that's the quicker
way to find them, so do LRU writeback.

Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 mm/vmscan.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ce2ee8331414..92e56cadceae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1153,13 +1153,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		if (PageDirty(page)) {
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
-			 * if many dirty pages have been encountered.
+			 * Only kswapd can writeback filesystem pages
+			 * to avoid risk of stack overflow. But avoid
+			 * injecting inefficient single-page IO into
+			 * flusher writeback as much as possible: only
+			 * write pages when we've encountered many
+			 * dirty pages, and when we've already scanned
+			 * the rest of the LRU for clean pages and see
+			 * the same dirty pages again (PageReclaim).
 			 */
 			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
+			    (!current_is_kswapd() || !PageReclaim(page) ||
+			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
                   ` (4 preceding siblings ...)
  2017-02-02 19:19 ` [PATCH 5/7] mm: vmscan: only write dirty pages that the scanner has seen twice Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-03  7:42   ` Hillf Danton
  2017-02-02 19:19 ` [PATCH 7/7] mm: vmscan: move dirty pages out of the way until they're flushed fix Johannes Weiner
  2017-02-02 22:49 ` [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Andrew Morton
  7 siblings, 1 reply; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

We noticed a performance regression when moving hadoop workloads from 3.10
kernels to 4.0 and 4.6.  This is accompanied by increased pageout activity
initiated by kswapd as well as frequent bursts of allocation stalls and
direct reclaim scans.  Even lowering the dirty ratios to the equivalent of
less than 1% of memory would not eliminate the issue, suggesting that
dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance.  Where 3.10
would not detect refaulting pages and continuously supply clean cache to
the inactive list, a thrashing workload on 4.0+ will detect and activate
refaulting pages right away, distilling used-once pages on the inactive
list much more effectively.  This is by design, and it makes sense for
clean cache.  But for the most part our workload's cache faults are
refaults and its use-once cache is from streaming writes.  We end up with
most of the inactive list dirty, and we don't go after the active cache as
long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off.  Even if the refaults happen, reads are faster
than writes.  Before getting bogged down on writeback, reclaim should
first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that are dirty or under writeback
when they reach the end of the inactive LRU.  The pages are marked for
immediate reclaim, meaning they'll get moved back to the inactive LRU
tail as soon as they're written back and become reclaimable.  But in
the meantime, by reducing the inactive list to only immediately
reclaimable pages, we allow the scanner to deactivate and refill the
inactive list with clean cache from the active list tail to guarantee
forward progress.

Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm_inline.h | 7 +++++++
 mm/swap.c                 | 9 +++++----
 mm/vmscan.c               | 6 +++---
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 41d376e7116d..e030a68ead7e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
+static __always_inline void add_page_to_lru_list_tail(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
+{
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	list_add_tail(&page->lru, &lruvec->lists[lru]);
+}
+
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/swap.c b/mm/swap.c
index aabf2e90fe32..c4910f14f957 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 {
 	int *pgmoved = arg;
 
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+	if (PageLRU(page) && !PageUnevictable(page)) {
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		ClearPageActive(page);
+		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
 		(*pgmoved)++;
 	}
 }
@@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
  */
 void rotate_reclaimable_page(struct page *page)
 {
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+	if (!PageLocked(page) && !PageDirty(page) &&
 	    !PageUnevictable(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92e56cadceae..70103f411247 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			    PageReclaim(page) &&
 			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 2 above */
 			} else if (sane_reclaim(sc) ||
@@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 3 above */
 			} else {
@@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
-				goto keep_locked;
+				goto activate_locked;
 			}
 
 			if (references == PAGEREF_RECLAIM_CLEAN)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 7/7] mm: vmscan: move dirty pages out of the way until they're flushed fix
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
                   ` (5 preceding siblings ...)
  2017-02-02 19:19 ` [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed Johannes Weiner
@ 2017-02-02 19:19 ` Johannes Weiner
  2017-02-02 22:49 ` [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Andrew Morton
  7 siblings, 0 replies; 11+ messages in thread
From: Johannes Weiner @ 2017-02-02 19:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

Mention the trade-off between waiting for writeback and potentially
causing hot cache refaults in the code where we make this decisions
and activate writeback pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70103f411247..ae3d982216b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1056,6 +1056,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 *    throttling so we could easily OOM just because too many
 		 *    pages are in writeback and there is nothing else to
 		 *    reclaim. Wait for the writeback to complete.
+		 *
+		 * In cases 1) and 2) we activate the pages to get them out of
+		 * the way while we continue scanning for clean pages on the
+		 * inactive list and refilling from the active list. The
+		 * observation here is that waiting for disk writes is more
+		 * expensive than potentially causing reloads down the line.
+		 * Since they're marked for immediate reclaim, they won't put
+		 * memory pressure on the cache working set any longer than it
+		 * takes to write them to disk.
 		 */
 		if (PageWriteback(page)) {
 			/* Case 1 above */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2
  2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
                   ` (6 preceding siblings ...)
  2017-02-02 19:19 ` [PATCH 7/7] mm: vmscan: move dirty pages out of the way until they're flushed fix Johannes Weiner
@ 2017-02-02 22:49 ` Andrew Morton
  7 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2017-02-02 22:49 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Michal Hocko, Minchan Kim, Rik van Riel,
	Hillf Danton, linux-mm, linux-kernel

On Thu,  2 Feb 2017 14:19:50 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:

> here are some minor updates to the series. It's nothing functional,
> just code comments and updates to the changelogs from the mailing list
> discussions. Since we don't have a good delta system for changelogs
> I'm resending the entire thing as a drop-in replacement for -mm.

Thanks, I updated the changelogs in place.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-02-02 19:19 ` [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed Johannes Weiner
@ 2017-02-03  7:42   ` Hillf Danton
  2017-02-03 15:15     ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2017-02-03  7:42 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Mel Gorman', 'Michal Hocko',
	'Minchan Kim', 'Rik van Riel',
	linux-mm, linux-kernel


On February 03, 2017 3:20 AM Johannes Weiner wrote: 
> @@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			    PageReclaim(page) &&
>  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>  				nr_immediate++;
> -				goto keep_locked;
> +				goto activate_locked;

Out of topic but relevant IMHO, I can't find where it is cleared by grepping:

$ grep -nr PGDAT_WRITEBACK  linux-4.9/mm
linux-4.9/mm/vmscan.c:1019:	test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
linux-4.9/mm/vmscan.c:1777:	set_bit(PGDAT_WRITEBACK, &pgdat->flags);

It was removed in commit 1d82de618dd 
("mm, vmscan: make kswapd reclaim in terms of nodes")

Is it currently maintained somewhere else, Mel and John?

thanks
Hillf

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-02-03  7:42   ` Hillf Danton
@ 2017-02-03 15:15     ` Michal Hocko
  0 siblings, 0 replies; 11+ messages in thread
From: Michal Hocko @ 2017-02-03 15:15 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Johannes Weiner', 'Andrew Morton',
	'Mel Gorman', 'Minchan Kim',
	'Rik van Riel',
	linux-mm, linux-kernel

On Fri 03-02-17 15:42:55, Hillf Danton wrote:
> 
> On February 03, 2017 3:20 AM Johannes Weiner wrote: 
> > @@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			    PageReclaim(page) &&
> >  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
> >  				nr_immediate++;
> > -				goto keep_locked;
> > +				goto activate_locked;
> 
> Out of topic but relevant IMHO, I can't find where it is cleared by grepping:
> 
> $ grep -nr PGDAT_WRITEBACK  linux-4.9/mm
> linux-4.9/mm/vmscan.c:1019:	test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
> linux-4.9/mm/vmscan.c:1777:	set_bit(PGDAT_WRITEBACK, &pgdat->flags);

I would just get rid of this flag.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-02-03 15:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-02 19:19 [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Johannes Weiner
2017-02-02 19:19 ` [PATCH 1/7] mm: vmscan: scan dirty pages even in laptop mode Johannes Weiner
2017-02-02 19:19 ` [PATCH 2/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU Johannes Weiner
2017-02-02 19:19 ` [PATCH 3/7] mm: vmscan: kick flushers when we encounter dirty pages on the LRU fix Johannes Weiner
2017-02-02 19:19 ` [PATCH 4/7] mm: vmscan: remove old flusher wakeup from direct reclaim path Johannes Weiner
2017-02-02 19:19 ` [PATCH 5/7] mm: vmscan: only write dirty pages that the scanner has seen twice Johannes Weiner
2017-02-02 19:19 ` [PATCH 6/7] mm: vmscan: move dirty pages out of the way until they're flushed Johannes Weiner
2017-02-03  7:42   ` Hillf Danton
2017-02-03 15:15     ` Michal Hocko
2017-02-02 19:19 ` [PATCH 7/7] mm: vmscan: move dirty pages out of the way until they're flushed fix Johannes Weiner
2017-02-02 22:49 ` [PATCH 0/7] mm: vmscan: fix kswapd writeback regression v2 Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).