All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] mm: vmscan: fix kswapd writeback regression
@ 2017-01-23 18:16 ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We noticed a regression on multiple hadoop workloads when moving from
3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
writeout, causing direct reclaim herds that also don't make progress.

I tracked it down to the thrash avoidance efforts after 3.10 that make
the kernel better at keeping use-once cache and use-many cache sorted
on the inactive and active list, with more aggressive protection of
the active list as long as there is inactive cache. Unfortunately, our
workload's use-once cache is mostly from streaming writes. Waiting for
writes to avoid potential reloads in the future is not a good tradeoff.

These patches do the following:

1. Wake the flushers when kswapd sees a lump of dirty pages. It's
   possible to be below the dirty background limit and still have
   cache velocity push them through the LRU. So start a-flushin'.

2. Let kswapd only write pages that have been rotated twice. This
   makes sure we really tried to get all the clean pages on the
   inactive list before resorting to horrible LRU-order writeback.

3. Move rotating dirty pages off the inactive list. Instead of
   churning or waiting on page writeback, we'll go after clean active
   cache. This might lead to thrashing, but in this state memory
   demand outstrips IO speed anyway, and reads are faster than writes.

More details in the individual changelogs.

 include/linux/mm_inline.h        |  7 ++++
 include/linux/mmzone.h           |  2 --
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/swap.c                        |  9 ++---
 mm/vmscan.c                      | 68 +++++++++++++++-----------------------
 6 files changed, 41 insertions(+), 49 deletions(-)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/5] mm: vmscan: fix kswapd writeback regression
@ 2017-01-23 18:16 ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We noticed a regression on multiple hadoop workloads when moving from
3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
writeout, causing direct reclaim herds that also don't make progress.

I tracked it down to the thrash avoidance efforts after 3.10 that make
the kernel better at keeping use-once cache and use-many cache sorted
on the inactive and active list, with more aggressive protection of
the active list as long as there is inactive cache. Unfortunately, our
workload's use-once cache is mostly from streaming writes. Waiting for
writes to avoid potential reloads in the future is not a good tradeoff.

These patches do the following:

1. Wake the flushers when kswapd sees a lump of dirty pages. It's
   possible to be below the dirty background limit and still have
   cache velocity push them through the LRU. So start a-flushin'.

2. Let kswapd only write pages that have been rotated twice. This
   makes sure we really tried to get all the clean pages on the
   inactive list before resorting to horrible LRU-order writeback.

3. Move rotating dirty pages off the inactive list. Instead of
   churning or waiting on page writeback, we'll go after clean active
   cache. This might lead to thrashing, but in this state memory
   demand outstrips IO speed anyway, and reads are faster than writes.

More details in the individual changelogs.

 include/linux/mm_inline.h        |  7 ++++
 include/linux/mmzone.h           |  2 --
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/swap.c                        |  9 ++---
 mm/vmscan.c                      | 68 +++++++++++++++-----------------------
 6 files changed, 41 insertions(+), 49 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-23 18:16   ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We have an elaborate dirty/writeback throttling mechanism inside the
reclaim scanner, but for that to work the pages have to go through
shrink_page_list() and get counted for what they are. Otherwise, we
mess up the LRU order and don't match reclaim speed to writeback.

Especially during deactivation, there is never a reason to skip dirty
pages; nothing is even trying to write them out from there. Don't mess
up the LRU order for nothing, shuffle these pages along.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  2 --
 mm/vmscan.c            | 14 ++------------
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index df992831fde7..338a786a993f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -236,8 +236,6 @@ struct lruvec {
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
 #define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
 
-/* Isolate clean file */
-#define ISOLATE_CLEAN		((__force isolate_mode_t)0x1)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x2)
 /* Isolate for asynchronous migration */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bb23ff229b6..0d05f7f3b532 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,7 @@ struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	enum zone_type reclaim_idx;
 
+	/* Writepage batching in laptop mode; RECLAIM_WRITE */
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1373,13 +1374,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	 * wants to isolate pages it will be able to operate on without
 	 * blocking - clean pages for the most part.
 	 *
-	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
-	 * is used by reclaim when it is cannot write to backing storage
-	 *
 	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
 	 * that it is possible to migrate without blocking
 	 */
-	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+	if (mode & ISOLATE_ASYNC_MIGRATE) {
 		/* All the caller can do on PageWriteback is block */
 		if (PageWriteback(page))
 			return ret;
@@ -1387,10 +1385,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		if (PageDirty(page)) {
 			struct address_space *mapping;
 
-			/* ISOLATE_CLEAN means only clean pages */
-			if (mode & ISOLATE_CLEAN)
-				return ret;
-
 			/*
 			 * Only pages without mappings or that have a
 			 * ->migratepage callback are possible to migrate
@@ -1731,8 +1725,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
@@ -1929,8 +1921,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
@ 2017-01-23 18:16   ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We have an elaborate dirty/writeback throttling mechanism inside the
reclaim scanner, but for that to work the pages have to go through
shrink_page_list() and get counted for what they are. Otherwise, we
mess up the LRU order and don't match reclaim speed to writeback.

Especially during deactivation, there is never a reason to skip dirty
pages; nothing is even trying to write them out from there. Don't mess
up the LRU order for nothing, shuffle these pages along.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  2 --
 mm/vmscan.c            | 14 ++------------
 2 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index df992831fde7..338a786a993f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -236,8 +236,6 @@ struct lruvec {
 #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
 #define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
 
-/* Isolate clean file */
-#define ISOLATE_CLEAN		((__force isolate_mode_t)0x1)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x2)
 /* Isolate for asynchronous migration */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bb23ff229b6..0d05f7f3b532 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -87,6 +87,7 @@ struct scan_control {
 	/* The highest zone to isolate pages for reclaim from */
 	enum zone_type reclaim_idx;
 
+	/* Writepage batching in laptop mode; RECLAIM_WRITE */
 	unsigned int may_writepage:1;
 
 	/* Can mapped pages be reclaimed? */
@@ -1373,13 +1374,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 	 * wants to isolate pages it will be able to operate on without
 	 * blocking - clean pages for the most part.
 	 *
-	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
-	 * is used by reclaim when it is cannot write to backing storage
-	 *
 	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
 	 * that it is possible to migrate without blocking
 	 */
-	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+	if (mode & ISOLATE_ASYNC_MIGRATE) {
 		/* All the caller can do on PageWriteback is block */
 		if (PageWriteback(page))
 			return ret;
@@ -1387,10 +1385,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
 		if (PageDirty(page)) {
 			struct address_space *mapping;
 
-			/* ISOLATE_CLEAN means only clean pages */
-			if (mode & ISOLATE_CLEAN)
-				return ret;
-
 			/*
 			 * Only pages without mappings or that have a
 			 * ->migratepage callback are possible to migrate
@@ -1731,8 +1725,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
@@ -1929,8 +1921,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	if (!sc->may_unmap)
 		isolate_mode |= ISOLATE_UNMAPPED;
-	if (!sc->may_writepage)
-		isolate_mode |= ISOLATE_CLEAN;
 
 	spin_lock_irq(&pgdat->lru_lock);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-23 18:16   ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Memory pressure can put dirty pages at the end of the LRU without
anybody running into dirty limits. Don't start writing individual
pages from kswapd while the flushers might be asleep.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/vmscan.c                      | 18 +++++++++++++-----
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5527d910ba3d..a3c0cbd7c888 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -46,7 +46,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_VMSCAN,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 2ccd9ccbf9ef..7bd8783a590f 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -31,7 +31,7 @@
 
 #define WB_WORK_REASON							\
 	EM( WB_REASON_BACKGROUND,		"background")		\
-	EM( WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages")	\
+	EM( WB_REASON_VMSCAN,			"vmscan")		\
 	EM( WB_REASON_SYNC,			"sync")			\
 	EM( WB_REASON_PERIODIC,			"periodic")		\
 	EM( WB_REASON_LAPTOP_TIMER,		"laptop_timer")		\
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d05f7f3b532..56ea8d24041f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1798,12 +1798,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
-		 * implies that flushers are not keeping up. In this case, flag
-		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
-		 * reclaim context.
+		 * implies that flushers are not doing their job. This can
+		 * happen when memory pressure pushes dirty pages to the end
+		 * of the LRU without the dirty limits being breached. It can
+		 * also happen when the proportion of dirty pages grows not
+		 * through writes but through memory pressure reclaiming all
+		 * the clean cache. And in some cases, the flushers simply
+		 * cannot keep up with the allocation rate. Nudge the flusher
+		 * threads in case they are asleep, but also allow kswapd to
+		 * start writing pages during reclaim.
 		 */
-		if (stat.nr_unqueued_dirty == nr_taken)
+		if (stat.nr_unqueued_dirty == nr_taken) {
+			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
+		}
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -2787,7 +2795,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
+						WB_REASON_VMSCAN);
 			sc->may_writepage = 1;
 		}
 	} while (--sc->priority >= 0);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-23 18:16   ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Memory pressure can put dirty pages at the end of the LRU without
anybody running into dirty limits. Don't start writing individual
pages from kswapd while the flushers might be asleep.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/writeback.h        |  2 +-
 include/trace/events/writeback.h |  2 +-
 mm/vmscan.c                      | 18 +++++++++++++-----
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5527d910ba3d..a3c0cbd7c888 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -46,7 +46,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_VMSCAN,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 2ccd9ccbf9ef..7bd8783a590f 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -31,7 +31,7 @@
 
 #define WB_WORK_REASON							\
 	EM( WB_REASON_BACKGROUND,		"background")		\
-	EM( WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages")	\
+	EM( WB_REASON_VMSCAN,			"vmscan")		\
 	EM( WB_REASON_SYNC,			"sync")			\
 	EM( WB_REASON_PERIODIC,			"periodic")		\
 	EM( WB_REASON_LAPTOP_TIMER,		"laptop_timer")		\
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0d05f7f3b532..56ea8d24041f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1798,12 +1798,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
-		 * implies that flushers are not keeping up. In this case, flag
-		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
-		 * reclaim context.
+		 * implies that flushers are not doing their job. This can
+		 * happen when memory pressure pushes dirty pages to the end
+		 * of the LRU without the dirty limits being breached. It can
+		 * also happen when the proportion of dirty pages grows not
+		 * through writes but through memory pressure reclaiming all
+		 * the clean cache. And in some cases, the flushers simply
+		 * cannot keep up with the allocation rate. Nudge the flusher
+		 * threads in case they are asleep, but also allow kswapd to
+		 * start writing pages during reclaim.
 		 */
-		if (stat.nr_unqueued_dirty == nr_taken)
+		if (stat.nr_unqueued_dirty == nr_taken) {
+			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
 			set_bit(PGDAT_DIRTY, &pgdat->flags);
+		}
 
 		/*
 		 * If kswapd scans pages marked marked for immediate
@@ -2787,7 +2795,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
+						WB_REASON_VMSCAN);
 			sc->may_writepage = 1;
 		}
 	} while (--sc->priority >= 0);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-23 18:16   ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Direct reclaim has been replaced by kswapd reclaim in pretty much all
common memory pressure situations, so this code most likely doesn't
accomplish the described effect anymore. The previous patch wakes up
flushers for all reclaimers when we encounter dirty pages at the tail
end of the LRU. Remove the crufty old direct reclaim invocation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56ea8d24041f..915fc658de41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
-	unsigned long total_scanned = 0;
-	unsigned long writeback_threshold;
 retry:
 	delayacct_freepages_start();
 
@@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		shrink_zones(zonelist, sc);
 
-		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			break;
 
@@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (sc->priority < DEF_PRIORITY - 2)
 			sc->may_writepage = 1;
-
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
-		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_VMSCAN);
-			sc->may_writepage = 1;
-		}
 	} while (--sc->priority >= 0);
 
 	delayacct_freepages_end();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-23 18:16   ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Direct reclaim has been replaced by kswapd reclaim in pretty much all
common memory pressure situations, so this code most likely doesn't
accomplish the described effect anymore. The previous patch wakes up
flushers for all reclaimers when we encounter dirty pages at the tail
end of the LRU. Remove the crufty old direct reclaim invocation.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56ea8d24041f..915fc658de41 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
-	unsigned long total_scanned = 0;
-	unsigned long writeback_threshold;
 retry:
 	delayacct_freepages_start();
 
@@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		shrink_zones(zonelist, sc);
 
-		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			break;
 
@@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (sc->priority < DEF_PRIORITY - 2)
 			sc->may_writepage = 1;
-
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
-		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_VMSCAN);
-			sc->may_writepage = 1;
-		}
 	} while (--sc->priority >= 0);
 
 	delayacct_freepages_end();
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-23 18:16   ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Dirty pages can easily reach the end of the LRU while there are still
clean pages to reclaim around. Don't let kswapd write them back just
because there are a lot of them. It costs more CPU to find the clean
pages, but that's almost certainly better than to disrupt writeback
from the flushers with LRU-order single-page writes from reclaim. And
the flushers have been woken up by that point, so we spend IO capacity
on flushing and CPU capacity on finding the clean cache.

Only start writing dirty pages if they have cycled around the LRU
twice now and STILL haven't been queued on the IO device. It's
possible that the dirty pages are so sparsely distributed across
different bdis, inodes, memory cgroups, that the flushers take forever
to get to the ones we want reclaimed. Once we see them twice on the
LRU, we know that's the quicker way to find them, so do LRU writeback.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 915fc658de41..df0fe0cc438e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1153,13 +1153,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		if (PageDirty(page)) {
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
-			 * if many dirty pages have been encountered.
+			 * Only kswapd can writeback filesystem pages
+			 * to avoid risk of stack overflow. But avoid
+			 * injecting inefficient single-page IO into
+			 * flusher writeback as much as possible: only
+			 * write pages when we've encountered many
+			 * dirty pages, and when we've already scanned
+			 * the rest of the LRU for clean pages and see
+			 * the same dirty pages again (PageReclaim).
 			 */
 			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
+			    (!current_is_kswapd() || !PageReclaim(page) ||
+			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
@ 2017-01-23 18:16   ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

Dirty pages can easily reach the end of the LRU while there are still
clean pages to reclaim around. Don't let kswapd write them back just
because there are a lot of them. It costs more CPU to find the clean
pages, but that's almost certainly better than to disrupt writeback
from the flushers with LRU-order single-page writes from reclaim. And
the flushers have been woken up by that point, so we spend IO capacity
on flushing and CPU capacity on finding the clean cache.

Only start writing dirty pages if they have cycled around the LRU
twice now and STILL haven't been queued on the IO device. It's
possible that the dirty pages are so sparsely distributed across
different bdis, inodes, memory cgroups, that the flushers take forever
to get to the ones we want reclaimed. Once we see them twice on the
LRU, we know that's the quicker way to find them, so do LRU writeback.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 915fc658de41..df0fe0cc438e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1153,13 +1153,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		if (PageDirty(page)) {
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
-			 * if many dirty pages have been encountered.
+			 * Only kswapd can writeback filesystem pages
+			 * to avoid risk of stack overflow. But avoid
+			 * injecting inefficient single-page IO into
+			 * flusher writeback as much as possible: only
+			 * write pages when we've encountered many
+			 * dirty pages, and when we've already scanned
+			 * the rest of the LRU for clean pages and see
+			 * the same dirty pages again (PageReclaim).
 			 */
 			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
+			    (!current_is_kswapd() || !PageReclaim(page) ||
+			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-23 18:16   ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans. Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance. Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect
and activate refaulting pages right away, distilling used-once pages
on the inactive list much more effectively. This is by design, and it
makes sense for clean cache. But for the most part our workload's
cache faults are refaults and its use-once cache is from streaming
writes. We end up with most of the inactive list dirty, and we don't
go after the active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off. Even if the refaults happen, reads are
faster than writes. Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that have been dirty or under
writeback for two inactive LRU cycles. We know at this point that
there are not enough clean inactive pages left to satisfy memory
demand in the system. The pages are marked for immediate reclaim,
meaning they'll get moved back to the inactive LRU tail as soon as
they're written back and become reclaimable. But in the meantime, by
reducing the inactive list to only immediately reclaimable pages, we
allow the scanner to deactivate and refill the inactive list with
clean cache from the active list tail to guarantee forward progress.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mm_inline.h | 7 +++++++
 mm/swap.c                 | 9 +++++----
 mm/vmscan.c               | 6 +++---
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 41d376e7116d..e030a68ead7e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
+static __always_inline void add_page_to_lru_list_tail(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
+{
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	list_add_tail(&page->lru, &lruvec->lists[lru]);
+}
+
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/swap.c b/mm/swap.c
index aabf2e90fe32..c4910f14f957 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 {
 	int *pgmoved = arg;
 
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+	if (PageLRU(page) && !PageUnevictable(page)) {
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		ClearPageActive(page);
+		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
 		(*pgmoved)++;
 	}
 }
@@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
  */
 void rotate_reclaimable_page(struct page *page)
 {
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+	if (!PageLocked(page) && !PageDirty(page) &&
 	    !PageUnevictable(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df0fe0cc438e..947ab6f4db10 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			    PageReclaim(page) &&
 			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 2 above */
 			} else if (sane_reclaim(sc) ||
@@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 3 above */
 			} else {
@@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
-				goto keep_locked;
+				goto activate_locked;
 			}
 
 			if (references == PAGEREF_RECLAIM_CLEAN)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-23 18:16   ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-23 18:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel, kernel-team

We noticed a performance regression when moving hadoop workloads from
3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
activity initiated by kswapd as well as frequent bursts of allocation
stalls and direct reclaim scans. Even lowering the dirty ratios to the
equivalent of less than 1% of memory would not eliminate the issue,
suggesting that dirty pages concentrate where the scanner is looking.

This can be traced back to recent efforts of thrash avoidance. Where
3.10 would not detect refaulting pages and continuously supply clean
cache to the inactive list, a thrashing workload on 4.0+ will detect
and activate refaulting pages right away, distilling used-once pages
on the inactive list much more effectively. This is by design, and it
makes sense for clean cache. But for the most part our workload's
cache faults are refaults and its use-once cache is from streaming
writes. We end up with most of the inactive list dirty, and we don't
go after the active cache as long as we have use-once pages around.

But waiting for writes to avoid reclaiming clean cache that *might*
refault is a bad trade-off. Even if the refaults happen, reads are
faster than writes. Before getting bogged down on writeback, reclaim
should first look at *all* cache in the system, even active cache.

To accomplish this, activate pages that have been dirty or under
writeback for two inactive LRU cycles. We know at this point that
there are not enough clean inactive pages left to satisfy memory
demand in the system. The pages are marked for immediate reclaim,
meaning they'll get moved back to the inactive LRU tail as soon as
they're written back and become reclaimable. But in the meantime, by
reducing the inactive list to only immediately reclaimable pages, we
allow the scanner to deactivate and refill the inactive list with
clean cache from the active list tail to guarantee forward progress.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mm_inline.h | 7 +++++++
 mm/swap.c                 | 9 +++++----
 mm/vmscan.c               | 6 +++---
 3 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 41d376e7116d..e030a68ead7e 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
 	list_add(&page->lru, &lruvec->lists[lru]);
 }
 
+static __always_inline void add_page_to_lru_list_tail(struct page *page,
+				struct lruvec *lruvec, enum lru_list lru)
+{
+	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
+	list_add_tail(&page->lru, &lruvec->lists[lru]);
+}
+
 static __always_inline void del_page_from_lru_list(struct page *page,
 				struct lruvec *lruvec, enum lru_list lru)
 {
diff --git a/mm/swap.c b/mm/swap.c
index aabf2e90fe32..c4910f14f957 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
 {
 	int *pgmoved = arg;
 
-	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
-		enum lru_list lru = page_lru_base_type(page);
-		list_move_tail(&page->lru, &lruvec->lists[lru]);
+	if (PageLRU(page) && !PageUnevictable(page)) {
+		del_page_from_lru_list(page, lruvec, page_lru(page));
+		ClearPageActive(page);
+		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
 		(*pgmoved)++;
 	}
 }
@@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
  */
 void rotate_reclaimable_page(struct page *page)
 {
-	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
+	if (!PageLocked(page) && !PageDirty(page) &&
 	    !PageUnevictable(page) && PageLRU(page)) {
 		struct pagevec *pvec;
 		unsigned long flags;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index df0fe0cc438e..947ab6f4db10 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			    PageReclaim(page) &&
 			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
 				nr_immediate++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 2 above */
 			} else if (sane_reclaim(sc) ||
@@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				 */
 				SetPageReclaim(page);
 				nr_writeback++;
-				goto keep_locked;
+				goto activate_locked;
 
 			/* Case 3 above */
 			} else {
@@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
 				SetPageReclaim(page);
 
-				goto keep_locked;
+				goto activate_locked;
 			}
 
 			if (references == PAGEREF_RECLAIM_CLEAN)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  1:27     ` Minchan Kim
  -1 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:37PM -0500, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
@ 2017-01-26  1:27     ` Minchan Kim
  0 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:37PM -0500, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  1:35     ` Minchan Kim
  -1 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-26  1:35     ` Minchan Kim
  0 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  1:38     ` Minchan Kim
  -1 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-26  1:38     ` Minchan Kim
  0 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  1:42     ` Minchan Kim
  -1 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:40PM -0500, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
@ 2017-01-26  1:42     ` Minchan Kim
  0 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:42 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:40PM -0500, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  1:47     ` Minchan Kim
  -1 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

Every patches look reasaonable to me.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-26  1:47     ` Minchan Kim
  0 siblings, 0 replies; 60+ messages in thread
From: Minchan Kim @ 2017-01-26  1:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

Every patches look reasaonable to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] mm: vmscan: fix kswapd writeback regression
  2017-01-23 18:16 ` Johannes Weiner
@ 2017-01-26  5:44   ` Hillf Danton
  -1 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2017-01-26  5:44 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Mel Gorman', linux-mm, linux-kernel, kernel-team


On January 24, 2017 2:17 AM Johannes Weiner wrote:
> 
> We noticed a regression on multiple hadoop workloads when moving from
> 3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
> writeout, causing direct reclaim herds that also don't make progress.
> 
> I tracked it down to the thrash avoidance efforts after 3.10 that make
> the kernel better at keeping use-once cache and use-many cache sorted
> on the inactive and active list, with more aggressive protection of
> the active list as long as there is inactive cache. Unfortunately, our
> workload's use-once cache is mostly from streaming writes. Waiting for
> writes to avoid potential reloads in the future is not a good tradeoff.
> 
> These patches do the following:
> 
> 1. Wake the flushers when kswapd sees a lump of dirty pages. It's
>    possible to be below the dirty background limit and still have
>    cache velocity push them through the LRU. So start a-flushin'.
> 
> 2. Let kswapd only write pages that have been rotated twice. This
>    makes sure we really tried to get all the clean pages on the
>    inactive list before resorting to horrible LRU-order writeback.
> 
> 3. Move rotating dirty pages off the inactive list. Instead of
>    churning or waiting on page writeback, we'll go after clean active
>    cache. This might lead to thrashing, but in this state memory
>    demand outstrips IO speed anyway, and reads are faster than writes.
> 
> More details in the individual changelogs.
> 
>  include/linux/mm_inline.h        |  7 ++++
>  include/linux/mmzone.h           |  2 --
>  include/linux/writeback.h        |  2 +-
>  include/trace/events/writeback.h |  2 +-
>  mm/swap.c                        |  9 ++---
>  mm/vmscan.c                      | 68 +++++++++++++++-----------------------
>  6 files changed, 41 insertions(+), 49 deletions(-)
> 
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/5] mm: vmscan: fix kswapd writeback regression
@ 2017-01-26  5:44   ` Hillf Danton
  0 siblings, 0 replies; 60+ messages in thread
From: Hillf Danton @ 2017-01-26  5:44 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Mel Gorman', linux-mm, linux-kernel, kernel-team


On January 24, 2017 2:17 AM Johannes Weiner wrote:
> 
> We noticed a regression on multiple hadoop workloads when moving from
> 3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
> writeout, causing direct reclaim herds that also don't make progress.
> 
> I tracked it down to the thrash avoidance efforts after 3.10 that make
> the kernel better at keeping use-once cache and use-many cache sorted
> on the inactive and active list, with more aggressive protection of
> the active list as long as there is inactive cache. Unfortunately, our
> workload's use-once cache is mostly from streaming writes. Waiting for
> writes to avoid potential reloads in the future is not a good tradeoff.
> 
> These patches do the following:
> 
> 1. Wake the flushers when kswapd sees a lump of dirty pages. It's
>    possible to be below the dirty background limit and still have
>    cache velocity push them through the LRU. So start a-flushin'.
> 
> 2. Let kswapd only write pages that have been rotated twice. This
>    makes sure we really tried to get all the clean pages on the
>    inactive list before resorting to horrible LRU-order writeback.
> 
> 3. Move rotating dirty pages off the inactive list. Instead of
>    churning or waiting on page writeback, we'll go after clean active
>    cache. This might lead to thrashing, but in this state memory
>    demand outstrips IO speed anyway, and reads are faster than writes.
> 
> More details in the individual changelogs.
> 
>  include/linux/mm_inline.h        |  7 ++++
>  include/linux/mmzone.h           |  2 --
>  include/linux/writeback.h        |  2 +-
>  include/trace/events/writeback.h |  2 +-
>  mm/swap.c                        |  9 ++---
>  mm/vmscan.c                      | 68 +++++++++++++++-----------------------
>  6 files changed, 41 insertions(+), 49 deletions(-)
> 
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  9:52     ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26  9:52 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:37PM -0500, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
@ 2017-01-26  9:52     ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26  9:52 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:37PM -0500, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26  9:57     ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26  9:57 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I don't understand the motivation for checking the wb_reason name. Maybe
it was easier to eyeball while reading ftraces. The comment about the
flusher not doing its job could also be as simple as the writes took
place and clean pages were reclaimed before dirty_expire was reached.
Not impossible if there was a light writer combined with a heavy reader
or a large number of anonymous faults.

Anyway;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-26  9:57     ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26  9:57 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I don't understand the motivation for checking the wb_reason name. Maybe
it was easier to eyeball while reading ftraces. The comment about the
flusher not doing its job could also be as simple as the writes took
place and clean pages were reclaimed before dirty_expire was reached.
Not impossible if there was a light writer combined with a heavy reader
or a large number of anonymous faults.

Anyway;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 10:05     ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:05 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

In general I like this. I worried first that if kswapd is blocked
writing pages that it won't reach the wakeup_flusher_threads but the
previous patch handles it.

Now though, it occurs to me with the last patch that we always writeout
the world when flushing threads. This may not be a great idea. Consider
for example if there is a heavy writer of short-lived tmp files. In such a
case, it is possible for the files to be truncated before they even hit the
disk. However, if there are multiple "writeout the world" calls, these may
now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
could all be requested to writeout the world and each request unplugs.

Is it possible to maintain the property of writing back pages relative
to the numbers of pages scanned or have you determined already that it's
not necessary?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-26 10:05     ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:05 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

In general I like this. I worried first that if kswapd is blocked
writing pages that it won't reach the wakeup_flusher_threads but the
previous patch handles it.

Now though, it occurs to me with the last patch that we always writeout
the world when flushing threads. This may not be a great idea. Consider
for example if there is a heavy writer of short-lived tmp files. In such a
case, it is possible for the files to be truncated before they even hit the
disk. However, if there are multiple "writeout the world" calls, these may
now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
could all be requested to writeout the world and each request unplugs.

Is it possible to maintain the property of writing back pages relative
to the numbers of pages scanned or have you determined already that it's
not necessary?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 10:08     ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:40PM -0500, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
@ 2017-01-26 10:08     ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:08 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:40PM -0500, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 10:19     ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 

Note that some of this is also impacted by
bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
of dirty pages reaching the end of the LRU sooner if they are being
written. It's not impossible that hadoop is rewriting the same files,
hitting the end of the LRU due to no reads and then throwing reclaim
into a hole.

I've seen a few cases where random write only workloads regressed and it
was based on whether the random number generator was selecting the same
pages. With that commit, the LRU was effectively LIFO.

Similarly, I'd seen a case where a databases whose working set was
larger than the shared memory area regressed because the spill-over from
the database buffer to RAM was not being preserved because it was all
rights. That said, the same patch prevents the database being swapped so
it's not all bad but there have been consequences.

I don't have a problem with the patch although would prefer to have seen
more data for the series. However, I'm not entirely convinced that
thrash detection was the only problem. I think not activating pages on
write was a contributing factor although this patch looks better than
considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-26 10:19     ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 10:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 

Note that some of this is also impacted by
bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
of dirty pages reaching the end of the LRU sooner if they are being
written. It's not impossible that hadoop is rewriting the same files,
hitting the end of the LRU due to no reads and then throwing reclaim
into a hole.

I've seen a few cases where random write only workloads regressed and it
was based on whether the random number generator was selecting the same
pages. With that commit, the LRU was effectively LIFO.

Similarly, I'd seen a case where a databases whose working set was
larger than the shared memory area regressed because the spill-over from
the database buffer to RAM was not being preserved because it was all
rights. That said, the same patch prevents the database being swapped so
it's not all bad but there have been consequences.

I don't have a problem with the patch although would prefer to have seen
more data for the series. However, I'm not entirely convinced that
thrash detection was the only problem. I think not activating pages on
write was a contributing factor although this patch looks better than
considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 13:13     ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:37, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.

absolutely agreed.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mmzone.h |  2 --
>  mm/vmscan.c            | 14 ++------------
>  2 files changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index df992831fde7..338a786a993f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -236,8 +236,6 @@ struct lruvec {
>  #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
>  #define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
>  
> -/* Isolate clean file */
> -#define ISOLATE_CLEAN		((__force isolate_mode_t)0x1)
>  /* Isolate unmapped file */
>  #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x2)
>  /* Isolate for asynchronous migration */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7bb23ff229b6..0d05f7f3b532 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -87,6 +87,7 @@ struct scan_control {
>  	/* The highest zone to isolate pages for reclaim from */
>  	enum zone_type reclaim_idx;
>  
> +	/* Writepage batching in laptop mode; RECLAIM_WRITE */
>  	unsigned int may_writepage:1;
>  
>  	/* Can mapped pages be reclaimed? */
> @@ -1373,13 +1374,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  	 * wants to isolate pages it will be able to operate on without
>  	 * blocking - clean pages for the most part.
>  	 *
> -	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
> -	 * is used by reclaim when it is cannot write to backing storage
> -	 *
>  	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
>  	 * that it is possible to migrate without blocking
>  	 */
> -	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
> +	if (mode & ISOLATE_ASYNC_MIGRATE) {
>  		/* All the caller can do on PageWriteback is block */
>  		if (PageWriteback(page))
>  			return ret;
> @@ -1387,10 +1385,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  		if (PageDirty(page)) {
>  			struct address_space *mapping;
>  
> -			/* ISOLATE_CLEAN means only clean pages */
> -			if (mode & ISOLATE_CLEAN)
> -				return ret;
> -
>  			/*
>  			 * Only pages without mappings or that have a
>  			 * ->migratepage callback are possible to migrate
> @@ -1731,8 +1725,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	if (!sc->may_unmap)
>  		isolate_mode |= ISOLATE_UNMAPPED;
> -	if (!sc->may_writepage)
> -		isolate_mode |= ISOLATE_CLEAN;
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> @@ -1929,8 +1921,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  
>  	if (!sc->may_unmap)
>  		isolate_mode |= ISOLATE_UNMAPPED;
> -	if (!sc->may_writepage)
> -		isolate_mode |= ISOLATE_CLEAN;
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode
@ 2017-01-26 13:13     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:37, Johannes Weiner wrote:
> We have an elaborate dirty/writeback throttling mechanism inside the
> reclaim scanner, but for that to work the pages have to go through
> shrink_page_list() and get counted for what they are. Otherwise, we
> mess up the LRU order and don't match reclaim speed to writeback.
> 
> Especially during deactivation, there is never a reason to skip dirty
> pages; nothing is even trying to write them out from there. Don't mess
> up the LRU order for nothing, shuffle these pages along.

absolutely agreed.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mmzone.h |  2 --
>  mm/vmscan.c            | 14 ++------------
>  2 files changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index df992831fde7..338a786a993f 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -236,8 +236,6 @@ struct lruvec {
>  #define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
>  #define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
>  
> -/* Isolate clean file */
> -#define ISOLATE_CLEAN		((__force isolate_mode_t)0x1)
>  /* Isolate unmapped file */
>  #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x2)
>  /* Isolate for asynchronous migration */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7bb23ff229b6..0d05f7f3b532 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -87,6 +87,7 @@ struct scan_control {
>  	/* The highest zone to isolate pages for reclaim from */
>  	enum zone_type reclaim_idx;
>  
> +	/* Writepage batching in laptop mode; RECLAIM_WRITE */
>  	unsigned int may_writepage:1;
>  
>  	/* Can mapped pages be reclaimed? */
> @@ -1373,13 +1374,10 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  	 * wants to isolate pages it will be able to operate on without
>  	 * blocking - clean pages for the most part.
>  	 *
> -	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
> -	 * is used by reclaim when it is cannot write to backing storage
> -	 *
>  	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
>  	 * that it is possible to migrate without blocking
>  	 */
> -	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
> +	if (mode & ISOLATE_ASYNC_MIGRATE) {
>  		/* All the caller can do on PageWriteback is block */
>  		if (PageWriteback(page))
>  			return ret;
> @@ -1387,10 +1385,6 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
>  		if (PageDirty(page)) {
>  			struct address_space *mapping;
>  
> -			/* ISOLATE_CLEAN means only clean pages */
> -			if (mode & ISOLATE_CLEAN)
> -				return ret;
> -
>  			/*
>  			 * Only pages without mappings or that have a
>  			 * ->migratepage callback are possible to migrate
> @@ -1731,8 +1725,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	if (!sc->may_unmap)
>  		isolate_mode |= ISOLATE_UNMAPPED;
> -	if (!sc->may_writepage)
> -		isolate_mode |= ISOLATE_CLEAN;
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> @@ -1929,8 +1921,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  
>  	if (!sc->may_unmap)
>  		isolate_mode |= ISOLATE_UNMAPPED;
> -	if (!sc->may_writepage)
> -		isolate_mode |= ISOLATE_CLEAN;
>  
>  	spin_lock_irq(&pgdat->lru_lock);
>  
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 13:16     ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:38, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/writeback.h        |  2 +-
>  include/trace/events/writeback.h |  2 +-
>  mm/vmscan.c                      | 18 +++++++++++++-----
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 5527d910ba3d..a3c0cbd7c888 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -46,7 +46,7 @@ enum writeback_sync_modes {
>   */
>  enum wb_reason {
>  	WB_REASON_BACKGROUND,
> -	WB_REASON_TRY_TO_FREE_PAGES,
> +	WB_REASON_VMSCAN,
>  	WB_REASON_SYNC,
>  	WB_REASON_PERIODIC,
>  	WB_REASON_LAPTOP_TIMER,
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 2ccd9ccbf9ef..7bd8783a590f 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -31,7 +31,7 @@
>  
>  #define WB_WORK_REASON							\
>  	EM( WB_REASON_BACKGROUND,		"background")		\
> -	EM( WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages")	\
> +	EM( WB_REASON_VMSCAN,			"vmscan")		\
>  	EM( WB_REASON_SYNC,			"sync")			\
>  	EM( WB_REASON_PERIODIC,			"periodic")		\
>  	EM( WB_REASON_LAPTOP_TIMER,		"laptop_timer")		\
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0d05f7f3b532..56ea8d24041f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1798,12 +1798,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  		/*
>  		 * If dirty pages are scanned that are not queued for IO, it
> -		 * implies that flushers are not keeping up. In this case, flag
> -		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
> -		 * reclaim context.
> +		 * implies that flushers are not doing their job. This can
> +		 * happen when memory pressure pushes dirty pages to the end
> +		 * of the LRU without the dirty limits being breached. It can
> +		 * also happen when the proportion of dirty pages grows not
> +		 * through writes but through memory pressure reclaiming all
> +		 * the clean cache. And in some cases, the flushers simply
> +		 * cannot keep up with the allocation rate. Nudge the flusher
> +		 * threads in case they are asleep, but also allow kswapd to
> +		 * start writing pages during reclaim.
>  		 */
> -		if (stat.nr_unqueued_dirty == nr_taken)
> +		if (stat.nr_unqueued_dirty == nr_taken) {
> +			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
>  			set_bit(PGDAT_DIRTY, &pgdat->flags);
> +		}
>  
>  		/*
>  		 * If kswapd scans pages marked marked for immediate
> @@ -2787,7 +2795,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
>  		if (total_scanned > writeback_threshold) {
>  			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
> -						WB_REASON_TRY_TO_FREE_PAGES);
> +						WB_REASON_VMSCAN);
>  			sc->may_writepage = 1;
>  		}
>  	} while (--sc->priority >= 0);
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-26 13:16     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:38, Johannes Weiner wrote:
> Memory pressure can put dirty pages at the end of the LRU without
> anybody running into dirty limits. Don't start writing individual
> pages from kswapd while the flushers might be asleep.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/writeback.h        |  2 +-
>  include/trace/events/writeback.h |  2 +-
>  mm/vmscan.c                      | 18 +++++++++++++-----
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 5527d910ba3d..a3c0cbd7c888 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -46,7 +46,7 @@ enum writeback_sync_modes {
>   */
>  enum wb_reason {
>  	WB_REASON_BACKGROUND,
> -	WB_REASON_TRY_TO_FREE_PAGES,
> +	WB_REASON_VMSCAN,
>  	WB_REASON_SYNC,
>  	WB_REASON_PERIODIC,
>  	WB_REASON_LAPTOP_TIMER,
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 2ccd9ccbf9ef..7bd8783a590f 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -31,7 +31,7 @@
>  
>  #define WB_WORK_REASON							\
>  	EM( WB_REASON_BACKGROUND,		"background")		\
> -	EM( WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages")	\
> +	EM( WB_REASON_VMSCAN,			"vmscan")		\
>  	EM( WB_REASON_SYNC,			"sync")			\
>  	EM( WB_REASON_PERIODIC,			"periodic")		\
>  	EM( WB_REASON_LAPTOP_TIMER,		"laptop_timer")		\
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0d05f7f3b532..56ea8d24041f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1798,12 +1798,20 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  		/*
>  		 * If dirty pages are scanned that are not queued for IO, it
> -		 * implies that flushers are not keeping up. In this case, flag
> -		 * the pgdat PGDAT_DIRTY and kswapd will start writing pages from
> -		 * reclaim context.
> +		 * implies that flushers are not doing their job. This can
> +		 * happen when memory pressure pushes dirty pages to the end
> +		 * of the LRU without the dirty limits being breached. It can
> +		 * also happen when the proportion of dirty pages grows not
> +		 * through writes but through memory pressure reclaiming all
> +		 * the clean cache. And in some cases, the flushers simply
> +		 * cannot keep up with the allocation rate. Nudge the flusher
> +		 * threads in case they are asleep, but also allow kswapd to
> +		 * start writing pages during reclaim.
>  		 */
> -		if (stat.nr_unqueued_dirty == nr_taken)
> +		if (stat.nr_unqueued_dirty == nr_taken) {
> +			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
>  			set_bit(PGDAT_DIRTY, &pgdat->flags);
> +		}
>  
>  		/*
>  		 * If kswapd scans pages marked marked for immediate
> @@ -2787,7 +2795,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
>  		if (total_scanned > writeback_threshold) {
>  			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
> -						WB_REASON_TRY_TO_FREE_PAGES);
> +						WB_REASON_VMSCAN);
>  			sc->may_writepage = 1;
>  		}
>  	} while (--sc->priority >= 0);
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 13:21     ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:39, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 17 -----------------
>  1 file changed, 17 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 56ea8d24041f..915fc658de41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					  struct scan_control *sc)
>  {
>  	int initial_priority = sc->priority;
> -	unsigned long total_scanned = 0;
> -	unsigned long writeback_threshold;
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		sc->nr_scanned = 0;
>  		shrink_zones(zonelist, sc);
>  
> -		total_scanned += sc->nr_scanned;
>  		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>  			break;
>  
> @@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 */
>  		if (sc->priority < DEF_PRIORITY - 2)
>  			sc->may_writepage = 1;
> -
> -		/*
> -		 * Try to write back as many pages as we just scanned.  This
> -		 * tends to cause slow streaming writers to write data to the
> -		 * disk smoothly, at the dirtying rate, which is nice.   But
> -		 * that's undesirable in laptop mode, where we *want* lumpy
> -		 * writeout.  So in laptop mode, write out the whole world.
> -		 */
> -		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> -			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
> -						WB_REASON_VMSCAN);
> -			sc->may_writepage = 1;
> -		}
>  	} while (--sc->priority >= 0);
>  
>  	delayacct_freepages_end();
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-26 13:21     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:39, Johannes Weiner wrote:
> Direct reclaim has been replaced by kswapd reclaim in pretty much all
> common memory pressure situations, so this code most likely doesn't
> accomplish the described effect anymore. The previous patch wakes up
> flushers for all reclaimers when we encounter dirty pages at the tail
> end of the LRU. Remove the crufty old direct reclaim invocation.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 17 -----------------
>  1 file changed, 17 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 56ea8d24041f..915fc658de41 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2757,8 +2757,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  					  struct scan_control *sc)
>  {
>  	int initial_priority = sc->priority;
> -	unsigned long total_scanned = 0;
> -	unsigned long writeback_threshold;
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2771,7 +2769,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		sc->nr_scanned = 0;
>  		shrink_zones(zonelist, sc);
>  
> -		total_scanned += sc->nr_scanned;
>  		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
>  			break;
>  
> @@ -2784,20 +2781,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		 */
>  		if (sc->priority < DEF_PRIORITY - 2)
>  			sc->may_writepage = 1;
> -
> -		/*
> -		 * Try to write back as many pages as we just scanned.  This
> -		 * tends to cause slow streaming writers to write data to the
> -		 * disk smoothly, at the dirtying rate, which is nice.   But
> -		 * that's undesirable in laptop mode, where we *want* lumpy
> -		 * writeout.  So in laptop mode, write out the whole world.
> -		 */
> -		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
> -		if (total_scanned > writeback_threshold) {
> -			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
> -						WB_REASON_VMSCAN);
> -			sc->may_writepage = 1;
> -		}
>  	} while (--sc->priority >= 0);
>  
>  	delayacct_freepages_end();
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 13:29     ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:40, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 915fc658de41..df0fe0cc438e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1153,13 +1153,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		if (PageDirty(page)) {
>  			/*
> -			 * Only kswapd can writeback filesystem pages to
> -			 * avoid risk of stack overflow but only writeback
> -			 * if many dirty pages have been encountered.
> +			 * Only kswapd can writeback filesystem pages
> +			 * to avoid risk of stack overflow. But avoid
> +			 * injecting inefficient single-page IO into
> +			 * flusher writeback as much as possible: only
> +			 * write pages when we've encountered many
> +			 * dirty pages, and when we've already scanned
> +			 * the rest of the LRU for clean pages and see
> +			 * the same dirty pages again (PageReclaim).
>  			 */
>  			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() ||
> -					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> +			    (!current_is_kswapd() || !PageReclaim(page) ||
> +			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice
@ 2017-01-26 13:29     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:40, Johannes Weiner wrote:
> Dirty pages can easily reach the end of the LRU while there are still
> clean pages to reclaim around. Don't let kswapd write them back just
> because there are a lot of them. It costs more CPU to find the clean
> pages, but that's almost certainly better than to disrupt writeback
> from the flushers with LRU-order single-page writes from reclaim. And
> the flushers have been woken up by that point, so we spend IO capacity
> on flushing and CPU capacity on finding the clean cache.
> 
> Only start writing dirty pages if they have cycled around the LRU
> twice now and STILL haven't been queued on the IO device. It's
> possible that the dirty pages are so sparsely distributed across
> different bdis, inodes, memory cgroups, that the flushers take forever
> to get to the ones we want reclaimed. Once we see them twice on the
> LRU, we know that's the quicker way to find them, so do LRU writeback.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 915fc658de41..df0fe0cc438e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1153,13 +1153,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		if (PageDirty(page)) {
>  			/*
> -			 * Only kswapd can writeback filesystem pages to
> -			 * avoid risk of stack overflow but only writeback
> -			 * if many dirty pages have been encountered.
> +			 * Only kswapd can writeback filesystem pages
> +			 * to avoid risk of stack overflow. But avoid
> +			 * injecting inefficient single-page IO into
> +			 * flusher writeback as much as possible: only
> +			 * write pages when we've encountered many
> +			 * dirty pages, and when we've already scanned
> +			 * the rest of the LRU for clean pages and see
> +			 * the same dirty pages again (PageReclaim).
>  			 */
>  			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() ||
> -					 !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
> +			    (!current_is_kswapd() || !PageReclaim(page) ||
> +			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-23 18:16   ` Johannes Weiner
@ 2017-01-26 13:52     ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:41, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.

I was worried that the inactive list can shrink too low and that could
lead to pre-mature OOM declaration but should_reclaim_retry should cope
with this because it considers NR_ZONE_WRITE_PENDING which includes both
dirty and writeback pages.

That being said the patch makes sense to me

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm_inline.h | 7 +++++++
>  mm/swap.c                 | 9 +++++----
>  mm/vmscan.c               | 6 +++---
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 41d376e7116d..e030a68ead7e 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
>  	list_add(&page->lru, &lruvec->lists[lru]);
>  }
>  
> +static __always_inline void add_page_to_lru_list_tail(struct page *page,
> +				struct lruvec *lruvec, enum lru_list lru)
> +{
> +	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	list_add_tail(&page->lru, &lruvec->lists[lru]);
> +}
> +
>  static __always_inline void del_page_from_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> diff --git a/mm/swap.c b/mm/swap.c
> index aabf2e90fe32..c4910f14f957 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
>  {
>  	int *pgmoved = arg;
>  
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -		enum lru_list lru = page_lru_base_type(page);
> -		list_move_tail(&page->lru, &lruvec->lists[lru]);
> +	if (PageLRU(page) && !PageUnevictable(page)) {
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		ClearPageActive(page);
> +		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
>  		(*pgmoved)++;
>  	}
>  }
> @@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
>   */
>  void rotate_reclaimable_page(struct page *page)
>  {
> -	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> +	if (!PageLocked(page) && !PageDirty(page) &&
>  	    !PageUnevictable(page) && PageLRU(page)) {
>  		struct pagevec *pvec;
>  		unsigned long flags;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df0fe0cc438e..947ab6f4db10 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			    PageReclaim(page) &&
>  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>  				nr_immediate++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 2 above */
>  			} else if (sane_reclaim(sc) ||
> @@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				 */
>  				SetPageReclaim(page);
>  				nr_writeback++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 3 above */
>  			} else {
> @@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
>  				SetPageReclaim(page);
>  
> -				goto keep_locked;
> +				goto activate_locked;
>  			}
>  
>  			if (references == PAGEREF_RECLAIM_CLEAN)
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-26 13:52     ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-26 13:52 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Mon 23-01-17 13:16:41, Johannes Weiner wrote:
> We noticed a performance regression when moving hadoop workloads from
> 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> activity initiated by kswapd as well as frequent bursts of allocation
> stalls and direct reclaim scans. Even lowering the dirty ratios to the
> equivalent of less than 1% of memory would not eliminate the issue,
> suggesting that dirty pages concentrate where the scanner is looking.
> 
> This can be traced back to recent efforts of thrash avoidance. Where
> 3.10 would not detect refaulting pages and continuously supply clean
> cache to the inactive list, a thrashing workload on 4.0+ will detect
> and activate refaulting pages right away, distilling used-once pages
> on the inactive list much more effectively. This is by design, and it
> makes sense for clean cache. But for the most part our workload's
> cache faults are refaults and its use-once cache is from streaming
> writes. We end up with most of the inactive list dirty, and we don't
> go after the active cache as long as we have use-once pages around.
> 
> But waiting for writes to avoid reclaiming clean cache that *might*
> refault is a bad trade-off. Even if the refaults happen, reads are
> faster than writes. Before getting bogged down on writeback, reclaim
> should first look at *all* cache in the system, even active cache.
> 
> To accomplish this, activate pages that have been dirty or under
> writeback for two inactive LRU cycles. We know at this point that
> there are not enough clean inactive pages left to satisfy memory
> demand in the system. The pages are marked for immediate reclaim,
> meaning they'll get moved back to the inactive LRU tail as soon as
> they're written back and become reclaimable. But in the meantime, by
> reducing the inactive list to only immediately reclaimable pages, we
> allow the scanner to deactivate and refill the inactive list with
> clean cache from the active list tail to guarantee forward progress.

I was worried that the inactive list can shrink too low and that could
lead to pre-mature OOM declaration but should_reclaim_retry should cope
with this because it considers NR_ZONE_WRITE_PENDING which includes both
dirty and writeback pages.

That being said the patch makes sense to me

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm_inline.h | 7 +++++++
>  mm/swap.c                 | 9 +++++----
>  mm/vmscan.c               | 6 +++---
>  3 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 41d376e7116d..e030a68ead7e 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -50,6 +50,13 @@ static __always_inline void add_page_to_lru_list(struct page *page,
>  	list_add(&page->lru, &lruvec->lists[lru]);
>  }
>  
> +static __always_inline void add_page_to_lru_list_tail(struct page *page,
> +				struct lruvec *lruvec, enum lru_list lru)
> +{
> +	update_lru_size(lruvec, lru, page_zonenum(page), hpage_nr_pages(page));
> +	list_add_tail(&page->lru, &lruvec->lists[lru]);
> +}
> +
>  static __always_inline void del_page_from_lru_list(struct page *page,
>  				struct lruvec *lruvec, enum lru_list lru)
>  {
> diff --git a/mm/swap.c b/mm/swap.c
> index aabf2e90fe32..c4910f14f957 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -209,9 +209,10 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
>  {
>  	int *pgmoved = arg;
>  
> -	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
> -		enum lru_list lru = page_lru_base_type(page);
> -		list_move_tail(&page->lru, &lruvec->lists[lru]);
> +	if (PageLRU(page) && !PageUnevictable(page)) {
> +		del_page_from_lru_list(page, lruvec, page_lru(page));
> +		ClearPageActive(page);
> +		add_page_to_lru_list_tail(page, lruvec, page_lru(page));
>  		(*pgmoved)++;
>  	}
>  }
> @@ -235,7 +236,7 @@ static void pagevec_move_tail(struct pagevec *pvec)
>   */
>  void rotate_reclaimable_page(struct page *page)
>  {
> -	if (!PageLocked(page) && !PageDirty(page) && !PageActive(page) &&
> +	if (!PageLocked(page) && !PageDirty(page) &&
>  	    !PageUnevictable(page) && PageLRU(page)) {
>  		struct pagevec *pvec;
>  		unsigned long flags;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index df0fe0cc438e..947ab6f4db10 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1063,7 +1063,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			    PageReclaim(page) &&
>  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
>  				nr_immediate++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 2 above */
>  			} else if (sane_reclaim(sc) ||
> @@ -1081,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				 */
>  				SetPageReclaim(page);
>  				nr_writeback++;
> -				goto keep_locked;
> +				goto activate_locked;
>  
>  			/* Case 3 above */
>  			} else {
> @@ -1174,7 +1174,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  				inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
>  				SetPageReclaim(page);
>  
> -				goto keep_locked;
> +				goto activate_locked;
>  			}
>  
>  			if (references == PAGEREF_RECLAIM_CLEAN)
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-26  9:57     ` Mel Gorman
@ 2017-01-26 17:47       ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 17:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 09:57:45AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> > Memory pressure can put dirty pages at the end of the LRU without
> > anybody running into dirty limits. Don't start writing individual
> > pages from kswapd while the flushers might be asleep.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I don't understand the motivation for checking the wb_reason name. Maybe
> it was easier to eyeball while reading ftraces. The comment about the
> flusher not doing its job could also be as simple as the writes took
> place and clean pages were reclaimed before dirty_expire was reached.
> Not impossible if there was a light writer combined with a heavy reader
> or a large number of anonymous faults.

The name change was only because try_to_free_pages() wasn't the only
function doing this flusher wakeup anymore. I associate that name with
direct reclaim rather than reclaim in general, so I figured this makes
more sense. No strong feelings either way, but I doubt this will break
anything in userspace.

The comment on dirty expiration is a good point. Let's add this to the
list of reasons why reclaim might run into dirty data. Fixlet below.

> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!

---

>From 44c4289ab85c0af66cb06de6d1bb72a5c67fd755 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 26 Jan 2017 12:41:39 -0500
Subject: [PATCH] mm: vmscan: kick flushers when we encounter dirty pages on
 the LRU fix

Mention dirty expiration as a condition: we need dirty data that is
too recent for periodic flushing and not large enough for waking up
limit flushing. As per Mel.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 56ea8d24041f..ccd4bf952cb3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1799,15 +1799,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not doing their job. This can
-		 * happen when memory pressure pushes dirty pages to the end
-		 * of the LRU without the dirty limits being breached. It can
-		 * also happen when the proportion of dirty pages grows not
-		 * through writes but through memory pressure reclaiming all
-		 * the clean cache. And in some cases, the flushers simply
-		 * cannot keep up with the allocation rate. Nudge the flusher
-		 * threads in case they are asleep, but also allow kswapd to
-		 * start writing pages during reclaim.
+		 * happen when memory pressure pushes dirty pages to the end of
+		 * the LRU before the dirty limits are breached and the dirty
+		 * data has expired. It can also happen when the proportion of
+		 * dirty pages grows not through writes but through memory
+		 * pressure reclaiming all the clean cache. And in some cases,
+		 * the flushers simply cannot keep up with the allocation
+		 * rate. Nudge the flusher threads in case they are asleep, but
+		 * also allow kswapd to start writing pages during reclaim.
 		 */
 		if (stat.nr_unqueued_dirty == nr_taken) {
 			wakeup_flusher_threads(0, WB_REASON_VMSCAN);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-26 17:47       ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 17:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 09:57:45AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> > Memory pressure can put dirty pages at the end of the LRU without
> > anybody running into dirty limits. Don't start writing individual
> > pages from kswapd while the flushers might be asleep.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> I don't understand the motivation for checking the wb_reason name. Maybe
> it was easier to eyeball while reading ftraces. The comment about the
> flusher not doing its job could also be as simple as the writes took
> place and clean pages were reclaimed before dirty_expire was reached.
> Not impossible if there was a light writer combined with a heavy reader
> or a large number of anonymous faults.

The name change was only because try_to_free_pages() wasn't the only
function doing this flusher wakeup anymore. I associate that name with
direct reclaim rather than reclaim in general, so I figured this makes
more sense. No strong feelings either way, but I doubt this will break
anything in userspace.

The comment on dirty expiration is a good point. Let's add this to the
list of reasons why reclaim might run into dirty data. Fixlet below.

> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!

---

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
  2017-01-26 17:47       ` Johannes Weiner
@ 2017-01-26 18:47         ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 18:47 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 12:47:39PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 09:57:45AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> > > Memory pressure can put dirty pages at the end of the LRU without
> > > anybody running into dirty limits. Don't start writing individual
> > > pages from kswapd while the flushers might be asleep.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > I don't understand the motivation for checking the wb_reason name. Maybe
> > it was easier to eyeball while reading ftraces. The comment about the
> > flusher not doing its job could also be as simple as the writes took
> > place and clean pages were reclaimed before dirty_expire was reached.
> > Not impossible if there was a light writer combined with a heavy reader
> > or a large number of anonymous faults.
> 
> The name change was only because try_to_free_pages() wasn't the only
> function doing this flusher wakeup anymore.

Ah, ok. I was thinking of it in terms of "we are trying to free pages"
and not the specific name of the direct reclaim function.

> I associate that name with
> direct reclaim rather than reclaim in general, so I figured this makes
> more sense. No strong feelings either way, but I doubt this will break
> anything in userspace.
> 

Doubtful, maybe some tracing analysis scripts but they routinely have
to adapt.

> The comment on dirty expiration is a good point. Let's add this to the
> list of reasons why reclaim might run into dirty data. Fixlet below.
> 

Looks good.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU
@ 2017-01-26 18:47         ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 18:47 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 12:47:39PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 09:57:45AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:38PM -0500, Johannes Weiner wrote:
> > > Memory pressure can put dirty pages at the end of the LRU without
> > > anybody running into dirty limits. Don't start writing individual
> > > pages from kswapd while the flushers might be asleep.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > I don't understand the motivation for checking the wb_reason name. Maybe
> > it was easier to eyeball while reading ftraces. The comment about the
> > flusher not doing its job could also be as simple as the writes took
> > place and clean pages were reclaimed before dirty_expire was reached.
> > Not impossible if there was a light writer combined with a heavy reader
> > or a large number of anonymous faults.
> 
> The name change was only because try_to_free_pages() wasn't the only
> function doing this flusher wakeup anymore.

Ah, ok. I was thinking of it in terms of "we are trying to free pages"
and not the specific name of the direct reclaim function.

> I associate that name with
> direct reclaim rather than reclaim in general, so I figured this makes
> more sense. No strong feelings either way, but I doubt this will break
> anything in userspace.
> 

Doubtful, maybe some tracing analysis scripts but they routinely have
to adapt.

> The comment on dirty expiration is a good point. Let's add this to the
> list of reasons why reclaim might run into dirty data. Fixlet below.
> 

Looks good.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-26 10:05     ` Mel Gorman
@ 2017-01-26 18:50       ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 18:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > common memory pressure situations, so this code most likely doesn't
> > accomplish the described effect anymore. The previous patch wakes up
> > flushers for all reclaimers when we encounter dirty pages at the tail
> > end of the LRU. Remove the crufty old direct reclaim invocation.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> In general I like this. I worried first that if kswapd is blocked
> writing pages that it won't reach the wakeup_flusher_threads but the
> previous patch handles it.
> 
> Now though, it occurs to me with the last patch that we always writeout
> the world when flushing threads. This may not be a great idea. Consider
> for example if there is a heavy writer of short-lived tmp files. In such a
> case, it is possible for the files to be truncated before they even hit the
> disk. However, if there are multiple "writeout the world" calls, these may
> now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> could all be requested to writeout the world and each request unplugs.
> 
> Is it possible to maintain the property of writing back pages relative
> to the numbers of pages scanned or have you determined already that it's
> not necessary?

That's what I started out with - waking the flushers for nr_taken. I
was using a silly test case that wrote < dirty background limit and
then allocated a burst of anon memory. When the dirty data is linear,
the bigger IO requests are beneficial. They don't exhaust struct
request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
only 32), and they require less frequent plugging.

Force-flushing temporary files under memory pressure is a concern -
although the most recently dirtied files would get queued last, giving
them still some time to get truncated - but I'm wary about splitting
the flush requests too aggressively when we DO sustain throngs of
dirty pages hitting the reclaim scanners.

I didn't test this with the real workload that gave us problems yet,
though, because deploying enough machines to get a good sample size
takes 1-2 days and to run through the full load spectrum another 4-5.
So it's harder to fine-tune these patches.

But this is a legit concern. I'll try to find out what happens when we
reduce the wakeups to nr_taken.

Given the problem these patches address, though, would you be okay
with keeping this patch in -mm? We're too far into 4.10 to merge it
upstream now, and I should have data on more precise wakeups before
the next merge window.

Thanks

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-26 18:50       ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 18:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > common memory pressure situations, so this code most likely doesn't
> > accomplish the described effect anymore. The previous patch wakes up
> > flushers for all reclaimers when we encounter dirty pages at the tail
> > end of the LRU. Remove the crufty old direct reclaim invocation.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> In general I like this. I worried first that if kswapd is blocked
> writing pages that it won't reach the wakeup_flusher_threads but the
> previous patch handles it.
> 
> Now though, it occurs to me with the last patch that we always writeout
> the world when flushing threads. This may not be a great idea. Consider
> for example if there is a heavy writer of short-lived tmp files. In such a
> case, it is possible for the files to be truncated before they even hit the
> disk. However, if there are multiple "writeout the world" calls, these may
> now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> could all be requested to writeout the world and each request unplugs.
> 
> Is it possible to maintain the property of writing back pages relative
> to the numbers of pages scanned or have you determined already that it's
> not necessary?

That's what I started out with - waking the flushers for nr_taken. I
was using a silly test case that wrote < dirty background limit and
then allocated a burst of anon memory. When the dirty data is linear,
the bigger IO requests are beneficial. They don't exhaust struct
request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
only 32), and they require less frequent plugging.

Force-flushing temporary files under memory pressure is a concern -
although the most recently dirtied files would get queued last, giving
them still some time to get truncated - but I'm wary about splitting
the flush requests too aggressively when we DO sustain throngs of
dirty pages hitting the reclaim scanners.

I didn't test this with the real workload that gave us problems yet,
though, because deploying enough machines to get a good sample size
takes 1-2 days and to run through the full load spectrum another 4-5.
So it's harder to fine-tune these patches.

But this is a legit concern. I'll try to find out what happens when we
reduce the wakeups to nr_taken.

Given the problem these patches address, though, would you be okay
with keeping this patch in -mm? We're too far into 4.10 to merge it
upstream now, and I should have data on more precise wakeups before
the next merge window.

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-26 10:19     ` Mel Gorman
@ 2017-01-26 20:07       ` Johannes Weiner
  -1 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 20:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > We noticed a performance regression when moving hadoop workloads from
> > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > activity initiated by kswapd as well as frequent bursts of allocation
> > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > equivalent of less than 1% of memory would not eliminate the issue,
> > suggesting that dirty pages concentrate where the scanner is looking.
> 
> Note that some of this is also impacted by
> bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> of dirty pages reaching the end of the LRU sooner if they are being
> written. It's not impossible that hadoop is rewriting the same files,
> hitting the end of the LRU due to no reads and then throwing reclaim
> into a hole.
> 
> I've seen a few cases where random write only workloads regressed and it
> was based on whether the random number generator was selecting the same
> pages. With that commit, the LRU was effectively LIFO.
> 
> Similarly, I'd seen a case where a databases whose working set was
> larger than the shared memory area regressed because the spill-over from
> the database buffer to RAM was not being preserved because it was all
> rights. That said, the same patch prevents the database being swapped so
> it's not all bad but there have been consequences.
> 
> I don't have a problem with the patch although would prefer to have seen
> more data for the series. However, I'm not entirely convinced that
> thrash detection was the only problem. I think not activating pages on
> write was a contributing factor although this patch looks better than
> considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.

We didn't backport this commit into our 4.6 kernel, so it couldn't
have been a factor in our particular testing. But I will fully agree
with you that this change probably exacerbates the problem.

Another example is the recent shrinking of the inactive list:
59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
one we did in fact backport, after which the problem we were already
debugging got worse. That was a good hint where the problem was:

Every time we got better at keeping the clean hot cache separated out
on the active list, we increased the concentration of dirty pages on
the inactive list. Whether this is workingset.c activating refaulting
pages, whether that's not activating writeback cache, or whether that
is shrinking the inactive list size, they all worked toward exposing
the same deficiency in the reclaim-writeback model: that waiting for
writes is worse than potentially causing reads. That flaw has always
been there - since we had wait_on_page_writeback() in the reclaim
scanner and the split between inactive and active cache. It was just
historically much harder to trigger problems like this in practice.

That's why this is a regression over a period of kernel development
and cannot really be pinpointed to a specific commit.

This patch, by straight-up putting dirty/writeback pages at the head
of the combined page cache double LRU regardless of access frequency,
is making an explicit update to the reclaim-writeback model to codify
the trade-off between writes and potential refaults. Any alternative
(implementation differences aside of course) would require regressing
use-once separation to previous levels in some form.

The lack of data is not great, agreed as well. The thing I can say is
that for the hadoop workloads - and this is a whole spectrum of jobs
running on hundreds of machines in a test group over several days -
this patch series restores average job completions, allocation stalls,
amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
a high confidence, and no obvious metric that could have regressed.

Is there something specific that you would like to see tested? Aside
from trying that load with more civilized flusher wakeups in kswapd?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-26 20:07       ` Johannes Weiner
  0 siblings, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2017-01-26 20:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > We noticed a performance regression when moving hadoop workloads from
> > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > activity initiated by kswapd as well as frequent bursts of allocation
> > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > equivalent of less than 1% of memory would not eliminate the issue,
> > suggesting that dirty pages concentrate where the scanner is looking.
> 
> Note that some of this is also impacted by
> bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> of dirty pages reaching the end of the LRU sooner if they are being
> written. It's not impossible that hadoop is rewriting the same files,
> hitting the end of the LRU due to no reads and then throwing reclaim
> into a hole.
> 
> I've seen a few cases where random write only workloads regressed and it
> was based on whether the random number generator was selecting the same
> pages. With that commit, the LRU was effectively LIFO.
> 
> Similarly, I'd seen a case where a databases whose working set was
> larger than the shared memory area regressed because the spill-over from
> the database buffer to RAM was not being preserved because it was all
> rights. That said, the same patch prevents the database being swapped so
> it's not all bad but there have been consequences.
> 
> I don't have a problem with the patch although would prefer to have seen
> more data for the series. However, I'm not entirely convinced that
> thrash detection was the only problem. I think not activating pages on
> write was a contributing factor although this patch looks better than
> considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.

We didn't backport this commit into our 4.6 kernel, so it couldn't
have been a factor in our particular testing. But I will fully agree
with you that this change probably exacerbates the problem.

Another example is the recent shrinking of the inactive list:
59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
one we did in fact backport, after which the problem we were already
debugging got worse. That was a good hint where the problem was:

Every time we got better at keeping the clean hot cache separated out
on the active list, we increased the concentration of dirty pages on
the inactive list. Whether this is workingset.c activating refaulting
pages, whether that's not activating writeback cache, or whether that
is shrinking the inactive list size, they all worked toward exposing
the same deficiency in the reclaim-writeback model: that waiting for
writes is worse than potentially causing reads. That flaw has always
been there - since we had wait_on_page_writeback() in the reclaim
scanner and the split between inactive and active cache. It was just
historically much harder to trigger problems like this in practice.

That's why this is a regression over a period of kernel development
and cannot really be pinpointed to a specific commit.

This patch, by straight-up putting dirty/writeback pages at the head
of the combined page cache double LRU regardless of access frequency,
is making an explicit update to the reclaim-writeback model to codify
the trade-off between writes and potential refaults. Any alternative
(implementation differences aside of course) would require regressing
use-once separation to previous levels in some form.

The lack of data is not great, agreed as well. The thing I can say is
that for the hadoop workloads - and this is a whole spectrum of jobs
running on hundreds of machines in a test group over several days -
this patch series restores average job completions, allocation stalls,
amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
a high confidence, and no obvious metric that could have regressed.

Is there something specific that you would like to see tested? Aside
from trying that load with more civilized flusher wakeups in kswapd?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-26 18:50       ` Johannes Weiner
@ 2017-01-26 20:45         ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 20:45 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 01:50:27PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 

Understood.

> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.
> 

That's fair enough. It's rare to see a case where a tmp file being
written instead of truncated in RAM causes problems. The only one that
really springs to mind is dbench3 whose "performance" often relied on
whether the files were truncated before writeback.

> I didn't test this with the real workload that gave us problems yet,
> though, because deploying enough machines to get a good sample size
> takes 1-2 days and to run through the full load spectrum another 4-5.
> So it's harder to fine-tune these patches.
> 
> But this is a legit concern. I'll try to find out what happens when we
> reduce the wakeups to nr_taken.
> 
> Given the problem these patches address, though, would you be okay
> with keeping this patch in -mm? We're too far into 4.10 to merge it
> upstream now, and I should have data on more precise wakeups before
> the next merge window.
> 

Yeah, that's fine. My concern is mostly theoritical but it's something
to watch out for in future regression reports. It should be relatively
easy to spot -- workload generates lots of short-lived tmp files for
whatever reason and reports that write IO is higher causing the system
to stall other IO requests.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-26 20:45         ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 20:45 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 01:50:27PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 

Understood.

> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.
> 

That's fair enough. It's rare to see a case where a tmp file being
written instead of truncated in RAM causes problems. The only one that
really springs to mind is dbench3 whose "performance" often relied on
whether the files were truncated before writeback.

> I didn't test this with the real workload that gave us problems yet,
> though, because deploying enough machines to get a good sample size
> takes 1-2 days and to run through the full load spectrum another 4-5.
> So it's harder to fine-tune these patches.
> 
> But this is a legit concern. I'll try to find out what happens when we
> reduce the wakeups to nr_taken.
> 
> Given the problem these patches address, though, would you be okay
> with keeping this patch in -mm? We're too far into 4.10 to merge it
> upstream now, and I should have data on more precise wakeups before
> the next merge window.
> 

Yeah, that's fine. My concern is mostly theoritical but it's something
to watch out for in future regression reports. It should be relatively
easy to spot -- workload generates lots of short-lived tmp files for
whatever reason and reports that write IO is higher causing the system
to stall other IO requests.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
  2017-01-26 20:07       ` Johannes Weiner
@ 2017-01-26 20:58         ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 20:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 03:07:45PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > > We noticed a performance regression when moving hadoop workloads from
> > > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > > activity initiated by kswapd as well as frequent bursts of allocation
> > > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > > equivalent of less than 1% of memory would not eliminate the issue,
> > > suggesting that dirty pages concentrate where the scanner is looking.
> > 
> > Note that some of this is also impacted by
> > bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> > of dirty pages reaching the end of the LRU sooner if they are being
> > written. It's not impossible that hadoop is rewriting the same files,
> > hitting the end of the LRU due to no reads and then throwing reclaim
> > into a hole.
> > 
> > I've seen a few cases where random write only workloads regressed and it
> > was based on whether the random number generator was selecting the same
> > pages. With that commit, the LRU was effectively LIFO.
> > 
> > Similarly, I'd seen a case where a databases whose working set was
> > larger than the shared memory area regressed because the spill-over from
> > the database buffer to RAM was not being preserved because it was all
> > rights. That said, the same patch prevents the database being swapped so
> > it's not all bad but there have been consequences.
> > 
> > I don't have a problem with the patch although would prefer to have seen
> > more data for the series. However, I'm not entirely convinced that
> > thrash detection was the only problem. I think not activating pages on
> > write was a contributing factor although this patch looks better than
> > considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.
> 
> We didn't backport this commit into our 4.6 kernel, so it couldn't
> have been a factor in our particular testing. But I will fully agree
> with you that this change probably exacerbates the problem.
> 

Ah, ok. I was not aware the patch couldn't have been part of what you
were seeing.

> Another example is the recent shrinking of the inactive list:
> 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
> one we did in fact backport, after which the problem we were already
> debugging got worse. That was a good hint where the problem was:
> 
> Every time we got better at keeping the clean hot cache separated out
> on the active list, we increased the concentration of dirty pages on
> the inactive list.

Somewhat ironic because the improved separation increases the
chances of kswapd writing out pages and direct reclaimers stalling on
wait_iff_congested.

> Whether this is workingset.c activating refaulting
> pages, whether that's not activating writeback cache, or whether that
> is shrinking the inactive list size, they all worked toward exposing
> the same deficiency in the reclaim-writeback model: that waiting for
> writes is worse than potentially causing reads. That flaw has always
> been there - since we had wait_on_page_writeback() in the reclaim
> scanner and the split between inactive and active cache. It was just
> historically much harder to trigger problems like this in practice.
> 
> That's why this is a regression over a period of kernel development
> and cannot really be pinpointed to a specific commit.
> 

Understood.

> This patch, by straight-up putting dirty/writeback pages at the head
> of the combined page cache double LRU regardless of access frequency,
> is making an explicit update to the reclaim-writeback model to codify
> the trade-off between writes and potential refaults. Any alternative
> (implementation differences aside of course) would require regressing
> use-once separation to previous levels in some form.
> 
> The lack of data is not great, agreed as well. The thing I can say is
> that for the hadoop workloads - and this is a whole spectrum of jobs
> running on hundreds of machines in a test group over several days -
> this patch series restores average job completions, allocation stalls,
> amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
> a high confidence, and no obvious metric that could have regressed.
> 

That's fair enough. It's rarely the case that a regression in a complex
workload has a single root cause. If it was, bisections would always work.

> Is there something specific that you would like to see tested? Aside
> from trying that load with more civilized flusher wakeups in kswapd?

Nothing specific that I'll force on you. At some point I'll shove Chris's
simoop workload through it has it allegedly has similar propertys to what
you're seeing. I only got around to examining it last week to see how it
behaved. It was very obvious that between 4.4 and 4.9 it started writing
heavily from reclaim context. However, it had also stopped swappiing which
pointing towards the grab_cache_page_write() commit. Kswapd scan rates had
also doubled. Detailed examination of the stall stats showed extremely long
stalls. I expect these patches to have an impact and would be surprised
if they didn't.

Similarly, any random read/write workload that is write intensive might
also be interesting although that might just hit the dirty balancing limits
if not tuned properly.

A write-only sysbench would also be interesting. That is also a workload
that between 4.4 and 4.9 had regressed severely. Partly this was dirty
pages getting to the tail of the LRU and the other part was the random
number generator reusing some pages that the activations preserved. I
think your patches would at least mitigate the first problem.

If you have the chance to do any of them, it would be nice, but the
patches make enough sense from plain review. If I thought they were
shakier than I would make more of a fuss.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed
@ 2017-01-26 20:58         ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-26 20:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu, Jan 26, 2017 at 03:07:45PM -0500, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:19:16AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:41PM -0500, Johannes Weiner wrote:
> > > We noticed a performance regression when moving hadoop workloads from
> > > 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout
> > > activity initiated by kswapd as well as frequent bursts of allocation
> > > stalls and direct reclaim scans. Even lowering the dirty ratios to the
> > > equivalent of less than 1% of memory would not eliminate the issue,
> > > suggesting that dirty pages concentrate where the scanner is looking.
> > 
> > Note that some of this is also impacted by
> > bbddabe2e436aa7869b3ac5248df5c14ddde0cbf because it can have the effect
> > of dirty pages reaching the end of the LRU sooner if they are being
> > written. It's not impossible that hadoop is rewriting the same files,
> > hitting the end of the LRU due to no reads and then throwing reclaim
> > into a hole.
> > 
> > I've seen a few cases where random write only workloads regressed and it
> > was based on whether the random number generator was selecting the same
> > pages. With that commit, the LRU was effectively LIFO.
> > 
> > Similarly, I'd seen a case where a databases whose working set was
> > larger than the shared memory area regressed because the spill-over from
> > the database buffer to RAM was not being preserved because it was all
> > rights. That said, the same patch prevents the database being swapped so
> > it's not all bad but there have been consequences.
> > 
> > I don't have a problem with the patch although would prefer to have seen
> > more data for the series. However, I'm not entirely convinced that
> > thrash detection was the only problem. I think not activating pages on
> > write was a contributing factor although this patch looks better than
> > considering reverting bbddabe2e436aa7869b3ac5248df5c14ddde0cbf.
> 
> We didn't backport this commit into our 4.6 kernel, so it couldn't
> have been a factor in our particular testing. But I will fully agree
> with you that this change probably exacerbates the problem.
> 

Ah, ok. I was not aware the patch couldn't have been part of what you
were seeing.

> Another example is the recent shrinking of the inactive list:
> 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). That
> one we did in fact backport, after which the problem we were already
> debugging got worse. That was a good hint where the problem was:
> 
> Every time we got better at keeping the clean hot cache separated out
> on the active list, we increased the concentration of dirty pages on
> the inactive list.

Somewhat ironic because the improved separation increases the
chances of kswapd writing out pages and direct reclaimers stalling on
wait_iff_congested.

> Whether this is workingset.c activating refaulting
> pages, whether that's not activating writeback cache, or whether that
> is shrinking the inactive list size, they all worked toward exposing
> the same deficiency in the reclaim-writeback model: that waiting for
> writes is worse than potentially causing reads. That flaw has always
> been there - since we had wait_on_page_writeback() in the reclaim
> scanner and the split between inactive and active cache. It was just
> historically much harder to trigger problems like this in practice.
> 
> That's why this is a regression over a period of kernel development
> and cannot really be pinpointed to a specific commit.
> 

Understood.

> This patch, by straight-up putting dirty/writeback pages at the head
> of the combined page cache double LRU regardless of access frequency,
> is making an explicit update to the reclaim-writeback model to codify
> the trade-off between writes and potential refaults. Any alternative
> (implementation differences aside of course) would require regressing
> use-once separation to previous levels in some form.
> 
> The lack of data is not great, agreed as well. The thing I can say is
> that for the hadoop workloads - and this is a whole spectrum of jobs
> running on hundreds of machines in a test group over several days -
> this patch series restores average job completions, allocation stalls,
> amount of kswapd-initiated IO, sys% and iowait% to 3.10 levels - with
> a high confidence, and no obvious metric that could have regressed.
> 

That's fair enough. It's rarely the case that a regression in a complex
workload has a single root cause. If it was, bisections would always work.

> Is there something specific that you would like to see tested? Aside
> from trying that load with more civilized flusher wakeups in kswapd?

Nothing specific that I'll force on you. At some point I'll shove Chris's
simoop workload through it has it allegedly has similar propertys to what
you're seeing. I only got around to examining it last week to see how it
behaved. It was very obvious that between 4.4 and 4.9 it started writing
heavily from reclaim context. However, it had also stopped swappiing which
pointing towards the grab_cache_page_write() commit. Kswapd scan rates had
also doubled. Detailed examination of the stall stats showed extremely long
stalls. I expect these patches to have an impact and would be surprised
if they didn't.

Similarly, any random read/write workload that is write intensive might
also be interesting although that might just hit the dirty balancing limits
if not tuned properly.

A write-only sysbench would also be interesting. That is also a workload
that between 4.4 and 4.9 had regressed severely. Partly this was dirty
pages getting to the tail of the LRU and the other part was the random
number generator reusing some pages that the activations preserved. I
think your patches would at least mitigate the first problem.

If you have the chance to do any of them, it would be nice, but the
patches make enough sense from plain review. If I thought they were
shakier than I would make more of a fuss.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-26 18:50       ` Johannes Weiner
@ 2017-01-27 12:01         ` Michal Hocko
  -1 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-27 12:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 
> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.

I think the above would be helpful in the changelog for future
reference.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-27 12:01         ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2017-01-27 12:01 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Andrew Morton, linux-mm, linux-kernel, kernel-team

On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > common memory pressure situations, so this code most likely doesn't
> > > accomplish the described effect anymore. The previous patch wakes up
> > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > 
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > In general I like this. I worried first that if kswapd is blocked
> > writing pages that it won't reach the wakeup_flusher_threads but the
> > previous patch handles it.
> > 
> > Now though, it occurs to me with the last patch that we always writeout
> > the world when flushing threads. This may not be a great idea. Consider
> > for example if there is a heavy writer of short-lived tmp files. In such a
> > case, it is possible for the files to be truncated before they even hit the
> > disk. However, if there are multiple "writeout the world" calls, these may
> > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > could all be requested to writeout the world and each request unplugs.
> > 
> > Is it possible to maintain the property of writing back pages relative
> > to the numbers of pages scanned or have you determined already that it's
> > not necessary?
> 
> That's what I started out with - waking the flushers for nr_taken. I
> was using a silly test case that wrote < dirty background limit and
> then allocated a burst of anon memory. When the dirty data is linear,
> the bigger IO requests are beneficial. They don't exhaust struct
> request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> only 32), and they require less frequent plugging.
> 
> Force-flushing temporary files under memory pressure is a concern -
> although the most recently dirtied files would get queued last, giving
> them still some time to get truncated - but I'm wary about splitting
> the flush requests too aggressively when we DO sustain throngs of
> dirty pages hitting the reclaim scanners.

I think the above would be helpful in the changelog for future
reference.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
  2017-01-27 12:01         ` Michal Hocko
@ 2017-01-27 14:27           ` Mel Gorman
  -1 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-27 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel, kernel-team

On Fri, Jan 27, 2017 at 01:01:01PM +0100, Michal Hocko wrote:
> On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> > On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > > common memory pressure situations, so this code most likely doesn't
> > > > accomplish the described effect anymore. The previous patch wakes up
> > > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > 
> > > In general I like this. I worried first that if kswapd is blocked
> > > writing pages that it won't reach the wakeup_flusher_threads but the
> > > previous patch handles it.
> > > 
> > > Now though, it occurs to me with the last patch that we always writeout
> > > the world when flushing threads. This may not be a great idea. Consider
> > > for example if there is a heavy writer of short-lived tmp files. In such a
> > > case, it is possible for the files to be truncated before they even hit the
> > > disk. However, if there are multiple "writeout the world" calls, these may
> > > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > > could all be requested to writeout the world and each request unplugs.
> > > 
> > > Is it possible to maintain the property of writing back pages relative
> > > to the numbers of pages scanned or have you determined already that it's
> > > not necessary?
> > 
> > That's what I started out with - waking the flushers for nr_taken. I
> > was using a silly test case that wrote < dirty background limit and
> > then allocated a burst of anon memory. When the dirty data is linear,
> > the bigger IO requests are beneficial. They don't exhaust struct
> > request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> > only 32), and they require less frequent plugging.
> > 
> > Force-flushing temporary files under memory pressure is a concern -
> > although the most recently dirtied files would get queued last, giving
> > them still some time to get truncated - but I'm wary about splitting
> > the flush requests too aggressively when we DO sustain throngs of
> > dirty pages hitting the reclaim scanners.
> 
> I think the above would be helpful in the changelog for future
> reference.
> 

Agreed. I backported the series to 4.10-rc5 with one minor conflict and
ran a couple of tests on it. Mix of read/write random workload didn't show
anything interesting. Write-only database didn't show much difference in
performance but there were slight reductions in IO -- probably in the noise.

simoop did show big differences although not as big as I expected. This
is Chris Mason's workload that similate the VM activity of hadoop. I
won't go through the full details but over the samples measured during
an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons I
haven't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down. Direct reclaim activity is one fifth of
what it was according to vmstats. Kswapd activity is higher but this is not
necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages
scanned were reclaimed) but direct reclaim efficiency went from 77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context. With the series, no data was written back. With or without the
patch, pages are being immediately reclaimed after writeback completes.
However, with the patch, only 1/8th of the pages are reclaimed like
this.

I expect you've done plenty of internal analysis but FWIW, I can confirm
for some basic tests that exercise this are and on one machine that it's
looking good and roughly matches my expectations.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path
@ 2017-01-27 14:27           ` Mel Gorman
  0 siblings, 0 replies; 60+ messages in thread
From: Mel Gorman @ 2017-01-27 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel, kernel-team

On Fri, Jan 27, 2017 at 01:01:01PM +0100, Michal Hocko wrote:
> On Thu 26-01-17 13:50:27, Johannes Weiner wrote:
> > On Thu, Jan 26, 2017 at 10:05:09AM +0000, Mel Gorman wrote:
> > > On Mon, Jan 23, 2017 at 01:16:39PM -0500, Johannes Weiner wrote:
> > > > Direct reclaim has been replaced by kswapd reclaim in pretty much all
> > > > common memory pressure situations, so this code most likely doesn't
> > > > accomplish the described effect anymore. The previous patch wakes up
> > > > flushers for all reclaimers when we encounter dirty pages at the tail
> > > > end of the LRU. Remove the crufty old direct reclaim invocation.
> > > > 
> > > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > 
> > > In general I like this. I worried first that if kswapd is blocked
> > > writing pages that it won't reach the wakeup_flusher_threads but the
> > > previous patch handles it.
> > > 
> > > Now though, it occurs to me with the last patch that we always writeout
> > > the world when flushing threads. This may not be a great idea. Consider
> > > for example if there is a heavy writer of short-lived tmp files. In such a
> > > case, it is possible for the files to be truncated before they even hit the
> > > disk. However, if there are multiple "writeout the world" calls, these may
> > > now be hitting the disk. Furthermore, multiplle kswapd and direct reclaimers
> > > could all be requested to writeout the world and each request unplugs.
> > > 
> > > Is it possible to maintain the property of writing back pages relative
> > > to the numbers of pages scanned or have you determined already that it's
> > > not necessary?
> > 
> > That's what I started out with - waking the flushers for nr_taken. I
> > was using a silly test case that wrote < dirty background limit and
> > then allocated a burst of anon memory. When the dirty data is linear,
> > the bigger IO requests are beneficial. They don't exhaust struct
> > request (like kswapd 4k IO routinely does, and SWAP_CLUSTER_MAX is
> > only 32), and they require less frequent plugging.
> > 
> > Force-flushing temporary files under memory pressure is a concern -
> > although the most recently dirtied files would get queued last, giving
> > them still some time to get truncated - but I'm wary about splitting
> > the flush requests too aggressively when we DO sustain throngs of
> > dirty pages hitting the reclaim scanners.
> 
> I think the above would be helpful in the changelog for future
> reference.
> 

Agreed. I backported the series to 4.10-rc5 with one minor conflict and
ran a couple of tests on it. Mix of read/write random workload didn't show
anything interesting. Write-only database didn't show much difference in
performance but there were slight reductions in IO -- probably in the noise.

simoop did show big differences although not as big as I expected. This
is Chris Mason's workload that similate the VM activity of hadoop. I
won't go through the full details but over the samples measured during
an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons I
haven't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down. Direct reclaim activity is one fifth of
what it was according to vmstats. Kswapd activity is higher but this is not
necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages
scanned were reclaimed) but direct reclaim efficiency went from 77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context. With the series, no data was written back. With or without the
patch, pages are being immediately reclaimed after writeback completes.
However, with the patch, only 1/8th of the pages are reclaimed like
this.

I expect you've done plenty of internal analysis but FWIW, I can confirm
for some basic tests that exercise this are and on one machine that it's
looking good and roughly matches my expectations.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2017-01-27 14:44 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 18:16 [PATCH 0/5] mm: vmscan: fix kswapd writeback regression Johannes Weiner
2017-01-23 18:16 ` Johannes Weiner
2017-01-23 18:16 ` [PATCH 1/5] mm: vmscan: scan dirty pages even in laptop mode Johannes Weiner
2017-01-23 18:16   ` Johannes Weiner
2017-01-26  1:27   ` Minchan Kim
2017-01-26  1:27     ` Minchan Kim
2017-01-26  9:52   ` Mel Gorman
2017-01-26  9:52     ` Mel Gorman
2017-01-26 13:13   ` Michal Hocko
2017-01-26 13:13     ` Michal Hocko
2017-01-23 18:16 ` [PATCH 2/5] mm: vmscan: kick flushers when we encounter dirty pages on the LRU Johannes Weiner
2017-01-23 18:16   ` Johannes Weiner
2017-01-26  1:35   ` Minchan Kim
2017-01-26  1:35     ` Minchan Kim
2017-01-26  9:57   ` Mel Gorman
2017-01-26  9:57     ` Mel Gorman
2017-01-26 17:47     ` Johannes Weiner
2017-01-26 17:47       ` Johannes Weiner
2017-01-26 18:47       ` Mel Gorman
2017-01-26 18:47         ` Mel Gorman
2017-01-26 13:16   ` Michal Hocko
2017-01-26 13:16     ` Michal Hocko
2017-01-23 18:16 ` [PATCH 3/5] mm: vmscan: remove old flusher wakeup from direct reclaim path Johannes Weiner
2017-01-23 18:16   ` Johannes Weiner
2017-01-26  1:38   ` Minchan Kim
2017-01-26  1:38     ` Minchan Kim
2017-01-26 10:05   ` Mel Gorman
2017-01-26 10:05     ` Mel Gorman
2017-01-26 18:50     ` Johannes Weiner
2017-01-26 18:50       ` Johannes Weiner
2017-01-26 20:45       ` Mel Gorman
2017-01-26 20:45         ` Mel Gorman
2017-01-27 12:01       ` Michal Hocko
2017-01-27 12:01         ` Michal Hocko
2017-01-27 14:27         ` Mel Gorman
2017-01-27 14:27           ` Mel Gorman
2017-01-26 13:21   ` Michal Hocko
2017-01-26 13:21     ` Michal Hocko
2017-01-23 18:16 ` [PATCH 4/5] mm: vmscan: only write dirty pages that the scanner has seen twice Johannes Weiner
2017-01-23 18:16   ` Johannes Weiner
2017-01-26  1:42   ` Minchan Kim
2017-01-26  1:42     ` Minchan Kim
2017-01-26 10:08   ` Mel Gorman
2017-01-26 10:08     ` Mel Gorman
2017-01-26 13:29   ` Michal Hocko
2017-01-26 13:29     ` Michal Hocko
2017-01-23 18:16 ` [PATCH 5/5] mm: vmscan: move dirty pages out of the way until they're flushed Johannes Weiner
2017-01-23 18:16   ` Johannes Weiner
2017-01-26  1:47   ` Minchan Kim
2017-01-26  1:47     ` Minchan Kim
2017-01-26 10:19   ` Mel Gorman
2017-01-26 10:19     ` Mel Gorman
2017-01-26 20:07     ` Johannes Weiner
2017-01-26 20:07       ` Johannes Weiner
2017-01-26 20:58       ` Mel Gorman
2017-01-26 20:58         ` Mel Gorman
2017-01-26 13:52   ` Michal Hocko
2017-01-26 13:52     ` Michal Hocko
2017-01-26  5:44 ` [PATCH 0/5] mm: vmscan: fix kswapd writeback regression Hillf Danton
2017-01-26  5:44   ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.