All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] mm: kswapd spinning on unreclaimable nodes - fixes and cleanups
@ 2017-02-28 21:39 ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

Hi,

Jia reported a scenario in which the kswapd of a node indefinitely
spins at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node
(or whether to back off and sleep) is based on the amount of scanned
pages in proportion to the amount of reclaimable pages. In Jia's and
our scenarios, there are no reclaimable pages in the node, however,
and the condition for backing off is never met. Kswapd busyloops in an
attempt to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not
on scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking
the OOM killer. If it cannot free any pages, kswapd will go to sleep
and leave further attempts to direct reclaim invocations, which will
either make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

Thanks

 include/linux/mmzone.h |   3 +-
 mm/internal.h          |   7 +-
 mm/migrate.c           |   3 -
 mm/page_alloc.c        |  39 +++--------
 mm/vmscan.c            | 169 ++++++++++++++++++-----------------------------
 mm/vmstat.c            |  24 ++-----
 6 files changed, 89 insertions(+), 156 deletions(-)

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 0/9] mm: kswapd spinning on unreclaimable nodes - fixes and cleanups
@ 2017-02-28 21:39 ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

Hi,

Jia reported a scenario in which the kswapd of a node indefinitely
spins at 100% CPU usage. We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node
(or whether to back off and sleep) is based on the amount of scanned
pages in proportion to the amount of reclaimable pages. In Jia's and
our scenarios, there are no reclaimable pages in the node, however,
and the condition for backing off is never met. Kswapd busyloops in an
attempt to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not
on scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking
the OOM killer. If it cannot free any pages, kswapd will go to sleep
and leave further attempts to direct reclaim invocations, which will
either make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

Thanks

 include/linux/mmzone.h |   3 +-
 mm/internal.h          |   7 +-
 mm/migrate.c           |   3 -
 mm/page_alloc.c        |  39 +++--------
 mm/vmscan.c            | 169 ++++++++++++++++++-----------------------------
 mm/vmstat.c            |  24 ++-----
 6 files changed, 89 insertions(+), 156 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:39   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)

Reported-by: Jia He <hejianet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h          |  6 ++++++
 mm/page_alloc.c        |  9 ++-------
 mm/vmscan.c            | 27 ++++++++++++++++++++-------
 mm/vmstat.c            |  2 +-
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/internal.h b/mm/internal.h
index ccfc2a2969f4..aae93e3fd984 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * Maximum number of reclaim retries without progress before the OOM
+ * killer is consider the only way forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..f50e36e7b024 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			node_page_state(pgdat, NR_PAGES_SCANNED),
-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+				"yes" : "no");
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..407b27831ff7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd gives up on balancing particular nodes after too
+	 * many failures to reclaim anything from them and goes to
+	 * sleep. On reclaim progress, reset the failure counter. A
+	 * successful direct reclaim run will revive a dormant kswapd.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failures = 0;
+
 	return reclaimable;
 }
 
@@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-			if (sc->priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;	/* Let kswapd poll it */
-
 			/*
 			 * If we already have plenty of memory free for
 			 * compaction in this zone, don't free any more.
@@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failures++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
@@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(pgdat))
-		return NODE_RECLAIM_FULL;
-
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff39a2e..ff16cdc15df2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  node_unreclaimable:  %u"
 		   "\n  start_pfn:           %lu"
 		   "\n  node_inactive_ratio: %u",
-		   !pgdat_reclaimable(zone->zone_pgdat),
+		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
 		   zone->zone_start_pfn,
 		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-02-28 21:39   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance. Up until
1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
kswapd had such a mechanism. It considered zones whose theoretically
reclaimable pages it had reclaimed six times over as unreclaimable and
backed away from them. This guard was erroneously removed as the patch
changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met. Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off. If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node. This is the same number of shots
direct reclaim takes before declaring OOM. Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)

Reported-by: Jia He <hejianet@gmail.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 ++
 mm/internal.h          |  6 ++++++
 mm/page_alloc.c        |  9 ++-------
 mm/vmscan.c            | 27 ++++++++++++++++++++-------
 mm/vmstat.c            |  2 +-
 5 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e02b3750fe0..d2c50ab6ae40 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -630,6 +630,8 @@ typedef struct pglist_data {
 	int kswapd_order;
 	enum zone_type kswapd_classzone_idx;
 
+	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
+
 #ifdef CONFIG_COMPACTION
 	int kcompactd_max_order;
 	enum zone_type kcompactd_classzone_idx;
diff --git a/mm/internal.h b/mm/internal.h
index ccfc2a2969f4..aae93e3fd984 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * Maximum number of reclaim retries without progress before the OOM
+ * killer is consider the only way forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 614cd0397ce3..f50e36e7b024 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 }
 
 /*
- * Maximum number of reclaim retries without any progress before OOM killer
- * is consider as the only way to move forward.
- */
-#define MAX_RECLAIM_RETRIES 16
-
-/*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
  * The reclaim feedback represented by did_some_progress (any progress during
@@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			node_page_state(pgdat, NR_PAGES_SCANNED),
-			!pgdat_reclaimable(pgdat) ? "yes" : "no");
+			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
+				"yes" : "no");
 	}
 
 	for_each_populated_zone(zone) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..407b27831ff7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 					 sc->nr_scanned - nr_scanned, sc));
 
+	/*
+	 * Kswapd gives up on balancing particular nodes after too
+	 * many failures to reclaim anything from them and goes to
+	 * sleep. On reclaim progress, reset the failure counter. A
+	 * successful direct reclaim run will revive a dormant kswapd.
+	 */
+	if (reclaimable)
+		pgdat->kswapd_failures = 0;
+
 	return reclaimable;
 }
 
@@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
-			if (sc->priority != DEF_PRIORITY &&
-			    !pgdat_reclaimable(zone->zone_pgdat))
-				continue;	/* Let kswapd poll it */
-
 			/*
 			 * If we already have plenty of memory free for
 			 * compaction in this zone, don't free any more.
@@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return true;
+
 	for (i = 0; i <= classzone_idx; i++) {
 		struct zone *zone = pgdat->node_zones + i;
 
@@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
+	if (!sc.nr_reclaimed)
+		pgdat->kswapd_failures++;
+
 out:
 	/*
 	 * Return the order kswapd stopped reclaiming at as
@@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
+	/* Hopeless node, leave it to direct reclaim */
+	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
+		return;
+
 	/* Only wake kswapd if all zones are unbalanced */
 	for (z = 0; z <= classzone_idx; z++) {
 		zone = pgdat->node_zones + z;
@@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
 	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
 		return NODE_RECLAIM_FULL;
 
-	if (!pgdat_reclaimable(pgdat))
-		return NODE_RECLAIM_FULL;
-
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 69f9aff39a2e..ff16cdc15df2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n  node_unreclaimable:  %u"
 		   "\n  start_pfn:           %lu"
 		   "\n  node_inactive_ratio: %u",
-		   !pgdat_reclaimable(zone->zone_pgdat),
+		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
 		   zone->zone_start_pfn,
 		   zone->zone_pgdat->inactive_ratio);
 	seq_putc(m, '\n');
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
all free pages in each zone fall below half the min watermark. During
the summation, we want to exclude zones that don't have reclaimables.
Checking the same pgdat over and over again doesn't make sense.

Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 407b27831ff7..f006140f58c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2838,8 +2838,10 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!managed_zone(zone) ||
-		    pgdat_reclaimable_pages(pgdat) == 0)
+		if (!managed_zone(zone))
+			continue;
+
+		if (!zone_reclaimable_pages(zone))
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
all free pages in each zone fall below half the min watermark. During
the summation, we want to exclude zones that don't have reclaimables.
Checking the same pgdat over and over again doesn't make sense.

Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 407b27831ff7..f006140f58c6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2838,8 +2838,10 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
 	for (i = 0; i <= ZONE_NORMAL; i++) {
 		zone = &pgdat->node_zones[i];
-		if (!managed_zone(zone) ||
-		    pgdat_reclaimable_pages(pgdat) == 0)
+		if (!managed_zone(zone))
+			continue;
+
+		if (!zone_reclaimable_pages(zone))
 			continue;
 
 		pfmemalloc_reserve += min_wmark_pages(zone);
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
allowed laptop_mode=1 to start writing not just when the priority
drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
That appears to be a spurious change in this patch as I doubt the
series was tested with laptop_mode, and neither is that particular
change mentioned in the changelog. Remove it, it's still recent.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f006140f58c6..911957b66622 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3288,7 +3288,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
+		if (sc.priority < DEF_PRIORITY - 2)
 			sc.may_writepage = 1;
 
 		/* Call soft limit reclaim before calling shrink_node. */
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
allowed laptop_mode=1 to start writing not just when the priority
drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
That appears to be a spurious change in this patch as I doubt the
series was tested with laptop_mode, and neither is that particular
change mentioned in the changelog. Remove it, it's still recent.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f006140f58c6..911957b66622 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3288,7 +3288,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * If we're getting trouble reclaiming, start doing writepage
 		 * even in laptop mode.
 		 */
-		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
+		if (sc.priority < DEF_PRIORITY - 2)
 			sc.may_writepage = 1;
 
 		/* Call soft limit reclaim before calling shrink_node. */
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

NUMA balancing already checks the watermarks of the target node to
decide whether it's a suitable balancing target. Whether the node is
reclaimable or not is irrelevant when we don't intend to reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/migrate.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2c63ac06791b..45a18be27b1a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1718,9 +1718,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 {
 	int z;
 
-	if (!pgdat_reclaimable(pgdat))
-		return false;
-
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

NUMA balancing already checks the watermarks of the target node to
decide whether it's a suitable balancing target. Whether the node is
reclaimable or not is irrelevant when we don't intend to reclaim.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/migrate.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 2c63ac06791b..45a18be27b1a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1718,9 +1718,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
 {
 	int z;
 
-	if (!pgdat_reclaimable(pgdat))
-		return false;
-
 	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
 		struct zone *zone = pgdat->node_zones + z;
 
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
to avoid high reclaim priorities for kswapd by forcing it to scan a
minimum amount of pages when lru_pages >> priority yielded nothing.

b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
lists"), due to switching global reclaim to a round-robin scheme over
all cgroups, had to restrict this forceful behavior to unreclaimable
zones in order to prevent massive overreclaim with many cgroups.

The latter patch effectively neutered the behavior completely for all
but extreme memory pressure. But in those situations we might as well
drop the reclaimers to lower priority levels. Remove the check.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 911957b66622..46b6223fe7f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,22 +2129,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	int pass;
 
 	/*
-	 * If the zone or memcg is small, nr[l] can be 0.  This
-	 * results in no scanning on this priority and a potential
-	 * priority drop.  Global direct reclaim can go to the next
-	 * zone and tends to have no problems. Global kswapd is for
-	 * zone balancing and it needs to scan a minimum amount. When
+	 * If the zone or memcg is small, nr[l] can be 0. When
 	 * reclaiming for a memcg, a priority drop can cause high
-	 * latencies, so it's better to scan a minimum amount there as
-	 * well.
+	 * latencies, so it's better to scan a minimum amount. When a
+	 * cgroup has already been deleted, scrape out the remaining
+	 * cache forcefully to get rid of the lingering state.
 	 */
-	if (current_is_kswapd()) {
-		if (!pgdat_reclaimable(pgdat))
-			force_scan = true;
-		if (!mem_cgroup_online(memcg))
-			force_scan = true;
-	}
-	if (!global_reclaim(sc))
+	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
 		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
to avoid high reclaim priorities for kswapd by forcing it to scan a
minimum amount of pages when lru_pages >> priority yielded nothing.

b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
lists"), due to switching global reclaim to a round-robin scheme over
all cgroups, had to restrict this forceful behavior to unreclaimable
zones in order to prevent massive overreclaim with many cgroups.

The latter patch effectively neutered the behavior completely for all
but extreme memory pressure. But in those situations we might as well
drop the reclaimers to lower priority levels. Remove the check.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 19 +++++--------------
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 911957b66622..46b6223fe7f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2129,22 +2129,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	int pass;
 
 	/*
-	 * If the zone or memcg is small, nr[l] can be 0.  This
-	 * results in no scanning on this priority and a potential
-	 * priority drop.  Global direct reclaim can go to the next
-	 * zone and tends to have no problems. Global kswapd is for
-	 * zone balancing and it needs to scan a minimum amount. When
+	 * If the zone or memcg is small, nr[l] can be 0. When
 	 * reclaiming for a memcg, a priority drop can cause high
-	 * latencies, so it's better to scan a minimum amount there as
-	 * well.
+	 * latencies, so it's better to scan a minimum amount. When a
+	 * cgroup has already been deleted, scrape out the remaining
+	 * cache forcefully to get rid of the lingering state.
 	 */
-	if (current_is_kswapd()) {
-		if (!pgdat_reclaimable(pgdat))
-			force_scan = true;
-		if (!mem_cgroup_online(memcg))
-			force_scan = true;
-	}
-	if (!global_reclaim(sc))
+	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
 		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
to avoid high reclaim priorities for memcg by forcing it to scan a
minimum amount of pages when lru_pages >> priority yielded nothing.
This was done at a time when reclaim decisions like dirty throttling
were tied to the priority level.

Nowadays, the only meaningful thing still tied to priority dropping
below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
allowed to write. But that is from an era where direct reclaim was
still allowed to call ->writepage, and kswapd nowadays avoids writes
until it's scanned every clean page in the system. Potential changes
to how quick sc->may_writepage could trigger are of little concern.

Remove the force_scan stuff, as well as the ugly multi-pass target
calculation that it necessitated.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 94 ++++++++++++++++++++++++-------------------------------------
 1 file changed, 37 insertions(+), 57 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 46b6223fe7f3..8cff6e2cd02c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,21 +2122,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
-	bool force_scan = false;
 	unsigned long ap, fp;
 	enum lru_list lru;
-	bool some_scanned;
-	int pass;
-
-	/*
-	 * If the zone or memcg is small, nr[l] can be 0. When
-	 * reclaiming for a memcg, a priority drop can cause high
-	 * latencies, so it's better to scan a minimum amount. When a
-	 * cgroup has already been deleted, scrape out the remaining
-	 * cache forcefully to get rid of the lingering state.
-	 */
-	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
-		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
@@ -2267,55 +2254,48 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	fraction[1] = fp;
 	denominator = ap + fp + 1;
 out:
-	some_scanned = false;
-	/* Only use force_scan on second pass. */
-	for (pass = 0; !some_scanned && pass < 2; pass++) {
-		*lru_pages = 0;
-		for_each_evictable_lru(lru) {
-			int file = is_file_lru(lru);
-			unsigned long size;
-			unsigned long scan;
-
-			size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
-			scan = size >> sc->priority;
-
-			if (!scan && pass && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-
-			switch (scan_balance) {
-			case SCAN_EQUAL:
-				/* Scan lists relative to size */
-				break;
-			case SCAN_FRACT:
-				/*
-				 * Scan types proportional to swappiness and
-				 * their relative recent reclaim efficiency.
-				 */
-				scan = div64_u64(scan * fraction[file],
-							denominator);
-				break;
-			case SCAN_FILE:
-			case SCAN_ANON:
-				/* Scan one type exclusively */
-				if ((scan_balance == SCAN_FILE) != file) {
-					size = 0;
-					scan = 0;
-				}
-				break;
-			default:
-				/* Look ma, no brain */
-				BUG();
-			}
+	*lru_pages = 0;
+	for_each_evictable_lru(lru) {
+		int file = is_file_lru(lru);
+		unsigned long size;
+		unsigned long scan;
 
-			*lru_pages += size;
-			nr[lru] = scan;
+		size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
+		scan = size >> sc->priority;
+		/*
+		 * If the cgroup's already been deleted, make sure to
+		 * scrape out the remaining cache.
+		 */
+		if (!scan && !mem_cgroup_online(memcg))
+			scan = min(size, SWAP_CLUSTER_MAX);
 
+		switch (scan_balance) {
+		case SCAN_EQUAL:
+			/* Scan lists relative to size */
+			break;
+		case SCAN_FRACT:
 			/*
-			 * Skip the second pass and don't force_scan,
-			 * if we found something to scan.
+			 * Scan types proportional to swappiness and
+			 * their relative recent reclaim efficiency.
 			 */
-			some_scanned |= !!scan;
+			scan = div64_u64(scan * fraction[file],
+					 denominator);
+			break;
+		case SCAN_FILE:
+		case SCAN_ANON:
+			/* Scan one type exclusively */
+			if ((scan_balance == SCAN_FILE) != file) {
+				size = 0;
+				scan = 0;
+			}
+			break;
+		default:
+			/* Look ma, no brain */
+			BUG();
 		}
+
+		*lru_pages += size;
+		nr[lru] = scan;
 	}
 }
 
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
to avoid high reclaim priorities for memcg by forcing it to scan a
minimum amount of pages when lru_pages >> priority yielded nothing.
This was done at a time when reclaim decisions like dirty throttling
were tied to the priority level.

Nowadays, the only meaningful thing still tied to priority dropping
below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
allowed to write. But that is from an era where direct reclaim was
still allowed to call ->writepage, and kswapd nowadays avoids writes
until it's scanned every clean page in the system. Potential changes
to how quick sc->may_writepage could trigger are of little concern.

Remove the force_scan stuff, as well as the ugly multi-pass target
calculation that it necessitated.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 94 ++++++++++++++++++++++++-------------------------------------
 1 file changed, 37 insertions(+), 57 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 46b6223fe7f3..8cff6e2cd02c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,21 +2122,8 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	unsigned long anon_prio, file_prio;
 	enum scan_balance scan_balance;
 	unsigned long anon, file;
-	bool force_scan = false;
 	unsigned long ap, fp;
 	enum lru_list lru;
-	bool some_scanned;
-	int pass;
-
-	/*
-	 * If the zone or memcg is small, nr[l] can be 0. When
-	 * reclaiming for a memcg, a priority drop can cause high
-	 * latencies, so it's better to scan a minimum amount. When a
-	 * cgroup has already been deleted, scrape out the remaining
-	 * cache forcefully to get rid of the lingering state.
-	 */
-	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
-		force_scan = true;
 
 	/* If we have no swap space, do not bother scanning anon pages. */
 	if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
@@ -2267,55 +2254,48 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
 	fraction[1] = fp;
 	denominator = ap + fp + 1;
 out:
-	some_scanned = false;
-	/* Only use force_scan on second pass. */
-	for (pass = 0; !some_scanned && pass < 2; pass++) {
-		*lru_pages = 0;
-		for_each_evictable_lru(lru) {
-			int file = is_file_lru(lru);
-			unsigned long size;
-			unsigned long scan;
-
-			size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
-			scan = size >> sc->priority;
-
-			if (!scan && pass && force_scan)
-				scan = min(size, SWAP_CLUSTER_MAX);
-
-			switch (scan_balance) {
-			case SCAN_EQUAL:
-				/* Scan lists relative to size */
-				break;
-			case SCAN_FRACT:
-				/*
-				 * Scan types proportional to swappiness and
-				 * their relative recent reclaim efficiency.
-				 */
-				scan = div64_u64(scan * fraction[file],
-							denominator);
-				break;
-			case SCAN_FILE:
-			case SCAN_ANON:
-				/* Scan one type exclusively */
-				if ((scan_balance == SCAN_FILE) != file) {
-					size = 0;
-					scan = 0;
-				}
-				break;
-			default:
-				/* Look ma, no brain */
-				BUG();
-			}
+	*lru_pages = 0;
+	for_each_evictable_lru(lru) {
+		int file = is_file_lru(lru);
+		unsigned long size;
+		unsigned long scan;
 
-			*lru_pages += size;
-			nr[lru] = scan;
+		size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
+		scan = size >> sc->priority;
+		/*
+		 * If the cgroup's already been deleted, make sure to
+		 * scrape out the remaining cache.
+		 */
+		if (!scan && !mem_cgroup_online(memcg))
+			scan = min(size, SWAP_CLUSTER_MAX);
 
+		switch (scan_balance) {
+		case SCAN_EQUAL:
+			/* Scan lists relative to size */
+			break;
+		case SCAN_FRACT:
 			/*
-			 * Skip the second pass and don't force_scan,
-			 * if we found something to scan.
+			 * Scan types proportional to swappiness and
+			 * their relative recent reclaim efficiency.
 			 */
-			some_scanned |= !!scan;
+			scan = div64_u64(scan * fraction[file],
+					 denominator);
+			break;
+		case SCAN_FILE:
+		case SCAN_ANON:
+			/* Scan one type exclusively */
+			if ((scan_balance == SCAN_FILE) != file) {
+				size = 0;
+				scan = 0;
+			}
+			break;
+		default:
+			/* Look ma, no brain */
+			BUG();
 		}
+
+		*lru_pages += size;
+		nr[lru] = scan;
 	}
 }
 
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

NR_PAGES_SCANNED counts number of pages scanned since the last page
free event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceeding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  1 -
 mm/internal.h          |  1 -
 mm/page_alloc.c        | 15 +++------------
 mm/vmscan.c            |  9 ---------
 mm/vmstat.c            | 22 +++-------------------
 5 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d2c50ab6ae40..04e0969966f6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
diff --git a/mm/internal.h b/mm/internal.h
index aae93e3fd984..c583ce1b32b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -91,7 +91,6 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f50e36e7b024..9ac639864bed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1088,15 +1088,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned, flags;
+	unsigned long flags;
 	bool isolated_pageblocks;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -1148,13 +1144,10 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned, flags;
+	unsigned long flags;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	__count_vm_events(PGFREE, 1 << order);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
@@ -4497,7 +4490,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 #endif
 			" writeback_tmp:%lukB"
 			" unstable:%lukB"
-			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4520,7 +4512,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM)),
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
-			node_page_state(pgdat, NR_PAGES_SCANNED),
 			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
 				"yes" : "no");
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cff6e2cd02c..35b791a8922b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -229,12 +229,6 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 	return nr;
 }
 
-bool pgdat_reclaimable(struct pglist_data *pgdat)
-{
-	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
-		pgdat_reclaimable_pages(pgdat) * 6;
-}
-
 /**
  * lruvec_lru_size -  Returns the number of pages on the given LRU list.
  * @lruvec: lru vector
@@ -1749,7 +1743,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
 			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
@@ -1952,8 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	if (global_reclaim(sc))
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 	__count_vm_events(PGREFILL, nr_scanned);
 
 	spin_unlock_irq(&pgdat->lru_lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ff16cdc15df2..eface7467ea5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-	"nr_pages_scanned",
 	"workingset_refault",
 	"workingset_activate",
 	"workingset_nodereclaim",
@@ -1375,7 +1374,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1383,7 +1381,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
@@ -1584,22 +1581,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
-			switch (i) {
-			case NR_PAGES_SCANNED:
-				/*
-				 * This is often seen to go negative in
-				 * recent kernels, but not to go permanently
-				 * negative.  Whilst it would be nicer not to
-				 * have exceptions, rooting them out would be
-				 * another task, of rather low priority.
-				 */
-				break;
-			default:
-				pr_warn("%s: %s %ld\n",
-					__func__, vmstat_text[i], val);
-				err = -EINVAL;
-				break;
-			}
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i], val);
+			err = -EINVAL;
 		}
 	}
 	if (err)
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

NR_PAGES_SCANNED counts number of pages scanned since the last page
free event in the allocator. This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them. In that role, it has been replaced in the preceeding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well. It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/mmzone.h |  1 -
 mm/internal.h          |  1 -
 mm/page_alloc.c        | 15 +++------------
 mm/vmscan.c            |  9 ---------
 mm/vmstat.c            | 22 +++-------------------
 5 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d2c50ab6ae40..04e0969966f6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -149,7 +149,6 @@ enum node_stat_item {
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
-	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
 	WORKINGSET_NODERECLAIM,
diff --git a/mm/internal.h b/mm/internal.h
index aae93e3fd984..c583ce1b32b9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -91,7 +91,6 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern bool pgdat_reclaimable(struct pglist_data *pgdat);
 
 /*
  * in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f50e36e7b024..9ac639864bed 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1088,15 +1088,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
 	int migratetype = 0;
 	int batch_free = 0;
-	unsigned long nr_scanned, flags;
+	unsigned long flags;
 	bool isolated_pageblocks;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	isolated_pageblocks = has_isolate_pageblock(zone);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	while (count) {
 		struct page *page;
 		struct list_head *list;
@@ -1148,13 +1144,10 @@ static void free_one_page(struct zone *zone,
 				unsigned int order,
 				int migratetype)
 {
-	unsigned long nr_scanned, flags;
+	unsigned long flags;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	__count_vm_events(PGFREE, 1 << order);
-	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
-	if (nr_scanned)
-		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
-
 	if (unlikely(has_isolate_pageblock(zone) ||
 		is_migrate_isolate(migratetype))) {
 		migratetype = get_pfnblock_migratetype(page, pfn);
@@ -4497,7 +4490,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 #endif
 			" writeback_tmp:%lukB"
 			" unstable:%lukB"
-			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -4520,7 +4512,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_SHMEM)),
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
-			node_page_state(pgdat, NR_PAGES_SCANNED),
 			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
 				"yes" : "no");
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cff6e2cd02c..35b791a8922b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -229,12 +229,6 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 	return nr;
 }
 
-bool pgdat_reclaimable(struct pglist_data *pgdat)
-{
-	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
-		pgdat_reclaimable_pages(pgdat) * 6;
-}
-
 /**
  * lruvec_lru_size -  Returns the number of pages on the given LRU list.
  * @lruvec: lru vector
@@ -1749,7 +1743,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 		if (current_is_kswapd())
 			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
 		else
@@ -1952,8 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
-	if (global_reclaim(sc))
-		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
 	__count_vm_events(PGREFILL, nr_scanned);
 
 	spin_unlock_irq(&pgdat->lru_lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ff16cdc15df2..eface7467ea5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_unevictable",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-	"nr_pages_scanned",
 	"workingset_refault",
 	"workingset_activate",
 	"workingset_nodereclaim",
@@ -1375,7 +1374,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
-		   "\n   node_scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu"
 		   "\n        managed  %lu",
@@ -1383,7 +1381,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
-		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
 		   zone->spanned_pages,
 		   zone->present_pages,
 		   zone->managed_pages);
@@ -1584,22 +1581,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
 		val = atomic_long_read(&vm_zone_stat[i]);
 		if (val < 0) {
-			switch (i) {
-			case NR_PAGES_SCANNED:
-				/*
-				 * This is often seen to go negative in
-				 * recent kernels, but not to go permanently
-				 * negative.  Whilst it would be nicer not to
-				 * have exceptions, rooting them out would be
-				 * another task, of rather low priority.
-				 */
-				break;
-			default:
-				pr_warn("%s: %s %ld\n",
-					__func__, vmstat_text[i], val);
-				err = -EINVAL;
-				break;
-			}
+			pr_warn("%s: %s %ld\n",
+				__func__, vmstat_text[i], val);
+			err = -EINVAL;
 		}
 	}
 	if (err)
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

This reverts commit d7f05528eedb047efe2288cff777676b028747b6.

Now that reclaimability of a node is no longer based on the ratio
between pages scanned and theoretically reclaimable pages, we can
remove accounting tricks for pages skipped due to zone constraints.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 35b791a8922b..ddcff8a11c1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1471,12 +1471,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
 	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
-	unsigned long skipped = 0, total_skipped = 0;
+	unsigned long skipped = 0;
 	unsigned long scan, nr_pages;
 	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-					!list_empty(src);) {
+					!list_empty(src); scan++) {
 		struct page *page;
 
 		page = lru_to_page(src);
@@ -1490,12 +1490,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
-		/*
-		 * Account for scanned and skipped separetly to avoid the pgdat
-		 * being prematurely marked unreclaimable by pgdat_reclaimable.
-		 */
-		scan++;
-
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1524,6 +1518,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	if (!list_empty(&pages_skipped)) {
 		int zid;
 
+		list_splice(&pages_skipped, src);
 		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 			if (!nr_skipped[zid])
 				continue;
@@ -1531,17 +1526,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
 			skipped += nr_skipped[zid];
 		}
-
-		/*
-		 * Account skipped pages as a partial scan as the pgdat may be
-		 * close to unreclaimable. If the LRU list is empty, account
-		 * skipped pages as a full scan.
-		 */
-		total_skipped = list_empty(src) ? skipped : skipped >> 2;
-
-		list_splice(&pages_skipped, src);
 	}
-	*nr_scanned = scan + total_skipped;
+	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    scan, skipped, nr_taken, mode, lru);
 	update_lru_sizes(lruvec, lru, nr_zone_taken);
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

This reverts commit d7f05528eedb047efe2288cff777676b028747b6.

Now that reclaimability of a node is no longer based on the ratio
between pages scanned and theoretically reclaimable pages, we can
remove accounting tricks for pages skipped due to zone constraints.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 35b791a8922b..ddcff8a11c1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1471,12 +1471,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	unsigned long nr_taken = 0;
 	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
 	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
-	unsigned long skipped = 0, total_skipped = 0;
+	unsigned long skipped = 0;
 	unsigned long scan, nr_pages;
 	LIST_HEAD(pages_skipped);
 
 	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
-					!list_empty(src);) {
+					!list_empty(src); scan++) {
 		struct page *page;
 
 		page = lru_to_page(src);
@@ -1490,12 +1490,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			continue;
 		}
 
-		/*
-		 * Account for scanned and skipped separetly to avoid the pgdat
-		 * being prematurely marked unreclaimable by pgdat_reclaimable.
-		 */
-		scan++;
-
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			nr_pages = hpage_nr_pages(page);
@@ -1524,6 +1518,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 	if (!list_empty(&pages_skipped)) {
 		int zid;
 
+		list_splice(&pages_skipped, src);
 		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 			if (!nr_skipped[zid])
 				continue;
@@ -1531,17 +1526,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
 			skipped += nr_skipped[zid];
 		}
-
-		/*
-		 * Account skipped pages as a partial scan as the pgdat may be
-		 * close to unreclaimable. If the LRU list is empty, account
-		 * skipped pages as a full scan.
-		 */
-		total_skipped = list_empty(src) ? skipped : skipped >> 2;
-
-		list_splice(&pages_skipped, src);
 	}
-	*nr_scanned = scan + total_skipped;
+	*nr_scanned = scan;
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				    scan, skipped, nr_taken, mode, lru);
 	update_lru_sizes(lruvec, lru, nr_zone_taken);
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
  2017-02-28 21:39 ` Johannes Weiner
@ 2017-02-28 21:40   ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
loops without progress, we'll OOM anyway; backing off might cut one or
two iterations off that in the rare OOM case. If we have intermittent
success reclaiming a few pages, the backoff function gets reset also,
and so is of little help in these scenarios.

We might want a backoff function for when there IS progress, but not
enough to be satisfactory. But this isn't that. Remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ac639864bed..223644afed28 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3511,11 +3511,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
- * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round) and no_progress_loops (number of reclaim rounds without
- * any progress in a row) is considered as well as the reclaimable pages on the
- * applicable zone list (with a backoff mechanism which is a function of
- * no_progress_loops).
+ *
+ * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
+ * without success, or when we couldn't even meet the watermark if we
+ * reclaimed all remaining pages on the LRU lists.
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
@@ -3560,13 +3559,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		bool wmark;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
-		available -= DIV_ROUND_UP((*no_progress_loops) * available,
-					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * Would the allocation succeed if we reclaimed all
+		 * reclaimable pages?
 		 */
 		wmark = __zone_watermark_ok(zone, order, min_wmark,
 				ac_classzone_idx(ac), alloc_flags, available);
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
@ 2017-02-28 21:40   ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-02-28 21:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jia He, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, kernel-team

The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
loops without progress, we'll OOM anyway; backing off might cut one or
two iterations off that in the rare OOM case. If we have intermittent
success reclaiming a few pages, the backoff function gets reset also,
and so is of little help in these scenarios.

We might want a backoff function for when there IS progress, but not
enough to be satisfactory. But this isn't that. Remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ac639864bed..223644afed28 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3511,11 +3511,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
- * The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round) and no_progress_loops (number of reclaim rounds without
- * any progress in a row) is considered as well as the reclaimable pages on the
- * applicable zone list (with a backoff mechanism which is a function of
- * no_progress_loops).
+ *
+ * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
+ * without success, or when we couldn't even meet the watermark if we
+ * reclaimed all remaining pages on the LRU lists.
  *
  * Returns true if a retry is viable or false to enter the oom path.
  */
@@ -3560,13 +3559,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		bool wmark;
 
 		available = reclaimable = zone_reclaimable_pages(zone);
-		available -= DIV_ROUND_UP((*no_progress_loops) * available,
-					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
 		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * Would the allocation succeed if we reclaimed all
+		 * reclaimable pages?
 		 */
 		wmark = __zone_watermark_ok(zone, order, min_wmark,
 				ac_classzone_idx(ac), alloc_flags, available);
-- 
2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 14:56     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 14:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:07, Johannes Weiner wrote:
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.

Yes, as already mentioned elsewhere the original intention was to a more
graceful oom convergence when we are trashing over last few reclaimable
pages but as the code evolved the result is not all that great.
 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.

Completely agreed.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ac639864bed..223644afed28 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3511,11 +3511,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> - * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> - * any progress in a row) is considered as well as the reclaimable pages on the
> - * applicable zone list (with a backoff mechanism which is a function of
> - * no_progress_loops).
> + *
> + * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
> + * without success, or when we couldn't even meet the watermark if we
> + * reclaimed all remaining pages on the LRU lists.
>   *
>   * Returns true if a retry is viable or false to enter the oom path.
>   */
> @@ -3560,13 +3559,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		bool wmark;
>  
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available -= DIV_ROUND_UP((*no_progress_loops) * available,
> -					  MAX_RECLAIM_RETRIES);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * Would the allocation succeed if we reclaimed all
> +		 * reclaimable pages?
>  		 */
>  		wmark = __zone_watermark_ok(zone, order, min_wmark,
>  				ac_classzone_idx(ac), alloc_flags, available);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
@ 2017-03-01 14:56     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 14:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:07, Johannes Weiner wrote:
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.

Yes, as already mentioned elsewhere the original intention was to a more
graceful oom convergence when we are trashing over last few reclaimable
pages but as the code evolved the result is not all that great.
 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.

Completely agreed.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ac639864bed..223644afed28 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3511,11 +3511,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  /*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
> - * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> - * any progress in a row) is considered as well as the reclaimable pages on the
> - * applicable zone list (with a backoff mechanism which is a function of
> - * no_progress_loops).
> + *
> + * We give up when we either have tried MAX_RECLAIM_RETRIES in a row
> + * without success, or when we couldn't even meet the watermark if we
> + * reclaimed all remaining pages on the LRU lists.
>   *
>   * Returns true if a retry is viable or false to enter the oom path.
>   */
> @@ -3560,13 +3559,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		bool wmark;
>  
>  		available = reclaimable = zone_reclaimable_pages(zone);
> -		available -= DIV_ROUND_UP((*no_progress_loops) * available,
> -					  MAX_RECLAIM_RETRIES);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
>  		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * Would the allocation succeed if we reclaimed all
> +		 * reclaimable pages?
>  		 */
>  		wmark = __zone_watermark_ok(zone, order, min_wmark,
>  				ac_classzone_idx(ac), alloc_flags, available);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:02     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:00, Johannes Weiner wrote:
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 407b27831ff7..f006140f58c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2838,8 +2838,10 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!managed_zone(zone) ||
> -		    pgdat_reclaimable_pages(pgdat) == 0)
> +		if (!managed_zone(zone))
> +			continue;
> +
> +		if (!zone_reclaimable_pages(zone))
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
@ 2017-03-01 15:02     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:02 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:00, Johannes Weiner wrote:
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 407b27831ff7..f006140f58c6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2838,8 +2838,10 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>  
>  	for (i = 0; i <= ZONE_NORMAL; i++) {
>  		zone = &pgdat->node_zones[i];
> -		if (!managed_zone(zone) ||
> -		    pgdat_reclaimable_pages(pgdat) == 0)
> +		if (!managed_zone(zone))
> +			continue;
> +
> +		if (!zone_reclaimable_pages(zone))
>  			continue;
>  
>  		pfmemalloc_reserve += min_wmark_pages(zone);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:06     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:01, Johannes Weiner wrote:
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.

The less pgdat_reclaimable we have the better IMHO. If this is really
needed then I would appreciate a proper explanation because each such
a heuristic is just a head scratcher - especially after few years when
all the details are long forgotten.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f006140f58c6..911957b66622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3288,7 +3288,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
>  		 */
> -		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
> +		if (sc.priority < DEF_PRIORITY - 2)
>  			sc.may_writepage = 1;
>  
>  		/* Call soft limit reclaim before calling shrink_node. */
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
@ 2017-03-01 15:06     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:01, Johannes Weiner wrote:
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.

The less pgdat_reclaimable we have the better IMHO. If this is really
needed then I would appreciate a proper explanation because each such
a heuristic is just a head scratcher - especially after few years when
all the details are long forgotten.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f006140f58c6..911957b66622 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3288,7 +3288,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * If we're getting trouble reclaiming, start doing writepage
>  		 * even in laptop mode.
>  		 */
> -		if (sc.priority < DEF_PRIORITY - 2 || !pgdat_reclaimable(pgdat))
> +		if (sc.priority < DEF_PRIORITY - 2)
>  			sc.may_writepage = 1;
>  
>  		/* Call soft limit reclaim before calling shrink_node. */
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:14     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:02, Johannes Weiner wrote:
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.

I guess the original intention was to skip nodes which are under strong
memory pressure but I agree that this is questionable heuristic.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/migrate.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2c63ac06791b..45a18be27b1a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1718,9 +1718,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  {
>  	int z;
>  
> -	if (!pgdat_reclaimable(pgdat))
> -		return false;
> -
>  	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>  		struct zone *zone = pgdat->node_zones + z;
>  
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
@ 2017-03-01 15:14     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:02, Johannes Weiner wrote:
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.

I guess the original intention was to skip nodes which are under strong
memory pressure but I agree that this is questionable heuristic.
 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/migrate.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2c63ac06791b..45a18be27b1a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1718,9 +1718,6 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
>  {
>  	int z;
>  
> -	if (!pgdat_reclaimable(pgdat))
> -		return false;
> -
>  	for (z = pgdat->nr_zones - 1; z >= 0; z--) {
>  		struct zone *zone = pgdat->node_zones + z;
>  
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:17     ` Mel Gorman
  -1 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2017-03-01 15:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, linux-mm, linux-kernel, kernel-team

On Tue, Feb 28, 2017 at 04:40:01PM -0500, Johannes Weiner wrote:
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode,

laptop_mode was not tested.

> and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
@ 2017-03-01 15:17     ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2017-03-01 15:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, linux-mm, linux-kernel, kernel-team

On Tue, Feb 28, 2017 at 04:40:01PM -0500, Johannes Weiner wrote:
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode,

laptop_mode was not tested.

> and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:21     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:03, Johannes Weiner wrote:
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 911957b66622..46b6223fe7f3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2129,22 +2129,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	int pass;
>  
>  	/*
> -	 * If the zone or memcg is small, nr[l] can be 0.  This
> -	 * results in no scanning on this priority and a potential
> -	 * priority drop.  Global direct reclaim can go to the next
> -	 * zone and tends to have no problems. Global kswapd is for
> -	 * zone balancing and it needs to scan a minimum amount. When
> +	 * If the zone or memcg is small, nr[l] can be 0. When
>  	 * reclaiming for a memcg, a priority drop can cause high
> -	 * latencies, so it's better to scan a minimum amount there as
> -	 * well.
> +	 * latencies, so it's better to scan a minimum amount. When a
> +	 * cgroup has already been deleted, scrape out the remaining
> +	 * cache forcefully to get rid of the lingering state.
>  	 */
> -	if (current_is_kswapd()) {
> -		if (!pgdat_reclaimable(pgdat))
> -			force_scan = true;
> -		if (!mem_cgroup_online(memcg))
> -			force_scan = true;
> -	}
> -	if (!global_reclaim(sc))
> +	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
>  		force_scan = true;
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
@ 2017-03-01 15:21     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:21 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:03, Johannes Weiner wrote:
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 911957b66622..46b6223fe7f3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2129,22 +2129,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
>  	int pass;
>  
>  	/*
> -	 * If the zone or memcg is small, nr[l] can be 0.  This
> -	 * results in no scanning on this priority and a potential
> -	 * priority drop.  Global direct reclaim can go to the next
> -	 * zone and tends to have no problems. Global kswapd is for
> -	 * zone balancing and it needs to scan a minimum amount. When
> +	 * If the zone or memcg is small, nr[l] can be 0. When
>  	 * reclaiming for a memcg, a priority drop can cause high
> -	 * latencies, so it's better to scan a minimum amount there as
> -	 * well.
> +	 * latencies, so it's better to scan a minimum amount. When a
> +	 * cgroup has already been deleted, scrape out the remaining
> +	 * cache forcefully to get rid of the lingering state.
>  	 */
> -	if (current_is_kswapd()) {
> -		if (!pgdat_reclaimable(pgdat))
> -			force_scan = true;
> -		if (!mem_cgroup_online(memcg))
> -			force_scan = true;
> -	}
> -	if (!global_reclaim(sc))
> +	if (!global_reclaim(sc) || !mem_cgroup_online(memcg))
>  		force_scan = true;
>  
>  	/* If we have no swap space, do not bother scanning anon pages. */
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:40     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.

I _really_ like this, I hated the multi-pass part. One thig that I am
worried about and changelog doesn't mention it is what we are going to
do about small (<16MB) memcgs. On one hand they were already ignored in
the global reclaim so this is nothing really new but maybe we want to
preserve the behavior for the memcg reclaim at least which would reduce
side effect of this patch which is a great cleanup otherwise. Or at
least be explicit about this in the changelog.

Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
unconditionally?

> +		/*
> +		 * If the cgroup's already been deleted, make sure to
> +		 * scrape out the remaining cache.
		   Also make sure that small memcgs will not get
		   unnoticed during the memcg reclaim

> +		 */
> +		if (!scan && !mem_cgroup_online(memcg))

		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))

> +			scan = min(size, SWAP_CLUSTER_MAX);
>  

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
@ 2017-03-01 15:40     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.

I _really_ like this, I hated the multi-pass part. One thig that I am
worried about and changelog doesn't mention it is what we are going to
do about small (<16MB) memcgs. On one hand they were already ignored in
the global reclaim so this is nothing really new but maybe we want to
preserve the behavior for the memcg reclaim at least which would reduce
side effect of this patch which is a great cleanup otherwise. Or at
least be explicit about this in the changelog.

Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
unconditionally?

> +		/*
> +		 * If the cgroup's already been deleted, make sure to
> +		 * scrape out the remaining cache.
		   Also make sure that small memcgs will not get
		   unnoticed during the memcg reclaim

> +		 */
> +		if (!scan && !mem_cgroup_online(memcg))

		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))

> +			scan = min(size, SWAP_CLUSTER_MAX);
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:41     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:05, Johannes Weiner wrote:
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().

\o/

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mmzone.h |  1 -
>  mm/internal.h          |  1 -
>  mm/page_alloc.c        | 15 +++------------
>  mm/vmscan.c            |  9 ---------
>  mm/vmstat.c            | 22 +++-------------------
>  5 files changed, 6 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d2c50ab6ae40..04e0969966f6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -149,7 +149,6 @@ enum node_stat_item {
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> -	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
>  	WORKINGSET_REFAULT,
>  	WORKINGSET_ACTIVATE,
>  	WORKINGSET_NODERECLAIM,
> diff --git a/mm/internal.h b/mm/internal.h
> index aae93e3fd984..c583ce1b32b9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -91,7 +91,6 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>  
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f50e36e7b024..9ac639864bed 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1088,15 +1088,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned, flags;
> +	unsigned long flags;
>  	bool isolated_pageblocks;
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -1148,13 +1144,10 @@ static void free_one_page(struct zone *zone,
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned, flags;
> +	unsigned long flags;
> +
>  	spin_lock_irqsave(&zone->lock, flags);
>  	__count_vm_events(PGFREE, 1 << order);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -4497,7 +4490,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  #endif
>  			" writeback_tmp:%lukB"
>  			" unstable:%lukB"
> -			" pages_scanned:%lu"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -4520,7 +4512,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_SHMEM)),
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> -			node_page_state(pgdat, NR_PAGES_SCANNED),
>  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
>  				"yes" : "no");
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8cff6e2cd02c..35b791a8922b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -229,12 +229,6 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
>  	return nr;
>  }
>  
> -bool pgdat_reclaimable(struct pglist_data *pgdat)
> -{
> -	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> -		pgdat_reclaimable_pages(pgdat) * 6;
> -}
> -
>  /**
>   * lruvec_lru_size -  Returns the number of pages on the given LRU list.
>   * @lruvec: lru vector
> @@ -1749,7 +1743,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	if (global_reclaim(sc)) {
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  		if (current_is_kswapd())
>  			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>  		else
> @@ -1952,8 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	if (global_reclaim(sc))
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  	__count_vm_events(PGREFILL, nr_scanned);
>  
>  	spin_unlock_irq(&pgdat->lru_lock);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ff16cdc15df2..eface7467ea5 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
>  	"nr_unevictable",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
> -	"nr_pages_scanned",
>  	"workingset_refault",
>  	"workingset_activate",
>  	"workingset_nodereclaim",
> @@ -1375,7 +1374,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> -		   "\n   node_scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
> @@ -1383,7 +1381,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>  		   zone->spanned_pages,
>  		   zone->present_pages,
>  		   zone->managed_pages);
> @@ -1584,22 +1581,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> -			switch (i) {
> -			case NR_PAGES_SCANNED:
> -				/*
> -				 * This is often seen to go negative in
> -				 * recent kernels, but not to go permanently
> -				 * negative.  Whilst it would be nicer not to
> -				 * have exceptions, rooting them out would be
> -				 * another task, of rather low priority.
> -				 */
> -				break;
> -			default:
> -				pr_warn("%s: %s %ld\n",
> -					__func__, vmstat_text[i], val);
> -				err = -EINVAL;
> -				break;
> -			}
> +			pr_warn("%s: %s %ld\n",
> +				__func__, vmstat_text[i], val);
> +			err = -EINVAL;
>  		}
>  	}
>  	if (err)
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
@ 2017-03-01 15:41     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:05, Johannes Weiner wrote:
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().

\o/

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mmzone.h |  1 -
>  mm/internal.h          |  1 -
>  mm/page_alloc.c        | 15 +++------------
>  mm/vmscan.c            |  9 ---------
>  mm/vmstat.c            | 22 +++-------------------
>  5 files changed, 6 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d2c50ab6ae40..04e0969966f6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -149,7 +149,6 @@ enum node_stat_item {
>  	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
>  	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
>  	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
> -	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
>  	WORKINGSET_REFAULT,
>  	WORKINGSET_ACTIVATE,
>  	WORKINGSET_NODERECLAIM,
> diff --git a/mm/internal.h b/mm/internal.h
> index aae93e3fd984..c583ce1b32b9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -91,7 +91,6 @@ extern unsigned long highest_memmap_pfn;
>   */
>  extern int isolate_lru_page(struct page *page);
>  extern void putback_lru_page(struct page *page);
> -extern bool pgdat_reclaimable(struct pglist_data *pgdat);
>  
>  /*
>   * in mm/rmap.c:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f50e36e7b024..9ac639864bed 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1088,15 +1088,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>  {
>  	int migratetype = 0;
>  	int batch_free = 0;
> -	unsigned long nr_scanned, flags;
> +	unsigned long flags;
>  	bool isolated_pageblocks;
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  	isolated_pageblocks = has_isolate_pageblock(zone);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	while (count) {
>  		struct page *page;
>  		struct list_head *list;
> @@ -1148,13 +1144,10 @@ static void free_one_page(struct zone *zone,
>  				unsigned int order,
>  				int migratetype)
>  {
> -	unsigned long nr_scanned, flags;
> +	unsigned long flags;
> +
>  	spin_lock_irqsave(&zone->lock, flags);
>  	__count_vm_events(PGFREE, 1 << order);
> -	nr_scanned = node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED);
> -	if (nr_scanned)
> -		__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned);
> -
>  	if (unlikely(has_isolate_pageblock(zone) ||
>  		is_migrate_isolate(migratetype))) {
>  		migratetype = get_pfnblock_migratetype(page, pfn);
> @@ -4497,7 +4490,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  #endif
>  			" writeback_tmp:%lukB"
>  			" unstable:%lukB"
> -			" pages_scanned:%lu"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -4520,7 +4512,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_SHMEM)),
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> -			node_page_state(pgdat, NR_PAGES_SCANNED),
>  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
>  				"yes" : "no");
>  	}
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8cff6e2cd02c..35b791a8922b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -229,12 +229,6 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
>  	return nr;
>  }
>  
> -bool pgdat_reclaimable(struct pglist_data *pgdat)
> -{
> -	return node_page_state_snapshot(pgdat, NR_PAGES_SCANNED) <
> -		pgdat_reclaimable_pages(pgdat) * 6;
> -}
> -
>  /**
>   * lruvec_lru_size -  Returns the number of pages on the given LRU list.
>   * @lruvec: lru vector
> @@ -1749,7 +1743,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
>  	if (global_reclaim(sc)) {
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  		if (current_is_kswapd())
>  			__count_vm_events(PGSCAN_KSWAPD, nr_scanned);
>  		else
> @@ -1952,8 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
>  	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
>  	reclaim_stat->recent_scanned[file] += nr_taken;
>  
> -	if (global_reclaim(sc))
> -		__mod_node_page_state(pgdat, NR_PAGES_SCANNED, nr_scanned);
>  	__count_vm_events(PGREFILL, nr_scanned);
>  
>  	spin_unlock_irq(&pgdat->lru_lock);
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ff16cdc15df2..eface7467ea5 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -954,7 +954,6 @@ const char * const vmstat_text[] = {
>  	"nr_unevictable",
>  	"nr_isolated_anon",
>  	"nr_isolated_file",
> -	"nr_pages_scanned",
>  	"workingset_refault",
>  	"workingset_activate",
>  	"workingset_nodereclaim",
> @@ -1375,7 +1374,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n        min      %lu"
>  		   "\n        low      %lu"
>  		   "\n        high     %lu"
> -		   "\n   node_scanned  %lu"
>  		   "\n        spanned  %lu"
>  		   "\n        present  %lu"
>  		   "\n        managed  %lu",
> @@ -1383,7 +1381,6 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   min_wmark_pages(zone),
>  		   low_wmark_pages(zone),
>  		   high_wmark_pages(zone),
> -		   node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED),
>  		   zone->spanned_pages,
>  		   zone->present_pages,
>  		   zone->managed_pages);
> @@ -1584,22 +1581,9 @@ int vmstat_refresh(struct ctl_table *table, int write,
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
>  		val = atomic_long_read(&vm_zone_stat[i]);
>  		if (val < 0) {
> -			switch (i) {
> -			case NR_PAGES_SCANNED:
> -				/*
> -				 * This is often seen to go negative in
> -				 * recent kernels, but not to go permanently
> -				 * negative.  Whilst it would be nicer not to
> -				 * have exceptions, rooting them out would be
> -				 * another task, of rather low priority.
> -				 */
> -				break;
> -			default:
> -				pr_warn("%s: %s %ld\n",
> -					__func__, vmstat_text[i], val);
> -				err = -EINVAL;
> -				break;
> -			}
> +			pr_warn("%s: %s %ld\n",
> +				__func__, vmstat_text[i], val);
> +			err = -EINVAL;
>  		}
>  	}
>  	if (err)
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-01 15:51     ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:06, Johannes Weiner wrote:
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35b791a8922b..ddcff8a11c1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1471,12 +1471,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	unsigned long nr_taken = 0;
>  	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
>  	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
> -	unsigned long skipped = 0, total_skipped = 0;
> +	unsigned long skipped = 0;
>  	unsigned long scan, nr_pages;
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src);) {
> +					!list_empty(src); scan++) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1490,12 +1490,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> -		/*
> -		 * Account for scanned and skipped separetly to avoid the pgdat
> -		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> -		 */
> -		scan++;
> -
>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1524,6 +1518,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
>  
> +		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
> @@ -1531,17 +1526,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
>  			skipped += nr_skipped[zid];
>  		}
> -
> -		/*
> -		 * Account skipped pages as a partial scan as the pgdat may be
> -		 * close to unreclaimable. If the LRU list is empty, account
> -		 * skipped pages as a full scan.
> -		 */
> -		total_skipped = list_empty(src) ? skipped : skipped >> 2;
> -
> -		list_splice(&pages_skipped, src);
>  	}
> -	*nr_scanned = scan + total_skipped;
> +	*nr_scanned = scan;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				    scan, skipped, nr_taken, mode, lru);
>  	update_lru_sizes(lruvec, lru, nr_zone_taken);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
@ 2017-03-01 15:51     ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 15:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Tue 28-02-17 16:40:06, Johannes Weiner wrote:
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/vmscan.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35b791a8922b..ddcff8a11c1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1471,12 +1471,12 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	unsigned long nr_taken = 0;
>  	unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
>  	unsigned long nr_skipped[MAX_NR_ZONES] = { 0, };
> -	unsigned long skipped = 0, total_skipped = 0;
> +	unsigned long skipped = 0;
>  	unsigned long scan, nr_pages;
>  	LIST_HEAD(pages_skipped);
>  
>  	for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
> -					!list_empty(src);) {
> +					!list_empty(src); scan++) {
>  		struct page *page;
>  
>  		page = lru_to_page(src);
> @@ -1490,12 +1490,6 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			continue;
>  		}
>  
> -		/*
> -		 * Account for scanned and skipped separetly to avoid the pgdat
> -		 * being prematurely marked unreclaimable by pgdat_reclaimable.
> -		 */
> -		scan++;
> -
>  		switch (__isolate_lru_page(page, mode)) {
>  		case 0:
>  			nr_pages = hpage_nr_pages(page);
> @@ -1524,6 +1518,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  	if (!list_empty(&pages_skipped)) {
>  		int zid;
>  
> +		list_splice(&pages_skipped, src);
>  		for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>  			if (!nr_skipped[zid])
>  				continue;
> @@ -1531,17 +1526,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			__count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
>  			skipped += nr_skipped[zid];
>  		}
> -
> -		/*
> -		 * Account skipped pages as a partial scan as the pgdat may be
> -		 * close to unreclaimable. If the LRU list is empty, account
> -		 * skipped pages as a full scan.
> -		 */
> -		total_skipped = list_empty(src) ? skipped : skipped >> 2;
> -
> -		list_splice(&pages_skipped, src);
>  	}
> -	*nr_scanned = scan + total_skipped;
> +	*nr_scanned = scan;
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				    scan, skipped, nr_taken, mode, lru);
>  	update_lru_sizes(lruvec, lru, nr_zone_taken);
> -- 
> 2.11.1

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
  2017-03-01 15:40     ` Michal Hocko
@ 2017-03-01 17:36       ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-01 17:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Wed, Mar 01, 2017 at 04:40:27PM +0100, Michal Hocko wrote:
> On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> > 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> > to avoid high reclaim priorities for memcg by forcing it to scan a
> > minimum amount of pages when lru_pages >> priority yielded nothing.
> > This was done at a time when reclaim decisions like dirty throttling
> > were tied to the priority level.
> > 
> > Nowadays, the only meaningful thing still tied to priority dropping
> > below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> > allowed to write. But that is from an era where direct reclaim was
> > still allowed to call ->writepage, and kswapd nowadays avoids writes
> > until it's scanned every clean page in the system. Potential changes
> > to how quick sc->may_writepage could trigger are of little concern.
> > 
> > Remove the force_scan stuff, as well as the ugly multi-pass target
> > calculation that it necessitated.
> 
> I _really_ like this, I hated the multi-pass part. One thig that I am
> worried about and changelog doesn't mention it is what we are going to
> do about small (<16MB) memcgs. On one hand they were already ignored in
> the global reclaim so this is nothing really new but maybe we want to
> preserve the behavior for the memcg reclaim at least which would reduce
> side effect of this patch which is a great cleanup otherwise. Or at
> least be explicit about this in the changelog.

<16MB groups are a legitimate concern during global reclaim, but we
have done it this way for a long time and it never seemed to have
mattered in practice.

And for limit reclaim, this should be much less of a concern. It just
means we no longer scan these groups at DEF_PRIORITY and will have to
increase the scan window. I don't see a problem with that. And that
consequence of higher priorities is right in the patch subject.

> Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
> unconditionally?
> 
> > +		/*
> > +		 * If the cgroup's already been deleted, make sure to
> > +		 * scrape out the remaining cache.
> 		   Also make sure that small memcgs will not get
> 		   unnoticed during the memcg reclaim
> 
> > +		 */
> > +		if (!scan && !mem_cgroup_online(memcg))
> 
> 		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))

With this I'd be worried about regressing the setups pointed out in
6f04f48dc9c0 ("mm: only force scan in reclaim when none of the LRUs
are big enough.").

Granted, that patch is a little dubious. IMO, we should be steering
the LRU balance through references and, in that case in particular,
with swappiness. Using the default 60 for zswap is too low.

Plus, I would expect the refault detection code that was introduced
around the same time as this patch to counter-act the hot file
thrashing that is mentioned in that patch's changelog.

Nevertheless, it seems a bit gratuitous to go against that change so
directly when global reclaim hasn't historically been a problem with
groups <16MB. Limit reclaim should be fine too.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
@ 2017-03-01 17:36       ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-01 17:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Wed, Mar 01, 2017 at 04:40:27PM +0100, Michal Hocko wrote:
> On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> > 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> > to avoid high reclaim priorities for memcg by forcing it to scan a
> > minimum amount of pages when lru_pages >> priority yielded nothing.
> > This was done at a time when reclaim decisions like dirty throttling
> > were tied to the priority level.
> > 
> > Nowadays, the only meaningful thing still tied to priority dropping
> > below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> > allowed to write. But that is from an era where direct reclaim was
> > still allowed to call ->writepage, and kswapd nowadays avoids writes
> > until it's scanned every clean page in the system. Potential changes
> > to how quick sc->may_writepage could trigger are of little concern.
> > 
> > Remove the force_scan stuff, as well as the ugly multi-pass target
> > calculation that it necessitated.
> 
> I _really_ like this, I hated the multi-pass part. One thig that I am
> worried about and changelog doesn't mention it is what we are going to
> do about small (<16MB) memcgs. On one hand they were already ignored in
> the global reclaim so this is nothing really new but maybe we want to
> preserve the behavior for the memcg reclaim at least which would reduce
> side effect of this patch which is a great cleanup otherwise. Or at
> least be explicit about this in the changelog.

<16MB groups are a legitimate concern during global reclaim, but we
have done it this way for a long time and it never seemed to have
mattered in practice.

And for limit reclaim, this should be much less of a concern. It just
means we no longer scan these groups at DEF_PRIORITY and will have to
increase the scan window. I don't see a problem with that. And that
consequence of higher priorities is right in the patch subject.

> Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
> unconditionally?
> 
> > +		/*
> > +		 * If the cgroup's already been deleted, make sure to
> > +		 * scrape out the remaining cache.
> 		   Also make sure that small memcgs will not get
> 		   unnoticed during the memcg reclaim
> 
> > +		 */
> > +		if (!scan && !mem_cgroup_online(memcg))
> 
> 		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))

With this I'd be worried about regressing the setups pointed out in
6f04f48dc9c0 ("mm: only force scan in reclaim when none of the LRUs
are big enough.").

Granted, that patch is a little dubious. IMO, we should be steering
the LRU balance through references and, in that case in particular,
with swappiness. Using the default 60 for zswap is too low.

Plus, I would expect the refault detection code that was introduced
around the same time as this patch to counter-act the hot file
thrashing that is mentioned in that patch's changelog.

Nevertheless, it seems a bit gratuitous to go against that change so
directly when global reclaim hasn't historically been a problem with
groups <16MB. Limit reclaim should be fine too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
  2017-03-01 17:36       ` Johannes Weiner
@ 2017-03-01 19:13         ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 19:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Wed 01-03-17 12:36:28, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 04:40:27PM +0100, Michal Hocko wrote:
> > On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> > > 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> > > to avoid high reclaim priorities for memcg by forcing it to scan a
> > > minimum amount of pages when lru_pages >> priority yielded nothing.
> > > This was done at a time when reclaim decisions like dirty throttling
> > > were tied to the priority level.
> > > 
> > > Nowadays, the only meaningful thing still tied to priority dropping
> > > below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> > > allowed to write. But that is from an era where direct reclaim was
> > > still allowed to call ->writepage, and kswapd nowadays avoids writes
> > > until it's scanned every clean page in the system. Potential changes
> > > to how quick sc->may_writepage could trigger are of little concern.
> > > 
> > > Remove the force_scan stuff, as well as the ugly multi-pass target
> > > calculation that it necessitated.
> > 
> > I _really_ like this, I hated the multi-pass part. One thig that I am
> > worried about and changelog doesn't mention it is what we are going to
> > do about small (<16MB) memcgs. On one hand they were already ignored in
> > the global reclaim so this is nothing really new but maybe we want to
> > preserve the behavior for the memcg reclaim at least which would reduce
> > side effect of this patch which is a great cleanup otherwise. Or at
> > least be explicit about this in the changelog.
> 
> <16MB groups are a legitimate concern during global reclaim, but we
> have done it this way for a long time and it never seemed to have
> mattered in practice.

Yeah, this is not really easy to spot because there are usually other
memcgs which can be reclaimed.

> And for limit reclaim, this should be much less of a concern. It just
> means we no longer scan these groups at DEF_PRIORITY and will have to
> increase the scan window. I don't see a problem with that. And that
> consequence of higher priorities is right in the patch subject.

well the memory pressure spills over to others in the same hierarchy.
But I agree this shouldn't a disaster.

> > Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
> > unconditionally?
> > 
> > > +		/*
> > > +		 * If the cgroup's already been deleted, make sure to
> > > +		 * scrape out the remaining cache.
> > 		   Also make sure that small memcgs will not get
> > 		   unnoticed during the memcg reclaim
> > 
> > > +		 */
> > > +		if (!scan && !mem_cgroup_online(memcg))
> > 
> > 		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))
> 
> With this I'd be worried about regressing the setups pointed out in
> 6f04f48dc9c0 ("mm: only force scan in reclaim when none of the LRUs
> are big enough.").
> 
> Granted, that patch is a little dubious. IMO, we should be steering
> the LRU balance through references and, in that case in particular,
> with swappiness. Using the default 60 for zswap is too low.
> 
> Plus, I would expect the refault detection code that was introduced
> around the same time as this patch to counter-act the hot file
> thrashing that is mentioned in that patch's changelog.
> 
> Nevertheless, it seems a bit gratuitous to go against that change so
> directly when global reclaim hasn't historically been a problem with
> groups <16MB. Limit reclaim should be fine too.

As I've already mentioned, I really love this patch I just think this is
a subtle side effect. The above reasoning should be good enough I
believe.

Anyway I forgot to add, I will leave the decision whether to have this
in a separate patch or just added to the changelog to you.
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
@ 2017-03-01 19:13         ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-01 19:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Mel Gorman, linux-mm, linux-kernel, kernel-team

On Wed 01-03-17 12:36:28, Johannes Weiner wrote:
> On Wed, Mar 01, 2017 at 04:40:27PM +0100, Michal Hocko wrote:
> > On Tue 28-02-17 16:40:04, Johannes Weiner wrote:
> > > 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> > > to avoid high reclaim priorities for memcg by forcing it to scan a
> > > minimum amount of pages when lru_pages >> priority yielded nothing.
> > > This was done at a time when reclaim decisions like dirty throttling
> > > were tied to the priority level.
> > > 
> > > Nowadays, the only meaningful thing still tied to priority dropping
> > > below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> > > allowed to write. But that is from an era where direct reclaim was
> > > still allowed to call ->writepage, and kswapd nowadays avoids writes
> > > until it's scanned every clean page in the system. Potential changes
> > > to how quick sc->may_writepage could trigger are of little concern.
> > > 
> > > Remove the force_scan stuff, as well as the ugly multi-pass target
> > > calculation that it necessitated.
> > 
> > I _really_ like this, I hated the multi-pass part. One thig that I am
> > worried about and changelog doesn't mention it is what we are going to
> > do about small (<16MB) memcgs. On one hand they were already ignored in
> > the global reclaim so this is nothing really new but maybe we want to
> > preserve the behavior for the memcg reclaim at least which would reduce
> > side effect of this patch which is a great cleanup otherwise. Or at
> > least be explicit about this in the changelog.
> 
> <16MB groups are a legitimate concern during global reclaim, but we
> have done it this way for a long time and it never seemed to have
> mattered in practice.

Yeah, this is not really easy to spot because there are usually other
memcgs which can be reclaimed.

> And for limit reclaim, this should be much less of a concern. It just
> means we no longer scan these groups at DEF_PRIORITY and will have to
> increase the scan window. I don't see a problem with that. And that
> consequence of higher priorities is right in the patch subject.

well the memory pressure spills over to others in the same hierarchy.
But I agree this shouldn't a disaster.

> > Btw. why cannot we simply force scan at least SWAP_CLUSTER_MAX
> > unconditionally?
> > 
> > > +		/*
> > > +		 * If the cgroup's already been deleted, make sure to
> > > +		 * scrape out the remaining cache.
> > 		   Also make sure that small memcgs will not get
> > 		   unnoticed during the memcg reclaim
> > 
> > > +		 */
> > > +		if (!scan && !mem_cgroup_online(memcg))
> > 
> > 		if (!scan && (!mem_cgroup_online(memcg) || !global_reclaim(sc)))
> 
> With this I'd be worried about regressing the setups pointed out in
> 6f04f48dc9c0 ("mm: only force scan in reclaim when none of the LRUs
> are big enough.").
> 
> Granted, that patch is a little dubious. IMO, we should be steering
> the LRU balance through references and, in that case in particular,
> with swappiness. Using the default 60 for zswap is too low.
> 
> Plus, I would expect the refault detection code that was introduced
> around the same time as this patch to counter-act the hot file
> thrashing that is mentioned in that patch's changelog.
> 
> Nevertheless, it seems a bit gratuitous to go against that change so
> directly when global reclaim hasn't historically been a problem with
> groups <16MB. Limit reclaim should be fine too.

As I've already mentioned, I really love this patch I just think this is
a subtle side effect. The above reasoning should be good enough I
believe.

Anyway I forgot to add, I will leave the decision whether to have this
in a separate patch or just added to the changelog to you.
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-02-28 21:39   ` Johannes Weiner
@ 2017-03-02  3:23     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:23 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-02  3:23     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:23 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---

Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:25     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:25 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling
@ 2017-03-02  3:25     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:25 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> PF_MEMALLOC direct reclaimers get throttled on a node when the sum of
> all free pages in each zone fall below half the min watermark. During
> the summation, we want to exclude zones that don't have reclaimables.
> Checking the same pgdat over and over again doesn't make sense.
> 
> Fixes: 599d0c954f91 ("mm, vmscan: move LRU lists to node")
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:27     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:27 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating
@ 2017-03-02  3:27     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:27 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> allowed laptop_mode=1 to start writing not just when the priority
> drops to DEF_PRIORITY - 2 but also when the node is unreclaimable.
> That appears to be a spurious change in this patch as I doubt the
> series was tested with laptop_mode, and neither is that particular
> change mentioned in the changelog. Remove it, it's still recent.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:28     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:28 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team




On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target
@ 2017-03-02  3:28     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:28 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team




On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NUMA balancing already checks the watermarks of the target node to
> decide whether it's a suitable balancing target. Whether the node is
> reclaimable or not is irrelevant when we don't intend to reclaim.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:31     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:31 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes
@ 2017-03-02  3:31     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:31 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for kswapd by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> 
> b95a2f2d486d ("mm: vmscan: convert global reclaim to per-memcg LRU
> lists"), due to switching global reclaim to a round-robin scheme over
> all cgroups, had to restrict this forceful behavior to unreclaimable
> zones in order to prevent massive overreclaim with many cgroups.
> 
> The latter patch effectively neutered the behavior completely for all
> but extreme memory pressure. But in those situations we might as well
> drop the reclaimers to lower priority levels. Remove the check.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 19 +++++--------------
>  1 file changed, 5 insertions(+), 14 deletions(-)
> 
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:32     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:32 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim
@ 2017-03-02  3:32     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:32 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> 246e87a93934 ("memcg: fix get_scan_count() for small targets") sought
> to avoid high reclaim priorities for memcg by forcing it to scan a
> minimum amount of pages when lru_pages >> priority yielded nothing.
> This was done at a time when reclaim decisions like dirty throttling
> were tied to the priority level.
> 
> Nowadays, the only meaningful thing still tied to priority dropping
> below DEF_PRIORITY - 2 is gating whether laptop_mode=1 is generally
> allowed to write. But that is from an era where direct reclaim was
> still allowed to call ->writepage, and kswapd nowadays avoids writes
> until it's scanned every clean page in the system. Potential changes
> to how quick sc->may_writepage could trigger are of little concern.
> 
> Remove the force_scan stuff, as well as the ugly multi-pass target
> calculation that it necessitated.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:34     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:34 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
@ 2017-03-02  3:34     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:34 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
> 
> NR_PAGES_SCANNED counts number of pages scanned since the last page
> free event in the allocator. This was used primarily to measure the
> reclaimability of zones and nodes, and determine when reclaim should
> give up on them. In that role, it has been replaced in the preceeding
> patches by a different mechanism.
> 
> Being implemented as an efficient vmstat counter, it was automatically
> exported to userspace as well. It's however unlikely that anyone
> outside the kernel is using this counter in any meaningful way.
> 
> Remove the counter and the unused pgdat_reclaimable().
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:36     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:36 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
>  
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan"
@ 2017-03-02  3:36     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:36 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team

On March 01, 2017 5:40 AM Johannes Weiner wrote: 
>  
> This reverts commit d7f05528eedb047efe2288cff777676b028747b6.
> 
> Now that reclaimability of a node is no longer based on the ratio
> between pages scanned and theoretically reclaimable pages, we can
> remove accounting tricks for pages skipped due to zone constraints.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
  2017-02-28 21:40   ` Johannes Weiner
@ 2017-03-02  3:37     ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:37 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.
> 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim
@ 2017-03-02  3:37     ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-02  3:37 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Andrew Morton'
  Cc: 'Jia He', 'Michal Hocko', 'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 01, 2017 5:40 AM Johannes Weiner wrote:
> 
> The backoff mechanism is not needed. If we have MAX_RECLAIM_RETRIES
> loops without progress, we'll OOM anyway; backing off might cut one or
> two iterations off that in the rare OOM case. If we have intermittent
> success reclaiming a few pages, the backoff function gets reset also,
> and so is of little help in these scenarios.
> 
> We might want a backoff function for when there IS progress, but not
> enough to be satisfactory. But this isn't that. Remove it.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-02-28 21:39   ` Johannes Weiner
@ 2017-03-02 23:30     ` Shakeel Butt
  -1 siblings, 0 replies; 80+ messages in thread
From: Shakeel Butt @ 2017-03-02 23:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, Mel Gorman, Linux MM, LKML,
	kernel-team

On Tue, Feb 28, 2017 at 1:39 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>

Should the condition of wait_event_killable in throttle_direct_reclaim
be changed to (pfmemalloc_watermark_ok(pgdat) ||
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)?

> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
>
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>         int kswapd_order;
>         enum zone_type kswapd_classzone_idx;
>
> +       int kswapd_failures;            /* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>         int kcompactd_max_order;
>         enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>                         K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>                         K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>                         node_page_state(pgdat, NR_PAGES_SCANNED),
> -                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +                       pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +                               "yes" : "no");
>         }
>
>         for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>                                          sc->nr_scanned - nr_scanned, sc));
>
> +       /*
> +        * Kswapd gives up on balancing particular nodes after too
> +        * many failures to reclaim anything from them and goes to
> +        * sleep. On reclaim progress, reset the failure counter. A
> +        * successful direct reclaim run will revive a dormant kswapd.
> +        */
> +       if (reclaimable)
> +               pgdat->kswapd_failures = 0;
> +
>         return reclaimable;
>  }
>
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                                  GFP_KERNEL | __GFP_HARDWALL))
>                                 continue;
>
> -                       if (sc->priority != DEF_PRIORITY &&
> -                           !pgdat_reclaimable(zone->zone_pgdat))
> -                               continue;       /* Let kswapd poll it */
> -
>                         /*
>                          * If we already have plenty of memory free for
>                          * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>         if (waitqueue_active(&pgdat->pfmemalloc_wait))
>                 wake_up_all(&pgdat->pfmemalloc_wait);
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return true;
> +
>         for (i = 0; i <= classzone_idx; i++) {
>                 struct zone *zone = pgdat->node_zones + i;
>
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                         sc.priority--;
>         } while (sc.priority >= 1);
>
> +       if (!sc.nr_reclaimed)
> +               pgdat->kswapd_failures++;
> +
>  out:
>         /*
>          * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return;
> +
>         /* Only wake kswapd if all zones are unbalanced */
>         for (z = 0; z <= classzone_idx; z++) {
>                 zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>             sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>                 return NODE_RECLAIM_FULL;
>
> -       if (!pgdat_reclaimable(pgdat))
> -               return NODE_RECLAIM_FULL;
> -
>         /*
>          * Do not scan if the allocation should not be delayed.
>          */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n  node_unreclaimable:  %u"
>                    "\n  start_pfn:           %lu"
>                    "\n  node_inactive_ratio: %u",
> -                  !pgdat_reclaimable(zone->zone_pgdat),
> +                  pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>                    zone->zone_start_pfn,
>                    zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
> --
> 2.11.1
>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-02 23:30     ` Shakeel Butt
  0 siblings, 0 replies; 80+ messages in thread
From: Shakeel Butt @ 2017-03-02 23:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, Mel Gorman, Linux MM, LKML,
	kernel-team

On Tue, Feb 28, 2017 at 1:39 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
>
> $ echo 4000 >/proc/sys/vm/nr_hugepages
>
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
>
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
>
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
>
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
>
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
>

Should the condition of wait_event_killable in throttle_direct_reclaim
be changed to (pfmemalloc_watermark_ok(pgdat) ||
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)?

> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
>
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>         int kswapd_order;
>         enum zone_type kswapd_classzone_idx;
>
> +       int kswapd_failures;            /* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>         int kcompactd_max_order;
>         enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>                         K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>                         K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>                         node_page_state(pgdat, NR_PAGES_SCANNED),
> -                       !pgdat_reclaimable(pgdat) ? "yes" : "no");
> +                       pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +                               "yes" : "no");
>         }
>
>         for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>         } while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>                                          sc->nr_scanned - nr_scanned, sc));
>
> +       /*
> +        * Kswapd gives up on balancing particular nodes after too
> +        * many failures to reclaim anything from them and goes to
> +        * sleep. On reclaim progress, reset the failure counter. A
> +        * successful direct reclaim run will revive a dormant kswapd.
> +        */
> +       if (reclaimable)
> +               pgdat->kswapd_failures = 0;
> +
>         return reclaimable;
>  }
>
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>                                                  GFP_KERNEL | __GFP_HARDWALL))
>                                 continue;
>
> -                       if (sc->priority != DEF_PRIORITY &&
> -                           !pgdat_reclaimable(zone->zone_pgdat))
> -                               continue;       /* Let kswapd poll it */
> -
>                         /*
>                          * If we already have plenty of memory free for
>                          * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>         if (waitqueue_active(&pgdat->pfmemalloc_wait))
>                 wake_up_all(&pgdat->pfmemalloc_wait);
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return true;
> +
>         for (i = 0; i <= classzone_idx; i++) {
>                 struct zone *zone = pgdat->node_zones + i;
>
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>                         sc.priority--;
>         } while (sc.priority >= 1);
>
> +       if (!sc.nr_reclaimed)
> +               pgdat->kswapd_failures++;
> +
>  out:
>         /*
>          * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>         if (!waitqueue_active(&pgdat->kswapd_wait))
>                 return;
>
> +       /* Hopeless node, leave it to direct reclaim */
> +       if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +               return;
> +
>         /* Only wake kswapd if all zones are unbalanced */
>         for (z = 0; z <= classzone_idx; z++) {
>                 zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>             sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>                 return NODE_RECLAIM_FULL;
>
> -       if (!pgdat_reclaimable(pgdat))
> -               return NODE_RECLAIM_FULL;
> -
>         /*
>          * Do not scan if the allocation should not be delayed.
>          */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>                    "\n  node_unreclaimable:  %u"
>                    "\n  start_pfn:           %lu"
>                    "\n  node_inactive_ratio: %u",
> -                  !pgdat_reclaimable(zone->zone_pgdat),
> +                  pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>                    zone->zone_start_pfn,
>                    zone->zone_pgdat->inactive_ratio);
>         seq_putc(m, '\n');
> --
> 2.11.1
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-02-28 21:39   ` Johannes Weiner
@ 2017-03-03  1:26     ` Minchan Kim
  -1 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-03  1:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Johannes,

On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>  
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>  
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>  
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  
> +	/*
> +	 * Kswapd gives up on balancing particular nodes after too
> +	 * many failures to reclaim anything from them and goes to
> +	 * sleep. On reclaim progress, reset the failure counter. A
> +	 * successful direct reclaim run will revive a dormant kswapd.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failures = 0;
> +
>  	return reclaimable;
>  }
>  
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>  
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>  
> +	/* Hopeless node, leave it to direct reclaim */

I hope to clear what we want by deferring the job to direct reclaim.
Direct reclaim is much limited reclaim worker by serveral things(e.g.,
avoid writeback for stack overflow, NOIO|NOFS context) so what do we
want for direct reclaimer to do even if kswapd can make forward
progress? OOM?

> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failures++;

sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
it pgdat->kswapd_failures is increased.

> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>  
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>  
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 2.11.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-03  1:26     ` Minchan Kim
  0 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-03  1:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jia He, Michal Hocko, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Johannes,

On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> Jia He reports a problem with kswapd spinning at 100% CPU when
> requesting more hugepages than memory available in the system:
> 
> $ echo 4000 >/proc/sys/vm/nr_hugepages
> 
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> 
> At that time, there are no reclaimable pages left in the node, but as
> kswapd fails to restore the high watermarks it refuses to go to sleep.
> 
> Kswapd needs to back away from nodes that fail to balance. Up until
> 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> kswapd had such a mechanism. It considered zones whose theoretically
> reclaimable pages it had reclaimed six times over as unreclaimable and
> backed away from them. This guard was erroneously removed as the patch
> changed the definition of a balanced node.
> 
> However, simply restoring this code wouldn't help in the case reported
> here: there *are* no reclaimable pages that could be scanned until the
> threshold is met. Kswapd would stay awake anyway.
> 
> Introduce a new and much simpler way of backing off. If kswapd runs
> through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> page, make it back off from the node. This is the same number of shots
> direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> that node until a direct reclaimer manages to reclaim some pages, thus
> proving the node reclaimable again.
> 
> v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> 
> Reported-by: Jia He <hejianet@gmail.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Tested-by: Jia He <hejianet@gmail.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  2 ++
>  mm/internal.h          |  6 ++++++
>  mm/page_alloc.c        |  9 ++-------
>  mm/vmscan.c            | 27 ++++++++++++++++++++-------
>  mm/vmstat.c            |  2 +-
>  5 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8e02b3750fe0..d2c50ab6ae40 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -630,6 +630,8 @@ typedef struct pglist_data {
>  	int kswapd_order;
>  	enum zone_type kswapd_classzone_idx;
>  
> +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> +
>  #ifdef CONFIG_COMPACTION
>  	int kcompactd_max_order;
>  	enum zone_type kcompactd_classzone_idx;
> diff --git a/mm/internal.h b/mm/internal.h
> index ccfc2a2969f4..aae93e3fd984 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
>  extern unsigned long highest_memmap_pfn;
>  
>  /*
> + * Maximum number of reclaim retries without progress before the OOM
> + * killer is consider the only way forward.
> + */
> +#define MAX_RECLAIM_RETRIES 16
> +
> +/*
>   * in mm/vmscan.c:
>   */
>  extern int isolate_lru_page(struct page *page);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 614cd0397ce3..f50e36e7b024 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  }
>  
>  /*
> - * Maximum number of reclaim retries without any progress before OOM killer
> - * is consider as the only way to move forward.
> - */
> -#define MAX_RECLAIM_RETRIES 16
> -
> -/*
>   * Checks whether it makes sense to retry the reclaim to make a forward progress
>   * for the given allocation request.
>   * The reclaim feedback represented by did_some_progress (any progress during
> @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			node_page_state(pgdat, NR_PAGES_SCANNED),
> -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> +				"yes" : "no");
>  	}
>  
>  	for_each_populated_zone(zone) {
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..407b27831ff7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
>  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>  					 sc->nr_scanned - nr_scanned, sc));
>  
> +	/*
> +	 * Kswapd gives up on balancing particular nodes after too
> +	 * many failures to reclaim anything from them and goes to
> +	 * sleep. On reclaim progress, reset the failure counter. A
> +	 * successful direct reclaim run will revive a dormant kswapd.
> +	 */
> +	if (reclaimable)
> +		pgdat->kswapd_failures = 0;
> +
>  	return reclaimable;
>  }
>  
> @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  						 GFP_KERNEL | __GFP_HARDWALL))
>  				continue;
>  
> -			if (sc->priority != DEF_PRIORITY &&
> -			    !pgdat_reclaimable(zone->zone_pgdat))
> -				continue;	/* Let kswapd poll it */
> -
>  			/*
>  			 * If we already have plenty of memory free for
>  			 * compaction in this zone, don't free any more.
> @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
>  		wake_up_all(&pgdat->pfmemalloc_wait);
>  
> +	/* Hopeless node, leave it to direct reclaim */

I hope to clear what we want by deferring the job to direct reclaim.
Direct reclaim is much limited reclaim worker by serveral things(e.g.,
avoid writeback for stack overflow, NOIO|NOFS context) so what do we
want for direct reclaimer to do even if kswapd can make forward
progress? OOM?

> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return true;
> +
>  	for (i = 0; i <= classzone_idx; i++) {
>  		struct zone *zone = pgdat->node_zones + i;
>  
> @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  
> +	if (!sc.nr_reclaimed)
> +		pgdat->kswapd_failures++;

sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
it pgdat->kswapd_failures is increased.

> +
>  out:
>  	/*
>  	 * Return the order kswapd stopped reclaiming at as
> @@ -3515,6 +3527,10 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
>  	if (!waitqueue_active(&pgdat->kswapd_wait))
>  		return;
>  
> +	/* Hopeless node, leave it to direct reclaim */
> +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> +		return;
> +
>  	/* Only wake kswapd if all zones are unbalanced */
>  	for (z = 0; z <= classzone_idx; z++) {
>  		zone = pgdat->node_zones + z;
> @@ -3785,9 +3801,6 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>  	    sum_zone_node_page_state(pgdat->node_id, NR_SLAB_RECLAIMABLE) <= pgdat->min_slab_pages)
>  		return NODE_RECLAIM_FULL;
>  
> -	if (!pgdat_reclaimable(pgdat))
> -		return NODE_RECLAIM_FULL;
> -
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 69f9aff39a2e..ff16cdc15df2 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1422,7 +1422,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
>  		   "\n  node_unreclaimable:  %u"
>  		   "\n  start_pfn:           %lu"
>  		   "\n  node_inactive_ratio: %u",
> -		   !pgdat_reclaimable(zone->zone_pgdat),
> +		   pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES,
>  		   zone->zone_start_pfn,
>  		   zone->zone_pgdat->inactive_ratio);
>  	seq_putc(m, '\n');
> -- 
> 2.11.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-03  1:26     ` Minchan Kim
@ 2017-03-03  7:59       ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-03  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> > 
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > 
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > 
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > 
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> > 
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> > 
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
> > 
> > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > 
> > Reported-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Jia He <hejianet@gmail.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/internal.h          |  6 ++++++
> >  mm/page_alloc.c        |  9 ++-------
> >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> >  mm/vmstat.c            |  2 +-
> >  5 files changed, 31 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 8e02b3750fe0..d2c50ab6ae40 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> >  	int kswapd_order;
> >  	enum zone_type kswapd_classzone_idx;
> >  
> > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > +
> >  #ifdef CONFIG_COMPACTION
> >  	int kcompactd_max_order;
> >  	enum zone_type kcompactd_classzone_idx;
> > diff --git a/mm/internal.h b/mm/internal.h
> > index ccfc2a2969f4..aae93e3fd984 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> >  extern unsigned long highest_memmap_pfn;
> >  
> >  /*
> > + * Maximum number of reclaim retries without progress before the OOM
> > + * killer is consider the only way forward.
> > + */
> > +#define MAX_RECLAIM_RETRIES 16
> > +
> > +/*
> >   * in mm/vmscan.c:
> >   */
> >  extern int isolate_lru_page(struct page *page);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 614cd0397ce3..f50e36e7b024 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  }
> >  
> >  /*
> > - * Maximum number of reclaim retries without any progress before OOM killer
> > - * is consider as the only way to move forward.
> > - */
> > -#define MAX_RECLAIM_RETRIES 16
> > -
> > -/*
> >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> >   * for the given allocation request.
> >   * The reclaim feedback represented by did_some_progress (any progress during
> > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > +				"yes" : "no");
> >  	}
> >  
> >  	for_each_populated_zone(zone) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..407b27831ff7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> >  					 sc->nr_scanned - nr_scanned, sc));
> >  
> > +	/*
> > +	 * Kswapd gives up on balancing particular nodes after too
> > +	 * many failures to reclaim anything from them and goes to
> > +	 * sleep. On reclaim progress, reset the failure counter. A
> > +	 * successful direct reclaim run will revive a dormant kswapd.
> > +	 */
> > +	if (reclaimable)
> > +		pgdat->kswapd_failures = 0;
> > +
> >  	return reclaimable;
> >  }
> >  
> > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  						 GFP_KERNEL | __GFP_HARDWALL))
> >  				continue;
> >  
> > -			if (sc->priority != DEF_PRIORITY &&
> > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > -				continue;	/* Let kswapd poll it */
> > -
> >  			/*
> >  			 * If we already have plenty of memory free for
> >  			 * compaction in this zone, don't free any more.
> > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> >  		wake_up_all(&pgdat->pfmemalloc_wait);
> >  
> > +	/* Hopeless node, leave it to direct reclaim */
> 
> I hope to clear what we want by deferring the job to direct reclaim.
> Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> avoid writeback for stack overflow, NOIO|NOFS context)

This is true but if kswapd cannot reclaim anything at all then we do not
have much choice left

> so what do we
> want for direct reclaimer to do even if kswapd can make forward
> progress? OOM?

yes resp. back off for costly high order requests and leave the node
unbalanced.

> > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > +		return true;
> > +
> >  	for (i = 0; i <= classzone_idx; i++) {
> >  		struct zone *zone = pgdat->node_zones + i;
> >  
> > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> > +	if (!sc.nr_reclaimed)
> > +		pgdat->kswapd_failures++;
> 
> sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> it pgdat->kswapd_failures is increased.

But then we increase the counter in kswapd_shrink_node or do I miss your
point? Are you suggesting to use the aggregate nr_reclaimed over all
priorities because the last round might have made no progress?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-03  7:59       ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-03  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> Hi Johannes,
> 
> On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > Jia He reports a problem with kswapd spinning at 100% CPU when
> > requesting more hugepages than memory available in the system:
> > 
> > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > 
> > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > 
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > 
> > At that time, there are no reclaimable pages left in the node, but as
> > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > 
> > Kswapd needs to back away from nodes that fail to balance. Up until
> > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > kswapd had such a mechanism. It considered zones whose theoretically
> > reclaimable pages it had reclaimed six times over as unreclaimable and
> > backed away from them. This guard was erroneously removed as the patch
> > changed the definition of a balanced node.
> > 
> > However, simply restoring this code wouldn't help in the case reported
> > here: there *are* no reclaimable pages that could be scanned until the
> > threshold is met. Kswapd would stay awake anyway.
> > 
> > Introduce a new and much simpler way of backing off. If kswapd runs
> > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > page, make it back off from the node. This is the same number of shots
> > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > that node until a direct reclaimer manages to reclaim some pages, thus
> > proving the node reclaimable again.
> > 
> > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > 
> > Reported-by: Jia He <hejianet@gmail.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Tested-by: Jia He <hejianet@gmail.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  include/linux/mmzone.h |  2 ++
> >  mm/internal.h          |  6 ++++++
> >  mm/page_alloc.c        |  9 ++-------
> >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> >  mm/vmstat.c            |  2 +-
> >  5 files changed, 31 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 8e02b3750fe0..d2c50ab6ae40 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> >  	int kswapd_order;
> >  	enum zone_type kswapd_classzone_idx;
> >  
> > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > +
> >  #ifdef CONFIG_COMPACTION
> >  	int kcompactd_max_order;
> >  	enum zone_type kcompactd_classzone_idx;
> > diff --git a/mm/internal.h b/mm/internal.h
> > index ccfc2a2969f4..aae93e3fd984 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> >  extern unsigned long highest_memmap_pfn;
> >  
> >  /*
> > + * Maximum number of reclaim retries without progress before the OOM
> > + * killer is consider the only way forward.
> > + */
> > +#define MAX_RECLAIM_RETRIES 16
> > +
> > +/*
> >   * in mm/vmscan.c:
> >   */
> >  extern int isolate_lru_page(struct page *page);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 614cd0397ce3..f50e36e7b024 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  }
> >  
> >  /*
> > - * Maximum number of reclaim retries without any progress before OOM killer
> > - * is consider as the only way to move forward.
> > - */
> > -#define MAX_RECLAIM_RETRIES 16
> > -
> > -/*
> >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> >   * for the given allocation request.
> >   * The reclaim feedback represented by did_some_progress (any progress during
> > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > +				"yes" : "no");
> >  	}
> >  
> >  	for_each_populated_zone(zone) {
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..407b27831ff7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> >  					 sc->nr_scanned - nr_scanned, sc));
> >  
> > +	/*
> > +	 * Kswapd gives up on balancing particular nodes after too
> > +	 * many failures to reclaim anything from them and goes to
> > +	 * sleep. On reclaim progress, reset the failure counter. A
> > +	 * successful direct reclaim run will revive a dormant kswapd.
> > +	 */
> > +	if (reclaimable)
> > +		pgdat->kswapd_failures = 0;
> > +
> >  	return reclaimable;
> >  }
> >  
> > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> >  						 GFP_KERNEL | __GFP_HARDWALL))
> >  				continue;
> >  
> > -			if (sc->priority != DEF_PRIORITY &&
> > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > -				continue;	/* Let kswapd poll it */
> > -
> >  			/*
> >  			 * If we already have plenty of memory free for
> >  			 * compaction in this zone, don't free any more.
> > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> >  		wake_up_all(&pgdat->pfmemalloc_wait);
> >  
> > +	/* Hopeless node, leave it to direct reclaim */
> 
> I hope to clear what we want by deferring the job to direct reclaim.
> Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> avoid writeback for stack overflow, NOIO|NOFS context)

This is true but if kswapd cannot reclaim anything at all then we do not
have much choice left

> so what do we
> want for direct reclaimer to do even if kswapd can make forward
> progress? OOM?

yes resp. back off for costly high order requests and leave the node
unbalanced.

> > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > +		return true;
> > +
> >  	for (i = 0; i <= classzone_idx; i++) {
> >  		struct zone *zone = pgdat->node_zones + i;
> >  
> > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> > +	if (!sc.nr_reclaimed)
> > +		pgdat->kswapd_failures++;
> 
> sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> it pgdat->kswapd_failures is increased.

But then we increase the counter in kswapd_shrink_node or do I miss your
point? Are you suggesting to use the aggregate nr_reclaimed over all
priorities because the last round might have made no progress?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-03  7:59       ` Michal Hocko
@ 2017-03-06  1:37         ` Minchan Kim
  -1 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-06  1:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Michal,

On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > > 
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > > 
> > > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > > 
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > > 
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > > 
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > > 
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > > 
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> > > 
> > > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > > 
> > > Reported-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Tested-by: Jia He <hejianet@gmail.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/mmzone.h |  2 ++
> > >  mm/internal.h          |  6 ++++++
> > >  mm/page_alloc.c        |  9 ++-------
> > >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> > >  mm/vmstat.c            |  2 +-
> > >  5 files changed, 31 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 8e02b3750fe0..d2c50ab6ae40 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> > >  	int kswapd_order;
> > >  	enum zone_type kswapd_classzone_idx;
> > >  
> > > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > > +
> > >  #ifdef CONFIG_COMPACTION
> > >  	int kcompactd_max_order;
> > >  	enum zone_type kcompactd_classzone_idx;
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ccfc2a2969f4..aae93e3fd984 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> > >  extern unsigned long highest_memmap_pfn;
> > >  
> > >  /*
> > > + * Maximum number of reclaim retries without progress before the OOM
> > > + * killer is consider the only way forward.
> > > + */
> > > +#define MAX_RECLAIM_RETRIES 16
> > > +
> > > +/*
> > >   * in mm/vmscan.c:
> > >   */
> > >  extern int isolate_lru_page(struct page *page);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 614cd0397ce3..f50e36e7b024 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  }
> > >  
> > >  /*
> > > - * Maximum number of reclaim retries without any progress before OOM killer
> > > - * is consider as the only way to move forward.
> > > - */
> > > -#define MAX_RECLAIM_RETRIES 16
> > > -
> > > -/*
> > >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> > >   * for the given allocation request.
> > >   * The reclaim feedback represented by did_some_progress (any progress during
> > > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> > >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > > +				"yes" : "no");
> > >  	}
> > >  
> > >  	for_each_populated_zone(zone) {
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 26c3b405ef34..407b27831ff7 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> > >  					 sc->nr_scanned - nr_scanned, sc));
> > >  
> > > +	/*
> > > +	 * Kswapd gives up on balancing particular nodes after too
> > > +	 * many failures to reclaim anything from them and goes to
> > > +	 * sleep. On reclaim progress, reset the failure counter. A
> > > +	 * successful direct reclaim run will revive a dormant kswapd.
> > > +	 */
> > > +	if (reclaimable)
> > > +		pgdat->kswapd_failures = 0;
> > > +
> > >  	return reclaimable;
> > >  }
> > >  
> > > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > >  						 GFP_KERNEL | __GFP_HARDWALL))
> > >  				continue;
> > >  
> > > -			if (sc->priority != DEF_PRIORITY &&
> > > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > > -				continue;	/* Let kswapd poll it */
> > > -
> > >  			/*
> > >  			 * If we already have plenty of memory free for
> > >  			 * compaction in this zone, don't free any more.
> > > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> > >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> > >  		wake_up_all(&pgdat->pfmemalloc_wait);
> > >  
> > > +	/* Hopeless node, leave it to direct reclaim */
> > 
> > I hope to clear what we want by deferring the job to direct reclaim.
> > Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> > avoid writeback for stack overflow, NOIO|NOFS context)
> 
> This is true but if kswapd cannot reclaim anything at all then we do not
> have much choice left
> 
> > so what do we
> > want for direct reclaimer to do even if kswapd can make forward
> > progress? OOM?
> 
> yes resp. back off for costly high order requests and leave the node
> unbalanced.

Okay, I just wanted to clear it out because we have kept logic to
prevent CPU burn out of direct reclaim in case of being full with
unreclaiamble pages of zones. And this patch removes it in
shrink_zones. It might be optimized to cut off direct reclaim
if kswapd failure is higher than threshold so we reach OOM fast
without pointless retrial to reclaim in direct reclaim path but
I guess it would be rare case so no worth to optimize.

commit 36fb7f8
Author: Andrew Morton <akpm@digeo.com>
Date:   Thu Nov 21 19:32:34 2002 -0800

    [PATCH] handle zones which are full of unreclaimable pages

> 
> > > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > > +		return true;
> > > +
> > >  	for (i = 0; i <= classzone_idx; i++) {
> > >  		struct zone *zone = pgdat->node_zones + i;
> > >  
> > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > > +	if (!sc.nr_reclaimed)
> > > +		pgdat->kswapd_failures++;
> > 
> > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > it pgdat->kswapd_failures is increased.
> 
> But then we increase the counter in kswapd_shrink_node or do I miss your
> point? Are you suggesting to use the aggregate nr_reclaimed over all
> priorities because the last round might have made no progress?

Yes.

Let's assume there is severe memory pressure so there would be less
LRU pages than sum += highwatermark of eligible zones(As well,
user can configure watermark big in a specific zone). In that case,
kswapd will increase prioirity by kswapd_shrink_node's return check
although it reclaims a few of pages.

Also, process can consume pages kswapd reclaimed in parallel without
entering slow path because it uses *low* watermark. So there would be
no chance to reset kswapd_failure to zero until it goes slow path.

Also, although it goes direct reclaim's slow path, it cannot wakeup
kswapd until it can make forward progress which is condition to
reset kswapd_failure and consider direct reclaimer's context is
easily limited with NO_[FS|IO] so sometime, it would be hard to make
forward progress.

We can rule out that situation easily via aggregating nr_reclaimed
in balance_pgdat, simply. Why not?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-06  1:37         ` Minchan Kim
  0 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-06  1:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

Hi Michal,

On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > Hi Johannes,
> > 
> > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > Jia He reports a problem with kswapd spinning at 100% CPU when
> > > requesting more hugepages than memory available in the system:
> > > 
> > > $ echo 4000 >/proc/sys/vm/nr_hugepages
> > > 
> > > top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> > > Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> > > %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > > KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> > > KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> > > 
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> > >    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
> > > 
> > > At that time, there are no reclaimable pages left in the node, but as
> > > kswapd fails to restore the high watermarks it refuses to go to sleep.
> > > 
> > > Kswapd needs to back away from nodes that fail to balance. Up until
> > > 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of nodes")
> > > kswapd had such a mechanism. It considered zones whose theoretically
> > > reclaimable pages it had reclaimed six times over as unreclaimable and
> > > backed away from them. This guard was erroneously removed as the patch
> > > changed the definition of a balanced node.
> > > 
> > > However, simply restoring this code wouldn't help in the case reported
> > > here: there *are* no reclaimable pages that could be scanned until the
> > > threshold is met. Kswapd would stay awake anyway.
> > > 
> > > Introduce a new and much simpler way of backing off. If kswapd runs
> > > through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
> > > page, make it back off from the node. This is the same number of shots
> > > direct reclaim takes before declaring OOM. Kswapd will go to sleep on
> > > that node until a direct reclaimer manages to reclaim some pages, thus
> > > proving the node reclaimable again.
> > > 
> > > v2: move MAX_RECLAIM_RETRIES to mm/internal.h (Michal)
> > > 
> > > Reported-by: Jia He <hejianet@gmail.com>
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Tested-by: Jia He <hejianet@gmail.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  include/linux/mmzone.h |  2 ++
> > >  mm/internal.h          |  6 ++++++
> > >  mm/page_alloc.c        |  9 ++-------
> > >  mm/vmscan.c            | 27 ++++++++++++++++++++-------
> > >  mm/vmstat.c            |  2 +-
> > >  5 files changed, 31 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 8e02b3750fe0..d2c50ab6ae40 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -630,6 +630,8 @@ typedef struct pglist_data {
> > >  	int kswapd_order;
> > >  	enum zone_type kswapd_classzone_idx;
> > >  
> > > +	int kswapd_failures;		/* Number of 'reclaimed == 0' runs */
> > > +
> > >  #ifdef CONFIG_COMPACTION
> > >  	int kcompactd_max_order;
> > >  	enum zone_type kcompactd_classzone_idx;
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index ccfc2a2969f4..aae93e3fd984 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -81,6 +81,12 @@ static inline void set_page_refcounted(struct page *page)
> > >  extern unsigned long highest_memmap_pfn;
> > >  
> > >  /*
> > > + * Maximum number of reclaim retries without progress before the OOM
> > > + * killer is consider the only way forward.
> > > + */
> > > +#define MAX_RECLAIM_RETRIES 16
> > > +
> > > +/*
> > >   * in mm/vmscan.c:
> > >   */
> > >  extern int isolate_lru_page(struct page *page);
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 614cd0397ce3..f50e36e7b024 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3516,12 +3516,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  }
> > >  
> > >  /*
> > > - * Maximum number of reclaim retries without any progress before OOM killer
> > > - * is consider as the only way to move forward.
> > > - */
> > > -#define MAX_RECLAIM_RETRIES 16
> > > -
> > > -/*
> > >   * Checks whether it makes sense to retry the reclaim to make a forward progress
> > >   * for the given allocation request.
> > >   * The reclaim feedback represented by did_some_progress (any progress during
> > > @@ -4527,7 +4521,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> > >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > >  			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> > >  			node_page_state(pgdat, NR_PAGES_SCANNED),
> > > -			!pgdat_reclaimable(pgdat) ? "yes" : "no");
> > > +			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> > > +				"yes" : "no");
> > >  	}
> > >  
> > >  	for_each_populated_zone(zone) {
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 26c3b405ef34..407b27831ff7 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2626,6 +2626,15 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> > >  	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
> > >  					 sc->nr_scanned - nr_scanned, sc));
> > >  
> > > +	/*
> > > +	 * Kswapd gives up on balancing particular nodes after too
> > > +	 * many failures to reclaim anything from them and goes to
> > > +	 * sleep. On reclaim progress, reset the failure counter. A
> > > +	 * successful direct reclaim run will revive a dormant kswapd.
> > > +	 */
> > > +	if (reclaimable)
> > > +		pgdat->kswapd_failures = 0;
> > > +
> > >  	return reclaimable;
> > >  }
> > >  
> > > @@ -2700,10 +2709,6 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > >  						 GFP_KERNEL | __GFP_HARDWALL))
> > >  				continue;
> > >  
> > > -			if (sc->priority != DEF_PRIORITY &&
> > > -			    !pgdat_reclaimable(zone->zone_pgdat))
> > > -				continue;	/* Let kswapd poll it */
> > > -
> > >  			/*
> > >  			 * If we already have plenty of memory free for
> > >  			 * compaction in this zone, don't free any more.
> > > @@ -3134,6 +3139,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> > >  	if (waitqueue_active(&pgdat->pfmemalloc_wait))
> > >  		wake_up_all(&pgdat->pfmemalloc_wait);
> > >  
> > > +	/* Hopeless node, leave it to direct reclaim */
> > 
> > I hope to clear what we want by deferring the job to direct reclaim.
> > Direct reclaim is much limited reclaim worker by serveral things(e.g.,
> > avoid writeback for stack overflow, NOIO|NOFS context)
> 
> This is true but if kswapd cannot reclaim anything at all then we do not
> have much choice left
> 
> > so what do we
> > want for direct reclaimer to do even if kswapd can make forward
> > progress? OOM?
> 
> yes resp. back off for costly high order requests and leave the node
> unbalanced.

Okay, I just wanted to clear it out because we have kept logic to
prevent CPU burn out of direct reclaim in case of being full with
unreclaiamble pages of zones. And this patch removes it in
shrink_zones. It might be optimized to cut off direct reclaim
if kswapd failure is higher than threshold so we reach OOM fast
without pointless retrial to reclaim in direct reclaim path but
I guess it would be rare case so no worth to optimize.

commit 36fb7f8
Author: Andrew Morton <akpm@digeo.com>
Date:   Thu Nov 21 19:32:34 2002 -0800

    [PATCH] handle zones which are full of unreclaimable pages

> 
> > > +	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES)
> > > +		return true;
> > > +
> > >  	for (i = 0; i <= classzone_idx; i++) {
> > >  		struct zone *zone = pgdat->node_zones + i;
> > >  
> > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > > +	if (!sc.nr_reclaimed)
> > > +		pgdat->kswapd_failures++;
> > 
> > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > it pgdat->kswapd_failures is increased.
> 
> But then we increase the counter in kswapd_shrink_node or do I miss your
> point? Are you suggesting to use the aggregate nr_reclaimed over all
> priorities because the last round might have made no progress?

Yes.

Let's assume there is severe memory pressure so there would be less
LRU pages than sum += highwatermark of eligible zones(As well,
user can configure watermark big in a specific zone). In that case,
kswapd will increase prioirity by kswapd_shrink_node's return check
although it reclaims a few of pages.

Also, process can consume pages kswapd reclaimed in parallel without
entering slow path because it uses *low* watermark. So there would be
no chance to reset kswapd_failure to zero until it goes slow path.

Also, although it goes direct reclaim's slow path, it cannot wakeup
kswapd until it can make forward progress which is condition to
reset kswapd_failure and consider direct reclaimer's context is
easily limited with NO_[FS|IO] so sometime, it would be hard to make
forward progress.

We can rule out that situation easily via aggregating nr_reclaimed
in balance_pgdat, simply. Why not?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-06  1:37         ` Minchan Kim
@ 2017-03-06 16:24           ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-06 16:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > >  			sc.priority--;
> > > >  	} while (sc.priority >= 1);
> > > >  
> > > > +	if (!sc.nr_reclaimed)
> > > > +		pgdat->kswapd_failures++;
> > > 
> > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > it pgdat->kswapd_failures is increased.

That wasn't intentional; I didn't see the sc.nr_reclaimed reset.

---

>From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 6 Mar 2017 10:53:59 -0500
Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix

Check kswapd failure against the cumulative nr_reclaimed count, not
against the count from the lowest priority iteration.

Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ddcff8a11c1e..b834b2dd4e19 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	count_vm_event(PAGEOUTRUN);
 
 	do {
+		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
 
-		sc.nr_reclaimed = 0;
 		sc.reclaim_idx = classzone_idx;
 
 		/*
@@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
 		 */
-		if (raise_priority || !sc.nr_reclaimed)
+		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
-- 
2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-06 16:24           ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-06 16:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > >  			sc.priority--;
> > > >  	} while (sc.priority >= 1);
> > > >  
> > > > +	if (!sc.nr_reclaimed)
> > > > +		pgdat->kswapd_failures++;
> > > 
> > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > it pgdat->kswapd_failures is increased.

That wasn't intentional; I didn't see the sc.nr_reclaimed reset.

---

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-06 16:24           ` Johannes Weiner
@ 2017-03-07  0:59             ` Hillf Danton
  -1 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-07  0:59 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Minchan Kim'
  Cc: 'Michal Hocko', 'Andrew Morton', 'Jia He',
	'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
> 
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
> 
> -		sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>  		sc.reclaim_idx = classzone_idx;
> 
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
> 
> --
> 2.11.1

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-07  0:59             ` Hillf Danton
  0 siblings, 0 replies; 80+ messages in thread
From: Hillf Danton @ 2017-03-07  0:59 UTC (permalink / raw)
  To: 'Johannes Weiner', 'Minchan Kim'
  Cc: 'Michal Hocko', 'Andrew Morton', 'Jia He',
	'Mel Gorman',
	linux-mm, linux-kernel, kernel-team


On March 07, 2017 12:24 AM Johannes Weiner wrote: 
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > >
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
> 
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
> 
> -		sc.nr_reclaimed = 0;

This has another effect that we'll reclaim less pages than we're
currently doing if we are balancing for high order request. And 
it looks worth including that info also in log message.

>  		sc.reclaim_idx = classzone_idx;
> 
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
> 
> --
> 2.11.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-06 16:24           ` Johannes Weiner
@ 2017-03-07  7:28             ` Minchan Kim
  -1 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-07  7:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Mar 06, 2017 at 11:24:10AM -0500, Johannes Weiner wrote:
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >  
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > > 
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-07  7:28             ` Minchan Kim
  0 siblings, 0 replies; 80+ messages in thread
From: Minchan Kim @ 2017-03-07  7:28 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon, Mar 06, 2017 at 11:24:10AM -0500, Johannes Weiner wrote:
> On Mon, Mar 06, 2017 at 10:37:40AM +0900, Minchan Kim wrote:
> > On Fri, Mar 03, 2017 at 08:59:54AM +0100, Michal Hocko wrote:
> > > On Fri 03-03-17 10:26:09, Minchan Kim wrote:
> > > > On Tue, Feb 28, 2017 at 04:39:59PM -0500, Johannes Weiner wrote:
> > > > > @@ -3316,6 +3325,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > > > >  			sc.priority--;
> > > > >  	} while (sc.priority >= 1);
> > > > >  
> > > > > +	if (!sc.nr_reclaimed)
> > > > > +		pgdat->kswapd_failures++;
> > > > 
> > > > sc.nr_reclaimed is reset to zero in above big loop's beginning so most of time,
> > > > it pgdat->kswapd_failures is increased.
> 
> That wasn't intentional; I didn't see the sc.nr_reclaimed reset.
> 
> ---
> 
> From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-06 16:24           ` Johannes Weiner
@ 2017-03-07 10:17             ` Michal Hocko
  -1 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-07 10:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
[...]
> >From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
>  
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
>  
> -		sc.nr_reclaimed = 0;
>  		sc.reclaim_idx = classzone_idx;
>  
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  

I would rather not play with the sc state here. From a quick look at
least 
	/*
	 * Fragmentation may mean that the system cannot be rebalanced for
	 * high-order allocations. If twice the allocation size has been
	 * reclaimed then recheck watermarks only at order-0 to prevent
	 * excessive reclaim. Assume that a process requested a high-order
	 * can direct reclaim/compact.
	 */
	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
		sc->order = 0;

does rely on the value. Wouldn't something like the following be safer?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..b731f24fed12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3183,6 +3183,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_unmap = 1,
 		.may_swap = 1,
 	};
+	bool reclaimable = false;
 	count_vm_event(PAGEOUTRUN);
 
 	do {
@@ -3274,6 +3275,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
+		if (sc.nr_reclaimed)
+			reclaimable = true;
+
 		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
@@ -3282,7 +3286,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
-	if (!sc.nr_reclaimed)
+	if (!reclaimable)
 		pgdat->kswapd_failures++;
 
 out:
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-07 10:17             ` Michal Hocko
  0 siblings, 0 replies; 80+ messages in thread
From: Michal Hocko @ 2017-03-07 10:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team

On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
[...]
> >From e126db716926ff353b35f3a6205bd5853e01877b Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 6 Mar 2017 10:53:59 -0500
> Subject: [PATCH] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes fix
> 
> Check kswapd failure against the cumulative nr_reclaimed count, not
> against the count from the lowest priority iteration.
> 
> Suggested-by: Minchan Kim <minchan@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ddcff8a11c1e..b834b2dd4e19 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3179,9 +3179,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	count_vm_event(PAGEOUTRUN);
>  
>  	do {
> +		unsigned long nr_reclaimed = sc.nr_reclaimed;
>  		bool raise_priority = true;
>  
> -		sc.nr_reclaimed = 0;
>  		sc.reclaim_idx = classzone_idx;
>  
>  		/*
> @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * Raise priority if scanning rate is too low or there was no
>  		 * progress in reclaiming pages
>  		 */
> -		if (raise_priority || !sc.nr_reclaimed)
> +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> +		if (raise_priority || !nr_reclaimed)
>  			sc.priority--;
>  	} while (sc.priority >= 1);
>  

I would rather not play with the sc state here. From a quick look at
least 
	/*
	 * Fragmentation may mean that the system cannot be rebalanced for
	 * high-order allocations. If twice the allocation size has been
	 * reclaimed then recheck watermarks only at order-0 to prevent
	 * excessive reclaim. Assume that a process requested a high-order
	 * can direct reclaim/compact.
	 */
	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
		sc->order = 0;

does rely on the value. Wouldn't something like the following be safer?
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c15b2e4c47ca..b731f24fed12 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3183,6 +3183,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		.may_unmap = 1,
 		.may_swap = 1,
 	};
+	bool reclaimable = false;
 	count_vm_event(PAGEOUTRUN);
 
 	do {
@@ -3274,6 +3275,9 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		if (try_to_freeze() || kthread_should_stop())
 			break;
 
+		if (sc.nr_reclaimed)
+			reclaimable = true;
+
 		/*
 		 * Raise priority if scanning rate is too low or there was no
 		 * progress in reclaiming pages
@@ -3282,7 +3286,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 			sc.priority--;
 	} while (sc.priority >= 1);
 
-	if (!sc.nr_reclaimed)
+	if (!reclaimable)
 		pgdat->kswapd_failures++;
 
 out:
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-07 10:17             ` Michal Hocko
@ 2017-03-07 16:56               ` Johannes Weiner
  -1 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-07 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team, Vlastimil Babka

On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  		 * Raise priority if scanning rate is too low or there was no
> >  		 * progress in reclaiming pages
> >  		 */
> > -		if (raise_priority || !sc.nr_reclaimed)
> > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +		if (raise_priority || !nr_reclaimed)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> 
> I would rather not play with the sc state here. From a quick look at
> least 
> 	/*
> 	 * Fragmentation may mean that the system cannot be rebalanced for
> 	 * high-order allocations. If twice the allocation size has been
> 	 * reclaimed then recheck watermarks only at order-0 to prevent
> 	 * excessive reclaim. Assume that a process requested a high-order
> 	 * can direct reclaim/compact.
> 	 */
> 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> 		sc->order = 0;
> 
> does rely on the value. Wouldn't something like the following be safer?

Well, what behavior is correct, though? This check looks like an
argument *against* resetting sc.nr_reclaimed.

If kswapd is woken up for a higher order, this check sets a reclaim
cutoff beyond which it should give up on the order and balance for 0.

That's on the scope of the kswapd invocation. Applying this threshold
to the outcome of just the preceeding priority seems like a mistake.

Mel? Vlastimil?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-07 16:56               ` Johannes Weiner
  0 siblings, 0 replies; 80+ messages in thread
From: Johannes Weiner @ 2017-03-07 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Andrew Morton, Jia He, Mel Gorman, linux-mm,
	linux-kernel, kernel-team, Vlastimil Babka

On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  		 * Raise priority if scanning rate is too low or there was no
> >  		 * progress in reclaiming pages
> >  		 */
> > -		if (raise_priority || !sc.nr_reclaimed)
> > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > +		if (raise_priority || !nr_reclaimed)
> >  			sc.priority--;
> >  	} while (sc.priority >= 1);
> >  
> 
> I would rather not play with the sc state here. From a quick look at
> least 
> 	/*
> 	 * Fragmentation may mean that the system cannot be rebalanced for
> 	 * high-order allocations. If twice the allocation size has been
> 	 * reclaimed then recheck watermarks only at order-0 to prevent
> 	 * excessive reclaim. Assume that a process requested a high-order
> 	 * can direct reclaim/compact.
> 	 */
> 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> 		sc->order = 0;
> 
> does rely on the value. Wouldn't something like the following be safer?

Well, what behavior is correct, though? This check looks like an
argument *against* resetting sc.nr_reclaimed.

If kswapd is woken up for a higher order, this check sets a reclaim
cutoff beyond which it should give up on the order and balance for 0.

That's on the scope of the kswapd invocation. Applying this threshold
to the outcome of just the preceeding priority seems like a mistake.

Mel? Vlastimil?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
  2017-03-07 16:56               ` Johannes Weiner
@ 2017-03-09 14:20                 ` Mel Gorman
  -1 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2017-03-09 14:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Minchan Kim, Andrew Morton, Jia He, linux-mm,
	linux-kernel, kernel-team, Vlastimil Babka

On Tue, Mar 07, 2017 at 11:56:31AM -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> > On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  		 * Raise priority if scanning rate is too low or there was no
> > >  		 * progress in reclaiming pages
> > >  		 */
> > > -		if (raise_priority || !sc.nr_reclaimed)
> > > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > > +		if (raise_priority || !nr_reclaimed)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > 
> > I would rather not play with the sc state here. From a quick look at
> > least 
> > 	/*
> > 	 * Fragmentation may mean that the system cannot be rebalanced for
> > 	 * high-order allocations. If twice the allocation size has been
> > 	 * reclaimed then recheck watermarks only at order-0 to prevent
> > 	 * excessive reclaim. Assume that a process requested a high-order
> > 	 * can direct reclaim/compact.
> > 	 */
> > 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > 		sc->order = 0;
> > 
> > does rely on the value. Wouldn't something like the following be safer?
> 
> Well, what behavior is correct, though? This check looks like an
> argument *against* resetting sc.nr_reclaimed.
> 
> If kswapd is woken up for a higher order, this check sets a reclaim
> cutoff beyond which it should give up on the order and balance for 0.
> 
> That's on the scope of the kswapd invocation. Applying this threshold
> to the outcome of just the preceeding priority seems like a mistake.
> 
> Mel? Vlastimil?

I cannot say which is definitely the correct behaviour. The current
behaviour is conservative due to the historical concerns about kswapd
reclaiming the world. The hazard as I see it is that resetting it *may*
lead to more aggressive reclaim for high-order allocations. That may be a
welcome outcome to some that really want high-order pages and be unwelcome
to others that prefer pages to remain resident.

However, in this case it's a tight window and problems would be tricky to
detect. THP allocations won't trigger the behaviour and with vmalloc'd
stack, I'd expect that only SLUB-intensive workloads using high-order
pages would trigger any adverse behaviour. While I'm mildly concerned, I
would be a little surprised if it actually caused runaway reclaim.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
@ 2017-03-09 14:20                 ` Mel Gorman
  0 siblings, 0 replies; 80+ messages in thread
From: Mel Gorman @ 2017-03-09 14:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Minchan Kim, Andrew Morton, Jia He, linux-mm,
	linux-kernel, kernel-team, Vlastimil Babka

On Tue, Mar 07, 2017 at 11:56:31AM -0500, Johannes Weiner wrote:
> On Tue, Mar 07, 2017 at 11:17:02AM +0100, Michal Hocko wrote:
> > On Mon 06-03-17 11:24:10, Johannes Weiner wrote:
> > > @@ -3271,7 +3271,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> > >  		 * Raise priority if scanning rate is too low or there was no
> > >  		 * progress in reclaiming pages
> > >  		 */
> > > -		if (raise_priority || !sc.nr_reclaimed)
> > > +		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
> > > +		if (raise_priority || !nr_reclaimed)
> > >  			sc.priority--;
> > >  	} while (sc.priority >= 1);
> > >  
> > 
> > I would rather not play with the sc state here. From a quick look at
> > least 
> > 	/*
> > 	 * Fragmentation may mean that the system cannot be rebalanced for
> > 	 * high-order allocations. If twice the allocation size has been
> > 	 * reclaimed then recheck watermarks only at order-0 to prevent
> > 	 * excessive reclaim. Assume that a process requested a high-order
> > 	 * can direct reclaim/compact.
> > 	 */
> > 	if (sc->order && sc->nr_reclaimed >= compact_gap(sc->order))
> > 		sc->order = 0;
> > 
> > does rely on the value. Wouldn't something like the following be safer?
> 
> Well, what behavior is correct, though? This check looks like an
> argument *against* resetting sc.nr_reclaimed.
> 
> If kswapd is woken up for a higher order, this check sets a reclaim
> cutoff beyond which it should give up on the order and balance for 0.
> 
> That's on the scope of the kswapd invocation. Applying this threshold
> to the outcome of just the preceeding priority seems like a mistake.
> 
> Mel? Vlastimil?

I cannot say which is definitely the correct behaviour. The current
behaviour is conservative due to the historical concerns about kswapd
reclaiming the world. The hazard as I see it is that resetting it *may*
lead to more aggressive reclaim for high-order allocations. That may be a
welcome outcome to some that really want high-order pages and be unwelcome
to others that prefer pages to remain resident.

However, in this case it's a tight window and problems would be tricky to
detect. THP allocations won't trigger the behaviour and with vmalloc'd
stack, I'd expect that only SLUB-intensive workloads using high-order
pages would trigger any adverse behaviour. While I'm mildly concerned, I
would be a little surprised if it actually caused runaway reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2017-03-09 14:35 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-28 21:39 [PATCH 0/9] mm: kswapd spinning on unreclaimable nodes - fixes and cleanups Johannes Weiner
2017-02-28 21:39 ` Johannes Weiner
2017-02-28 21:39 ` [PATCH 1/9] mm: fix 100% CPU kswapd busyloop on unreclaimable nodes Johannes Weiner
2017-02-28 21:39   ` Johannes Weiner
2017-03-02  3:23   ` Hillf Danton
2017-03-02  3:23     ` Hillf Danton
2017-03-02 23:30   ` Shakeel Butt
2017-03-02 23:30     ` Shakeel Butt
2017-03-03  1:26   ` Minchan Kim
2017-03-03  1:26     ` Minchan Kim
2017-03-03  7:59     ` Michal Hocko
2017-03-03  7:59       ` Michal Hocko
2017-03-06  1:37       ` Minchan Kim
2017-03-06  1:37         ` Minchan Kim
2017-03-06 16:24         ` Johannes Weiner
2017-03-06 16:24           ` Johannes Weiner
2017-03-07  0:59           ` Hillf Danton
2017-03-07  0:59             ` Hillf Danton
2017-03-07  7:28           ` Minchan Kim
2017-03-07  7:28             ` Minchan Kim
2017-03-07 10:17           ` Michal Hocko
2017-03-07 10:17             ` Michal Hocko
2017-03-07 16:56             ` Johannes Weiner
2017-03-07 16:56               ` Johannes Weiner
2017-03-09 14:20               ` Mel Gorman
2017-03-09 14:20                 ` Mel Gorman
2017-02-28 21:40 ` [PATCH 2/9] mm: fix check for reclaimable pages in PF_MEMALLOC reclaim throttling Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:02   ` Michal Hocko
2017-03-01 15:02     ` Michal Hocko
2017-03-02  3:25   ` Hillf Danton
2017-03-02  3:25     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 3/9] mm: remove seemingly spurious reclaimability check from laptop_mode gating Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:06   ` Michal Hocko
2017-03-01 15:06     ` Michal Hocko
2017-03-01 15:17   ` Mel Gorman
2017-03-01 15:17     ` Mel Gorman
2017-03-02  3:27   ` Hillf Danton
2017-03-02  3:27     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 4/9] mm: remove unnecessary reclaimability check from NUMA balancing target Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:14   ` Michal Hocko
2017-03-01 15:14     ` Michal Hocko
2017-03-02  3:28   ` Hillf Danton
2017-03-02  3:28     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 5/9] mm: don't avoid high-priority reclaim on unreclaimable nodes Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:21   ` Michal Hocko
2017-03-01 15:21     ` Michal Hocko
2017-03-02  3:31   ` Hillf Danton
2017-03-02  3:31     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 6/9] mm: don't avoid high-priority reclaim on memcg limit reclaim Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:40   ` Michal Hocko
2017-03-01 15:40     ` Michal Hocko
2017-03-01 17:36     ` Johannes Weiner
2017-03-01 17:36       ` Johannes Weiner
2017-03-01 19:13       ` Michal Hocko
2017-03-01 19:13         ` Michal Hocko
2017-03-02  3:32   ` Hillf Danton
2017-03-02  3:32     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 7/9] mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:41   ` Michal Hocko
2017-03-01 15:41     ` Michal Hocko
2017-03-02  3:34   ` Hillf Danton
2017-03-02  3:34     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 8/9] Revert "mm, vmscan: account for skipped pages as a partial scan" Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 15:51   ` Michal Hocko
2017-03-01 15:51     ` Michal Hocko
2017-03-02  3:36   ` Hillf Danton
2017-03-02  3:36     ` Hillf Danton
2017-02-28 21:40 ` [PATCH 9/9] mm: remove unnecessary back-off function when retrying page reclaim Johannes Weiner
2017-02-28 21:40   ` Johannes Weiner
2017-03-01 14:56   ` Michal Hocko
2017-03-01 14:56     ` Michal Hocko
2017-03-02  3:37   ` Hillf Danton
2017-03-02  3:37     ` Hillf Danton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.