All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15  9:22 ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

This patchset is based on mmots as of Feb 9th, 2016. The baseline is
important as there are a number of kswapd-related fixes in that tree and
a comparison against v4.10-rc7 would be almost meaningless as a result.

The series is unusual in that the first patch fixes one problem and
introduces a host of other issues and is incomplete. It was not developed
by me but it appears to have gotten lost so I picked it up and added to the
changelog. Patch 2 makes a minor modification that is worth considering
on its own but leaves the kernel in a state where it behaves badly. It's
not until patch 3 that there is an improvement against baseline.

This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP. It has been reported
that the benchmark has regressed severely during the last number of
releases. While I cannot reproduce all the same problems Chris experienced
due to hardware limitations, there was a number of problems on a 2-socket
machine with a single disk.

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209       keepawake-v1r25
Amean    p50-Read             22325202.49 (  0.00%) 22092755.48 (  1.04%)
Amean    p95-Read             26102988.80 (  0.00%) 26101849.04 (  0.00%)
Amean    p99-Read             30935176.53 (  0.00%) 29746220.52 (  3.84%)
Amean    p50-Write                 976.44 (  0.00%)      952.73 (  2.43%)
Amean    p95-Write               15471.29 (  0.00%)     3140.27 ( 79.70%)
Amean    p99-Write               35108.62 (  0.00%)     8843.73 ( 74.81%)
Amean    p50-Allocation          76382.61 (  0.00%)    76349.22 (  0.04%)
Amean    p95-Allocation         127777.39 (  0.00%)   108630.26 ( 14.98%)
Amean    p99-Allocation         187937.39 (  0.00%)   139094.26 ( 25.99%)

These are latencies. Read/write are threads reading fixed-size random blocks
from a simulated database. The allocation latency is mmaping and faulting
regions of memory. The p50, 95 and p99 reports the worst latencies for 50%
of the samples, 95% and 99% respectively.

For example, the report indicates that while the test was running 99% of
writes completed 74.81% faster. It's worth noting that on a UMA machine that
no difference in performance with simoop was observed so milage will vary.

On UMA, there was a notable difference in the "stutter" benchmark which
measures the latency of mmap while large files are being copied. This has
been used as a proxy measure for desktop jitter while large amounts of IO
were taking place

                            4.10.0-rc7            4.10.0-rc7
                        mmots-20170209          keepawake-v1
Min         mmap      6.3847 (  0.00%)      5.9785 (  6.36%)
1st-qrtle   mmap      7.6310 (  0.00%)      7.4086 (  2.91%)
2nd-qrtle   mmap      9.9959 (  0.00%)      7.7052 ( 22.92%)
3rd-qrtle   mmap     14.8180 (  0.00%)      8.5895 ( 42.03%)
Max-90%     mmap     15.8397 (  0.00%)     13.6974 ( 13.52%)
Max-93%     mmap     16.4268 (  0.00%)     14.3175 ( 12.84%)
Max-95%     mmap     18.3295 (  0.00%)     16.9233 (  7.67%)
Max-99%     mmap     24.2042 (  0.00%)     20.6182 ( 14.82%)
Max         mmap    255.0688 (  0.00%)    265.5818 ( -4.12%)
Mean        mmap     11.2192 (  0.00%)      9.1811 ( 18.17%)

Latency is measured in milliseconds and indicates that 99% of mmap
operations complete 14.82% faster and are 18.17% faster on average with
these patches applied.

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

Mel Gorman (2):
  mm, vmscan: Only clear pgdat congested/dirty/writeback state when
    balanced
  mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched
    classzone_idx

Shantanu Goel (1):
  mm, vmscan: fix zone balance check in prepare_kswapd_sleep

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15  9:22 ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

This patchset is based on mmots as of Feb 9th, 2016. The baseline is
important as there are a number of kswapd-related fixes in that tree and
a comparison against v4.10-rc7 would be almost meaningless as a result.

The series is unusual in that the first patch fixes one problem and
introduces a host of other issues and is incomplete. It was not developed
by me but it appears to have gotten lost so I picked it up and added to the
changelog. Patch 2 makes a minor modification that is worth considering
on its own but leaves the kernel in a state where it behaves badly. It's
not until patch 3 that there is an improvement against baseline.

This was mostly motivated by examining Chris Mason's "simoop" benchmark
which puts the VM under similar pressure to HADOOP. It has been reported
that the benchmark has regressed severely during the last number of
releases. While I cannot reproduce all the same problems Chris experienced
due to hardware limitations, there was a number of problems on a 2-socket
machine with a single disk.

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209       keepawake-v1r25
Amean    p50-Read             22325202.49 (  0.00%) 22092755.48 (  1.04%)
Amean    p95-Read             26102988.80 (  0.00%) 26101849.04 (  0.00%)
Amean    p99-Read             30935176.53 (  0.00%) 29746220.52 (  3.84%)
Amean    p50-Write                 976.44 (  0.00%)      952.73 (  2.43%)
Amean    p95-Write               15471.29 (  0.00%)     3140.27 ( 79.70%)
Amean    p99-Write               35108.62 (  0.00%)     8843.73 ( 74.81%)
Amean    p50-Allocation          76382.61 (  0.00%)    76349.22 (  0.04%)
Amean    p95-Allocation         127777.39 (  0.00%)   108630.26 ( 14.98%)
Amean    p99-Allocation         187937.39 (  0.00%)   139094.26 ( 25.99%)

These are latencies. Read/write are threads reading fixed-size random blocks
from a simulated database. The allocation latency is mmaping and faulting
regions of memory. The p50, 95 and p99 reports the worst latencies for 50%
of the samples, 95% and 99% respectively.

For example, the report indicates that while the test was running 99% of
writes completed 74.81% faster. It's worth noting that on a UMA machine that
no difference in performance with simoop was observed so milage will vary.

On UMA, there was a notable difference in the "stutter" benchmark which
measures the latency of mmap while large files are being copied. This has
been used as a proxy measure for desktop jitter while large amounts of IO
were taking place

                            4.10.0-rc7            4.10.0-rc7
                        mmots-20170209          keepawake-v1
Min         mmap      6.3847 (  0.00%)      5.9785 (  6.36%)
1st-qrtle   mmap      7.6310 (  0.00%)      7.4086 (  2.91%)
2nd-qrtle   mmap      9.9959 (  0.00%)      7.7052 ( 22.92%)
3rd-qrtle   mmap     14.8180 (  0.00%)      8.5895 ( 42.03%)
Max-90%     mmap     15.8397 (  0.00%)     13.6974 ( 13.52%)
Max-93%     mmap     16.4268 (  0.00%)     14.3175 ( 12.84%)
Max-95%     mmap     18.3295 (  0.00%)     16.9233 (  7.67%)
Max-99%     mmap     24.2042 (  0.00%)     20.6182 ( 14.82%)
Max         mmap    255.0688 (  0.00%)    265.5818 ( -4.12%)
Mean        mmap     11.2192 (  0.00%)      9.1811 ( 18.17%)

Latency is measured in milliseconds and indicates that 99% of mmap
operations complete 14.82% faster and are 18.17% faster on average with
these patches applied.

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

Mel Gorman (2):
  mm, vmscan: Only clear pgdat congested/dirty/writeback state when
    balanced
  mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched
    classzone_idx

Shantanu Goel (1):
  mm, vmscan: fix zone balance check in prepare_kswapd_sleep

 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 128 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 72 insertions(+), 58 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22 ` Mel Gorman
@ 2017-02-15  9:22   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

From: Shantanu Goel <sgoel01@yahoo.com>

The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
since the latter will return as soon as any one of the zones in the
classzone is above the watermark.  This is specially important for higher
order allocations since balance_pgdat will typically reset the order to
zero relying on compaction to create the higher order pages.  Without this
patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
balance check fails.

On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
zone balance check between balance_pgdat() and prepare_kswapd_sleep().
balance_pgdat() returns as soon as a single zone satisfies the allocation
but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
prepare_kswapd_sleep() to never succeed except in the order == 0 case and
consequently, wakeup_kcompactd() is never called.  On my machine prior to
apply this patch, the state of compaction from /proc/vmstat looked this
way after a day and a half +of uptime:

compact_migrate_scanned 240496
compact_free_scanned 76238632
compact_isolated 123472
compact_stall 1791
compact_fail 29
compact_success 1762
compact_daemon_wake 0

After applying the patch and about 10 hours of uptime the state looks
like this:

compact_migrate_scanned 59927299
compact_free_scanned 2021075136
compact_isolated 640926
compact_stall 4
compact_fail 2
compact_success 2
compact_daemon_wake 5160

Further notes from Mel that motivated him to pick this patch up and
resend it;

It was observed for the simoop workload (pressures the VM similar to HADOOP)
that kswapd was failing to keep ahead of direct reclaim. The investigation
noted that there was a need to rationalise kswapd decisions to reclaim
with kswapd decisions to sleep. With this patch on a 2-socket box, there
was a 43% reduction in direct reclaim scanning.

However, the impact otherwise is extremely negative. Kswapd reclaim
efficiency dropped from 98% to 76%. simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           fixcheck-v1
Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)

There are also more allocation stalls. One of the largest impacts was due
to pages written back from kswapd context rising from 0 pages to 4516642
pages during the hour the workload ran for. By and large, the patch has very
bad behaviour but easily missed as the impact on a UMA machine is negligible.

This patch is included with the data in case a bisection leads to this area.
This patch is also a pre-requisite for the rest of the series.

Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..92fc66bd52bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
-			return false;
+		if (zone_balanced(zone, order, classzone_idx))
+			return true;
 	}
 
-	return true;
+	return false;
 }
 
 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-15  9:22   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

From: Shantanu Goel <sgoel01@yahoo.com>

The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
since the latter will return as soon as any one of the zones in the
classzone is above the watermark.  This is specially important for higher
order allocations since balance_pgdat will typically reset the order to
zero relying on compaction to create the higher order pages.  Without this
patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
balance check fails.

On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
zone balance check between balance_pgdat() and prepare_kswapd_sleep().
balance_pgdat() returns as soon as a single zone satisfies the allocation
but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
prepare_kswapd_sleep() to never succeed except in the order == 0 case and
consequently, wakeup_kcompactd() is never called.  On my machine prior to
apply this patch, the state of compaction from /proc/vmstat looked this
way after a day and a half +of uptime:

compact_migrate_scanned 240496
compact_free_scanned 76238632
compact_isolated 123472
compact_stall 1791
compact_fail 29
compact_success 1762
compact_daemon_wake 0

After applying the patch and about 10 hours of uptime the state looks
like this:

compact_migrate_scanned 59927299
compact_free_scanned 2021075136
compact_isolated 640926
compact_stall 4
compact_fail 2
compact_success 2
compact_daemon_wake 5160

Further notes from Mel that motivated him to pick this patch up and
resend it;

It was observed for the simoop workload (pressures the VM similar to HADOOP)
that kswapd was failing to keep ahead of direct reclaim. The investigation
noted that there was a need to rationalise kswapd decisions to reclaim
with kswapd decisions to sleep. With this patch on a 2-socket box, there
was a 43% reduction in direct reclaim scanning.

However, the impact otherwise is extremely negative. Kswapd reclaim
efficiency dropped from 98% to 76%. simoop has three latency-related
metrics for read, write and allocation (an anonymous mmap and fault).

                                         4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           fixcheck-v1
Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)

There are also more allocation stalls. One of the largest impacts was due
to pages written back from kswapd context rising from 0 pages to 4516642
pages during the hour the workload ran for. By and large, the patch has very
bad behaviour but easily missed as the impact on a UMA machine is negligible.

This patch is included with the data in case a bisection leads to this area.
This patch is also a pre-requisite for the rest of the series.

Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..92fc66bd52bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (!zone_balanced(zone, order, classzone_idx))
-			return false;
+		if (zone_balanced(zone, order, classzone_idx))
+			return true;
 	}
 
-	return true;
+	return false;
 }
 
 /*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
  2017-02-15  9:22 ` Mel Gorman
@ 2017-02-15  9:22   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

A pgdat tracks if recent reclaim encountered too many dirty, writeback
or congested pages. The flags control whether kswapd writes pages back
from reclaim context, tags pages for immediate reclaim when IO completes,
whether processes block on wait_iff_congested and whether kswapd blocks
when too many pages marked for immediate reclaim are encountered.

The state is cleared in a check function with side-effects. With the patch
"mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing
of when the bits get cleared changed. Due to the way the check works,
it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation
because it does not account for lowmem reserves properly.

For the simoop workload, kswapd is not stalling when it should due to
the premature clearing, writing pages from reclaim context like crazy and
generally being unhelpful.

This patch resets the pgdat bits related to page reclaim only when kswapd
is going to sleep. The comparison with simoop is then

                                         4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           fixcheck-v1              clear-v1
Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%) 19491134.58 ( 12.69%)
Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%) 24294195.20 (  6.93%)
Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%) 30397053.16 (  1.74%)
Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)     1077.22 (-10.32%)
Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)    36419.56 (-135.40%)
Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)   102000.36 (-190.53%)
Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)    87485.22 (-14.54%)
Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)   204588.52 (-60.11%)
Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)   631657.74 (-236.10%)

Read latency is improved although write and allocation latency is
impacted.  Even with the patch, kswapd is still reclaiming inefficiently,
pages are being written back from writeback context and a host of other
issues. However, given the change, it needed to be spelled out why the
side-effect was moved.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92fc66bd52bc..b47b430ca7ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3097,17 +3097,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 		return false;
 
-	/*
-	 * If any eligible zone is balanced then the node is not considered
-	 * to be congested or dirty
-	 */
-	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_WRITEBACK, &zone->zone_pgdat->flags);
-
 	return true;
 }
 
+/* Clear pgdat state for congested, dirty or under writeback. */
+static void clear_pgdat_congested(pg_data_t *pgdat)
+{
+	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &pgdat->flags);
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+}
+
 /*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -3140,8 +3140,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, classzone_idx))
+		if (zone_balanced(zone, order, classzone_idx)) {
+			clear_pgdat_congested(pgdat);
 			return true;
+		}
 	}
 
 	return false;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced
@ 2017-02-15  9:22   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

A pgdat tracks if recent reclaim encountered too many dirty, writeback
or congested pages. The flags control whether kswapd writes pages back
from reclaim context, tags pages for immediate reclaim when IO completes,
whether processes block on wait_iff_congested and whether kswapd blocks
when too many pages marked for immediate reclaim are encountered.

The state is cleared in a check function with side-effects. With the patch
"mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing
of when the bits get cleared changed. Due to the way the check works,
it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation
because it does not account for lowmem reserves properly.

For the simoop workload, kswapd is not stalling when it should due to
the premature clearing, writing pages from reclaim context like crazy and
generally being unhelpful.

This patch resets the pgdat bits related to page reclaim only when kswapd
is going to sleep. The comparison with simoop is then

                                         4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           fixcheck-v1              clear-v1
Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%) 19491134.58 ( 12.69%)
Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%) 24294195.20 (  6.93%)
Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%) 30397053.16 (  1.74%)
Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)     1077.22 (-10.32%)
Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)    36419.56 (-135.40%)
Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)   102000.36 (-190.53%)
Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)    87485.22 (-14.54%)
Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)   204588.52 (-60.11%)
Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)   631657.74 (-236.10%)

Read latency is improved although write and allocation latency is
impacted.  Even with the patch, kswapd is still reclaiming inefficiently,
pages are being written back from writeback context and a host of other
issues. However, given the change, it needed to be spelled out why the
side-effect was moved.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 92fc66bd52bc..b47b430ca7ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3097,17 +3097,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 		return false;
 
-	/*
-	 * If any eligible zone is balanced then the node is not considered
-	 * to be congested or dirty
-	 */
-	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
-	clear_bit(PGDAT_WRITEBACK, &zone->zone_pgdat->flags);
-
 	return true;
 }
 
+/* Clear pgdat state for congested, dirty or under writeback. */
+static void clear_pgdat_congested(pg_data_t *pgdat)
+{
+	clear_bit(PGDAT_CONGESTED, &pgdat->flags);
+	clear_bit(PGDAT_DIRTY, &pgdat->flags);
+	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
+}
+
 /*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -3140,8 +3140,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		if (!managed_zone(zone))
 			continue;
 
-		if (zone_balanced(zone, order, classzone_idx))
+		if (zone_balanced(zone, order, classzone_idx)) {
+			clear_pgdat_congested(pgdat);
 			return true;
+		}
 	}
 
 	return false;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-15  9:22 ` Mel Gorman
@ 2017-02-15  9:22   ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches regressed
the workload so they are included as the second set of results. Otherwise
this patch looks artifically excellent

                                         4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           clear-v1r25       keepawake-v1r25
Amean    p50-Read             22325202.49 (  0.00%) 19491134.58 ( 12.69%) 22092755.48 (  1.04%)
Amean    p95-Read             26102988.80 (  0.00%) 24294195.20 (  6.93%) 26101849.04 (  0.00%)
Amean    p99-Read             30935176.53 (  0.00%) 30397053.16 (  1.74%) 29746220.52 (  3.84%)
Amean    p50-Write                 976.44 (  0.00%)     1077.22 (-10.32%)      952.73 (  2.43%)
Amean    p95-Write               15471.29 (  0.00%)    36419.56 (-135.40%)     3140.27 ( 79.70%)
Amean    p99-Write               35108.62 (  0.00%)   102000.36 (-190.53%)     8843.73 ( 74.81%)
Amean    p50-Allocation          76382.61 (  0.00%)    87485.22 (-14.54%)    76349.22 (  0.04%)
Amean    p95-Allocation         127777.39 (  0.00%)   204588.52 (-60.11%)   108630.26 ( 14.98%)
Amean    p99-Allocation         187937.39 (  0.00%)   631657.74 (-236.10%)   139094.26 ( 25.99%)

With this patch on top, all the latencies relative to the baseline are
improved, particularly write latencies. The read latencies are still high
for the number of threads but it's worth noting that this is mostly due
to the IO scheduler and not directly related to reclaim. The vmstats are
a bit of a mix but the relevant ones are as follows;

                            4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                          mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins                             0           0           0
Swap Outs                            0         608           0
Direct pages scanned           6910672     3132699     6357298
Kswapd pages scanned          57036946    82488665    56986286
Kswapd pages reclaimed        55993488    63474329    55939113
Direct pages reclaimed         6905990     2964843     6352115
Kswapd efficiency                  98%         76%         98%
Kswapd velocity              12494.375   17597.507   12488.065
Direct efficiency                  99%         94%         99%
Direct velocity               1513.835     668.306    1393.148
Page writes by reclaim           0.000 4410243.000       0.000
Page writes file                     0     4409635           0
Page writes anon                     0         608           0
Page reclaim immediate         1036792    14175203     1042571

Swap-outs are equivalent to baseline
Direct reclaim is reduced but not eliminated. It's worth noting
	that there are two periods of direct reclaim for this workload. The
	first is when it switches from preparing the files for the actual
	test itself. It's a lot of file IO followed by a lot of allocs
	that reclaims heavily for a brief window. After that, direct
	reclaim is intermittent when the workload spawns a number of
	threads periodically to do work. kswapd simply cannot wake and
	reclaim fast enough between the low and min watermarks. It could
	be mitigated using vm.watermark_scale_factor but not through
	special tricks in kswapd.
Page writes from reclaim context are at 0 which is the ideal
Pages immediately reclaimed after IO completes is back at the baseline

On UMA, there is almost no change so this is not expected to be a universal
win.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 118 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 66 insertions(+), 54 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 11581f4cfbb4..481aebc91782 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,7 +1209,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b47b430ca7ea..3bd1ddee09cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3090,14 +3090,36 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
+/*
+ * Returns true if there is an eligible zone balanced for the request order
+ * and classzone_idx
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	unsigned long mark = high_wmark_pages(zone);
+	int i;
+	unsigned long mark = -1;
+	struct zone *zone;
 
-	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
-		return false;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
 
-	return true;
+		if (!managed_zone(zone))
+			continue;
+
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+			return true;
+	}
+
+	/*
+	 * If a node has no populated zone within classzone_idx, it does not
+	 * need balancing by definition. This can happen if a zone-restricted
+	 * allocation tries to wake a remote kswapd.
+	 */
+	if (mark == -1)
+		return true;
+
+	return false;
 }
 
 /* Clear pgdat state for congested, dirty or under writeback. */
@@ -3116,8 +3138,6 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
  */
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	int i;
-
 	/*
 	 * The throttled processes are normally woken up in balance_pgdat() as
 	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3134,16 +3154,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx)) {
-			clear_pgdat_congested(pgdat);
-			return true;
-		}
+	if (pgdat_balanced(pgdat, order, classzone_idx)) {
+		clear_pgdat_congested(pgdat);
+		return true;
 	}
 
 	return false;
@@ -3249,23 +3262,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Check from
-		 * high to low zone as allocations prefer higher zones.
-		 * Scanning from low to high zone would allow congestion to be
-		 * cleared during a very small window when a small low
-		 * zone was balanced even under extreme pressure when the
-		 * overall node may be congested. Note that sc.reclaim_idx
-		 * is not used as buffer_heads_over_limit may have adjusted
-		 * it.
+		 * Only reclaim if there are no eligible zones. Note that
+		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
+		 * have adjusted it.
 		 */
-		for (i = classzone_idx; i >= 0; i--) {
-			zone = pgdat->node_zones + i;
-			if (!managed_zone(zone))
-				continue;
-
-			if (zone_balanced(zone, sc.order, classzone_idx))
-				goto out;
-		}
+		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+			goto out;
 
 		/*
 		 * Do some background aging of the anon list, to give
@@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	return sc.order;
 }
 
+/*
+ * pgdat->kswapd_classzone_idx is the highest zone index that a recent
+ * allocation request woke kswapd for. When kswapd has not woken recently,
+ * the value is MAX_NR_ZONES which is not a valid index. This compares a
+ * given classzone and returns it or the highest classzone index kswapd
+ * was recently woke for.
+ */
+static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
+					   enum zone_type classzone_idx)
+{
+	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
+		return classzone_idx;
+
+	return max(pgdat->kswapd_classzone_idx, classzone_idx);
+}
+
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
 				unsigned int classzone_idx)
 {
@@ -3363,7 +3381,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * the previous request that slept prematurely.
 		 */
 		if (remaining) {
-			pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+			pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
 		}
 
@@ -3417,7 +3435,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
  */
 static int kswapd(void *p)
 {
-	unsigned int alloc_order, reclaim_order, classzone_idx;
+	unsigned int alloc_order, reclaim_order;
+	unsigned int classzone_idx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3447,20 +3466,23 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
-	pgdat->kswapd_classzone_idx = classzone_idx = 0;
+	pgdat->kswapd_order = 0;
+	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
 		bool ret;
 
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
+
 kswapd_try_sleep:
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
 		/* Read the new order and classzone_idx */
 		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
+		classzone_idx = kswapd_classzone_idx(pgdat, 0);
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3486,9 +3508,6 @@ static int kswapd(void *p)
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
-
-		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3504,7 +3523,6 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
-	int z;
 
 	if (!managed_zone(zone))
 		return;
@@ -3512,22 +3530,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
 	/* Only wake kswapd if all zones are unbalanced */
-	for (z = 0; z <= classzone_idx; z++) {
-		zone = pgdat->node_zones + z;
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx))
-			return;
-	}
+	if (pgdat_balanced(pgdat, order, classzone_idx))
+		return;
 
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-15  9:22   ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15  9:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM, Mel Gorman

kswapd is woken to reclaim a node based on a failed allocation request
from any eligible zone. Once reclaiming in balance_pgdat(), it will
continue reclaiming until there is an eligible zone available for the
zone it was woken for. kswapd tracks what zone it was recently woken for
in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
zone will be 0.

However, the decision on whether to sleep is made on kswapd_classzone_idx
which is 0 without a recent wakeup request and that classzone does not
account for lowmem reserves.  This allows kswapd to sleep when a low
small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
a stream of allocations cannot use that zone. While kswapd may be woken
again shortly in the near future there are two consequences -- the pgdat
bits that control congestion are cleared prematurely and direct reclaim
is more likely as kswapd slept prematurely.

This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
index) when there has been no recent wakeups. If there are no wakeups,
it'll decide whether to sleep based on the highest possible zone available
(MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
decisions during reclaim and when deciding to sleep are the same. If there is
a mismatch, kswapd can stay awake continually trying to balance tiny zones.

simoop was used to evaluate it again. Two of the preparation patches regressed
the workload so they are included as the second set of results. Otherwise
this patch looks artifically excellent

                                         4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
                                     mmots-20170209           clear-v1r25       keepawake-v1r25
Amean    p50-Read             22325202.49 (  0.00%) 19491134.58 ( 12.69%) 22092755.48 (  1.04%)
Amean    p95-Read             26102988.80 (  0.00%) 24294195.20 (  6.93%) 26101849.04 (  0.00%)
Amean    p99-Read             30935176.53 (  0.00%) 30397053.16 (  1.74%) 29746220.52 (  3.84%)
Amean    p50-Write                 976.44 (  0.00%)     1077.22 (-10.32%)      952.73 (  2.43%)
Amean    p95-Write               15471.29 (  0.00%)    36419.56 (-135.40%)     3140.27 ( 79.70%)
Amean    p99-Write               35108.62 (  0.00%)   102000.36 (-190.53%)     8843.73 ( 74.81%)
Amean    p50-Allocation          76382.61 (  0.00%)    87485.22 (-14.54%)    76349.22 (  0.04%)
Amean    p95-Allocation         127777.39 (  0.00%)   204588.52 (-60.11%)   108630.26 ( 14.98%)
Amean    p99-Allocation         187937.39 (  0.00%)   631657.74 (-236.10%)   139094.26 ( 25.99%)

With this patch on top, all the latencies relative to the baseline are
improved, particularly write latencies. The read latencies are still high
for the number of threads but it's worth noting that this is mostly due
to the IO scheduler and not directly related to reclaim. The vmstats are
a bit of a mix but the relevant ones are as follows;

                            4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                          mmots-20170209 clear-v1r25keepawake-v1r25
Swap Ins                             0           0           0
Swap Outs                            0         608           0
Direct pages scanned           6910672     3132699     6357298
Kswapd pages scanned          57036946    82488665    56986286
Kswapd pages reclaimed        55993488    63474329    55939113
Direct pages reclaimed         6905990     2964843     6352115
Kswapd efficiency                  98%         76%         98%
Kswapd velocity              12494.375   17597.507   12488.065
Direct efficiency                  99%         94%         99%
Direct velocity               1513.835     668.306    1393.148
Page writes by reclaim           0.000 4410243.000       0.000
Page writes file                     0     4409635           0
Page writes anon                     0         608           0
Page reclaim immediate         1036792    14175203     1042571

Swap-outs are equivalent to baseline
Direct reclaim is reduced but not eliminated. It's worth noting
	that there are two periods of direct reclaim for this workload. The
	first is when it switches from preparing the files for the actual
	test itself. It's a lot of file IO followed by a lot of allocs
	that reclaims heavily for a brief window. After that, direct
	reclaim is intermittent when the workload spawns a number of
	threads periodically to do work. kswapd simply cannot wake and
	reclaim fast enough between the low and min watermarks. It could
	be mitigated using vm.watermark_scale_factor but not through
	special tricks in kswapd.
Page writes from reclaim context are at 0 which is the ideal
Pages immediately reclaimed after IO completes is back at the baseline

On UMA, there is almost no change so this is not expected to be a universal
win.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/memory_hotplug.c |   2 +-
 mm/vmscan.c         | 118 +++++++++++++++++++++++++++++-----------------------
 2 files changed, 66 insertions(+), 54 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 11581f4cfbb4..481aebc91782 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,7 +1209,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 		/* Reset the nr_zones, order and classzone_idx before reuse */
 		pgdat->nr_zones = 0;
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	}
 
 	/* we can use NODE_DATA(nid) from here */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b47b430ca7ea..3bd1ddee09cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3090,14 +3090,36 @@ static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
+/*
+ * Returns true if there is an eligible zone balanced for the request order
+ * and classzone_idx
+ */
+static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	unsigned long mark = high_wmark_pages(zone);
+	int i;
+	unsigned long mark = -1;
+	struct zone *zone;
 
-	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
-		return false;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
 
-	return true;
+		if (!managed_zone(zone))
+			continue;
+
+		mark = high_wmark_pages(zone);
+		if (zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+			return true;
+	}
+
+	/*
+	 * If a node has no populated zone within classzone_idx, it does not
+	 * need balancing by definition. This can happen if a zone-restricted
+	 * allocation tries to wake a remote kswapd.
+	 */
+	if (mark == -1)
+		return true;
+
+	return false;
 }
 
 /* Clear pgdat state for congested, dirty or under writeback. */
@@ -3116,8 +3138,6 @@ static void clear_pgdat_congested(pg_data_t *pgdat)
  */
 static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 {
-	int i;
-
 	/*
 	 * The throttled processes are normally woken up in balance_pgdat() as
 	 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3134,16 +3154,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 	if (waitqueue_active(&pgdat->pfmemalloc_wait))
 		wake_up_all(&pgdat->pfmemalloc_wait);
 
-	for (i = 0; i <= classzone_idx; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx)) {
-			clear_pgdat_congested(pgdat);
-			return true;
-		}
+	if (pgdat_balanced(pgdat, order, classzone_idx)) {
+		clear_pgdat_congested(pgdat);
+		return true;
 	}
 
 	return false;
@@ -3249,23 +3262,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Check from
-		 * high to low zone as allocations prefer higher zones.
-		 * Scanning from low to high zone would allow congestion to be
-		 * cleared during a very small window when a small low
-		 * zone was balanced even under extreme pressure when the
-		 * overall node may be congested. Note that sc.reclaim_idx
-		 * is not used as buffer_heads_over_limit may have adjusted
-		 * it.
+		 * Only reclaim if there are no eligible zones. Note that
+		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
+		 * have adjusted it.
 		 */
-		for (i = classzone_idx; i >= 0; i--) {
-			zone = pgdat->node_zones + i;
-			if (!managed_zone(zone))
-				continue;
-
-			if (zone_balanced(zone, sc.order, classzone_idx))
-				goto out;
-		}
+		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+			goto out;
 
 		/*
 		 * Do some background aging of the anon list, to give
@@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	return sc.order;
 }
 
+/*
+ * pgdat->kswapd_classzone_idx is the highest zone index that a recent
+ * allocation request woke kswapd for. When kswapd has not woken recently,
+ * the value is MAX_NR_ZONES which is not a valid index. This compares a
+ * given classzone and returns it or the highest classzone index kswapd
+ * was recently woke for.
+ */
+static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
+					   enum zone_type classzone_idx)
+{
+	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
+		return classzone_idx;
+
+	return max(pgdat->kswapd_classzone_idx, classzone_idx);
+}
+
 static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order,
 				unsigned int classzone_idx)
 {
@@ -3363,7 +3381,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * the previous request that slept prematurely.
 		 */
 		if (remaining) {
-			pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+			pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 			pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order);
 		}
 
@@ -3417,7 +3435,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
  */
 static int kswapd(void *p)
 {
-	unsigned int alloc_order, reclaim_order, classzone_idx;
+	unsigned int alloc_order, reclaim_order;
+	unsigned int classzone_idx = MAX_NR_ZONES - 1;
 	pg_data_t *pgdat = (pg_data_t*)p;
 	struct task_struct *tsk = current;
 
@@ -3447,20 +3466,23 @@ static int kswapd(void *p)
 	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 	set_freezable();
 
-	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
-	pgdat->kswapd_classzone_idx = classzone_idx = 0;
+	pgdat->kswapd_order = 0;
+	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 	for ( ; ; ) {
 		bool ret;
 
+		alloc_order = reclaim_order = pgdat->kswapd_order;
+		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
+
 kswapd_try_sleep:
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
 		/* Read the new order and classzone_idx */
 		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
+		classzone_idx = kswapd_classzone_idx(pgdat, 0);
 		pgdat->kswapd_order = 0;
-		pgdat->kswapd_classzone_idx = 0;
+		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
 
 		ret = try_to_freeze();
 		if (kthread_should_stop())
@@ -3486,9 +3508,6 @@ static int kswapd(void *p)
 		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
 		if (reclaim_order < alloc_order)
 			goto kswapd_try_sleep;
-
-		alloc_order = reclaim_order = pgdat->kswapd_order;
-		classzone_idx = pgdat->kswapd_classzone_idx;
 	}
 
 	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
@@ -3504,7 +3523,6 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
-	int z;
 
 	if (!managed_zone(zone))
 		return;
@@ -3512,22 +3530,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 	if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
 		return;
 	pgdat = zone->zone_pgdat;
-	pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx);
+	pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
 	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;
 
 	/* Only wake kswapd if all zones are unbalanced */
-	for (z = 0; z <= classzone_idx; z++) {
-		zone = pgdat->node_zones + z;
-		if (!managed_zone(zone))
-			continue;
-
-		if (zone_balanced(zone, order, classzone_idx))
-			return;
-	}
+	if (pgdat_balanced(pgdat, order, classzone_idx))
+		return;
 
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
  2017-02-15  9:22 ` Mel Gorman
@ 2017-02-15 20:30   ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-02-15 20:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> important as there are a number of kswapd-related fixes in that tree and
> a comparison against v4.10-rc7 would be almost meaningless as a result.

It's very late to squeeze this into 4.10.  We can make it 4.11 material
and perhaps tag it for backporting into 4.10.1?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15 20:30   ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-02-15 20:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> important as there are a number of kswapd-related fixes in that tree and
> a comparison against v4.10-rc7 would be almost meaningless as a result.

It's very late to squeeze this into 4.10.  We can make it 4.11 material
and perhaps tag it for backporting into 4.10.1?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
  2017-02-15 20:30   ` Andrew Morton
@ 2017-02-15 21:29     ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15 21:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > important as there are a number of kswapd-related fixes in that tree and
> > a comparison against v4.10-rc7 would be almost meaningless as a result.
> 
> It's very late to squeeze this into 4.10.  We can make it 4.11 material
> and perhaps tag it for backporting into 4.10.1?

It would be important that Johannes's patches go along with then because
I'm relied on Johannes' fixes to deal with pages being inappropriately
written back from reclaim context when I was analysing the workload.
I'm thinking specifically about these patches

mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch

This is 4.11 material for sure but I would not automatically try merging
them to 4.10 unless those patches were also included, ideally with a rerun
of just those patches against 4.10 to make sure there are no surprises
lurking in there.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15 21:29     ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15 21:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > important as there are a number of kswapd-related fixes in that tree and
> > a comparison against v4.10-rc7 would be almost meaningless as a result.
> 
> It's very late to squeeze this into 4.10.  We can make it 4.11 material
> and perhaps tag it for backporting into 4.10.1?

It would be important that Johannes's patches go along with then because
I'm relied on Johannes' fixes to deal with pages being inappropriately
written back from reclaim context when I was analysing the workload.
I'm thinking specifically about these patches

mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch

This is 4.11 material for sure but I would not automatically try merging
them to 4.10 unless those patches were also included, ideally with a rerun
of just those patches against 4.10 to make sure there are no surprises
lurking in there.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
  2017-02-15 21:29     ` Mel Gorman
@ 2017-02-15 21:56       ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-02-15 21:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > > important as there are a number of kswapd-related fixes in that tree and
> > > a comparison against v4.10-rc7 would be almost meaningless as a result.
> > 
> > It's very late to squeeze this into 4.10.  We can make it 4.11 material
> > and perhaps tag it for backporting into 4.10.1?
> 
> It would be important that Johannes's patches go along with then because
> I'm relied on Johannes' fixes to deal with pages being inappropriately
> written back from reclaim context when I was analysing the workload.
> I'm thinking specifically about these patches
> 
> mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> 
> This is 4.11 material for sure but I would not automatically try merging
> them to 4.10 unless those patches were also included, ideally with a rerun
> of just those patches against 4.10 to make sure there are no surprises
> lurking in there.

Head spinning a bit.  You're saying that if the three patches in the
series "Reduce amount of time kswapd sleeps prematurely" are held off
until 4.11 then the above 6 patches from Johannes should also be held
off for 4.11?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15 21:56       ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-02-15 21:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > > important as there are a number of kswapd-related fixes in that tree and
> > > a comparison against v4.10-rc7 would be almost meaningless as a result.
> > 
> > It's very late to squeeze this into 4.10.  We can make it 4.11 material
> > and perhaps tag it for backporting into 4.10.1?
> 
> It would be important that Johannes's patches go along with then because
> I'm relied on Johannes' fixes to deal with pages being inappropriately
> written back from reclaim context when I was analysing the workload.
> I'm thinking specifically about these patches
> 
> mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> 
> This is 4.11 material for sure but I would not automatically try merging
> them to 4.10 unless those patches were also included, ideally with a rerun
> of just those patches against 4.10 to make sure there are no surprises
> lurking in there.

Head spinning a bit.  You're saying that if the three patches in the
series "Reduce amount of time kswapd sleeps prematurely" are held off
until 4.11 then the above 6 patches from Johannes should also be held
off for 4.11?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
  2017-02-15 21:29     ` Mel Gorman
@ 2017-02-15 22:00       ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-15 22:00 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM

On 15.2.2017 22:29, Mel Gorman wrote:
> On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
>> On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>>> This patchset is based on mmots as of Feb 9th, 2016. The baseline is
>>> important as there are a number of kswapd-related fixes in that tree and
>>> a comparison against v4.10-rc7 would be almost meaningless as a result.
>>
>> It's very late to squeeze this into 4.10.  We can make it 4.11 material
>> and perhaps tag it for backporting into 4.10.1?
> 
> It would be important that Johannes's patches go along with then because
> I'm relied on Johannes' fixes to deal with pages being inappropriately
> written back from reclaim context when I was analysing the workload.
> I'm thinking specifically about these patches
> 
> mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> 
> This is 4.11 material for sure but I would not automatically try merging
> them to 4.10 unless those patches were also included, ideally with a rerun
> of just those patches against 4.10 to make sure there are no surprises
> lurking in there.

I wonder if we should also care about 4.9 which will be LTS, if we decide to
look at stable at all. IIUC at least the problem that patch 1/3 fixes (wrt
kcompactd not being woken up) is there since 4.8?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15 22:00       ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-15 22:00 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM

On 15.2.2017 22:29, Mel Gorman wrote:
> On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
>> On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
>>
>>> This patchset is based on mmots as of Feb 9th, 2016. The baseline is
>>> important as there are a number of kswapd-related fixes in that tree and
>>> a comparison against v4.10-rc7 would be almost meaningless as a result.
>>
>> It's very late to squeeze this into 4.10.  We can make it 4.11 material
>> and perhaps tag it for backporting into 4.10.1?
> 
> It would be important that Johannes's patches go along with then because
> I'm relied on Johannes' fixes to deal with pages being inappropriately
> written back from reclaim context when I was analysing the workload.
> I'm thinking specifically about these patches
> 
> mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> 
> This is 4.11 material for sure but I would not automatically try merging
> them to 4.10 unless those patches were also included, ideally with a rerun
> of just those patches against 4.10 to make sure there are no surprises
> lurking in there.

I wonder if we should also care about 4.9 which will be LTS, if we decide to
look at stable at all. IIUC at least the problem that patch 1/3 fixes (wrt
kcompactd not being woken up) is there since 4.8?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
  2017-02-15 21:56       ` Andrew Morton
@ 2017-02-15 22:15         ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, Feb 15, 2017 at 01:56:54PM -0800, Andrew Morton wrote:
> On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> > > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > > 
> > > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > > > important as there are a number of kswapd-related fixes in that tree and
> > > > a comparison against v4.10-rc7 would be almost meaningless as a result.
> > > 
> > > It's very late to squeeze this into 4.10.  We can make it 4.11 material
> > > and perhaps tag it for backporting into 4.10.1?
> > 
> > It would be important that Johannes's patches go along with then because
> > I'm relied on Johannes' fixes to deal with pages being inappropriately
> > written back from reclaim context when I was analysing the workload.
> > I'm thinking specifically about these patches
> > 
> > mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> > mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> > mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> > 
> > This is 4.11 material for sure but I would not automatically try merging
> > them to 4.10 unless those patches were also included, ideally with a rerun
> > of just those patches against 4.10 to make sure there are no surprises
> > lurking in there.
> 
> Head spinning a bit.  You're saying that if the three patches in the
> series "Reduce amount of time kswapd sleeps prematurely" are held off
> until 4.11 then the above 6 patches from Johannes should also be held
> off for 4.11?
> 

Not quite, sorry for the confusion. Johannes's patches stand on their
own and they are fine. I evaluated them before against 4.10-rcX and they
worked as expected.  If they go in first then these patches can go into
4.10-stable on top. If you plan to merge Johannes's patches for 4.10 or
4.10.1 then I have no problem with that whatsoever.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely
@ 2017-02-15 22:15         ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-15 22:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka,
	LKML, Linux-MM

On Wed, Feb 15, 2017 at 01:56:54PM -0800, Andrew Morton wrote:
> On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote:
> > > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> > > 
> > > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is
> > > > important as there are a number of kswapd-related fixes in that tree and
> > > > a comparison against v4.10-rc7 would be almost meaningless as a result.
> > > 
> > > It's very late to squeeze this into 4.10.  We can make it 4.11 material
> > > and perhaps tag it for backporting into 4.10.1?
> > 
> > It would be important that Johannes's patches go along with then because
> > I'm relied on Johannes' fixes to deal with pages being inappropriately
> > written back from reclaim context when I was analysing the workload.
> > I'm thinking specifically about these patches
> > 
> > mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch
> > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch
> > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch
> > mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch
> > mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch
> > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch
> > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch
> > 
> > This is 4.11 material for sure but I would not automatically try merging
> > them to 4.10 unless those patches were also included, ideally with a rerun
> > of just those patches against 4.10 to make sure there are no surprises
> > lurking in there.
> 
> Head spinning a bit.  You're saying that if the three patches in the
> series "Reduce amount of time kswapd sleeps prematurely" are held off
> until 4.11 then the above 6 patches from Johannes should also be held
> off for 4.11?
> 

Not quite, sorry for the confusion. Johannes's patches stand on their
own and they are fine. I evaluated them before against 4.10-rcX and they
worked as expected.  If they go in first then these patches can go into
4.10-stable on top. If you plan to merge Johannes's patches for 4.10 or
4.10.1 then I have no problem with that whatsoever.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22   ` Mel Gorman
@ 2017-02-16  2:50     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  2:50 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Shantanu Goel', 'Chris Mason',
	'Johannes Weiner', 'Vlastimil Babka',
	'LKML', 'Linux-MM'

On February 15, 2017 5:23 PM Mel Gorman wrote: 
> 
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..92fc66bd52bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
> 
> -		if (!zone_balanced(zone, order, classzone_idx))
> -			return false;
> +		if (zone_balanced(zone, order, classzone_idx))
> +			return true;
>  	}
> 
> -	return true;
> +	return false;
>  }
> 
>  /*
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-16  2:50     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  2:50 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Shantanu Goel', 'Chris Mason',
	'Johannes Weiner', 'Vlastimil Babka',
	'LKML', 'Linux-MM'

On February 15, 2017 5:23 PM Mel Gorman wrote: 
> 
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..92fc66bd52bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx)
>  		if (!managed_zone(zone))
>  			continue;
> 
> -		if (!zone_balanced(zone, order, classzone_idx))
> -			return false;
> +		if (zone_balanced(zone, order, classzone_idx))
> +			return true;
>  	}
> 
> -	return true;
> +	return false;
>  }
> 
>  /*
> --
> 2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-15  9:22   ` Mel Gorman
@ 2017-02-16  6:23     ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  6:23 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Shantanu Goel', 'Chris Mason',
	'Johannes Weiner', 'Vlastimil Babka',
	'LKML', 'Linux-MM'

On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> -	unsigned int alloc_order, reclaim_order, classzone_idx;
> +	unsigned int alloc_order, reclaim_order;
> +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
>  	pg_data_t *pgdat = (pg_data_t*)p;
>  	struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
> 
> -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> +	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
>  		bool ret;
> 
> +		alloc_order = reclaim_order = pgdat->kswapd_order;
> +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>  					classzone_idx);
> 
>  		/* Read the new order and classzone_idx */
>  		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = pgdat->kswapd_classzone_idx;
> +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
>  		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_classzone_idx = 0;
> +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
>  		if (reclaim_order < alloc_order)
>  			goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = pgdat->kswapd_classzone_idx;
>  	}
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-16  6:23     ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  6:23 UTC (permalink / raw)
  To: 'Mel Gorman', 'Andrew Morton'
  Cc: 'Shantanu Goel', 'Chris Mason',
	'Johannes Weiner', 'Vlastimil Babka',
	'LKML', 'Linux-MM'

On February 15, 2017 5:23 PM Mel Gorman wrote: 
>   */
>  static int kswapd(void *p)
>  {
> -	unsigned int alloc_order, reclaim_order, classzone_idx;
> +	unsigned int alloc_order, reclaim_order;
> +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
>  	pg_data_t *pgdat = (pg_data_t*)p;
>  	struct task_struct *tsk = current;
> 
> @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  	set_freezable();
> 
> -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> +	pgdat->kswapd_order = 0;
> +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>  	for ( ; ; ) {
>  		bool ret;
> 
> +		alloc_order = reclaim_order = pgdat->kswapd_order;
> +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> +
>  kswapd_try_sleep:
>  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>  					classzone_idx);
> 
>  		/* Read the new order and classzone_idx */
>  		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = pgdat->kswapd_classzone_idx;
> +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
>  		pgdat->kswapd_order = 0;
> -		pgdat->kswapd_classzone_idx = 0;
> +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> 
>  		ret = try_to_freeze();
>  		if (kthread_should_stop())
> @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
>  		if (reclaim_order < alloc_order)
>  			goto kswapd_try_sleep;

If we fail order-5 request,  can we then give up order-5, and
try order-3 if requested, after napping?

> -
> -		alloc_order = reclaim_order = pgdat->kswapd_order;
> -		classzone_idx = pgdat->kswapd_classzone_idx;
>  	}
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-16  6:23     ` Hillf Danton
@ 2017-02-16  8:10       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-16  8:10 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'

On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> On February 15, 2017 5:23 PM Mel Gorman wrote: 
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > +	unsigned int alloc_order, reclaim_order;
> > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> > 
> > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> > 
> > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > +	pgdat->kswapd_order = 0;
> > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> >  	for ( ; ; ) {
> >  		bool ret;
> > 
> > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > +
> >  kswapd_try_sleep:
> >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> >  					classzone_idx);
> > 
> >  		/* Read the new order and classzone_idx */
> >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> >  		pgdat->kswapd_order = 0;
> > -		pgdat->kswapd_classzone_idx = 0;
> > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > 
> >  		ret = try_to_freeze();
> >  		if (kthread_should_stop())
> > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> >  		if (reclaim_order < alloc_order)
> >  			goto kswapd_try_sleep;
> 
> If we fail order-5 request,  can we then give up order-5, and
> try order-3 if requested, after napping?
> 

That has no bearing upon this patch. At this point, kswapd has stopped
reclaiming at the requested order and is preparing to sleep. If there is
a parallel request for order-3 while it's sleeping, it'll wake and start
reclaiming at order-3 as requested.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-16  8:10       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-16  8:10 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'

On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> On February 15, 2017 5:23 PM Mel Gorman wrote: 
> >   */
> >  static int kswapd(void *p)
> >  {
> > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > +	unsigned int alloc_order, reclaim_order;
> > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> >  	pg_data_t *pgdat = (pg_data_t*)p;
> >  	struct task_struct *tsk = current;
> > 
> > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> >  	set_freezable();
> > 
> > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > +	pgdat->kswapd_order = 0;
> > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> >  	for ( ; ; ) {
> >  		bool ret;
> > 
> > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > +
> >  kswapd_try_sleep:
> >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> >  					classzone_idx);
> > 
> >  		/* Read the new order and classzone_idx */
> >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> >  		pgdat->kswapd_order = 0;
> > -		pgdat->kswapd_classzone_idx = 0;
> > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > 
> >  		ret = try_to_freeze();
> >  		if (kthread_should_stop())
> > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> >  		if (reclaim_order < alloc_order)
> >  			goto kswapd_try_sleep;
> 
> If we fail order-5 request,  can we then give up order-5, and
> try order-3 if requested, after napping?
> 

That has no bearing upon this patch. At this point, kswapd has stopped
reclaiming at the requested order and is preparing to sleep. If there is
a parallel request for order-3 while it's sleeping, it'll wake and start
reclaiming at order-3 as requested.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-16  8:10       ` Mel Gorman
@ 2017-02-16  8:21         ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  8:21 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'


On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > > +	unsigned int alloc_order, reclaim_order;
> > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >  	pg_data_t *pgdat = (pg_data_t*)p;
> > >  	struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >  	set_freezable();
> > >
> > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > +	pgdat->kswapd_order = 0;
> > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >  	for ( ; ; ) {
> > >  		bool ret;
> > >
> > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >  					classzone_idx);
> > >
> > >  		/* Read the new order and classzone_idx */
> > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >  		pgdat->kswapd_order = 0;
> > > -		pgdat->kswapd_classzone_idx = 0;
> > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >  		ret = try_to_freeze();
> > >  		if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> > >  		if (reclaim_order < alloc_order)
> > >  			goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-16  8:21         ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-16  8:21 UTC (permalink / raw)
  To: 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'


On February 16, 2017 4:11 PM Mel Gorman wrote:
> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > >   */
> > >  static int kswapd(void *p)
> > >  {
> > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > > +	unsigned int alloc_order, reclaim_order;
> > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > >  	pg_data_t *pgdat = (pg_data_t*)p;
> > >  	struct task_struct *tsk = current;
> > >
> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > >  	set_freezable();
> > >
> > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > +	pgdat->kswapd_order = 0;
> > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >  	for ( ; ; ) {
> > >  		bool ret;
> > >
> > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > +
> > >  kswapd_try_sleep:
> > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > >  					classzone_idx);
> > >
> > >  		/* Read the new order and classzone_idx */
> > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > >  		pgdat->kswapd_order = 0;
> > > -		pgdat->kswapd_classzone_idx = 0;
> > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > >
> > >  		ret = try_to_freeze();
> > >  		if (kthread_should_stop())
> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> > >  		if (reclaim_order < alloc_order)
> > >  			goto kswapd_try_sleep;
> >
> > If we fail order-5 request,  can we then give up order-5, and
> > try order-3 if requested, after napping?
> >
> 
> That has no bearing upon this patch. At this point, kswapd has stopped
> reclaiming at the requested order and is preparing to sleep. If there is
> a parallel request for order-3 while it's sleeping, it'll wake and start
> reclaiming at order-3 as requested.
> 
Right, but the order-3 request can also come up while kswapd is active and
gives up order-5.

thanks
Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-16  8:21         ` Hillf Danton
@ 2017-02-16  9:32           ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-16  9:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'

On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
> > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > > >   */
> > > >  static int kswapd(void *p)
> > > >  {
> > > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > > > +	unsigned int alloc_order, reclaim_order;
> > > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > > >  	pg_data_t *pgdat = (pg_data_t*)p;
> > > >  	struct task_struct *tsk = current;
> > > >
> > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > >  	set_freezable();
> > > >
> > > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > > +	pgdat->kswapd_order = 0;
> > > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >  	for ( ; ; ) {
> > > >  		bool ret;
> > > >
> > > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > > +
> > > >  kswapd_try_sleep:
> > > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > > >  					classzone_idx);
> > > >
> > > >  		/* Read the new order and classzone_idx */
> > > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > > >  		pgdat->kswapd_order = 0;
> > > > -		pgdat->kswapd_classzone_idx = 0;
> > > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >
> > > >  		ret = try_to_freeze();
> > > >  		if (kthread_should_stop())
> > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> > > >  		if (reclaim_order < alloc_order)
> > > >  			goto kswapd_try_sleep;
> > >
> > > If we fail order-5 request,  can we then give up order-5, and
> > > try order-3 if requested, after napping?
> > >
> > 
> > That has no bearing upon this patch. At this point, kswapd has stopped
> > reclaiming at the requested order and is preparing to sleep. If there is
> > a parallel request for order-3 while it's sleeping, it'll wake and start
> > reclaiming at order-3 as requested.
> > 
>
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.
> 

And then it'll be in pgdat->kswapd_order and be picked up on the next
wakeup. It won't be immediate but it's also unlikely to be worth picking
up immediately. The context here is that a high-order reclaim request
failed and rather keeping kswapd awake reclaiming the world, go to sleep
until another wakeup request comes in. Staying awake continually for
high orders caused problems with excessive reclaim in the past.

It could be revisited again but it's not related to what this patch is
aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced
for a GFP_DMA request which is nowhere in the request stream.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-16  9:32           ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-16  9:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner',
	'Vlastimil Babka', 'LKML', 'Linux-MM'

On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
> > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
> > > On February 15, 2017 5:23 PM Mel Gorman wrote:
> > > >   */
> > > >  static int kswapd(void *p)
> > > >  {
> > > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
> > > > +	unsigned int alloc_order, reclaim_order;
> > > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
> > > >  	pg_data_t *pgdat = (pg_data_t*)p;
> > > >  	struct task_struct *tsk = current;
> > > >
> > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
> > > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > > >  	set_freezable();
> > > >
> > > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
> > > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
> > > > +	pgdat->kswapd_order = 0;
> > > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >  	for ( ; ; ) {
> > > >  		bool ret;
> > > >
> > > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
> > > > +
> > > >  kswapd_try_sleep:
> > > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> > > >  					classzone_idx);
> > > >
> > > >  		/* Read the new order and classzone_idx */
> > > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
> > > > -		classzone_idx = pgdat->kswapd_classzone_idx;
> > > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
> > > >  		pgdat->kswapd_order = 0;
> > > > -		pgdat->kswapd_classzone_idx = 0;
> > > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
> > > >
> > > >  		ret = try_to_freeze();
> > > >  		if (kthread_should_stop())
> > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
> > > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
> > > >  		if (reclaim_order < alloc_order)
> > > >  			goto kswapd_try_sleep;
> > >
> > > If we fail order-5 request,  can we then give up order-5, and
> > > try order-3 if requested, after napping?
> > >
> > 
> > That has no bearing upon this patch. At this point, kswapd has stopped
> > reclaiming at the requested order and is preparing to sleep. If there is
> > a parallel request for order-3 while it's sleeping, it'll wake and start
> > reclaiming at order-3 as requested.
> > 
>
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.
> 

And then it'll be in pgdat->kswapd_order and be picked up on the next
wakeup. It won't be immediate but it's also unlikely to be worth picking
up immediately. The context here is that a high-order reclaim request
failed and rather keeping kswapd awake reclaiming the world, go to sleep
until another wakeup request comes in. Staying awake continually for
high orders caused problems with excessive reclaim in the past.

It could be revisited again but it's not related to what this patch is
aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced
for a GFP_DMA request which is nowhere in the request stream.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-16  8:21         ` Hillf Danton
@ 2017-02-20 16:34           ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-20 16:34 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner', 'LKML',
	'Linux-MM'

On 02/16/2017 09:21 AM, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
>> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
>> > On February 15, 2017 5:23 PM Mel Gorman wrote:
>> > >   */
>> > >  static int kswapd(void *p)
>> > >  {
>> > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
>> > > +	unsigned int alloc_order, reclaim_order;
>> > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
>> > >  	pg_data_t *pgdat = (pg_data_t*)p;
>> > >  	struct task_struct *tsk = current;
>> > >
>> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>> > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > >  	set_freezable();
>> > >
>> > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>> > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
>> > > +	pgdat->kswapd_order = 0;
>> > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >  	for ( ; ; ) {
>> > >  		bool ret;
>> > >
>> > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
>> > > +
>> > >  kswapd_try_sleep:
>> > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>> > >  					classzone_idx);
>> > >
>> > >  		/* Read the new order and classzone_idx */
>> > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > -		classzone_idx = pgdat->kswapd_classzone_idx;
>> > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
>> > >  		pgdat->kswapd_order = 0;
>> > > -		pgdat->kswapd_classzone_idx = 0;
>> > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >
>> > >  		ret = try_to_freeze();
>> > >  		if (kthread_should_stop())
>> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>> > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
>> > >  		if (reclaim_order < alloc_order)
>> > >  			goto kswapd_try_sleep;
>> >
>> > If we fail order-5 request,  can we then give up order-5, and
>> > try order-3 if requested, after napping?
>> >
>> 
>> That has no bearing upon this patch. At this point, kswapd has stopped
>> reclaiming at the requested order and is preparing to sleep. If there is
>> a parallel request for order-3 while it's sleeping, it'll wake and start
>> reclaiming at order-3 as requested.
>> 
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.

"Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
kswapd could help an order-3 allocation at that point - it's up to kcompactd.

> thanks
> Hillf
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-20 16:34           ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-20 16:34 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner', 'LKML',
	'Linux-MM'

On 02/16/2017 09:21 AM, Hillf Danton wrote:
> 
> On February 16, 2017 4:11 PM Mel Gorman wrote:
>> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote:
>> > On February 15, 2017 5:23 PM Mel Gorman wrote:
>> > >   */
>> > >  static int kswapd(void *p)
>> > >  {
>> > > -	unsigned int alloc_order, reclaim_order, classzone_idx;
>> > > +	unsigned int alloc_order, reclaim_order;
>> > > +	unsigned int classzone_idx = MAX_NR_ZONES - 1;
>> > >  	pg_data_t *pgdat = (pg_data_t*)p;
>> > >  	struct task_struct *tsk = current;
>> > >
>> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p)
>> > >  	tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>> > >  	set_freezable();
>> > >
>> > > -	pgdat->kswapd_order = alloc_order = reclaim_order = 0;
>> > > -	pgdat->kswapd_classzone_idx = classzone_idx = 0;
>> > > +	pgdat->kswapd_order = 0;
>> > > +	pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >  	for ( ; ; ) {
>> > >  		bool ret;
>> > >
>> > > +		alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > +		classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx);
>> > > +
>> > >  kswapd_try_sleep:
>> > >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>> > >  					classzone_idx);
>> > >
>> > >  		/* Read the new order and classzone_idx */
>> > >  		alloc_order = reclaim_order = pgdat->kswapd_order;
>> > > -		classzone_idx = pgdat->kswapd_classzone_idx;
>> > > +		classzone_idx = kswapd_classzone_idx(pgdat, 0);
>> > >  		pgdat->kswapd_order = 0;
>> > > -		pgdat->kswapd_classzone_idx = 0;
>> > > +		pgdat->kswapd_classzone_idx = MAX_NR_ZONES;
>> > >
>> > >  		ret = try_to_freeze();
>> > >  		if (kthread_should_stop())
>> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p)
>> > >  		reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx);
>> > >  		if (reclaim_order < alloc_order)
>> > >  			goto kswapd_try_sleep;
>> >
>> > If we fail order-5 request,  can we then give up order-5, and
>> > try order-3 if requested, after napping?
>> >
>> 
>> That has no bearing upon this patch. At this point, kswapd has stopped
>> reclaiming at the requested order and is preparing to sleep. If there is
>> a parallel request for order-3 while it's sleeping, it'll wake and start
>> reclaiming at order-3 as requested.
>> 
> Right, but the order-3 request can also come up while kswapd is active and
> gives up order-5.

"Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
kswapd could help an order-3 allocation at that point - it's up to kcompactd.

> thanks
> Hillf
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-15  9:22   ` Mel Gorman
@ 2017-02-20 16:42     ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-20 16:42 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM

On 02/15/2017 10:22 AM, Mel Gorman wrote:
> kswapd is woken to reclaim a node based on a failed allocation request
> from any eligible zone. Once reclaiming in balance_pgdat(), it will
> continue reclaiming until there is an eligible zone available for the
> zone it was woken for. kswapd tracks what zone it was recently woken for
> in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
> zone will be 0.
> 
> However, the decision on whether to sleep is made on kswapd_classzone_idx
> which is 0 without a recent wakeup request and that classzone does not
> account for lowmem reserves.  This allows kswapd to sleep when a low
> small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
> a stream of allocations cannot use that zone. While kswapd may be woken
> again shortly in the near future there are two consequences -- the pgdat
> bits that control congestion are cleared prematurely and direct reclaim
> is more likely as kswapd slept prematurely.
> 
> This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
> index) when there has been no recent wakeups. If there are no wakeups,
> it'll decide whether to sleep based on the highest possible zone available
> (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
> decisions during reclaim and when deciding to sleep are the same. If there is
> a mismatch, kswapd can stay awake continually trying to balance tiny zones.
> 
> simoop was used to evaluate it again. Two of the preparation patches regressed
> the workload so they are included as the second set of results. Otherwise
> this patch looks artifically excellent
> 
>                                          4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           clear-v1r25       keepawake-v1r25
> Amean    p50-Read             22325202.49 (  0.00%) 19491134.58 ( 12.69%) 22092755.48 (  1.04%)
> Amean    p95-Read             26102988.80 (  0.00%) 24294195.20 (  6.93%) 26101849.04 (  0.00%)
> Amean    p99-Read             30935176.53 (  0.00%) 30397053.16 (  1.74%) 29746220.52 (  3.84%)
> Amean    p50-Write                 976.44 (  0.00%)     1077.22 (-10.32%)      952.73 (  2.43%)
> Amean    p95-Write               15471.29 (  0.00%)    36419.56 (-135.40%)     3140.27 ( 79.70%)
> Amean    p99-Write               35108.62 (  0.00%)   102000.36 (-190.53%)     8843.73 ( 74.81%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87485.22 (-14.54%)    76349.22 (  0.04%)
> Amean    p95-Allocation         127777.39 (  0.00%)   204588.52 (-60.11%)   108630.26 ( 14.98%)
> Amean    p99-Allocation         187937.39 (  0.00%)   631657.74 (-236.10%)   139094.26 ( 25.99%)
> 
> With this patch on top, all the latencies relative to the baseline are
> improved, particularly write latencies. The read latencies are still high
> for the number of threads but it's worth noting that this is mostly due
> to the IO scheduler and not directly related to reclaim. The vmstats are
> a bit of a mix but the relevant ones are as follows;
> 
>                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>                           mmots-20170209 clear-v1r25keepawake-v1r25
> Swap Ins                             0           0           0
> Swap Outs                            0         608           0
> Direct pages scanned           6910672     3132699     6357298
> Kswapd pages scanned          57036946    82488665    56986286
> Kswapd pages reclaimed        55993488    63474329    55939113
> Direct pages reclaimed         6905990     2964843     6352115

These stats are confusing me. The earlier description suggests that this patch
should cause less direct reclaim and more kswapd reclaim, but compared to
"clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
considering direct + kswapd combined)

> Kswapd efficiency                  98%         76%         98%
> Kswapd velocity              12494.375   17597.507   12488.065
> Direct efficiency                  99%         94%         99%
> Direct velocity               1513.835     668.306    1393.148
> Page writes by reclaim           0.000 4410243.000       0.000
> Page writes file                     0     4409635           0
> Page writes anon                     0         608           0
> Page reclaim immediate         1036792    14175203     1042571
> 
> Swap-outs are equivalent to baseline
> Direct reclaim is reduced but not eliminated. It's worth noting
> 	that there are two periods of direct reclaim for this workload. The
> 	first is when it switches from preparing the files for the actual
> 	test itself. It's a lot of file IO followed by a lot of allocs
> 	that reclaims heavily for a brief window. After that, direct
> 	reclaim is intermittent when the workload spawns a number of
> 	threads periodically to do work. kswapd simply cannot wake and
> 	reclaim fast enough between the low and min watermarks. It could
> 	be mitigated using vm.watermark_scale_factor but not through
> 	special tricks in kswapd.
> Page writes from reclaim context are at 0 which is the ideal
> Pages immediately reclaimed after IO completes is back at the baseline
> 
> On UMA, there is almost no change so this is not expected to be a universal
> win.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

[...]

> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	return sc.order;
>  }
>  
> +/*
> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> + * allocation request woke kswapd for. When kswapd has not woken recently,
> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> + * given classzone and returns it or the highest classzone index kswapd
> + * was recently woke for.
> + */
> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> +					   enum zone_type classzone_idx)
> +{
> +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> +		return classzone_idx;
> +
> +	return max(pgdat->kswapd_classzone_idx, classzone_idx);

A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
a local variable with READ_ONCE(), otherwise something can set it to
MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
Probably not an issue with current callers, but I'd rather future-proof it.

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-20 16:42     ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-02-20 16:42 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM

On 02/15/2017 10:22 AM, Mel Gorman wrote:
> kswapd is woken to reclaim a node based on a failed allocation request
> from any eligible zone. Once reclaiming in balance_pgdat(), it will
> continue reclaiming until there is an eligible zone available for the
> zone it was woken for. kswapd tracks what zone it was recently woken for
> in pgdat->kswapd_classzone_idx. If it has not been woken recently, this
> zone will be 0.
> 
> However, the decision on whether to sleep is made on kswapd_classzone_idx
> which is 0 without a recent wakeup request and that classzone does not
> account for lowmem reserves.  This allows kswapd to sleep when a low
> small zone such as ZONE_DMA is balanced for a GFP_DMA request even if
> a stream of allocations cannot use that zone. While kswapd may be woken
> again shortly in the near future there are two consequences -- the pgdat
> bits that control congestion are cleared prematurely and direct reclaim
> is more likely as kswapd slept prematurely.
> 
> This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid
> index) when there has been no recent wakeups. If there are no wakeups,
> it'll decide whether to sleep based on the highest possible zone available
> (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced"
> decisions during reclaim and when deciding to sleep are the same. If there is
> a mismatch, kswapd can stay awake continually trying to balance tiny zones.
> 
> simoop was used to evaluate it again. Two of the preparation patches regressed
> the workload so they are included as the second set of results. Otherwise
> this patch looks artifically excellent
> 
>                                          4.10.0-rc7            4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           clear-v1r25       keepawake-v1r25
> Amean    p50-Read             22325202.49 (  0.00%) 19491134.58 ( 12.69%) 22092755.48 (  1.04%)
> Amean    p95-Read             26102988.80 (  0.00%) 24294195.20 (  6.93%) 26101849.04 (  0.00%)
> Amean    p99-Read             30935176.53 (  0.00%) 30397053.16 (  1.74%) 29746220.52 (  3.84%)
> Amean    p50-Write                 976.44 (  0.00%)     1077.22 (-10.32%)      952.73 (  2.43%)
> Amean    p95-Write               15471.29 (  0.00%)    36419.56 (-135.40%)     3140.27 ( 79.70%)
> Amean    p99-Write               35108.62 (  0.00%)   102000.36 (-190.53%)     8843.73 ( 74.81%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87485.22 (-14.54%)    76349.22 (  0.04%)
> Amean    p95-Allocation         127777.39 (  0.00%)   204588.52 (-60.11%)   108630.26 ( 14.98%)
> Amean    p99-Allocation         187937.39 (  0.00%)   631657.74 (-236.10%)   139094.26 ( 25.99%)
> 
> With this patch on top, all the latencies relative to the baseline are
> improved, particularly write latencies. The read latencies are still high
> for the number of threads but it's worth noting that this is mostly due
> to the IO scheduler and not directly related to reclaim. The vmstats are
> a bit of a mix but the relevant ones are as follows;
> 
>                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>                           mmots-20170209 clear-v1r25keepawake-v1r25
> Swap Ins                             0           0           0
> Swap Outs                            0         608           0
> Direct pages scanned           6910672     3132699     6357298
> Kswapd pages scanned          57036946    82488665    56986286
> Kswapd pages reclaimed        55993488    63474329    55939113
> Direct pages reclaimed         6905990     2964843     6352115

These stats are confusing me. The earlier description suggests that this patch
should cause less direct reclaim and more kswapd reclaim, but compared to
"clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
considering direct + kswapd combined)

> Kswapd efficiency                  98%         76%         98%
> Kswapd velocity              12494.375   17597.507   12488.065
> Direct efficiency                  99%         94%         99%
> Direct velocity               1513.835     668.306    1393.148
> Page writes by reclaim           0.000 4410243.000       0.000
> Page writes file                     0     4409635           0
> Page writes anon                     0         608           0
> Page reclaim immediate         1036792    14175203     1042571
> 
> Swap-outs are equivalent to baseline
> Direct reclaim is reduced but not eliminated. It's worth noting
> 	that there are two periods of direct reclaim for this workload. The
> 	first is when it switches from preparing the files for the actual
> 	test itself. It's a lot of file IO followed by a lot of allocs
> 	that reclaims heavily for a brief window. After that, direct
> 	reclaim is intermittent when the workload spawns a number of
> 	threads periodically to do work. kswapd simply cannot wake and
> 	reclaim fast enough between the low and min watermarks. It could
> 	be mitigated using vm.watermark_scale_factor but not through
> 	special tricks in kswapd.
> Page writes from reclaim context are at 0 which is the ideal
> Pages immediately reclaimed after IO completes is back at the baseline
> 
> On UMA, there is almost no change so this is not expected to be a universal
> win.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

[...]

> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  	return sc.order;
>  }
>  
> +/*
> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> + * allocation request woke kswapd for. When kswapd has not woken recently,
> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> + * given classzone and returns it or the highest classzone index kswapd
> + * was recently woke for.
> + */
> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> +					   enum zone_type classzone_idx)
> +{
> +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> +		return classzone_idx;
> +
> +	return max(pgdat->kswapd_classzone_idx, classzone_idx);

A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
a local variable with READ_ONCE(), otherwise something can set it to
MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
Probably not an issue with current callers, but I'd rather future-proof it.

Thanks,
Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-20 16:34           ` Vlastimil Babka
@ 2017-02-21  4:10             ` Hillf Danton
  -1 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-21  4:10 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner', 'LKML',
	'Linux-MM'

On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
	cpu0				cpu1
	give up order-5 
	fall back to order-0
					wake up kswapd for order-3 
					wake up kswapd for order-5
	fall in sleep
					wake up kswapd for order-3
	what order would
	we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-21  4:10             ` Hillf Danton
  0 siblings, 0 replies; 48+ messages in thread
From: Hillf Danton @ 2017-02-21  4:10 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman'
  Cc: 'Andrew Morton', 'Shantanu Goel',
	'Chris Mason', 'Johannes Weiner', 'LKML',
	'Linux-MM'

On February 21, 2017 12:34 AM Vlastimil Babka wrote:
> On 02/16/2017 09:21 AM, Hillf Danton wrote:
> > Right, but the order-3 request can also come up while kswapd is active and
> > gives up order-5.
> 
> "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming
> order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how
> kswapd could help an order-3 allocation at that point - it's up to kcompactd.
> 
	cpu0				cpu1
	give up order-5 
	fall back to order-0
					wake up kswapd for order-3 
					wake up kswapd for order-5
	fall in sleep
					wake up kswapd for order-3
	what order would
	we try?

It is order-5 in the patch. 

Given the fresh new world without hike ban after napping, 
one tenth second or 3 minutes, we feel free IMHO to select
any order and go another round of reclaiming pages.

thanks
Hillf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-15  9:22   ` Mel Gorman
@ 2017-02-22  7:00     ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-22  7:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi,

On Wed, Feb 15, 2017 at 09:22:45AM +0000, Mel Gorman wrote:
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
short sleep point where every eligible zones are balanced.
What's the correlation between them?

Can't we wake up kcompactd once we found a zone has enough free pages
above high watermark like this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..f4f0ad0e9ede 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * that pages and compaction may succeed so reset the cache.
 		 */
 		reset_isolation_suitable(pgdat);
-
-		/*
-		 * We have freed the memory, now we should compact it to make
-		 * allocation of the requested order possible.
-		 */
-		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
-
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3451,6 +3444,14 @@ static int kswapd(void *p)
 		bool ret;
 
 kswapd_try_sleep:
+		/*
+		 * We have freed the memory, now we should compact it to make
+		 * allocation of the requested order possible.
+		 */
+		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
+							classzone_idx))
+			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-22  7:00     ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-22  7:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi,

On Wed, Feb 15, 2017 at 09:22:45AM +0000, Mel Gorman wrote:
> From: Shantanu Goel <sgoel01@yahoo.com>
> 
> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat
> since the latter will return as soon as any one of the zones in the
> classzone is above the watermark.  This is specially important for higher
> order allocations since balance_pgdat will typically reset the order to
> zero relying on compaction to create the higher order pages.  Without this
> patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone
> balance check fails.
> 
> On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the
> zone balance check between balance_pgdat() and prepare_kswapd_sleep().
> balance_pgdat() returns as soon as a single zone satisfies the allocation
> but prepare_kswapd_sleep() requires all zones to do +the same.  This causes
> prepare_kswapd_sleep() to never succeed except in the order == 0 case and
> consequently, wakeup_kcompactd() is never called.  On my machine prior to
> apply this patch, the state of compaction from /proc/vmstat looked this
> way after a day and a half +of uptime:
> 
> compact_migrate_scanned 240496
> compact_free_scanned 76238632
> compact_isolated 123472
> compact_stall 1791
> compact_fail 29
> compact_success 1762
> compact_daemon_wake 0
> 
> After applying the patch and about 10 hours of uptime the state looks
> like this:
> 
> compact_migrate_scanned 59927299
> compact_free_scanned 2021075136
> compact_isolated 640926
> compact_stall 4
> compact_fail 2
> compact_success 2
> compact_daemon_wake 5160
> 
> Further notes from Mel that motivated him to pick this patch up and
> resend it;
> 
> It was observed for the simoop workload (pressures the VM similar to HADOOP)
> that kswapd was failing to keep ahead of direct reclaim. The investigation
> noted that there was a need to rationalise kswapd decisions to reclaim
> with kswapd decisions to sleep. With this patch on a 2-socket box, there
> was a 43% reduction in direct reclaim scanning.
> 
> However, the impact otherwise is extremely negative. Kswapd reclaim
> efficiency dropped from 98% to 76%. simoop has three latency-related
> metrics for read, write and allocation (an anonymous mmap and fault).
> 
>                                          4.10.0-rc7            4.10.0-rc7
>                                      mmots-20170209           fixcheck-v1
> Amean    p50-Read             22325202.49 (  0.00%) 20026926.55 ( 10.29%)
> Amean    p95-Read             26102988.80 (  0.00%) 27023360.00 ( -3.53%)
> Amean    p99-Read             30935176.53 (  0.00%) 30994432.00 ( -0.19%)
> Amean    p50-Write                 976.44 (  0.00%)     1905.28 (-95.12%)
> Amean    p95-Write               15471.29 (  0.00%)    36210.09 (-134.05%)
> Amean    p99-Write               35108.62 (  0.00%)   479494.96 (-1265.75%)
> Amean    p50-Allocation          76382.61 (  0.00%)    87603.20 (-14.69%)
> Amean    p95-Allocation         127777.39 (  0.00%)   244491.38 (-91.34%)
> Amean    p99-Allocation         187937.39 (  0.00%)  1745237.33 (-828.63%)
> 
> There are also more allocation stalls. One of the largest impacts was due
> to pages written back from kswapd context rising from 0 pages to 4516642
> pages during the hour the workload ran for. By and large, the patch has very
> bad behaviour but easily missed as the impact on a UMA machine is negligible.
> 
> This patch is included with the data in case a bisection leads to this area.
> This patch is also a pre-requisite for the rest of the series.
> 
> Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
short sleep point where every eligible zones are balanced.
What's the correlation between them?

Can't we wake up kcompactd once we found a zone has enough free pages
above high watermark like this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 26c3b405ef34..f4f0ad0e9ede 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
 		 * that pages and compaction may succeed so reset the cache.
 		 */
 		reset_isolation_suitable(pgdat);
-
-		/*
-		 * We have freed the memory, now we should compact it to make
-		 * allocation of the requested order possible.
-		 */
-		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
-
 		remaining = schedule_timeout(HZ/10);
 
 		/*
@@ -3451,6 +3444,14 @@ static int kswapd(void *p)
 		bool ret;
 
 kswapd_try_sleep:
+		/*
+		 * We have freed the memory, now we should compact it to make
+		 * allocation of the requested order possible.
+		 */
+		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
+							classzone_idx))
+			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+
 		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
 					classzone_idx);
 
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-20 16:42     ` Vlastimil Babka
@ 2017-02-23 15:01       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-23 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML,
	Linux-MM

On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
> > With this patch on top, all the latencies relative to the baseline are
> > improved, particularly write latencies. The read latencies are still high
> > for the number of threads but it's worth noting that this is mostly due
> > to the IO scheduler and not directly related to reclaim. The vmstats are
> > a bit of a mix but the relevant ones are as follows;
> > 
> >                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
> >                           mmots-20170209 clear-v1r25keepawake-v1r25
> > Swap Ins                             0           0           0
> > Swap Outs                            0         608           0
> > Direct pages scanned           6910672     3132699     6357298
> > Kswapd pages scanned          57036946    82488665    56986286
> > Kswapd pages reclaimed        55993488    63474329    55939113
> > Direct pages reclaimed         6905990     2964843     6352115
> 
> These stats are confusing me. The earlier description suggests that this patch
> should cause less direct reclaim and more kswapd reclaim, but compared to
> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
> considering direct + kswapd combined)
> 

The description is referring to the impact relative to baseline. It is
true that relative to patch that direct reclaim is higher but there are
a number of anomalies.

Note that kswapd is scanning very aggressively in "clear-v1" and overall
efficiency is down to 76%. It's also not clear in the stats but in
"clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
due to the patch "mm, vmscan: fix zone balance check in
prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
pages from reclaim context and some swapping due to the aggressiveness
of the scan.

While direct reclaim activity might be lower, it's due to kswapd scanning
aggressively and trying to reclaim the world which is not the right thing
to do.  With the patches applied, there is still direct reclaim but the fast
bulk of them are when the workload changes phase from "creating work files"
to starting multiple threads that allocate a lot of anonymous memory with
a sudden spike in memory pressure that kswapd does not keep ahead of with
multiple allocating threads.

> > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  	return sc.order;
> >  }
> >  
> > +/*
> > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> > + * allocation request woke kswapd for. When kswapd has not woken recently,
> > + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> > + * given classzone and returns it or the highest classzone index kswapd
> > + * was recently woke for.
> > + */
> > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> > +					   enum zone_type classzone_idx)
> > +{
> > +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> > +		return classzone_idx;
> > +
> > +	return max(pgdat->kswapd_classzone_idx, classzone_idx);
> 
> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
> a local variable with READ_ONCE(), otherwise something can set it to
> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
> Probably not an issue with current callers, but I'd rather future-proof it.
> 

I'm a little wary of adding READ_ONCE unless there is a definite
problem. Even if it was an issue, I think it would be better to protect
thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
if an update is required or a read to fully guarantee the ordering.

The consequences as they are is that kswapd may miss reclaiming at a
higher order or classzone than it should have although it is very
unlikely and the update and read are made with a workqueue wake and
scheduler wakeup which should be sufficient in terms of barriers.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-02-23 15:01       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-23 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML,
	Linux-MM

On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
> > With this patch on top, all the latencies relative to the baseline are
> > improved, particularly write latencies. The read latencies are still high
> > for the number of threads but it's worth noting that this is mostly due
> > to the IO scheduler and not directly related to reclaim. The vmstats are
> > a bit of a mix but the relevant ones are as follows;
> > 
> >                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
> >                           mmots-20170209 clear-v1r25keepawake-v1r25
> > Swap Ins                             0           0           0
> > Swap Outs                            0         608           0
> > Direct pages scanned           6910672     3132699     6357298
> > Kswapd pages scanned          57036946    82488665    56986286
> > Kswapd pages reclaimed        55993488    63474329    55939113
> > Direct pages reclaimed         6905990     2964843     6352115
> 
> These stats are confusing me. The earlier description suggests that this patch
> should cause less direct reclaim and more kswapd reclaim, but compared to
> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
> considering direct + kswapd combined)
> 

The description is referring to the impact relative to baseline. It is
true that relative to patch that direct reclaim is higher but there are
a number of anomalies.

Note that kswapd is scanning very aggressively in "clear-v1" and overall
efficiency is down to 76%. It's also not clear in the stats but in
"clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
due to the patch "mm, vmscan: fix zone balance check in
prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
pages from reclaim context and some swapping due to the aggressiveness
of the scan.

While direct reclaim activity might be lower, it's due to kswapd scanning
aggressively and trying to reclaim the world which is not the right thing
to do.  With the patches applied, there is still direct reclaim but the fast
bulk of them are when the workload changes phase from "creating work files"
to starting multiple threads that allocate a lot of anonymous memory with
a sudden spike in memory pressure that kswapd does not keep ahead of with
multiple allocating threads.

> > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
> >  	return sc.order;
> >  }
> >  
> > +/*
> > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
> > + * allocation request woke kswapd for. When kswapd has not woken recently,
> > + * the value is MAX_NR_ZONES which is not a valid index. This compares a
> > + * given classzone and returns it or the highest classzone index kswapd
> > + * was recently woke for.
> > + */
> > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
> > +					   enum zone_type classzone_idx)
> > +{
> > +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
> > +		return classzone_idx;
> > +
> > +	return max(pgdat->kswapd_classzone_idx, classzone_idx);
> 
> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
> a local variable with READ_ONCE(), otherwise something can set it to
> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
> Probably not an issue with current callers, but I'd rather future-proof it.
> 

I'm a little wary of adding READ_ONCE unless there is a definite
problem. Even if it was an issue, I think it would be better to protect
thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
if an update is required or a read to fully guarantee the ordering.

The consequences as they are is that kswapd may miss reclaiming at a
higher order or classzone than it should have although it is very
unlikely and the update and read are made with a workqueue wake and
scheduler wakeup which should be sufficient in terms of barriers.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-22  7:00     ` Minchan Kim
@ 2017-02-23 15:05       ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-23 15:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > There are also more allocation stalls. One of the largest impacts was due
> > to pages written back from kswapd context rising from 0 pages to 4516642
> > pages during the hour the workload ran for. By and large, the patch has very
> > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > 
> > This patch is included with the data in case a bisection leads to this area.
> > This patch is also a pre-requisite for the rest of the series.
> > 
> > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> short sleep point where every eligible zones are balanced.
> What's the correlation between them?
> 

If kswapd is ready for a short sleep, eligible zones are balanced for
order-0 but not necessarily the originally requested order if kswapd
gave up reclaiming as compaction was ready to start. As kswapd is ready
to sleep for a short period, it's a suitable time for kcompactd to decide
if it should start working or not. There is no need for kswapd to be aware
of kcompactd's wakeup criteria.

> Can't we wake up kcompactd once we found a zone has enough free pages
> above high watermark like this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..f4f0ad0e9ede 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
>  		 * that pages and compaction may succeed so reset the cache.
>  		 */
>  		reset_isolation_suitable(pgdat);
> -
> -		/*
> -		 * We have freed the memory, now we should compact it to make
> -		 * allocation of the requested order possible.
> -		 */
> -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> -
>  		remaining = schedule_timeout(HZ/10);
>  
>  		/*
> @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
>  		bool ret;
>  
>  kswapd_try_sleep:
> +		/*
> +		 * We have freed the memory, now we should compact it to make
> +		 * allocation of the requested order possible.
> +		 */
> +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> +							classzone_idx))
> +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> +
>  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>  					classzone_idx);

That's functionally very similar to what happens already.  wakeup_kcompactd
checks the order and does not wake for order-0. It also makes its own
decisions that include zone_balanced on whether it is safe to wakeup.

I doubt there would be any measurable difference from a patch like this
and to my mind at least, it does not improve the readability or flow of
the code.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-23 15:05       ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-23 15:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > There are also more allocation stalls. One of the largest impacts was due
> > to pages written back from kswapd context rising from 0 pages to 4516642
> > pages during the hour the workload ran for. By and large, the patch has very
> > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > 
> > This patch is included with the data in case a bisection leads to this area.
> > This patch is also a pre-requisite for the rest of the series.
> > 
> > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> short sleep point where every eligible zones are balanced.
> What's the correlation between them?
> 

If kswapd is ready for a short sleep, eligible zones are balanced for
order-0 but not necessarily the originally requested order if kswapd
gave up reclaiming as compaction was ready to start. As kswapd is ready
to sleep for a short period, it's a suitable time for kcompactd to decide
if it should start working or not. There is no need for kswapd to be aware
of kcompactd's wakeup criteria.

> Can't we wake up kcompactd once we found a zone has enough free pages
> above high watermark like this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 26c3b405ef34..f4f0ad0e9ede 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
>  		 * that pages and compaction may succeed so reset the cache.
>  		 */
>  		reset_isolation_suitable(pgdat);
> -
> -		/*
> -		 * We have freed the memory, now we should compact it to make
> -		 * allocation of the requested order possible.
> -		 */
> -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> -
>  		remaining = schedule_timeout(HZ/10);
>  
>  		/*
> @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
>  		bool ret;
>  
>  kswapd_try_sleep:
> +		/*
> +		 * We have freed the memory, now we should compact it to make
> +		 * allocation of the requested order possible.
> +		 */
> +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> +							classzone_idx))
> +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> +
>  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
>  					classzone_idx);

That's functionally very similar to what happens already.  wakeup_kcompactd
checks the order and does not wake for order-0. It also makes its own
decisions that include zone_balanced on whether it is safe to wakeup.

I doubt there would be any measurable difference from a patch like this
and to my mind at least, it does not improve the readability or flow of
the code.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-23 15:05       ` Mel Gorman
@ 2017-02-24  1:17         ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-24  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > There are also more allocation stalls. One of the largest impacts was due
> > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > pages during the hour the workload ran for. By and large, the patch has very
> > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > 
> > > This patch is included with the data in case a bisection leads to this area.
> > > This patch is also a pre-requisite for the rest of the series.
> > > 
> > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > short sleep point where every eligible zones are balanced.
> > What's the correlation between them?
> > 
> 
> If kswapd is ready for a short sleep, eligible zones are balanced for
> order-0 but not necessarily the originally requested order if kswapd
> gave up reclaiming as compaction was ready to start. As kswapd is ready
> to sleep for a short period, it's a suitable time for kcompactd to decide
> if it should start working or not. There is no need for kswapd to be aware
> of kcompactd's wakeup criteria.

If all eligible zones are balanced for order-0, I agree it's good timing
because high-order alloc's ratio would be higher since kcompactd can compact
eligible zones, not that only classzone.
However, this patch breaks it as well as long time kswapd behavior which
continues to balance eligible zones for order-0.
Is it really okay now?

> 
> > Can't we wake up kcompactd once we found a zone has enough free pages
> > above high watermark like this?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..f4f0ad0e9ede 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> >  		 * that pages and compaction may succeed so reset the cache.
> >  		 */
> >  		reset_isolation_suitable(pgdat);
> > -
> > -		/*
> > -		 * We have freed the memory, now we should compact it to make
> > -		 * allocation of the requested order possible.
> > -		 */
> > -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > -
> >  		remaining = schedule_timeout(HZ/10);
> >  
> >  		/*
> > @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
> >  		bool ret;
> >  
> >  kswapd_try_sleep:
> > +		/*
> > +		 * We have freed the memory, now we should compact it to make
> > +		 * allocation of the requested order possible.
> > +		 */
> > +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> > +							classzone_idx))
> > +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +
> >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> >  					classzone_idx);
> 
> That's functionally very similar to what happens already.  wakeup_kcompactd
> checks the order and does not wake for order-0. It also makes its own
> decisions that include zone_balanced on whether it is safe to wakeup.

Agree.

> 
> I doubt there would be any measurable difference from a patch like this
> and to my mind at least, it does not improve the readability or flow of
> the code.

However, my concern is premature kswapd sleep for order-0 which has been
long time behavior so I hope it should be documented why it's okay now.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-24  1:17         ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-24  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > There are also more allocation stalls. One of the largest impacts was due
> > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > pages during the hour the workload ran for. By and large, the patch has very
> > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > 
> > > This patch is included with the data in case a bisection leads to this area.
> > > This patch is also a pre-requisite for the rest of the series.
> > > 
> > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > 
> > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > short sleep point where every eligible zones are balanced.
> > What's the correlation between them?
> > 
> 
> If kswapd is ready for a short sleep, eligible zones are balanced for
> order-0 but not necessarily the originally requested order if kswapd
> gave up reclaiming as compaction was ready to start. As kswapd is ready
> to sleep for a short period, it's a suitable time for kcompactd to decide
> if it should start working or not. There is no need for kswapd to be aware
> of kcompactd's wakeup criteria.

If all eligible zones are balanced for order-0, I agree it's good timing
because high-order alloc's ratio would be higher since kcompactd can compact
eligible zones, not that only classzone.
However, this patch breaks it as well as long time kswapd behavior which
continues to balance eligible zones for order-0.
Is it really okay now?

> 
> > Can't we wake up kcompactd once we found a zone has enough free pages
> > above high watermark like this?
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 26c3b405ef34..f4f0ad0e9ede 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o
> >  		 * that pages and compaction may succeed so reset the cache.
> >  		 */
> >  		reset_isolation_suitable(pgdat);
> > -
> > -		/*
> > -		 * We have freed the memory, now we should compact it to make
> > -		 * allocation of the requested order possible.
> > -		 */
> > -		wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > -
> >  		remaining = schedule_timeout(HZ/10);
> >  
> >  		/*
> > @@ -3451,6 +3444,14 @@ static int kswapd(void *p)
> >  		bool ret;
> >  
> >  kswapd_try_sleep:
> > +		/*
> > +		 * We have freed the memory, now we should compact it to make
> > +		 * allocation of the requested order possible.
> > +		 */
> > +		if (alloc_order > 0 && zone_balanced(zone, reclaim_order,
> > +							classzone_idx))
> > +			wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
> > +
> >  		kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order,
> >  					classzone_idx);
> 
> That's functionally very similar to what happens already.  wakeup_kcompactd
> checks the order and does not wake for order-0. It also makes its own
> decisions that include zone_balanced on whether it is safe to wakeup.

Agree.

> 
> I doubt there would be any measurable difference from a patch like this
> and to my mind at least, it does not improve the readability or flow of
> the code.

However, my concern is premature kswapd sleep for order-0 which has been
long time behavior so I hope it should be documented why it's okay now.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-24  1:17         ` Minchan Kim
@ 2017-02-24  9:11           ` Mel Gorman
  -1 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-24  9:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > There are also more allocation stalls. One of the largest impacts was due
> > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > 
> > > > This patch is included with the data in case a bisection leads to this area.
> > > > This patch is also a pre-requisite for the rest of the series.
> > > > 
> > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > 
> > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > short sleep point where every eligible zones are balanced.
> > > What's the correlation between them?
> > > 
> > 
> > If kswapd is ready for a short sleep, eligible zones are balanced for
> > order-0 but not necessarily the originally requested order if kswapd
> > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > to sleep for a short period, it's a suitable time for kcompactd to decide
> > if it should start working or not. There is no need for kswapd to be aware
> > of kcompactd's wakeup criteria.
> 
> If all eligible zones are balanced for order-0, I agree it's good timing
> because high-order alloc's ratio would be higher since kcompactd can compact
> eligible zones, not that only classzone.
> However, this patch breaks it as well as long time kswapd behavior which
> continues to balance eligible zones for order-0.
> Is it really okay now?
> 

Reclaim stops in balance_pgdat() if any eligible zone for the requested
classzone is free. The initial sleep for kswapd is very different because
it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
The way node balancing works means there is no guarantee at all that all
zones will be balanced even if there is little or no memory pressure and
one large zone in a node with multiple zones can be balanced quickly.

The short-sleep logic that kswapd uses to decide whether to go to sleep
is shortcut and it does not properly try the short sleep checking if the
high watermarks are quickly reached or not. Instead, it quickly fails the
first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
rechecks sleeping based on order-0, classzone-0 which it can easily sleep
for but is *not* what kswapd was woken for in the first place.

For many allocation requests that initially woke kswapd, the impact is
marginal. kswapd sleeps early and is woken in the near future if there
is a continual stream of allocations with a risk that direct reclaim is
required. While the motivation for the patch was that kcompact is not woken
up, the existing behaviour is just wrong -- kswapd should be deciding to
sleep based on the classzone it was woken for and if possible, the order
it was woken for but the classzone is more important in the common case
for order-0 allocations.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-24  9:11           ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2017-02-24  9:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > There are also more allocation stalls. One of the largest impacts was due
> > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > 
> > > > This patch is included with the data in case a bisection leads to this area.
> > > > This patch is also a pre-requisite for the rest of the series.
> > > > 
> > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > 
> > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > short sleep point where every eligible zones are balanced.
> > > What's the correlation between them?
> > > 
> > 
> > If kswapd is ready for a short sleep, eligible zones are balanced for
> > order-0 but not necessarily the originally requested order if kswapd
> > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > to sleep for a short period, it's a suitable time for kcompactd to decide
> > if it should start working or not. There is no need for kswapd to be aware
> > of kcompactd's wakeup criteria.
> 
> If all eligible zones are balanced for order-0, I agree it's good timing
> because high-order alloc's ratio would be higher since kcompactd can compact
> eligible zones, not that only classzone.
> However, this patch breaks it as well as long time kswapd behavior which
> continues to balance eligible zones for order-0.
> Is it really okay now?
> 

Reclaim stops in balance_pgdat() if any eligible zone for the requested
classzone is free. The initial sleep for kswapd is very different because
it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
The way node balancing works means there is no guarantee at all that all
zones will be balanced even if there is little or no memory pressure and
one large zone in a node with multiple zones can be balanced quickly.

The short-sleep logic that kswapd uses to decide whether to go to sleep
is shortcut and it does not properly try the short sleep checking if the
high watermarks are quickly reached or not. Instead, it quickly fails the
first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
rechecks sleeping based on order-0, classzone-0 which it can easily sleep
for but is *not* what kswapd was woken for in the first place.

For many allocation requests that initially woke kswapd, the impact is
marginal. kswapd sleeps early and is woken in the near future if there
is a continual stream of allocations with a risk that direct reclaim is
required. While the motivation for the patch was that kcompact is not woken
up, the existing behaviour is just wrong -- kswapd should be deciding to
sleep based on the classzone it was woken for and if possible, the order
it was woken for but the classzone is more important in the common case
for order-0 allocations.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
  2017-02-24  9:11           ` Mel Gorman
@ 2017-02-27  6:16             ` Minchan Kim
  -1 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-27  6:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Fri, Feb 24, 2017 at 09:11:28AM +0000, Mel Gorman wrote:
> On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > > There are also more allocation stalls. One of the largest impacts was due
> > > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > > 
> > > > > This patch is included with the data in case a bisection leads to this area.
> > > > > This patch is also a pre-requisite for the rest of the series.
> > > > > 
> > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > > 
> > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > > short sleep point where every eligible zones are balanced.
> > > > What's the correlation between them?
> > > > 
> > > 
> > > If kswapd is ready for a short sleep, eligible zones are balanced for
> > > order-0 but not necessarily the originally requested order if kswapd
> > > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > > to sleep for a short period, it's a suitable time for kcompactd to decide
> > > if it should start working or not. There is no need for kswapd to be aware
> > > of kcompactd's wakeup criteria.
> > 
> > If all eligible zones are balanced for order-0, I agree it's good timing
> > because high-order alloc's ratio would be higher since kcompactd can compact
> > eligible zones, not that only classzone.
> > However, this patch breaks it as well as long time kswapd behavior which
> > continues to balance eligible zones for order-0.
> > Is it really okay now?
> > 
> 
> Reclaim stops in balance_pgdat() if any eligible zone for the requested
> classzone is free. The initial sleep for kswapd is very different because
> it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
> The way node balancing works means there is no guarantee at all that all
> zones will be balanced even if there is little or no memory pressure and
> one large zone in a node with multiple zones can be balanced quickly.

Indeed but it would tip toward direct relcaim more so it could make more
failure for allocation relies on kswapd like atomic allocation
However, if VM balance all of zones for order-0, it would make excessive
reclaim with node-based LRU unlike zone-based, which is bad, too.

> 
> The short-sleep logic that kswapd uses to decide whether to go to sleep
> is shortcut and it does not properly try the short sleep checking if the
> high watermarks are quickly reached or not. Instead, it quickly fails the
> first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
> rechecks sleeping based on order-0, classzone-0 which it can easily sleep
> for but is *not* what kswapd was woken for in the first place.
> 
> For many allocation requests that initially woke kswapd, the impact is
> marginal. kswapd sleeps early and is woken in the near future if there
> is a continual stream of allocations with a risk that direct reclaim is
> required. While the motivation for the patch was that kcompact is not woken
> up, the existing behaviour is just wrong -- kswapd should be deciding to
> sleep based on the classzone it was woken for and if possible, the order
> it was woken for but the classzone is more important in the common case
> for order-0 allocations.

I agree but I think it's rather risky to paper over order-0 zone-balancing
problem by kcompactd missing problem so at least, it should be documented.

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep
@ 2017-02-27  6:16             ` Minchan Kim
  0 siblings, 0 replies; 48+ messages in thread
From: Minchan Kim @ 2017-02-27  6:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner,
	Vlastimil Babka, LKML, Linux-MM

Hi Mel,

On Fri, Feb 24, 2017 at 09:11:28AM +0000, Mel Gorman wrote:
> On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote:
> > > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote:
> > > > > There are also more allocation stalls. One of the largest impacts was due
> > > > > to pages written back from kswapd context rising from 0 pages to 4516642
> > > > > pages during the hour the workload ran for. By and large, the patch has very
> > > > > bad behaviour but easily missed as the impact on a UMA machine is negligible.
> > > > > 
> > > > > This patch is included with the data in case a bisection leads to this area.
> > > > > This patch is also a pre-requisite for the rest of the series.
> > > > > 
> > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com>
> > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > > > 
> > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's
> > > > short sleep point where every eligible zones are balanced.
> > > > What's the correlation between them?
> > > > 
> > > 
> > > If kswapd is ready for a short sleep, eligible zones are balanced for
> > > order-0 but not necessarily the originally requested order if kswapd
> > > gave up reclaiming as compaction was ready to start. As kswapd is ready
> > > to sleep for a short period, it's a suitable time for kcompactd to decide
> > > if it should start working or not. There is no need for kswapd to be aware
> > > of kcompactd's wakeup criteria.
> > 
> > If all eligible zones are balanced for order-0, I agree it's good timing
> > because high-order alloc's ratio would be higher since kcompactd can compact
> > eligible zones, not that only classzone.
> > However, this patch breaks it as well as long time kswapd behavior which
> > continues to balance eligible zones for order-0.
> > Is it really okay now?
> > 
> 
> Reclaim stops in balance_pgdat() if any eligible zone for the requested
> classzone is free. The initial sleep for kswapd is very different because
> it'll sleep if all zones are balanced for order-0 which is a bad disconnect.
> The way node balancing works means there is no guarantee at all that all
> zones will be balanced even if there is little or no memory pressure and
> one large zone in a node with multiple zones can be balanced quickly.

Indeed but it would tip toward direct relcaim more so it could make more
failure for allocation relies on kswapd like atomic allocation
However, if VM balance all of zones for order-0, it would make excessive
reclaim with node-based LRU unlike zone-based, which is bad, too.

> 
> The short-sleep logic that kswapd uses to decide whether to go to sleep
> is shortcut and it does not properly try the short sleep checking if the
> high watermarks are quickly reached or not. Instead, it quickly fails the
> first attempt at sleep, reenters balance_pgdat(), finds nothing to do and
> rechecks sleeping based on order-0, classzone-0 which it can easily sleep
> for but is *not* what kswapd was woken for in the first place.
> 
> For many allocation requests that initially woke kswapd, the impact is
> marginal. kswapd sleeps early and is woken in the near future if there
> is a continual stream of allocations with a risk that direct reclaim is
> required. While the motivation for the patch was that kcompact is not woken
> up, the existing behaviour is just wrong -- kswapd should be deciding to
> sleep based on the classzone it was woken for and if possible, the order
> it was woken for but the classzone is more important in the common case
> for order-0 allocations.

I agree but I think it's rather risky to paper over order-0 zone-balancing
problem by kcompactd missing problem so at least, it should be documented.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
  2017-02-23 15:01       ` Mel Gorman
@ 2017-03-01  9:04         ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-03-01  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML,
	Linux-MM

On 02/23/2017 04:01 PM, Mel Gorman wrote:
> On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
>>> With this patch on top, all the latencies relative to the baseline are
>>> improved, particularly write latencies. The read latencies are still high
>>> for the number of threads but it's worth noting that this is mostly due
>>> to the IO scheduler and not directly related to reclaim. The vmstats are
>>> a bit of a mix but the relevant ones are as follows;
>>>
>>>                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>>>                           mmots-20170209 clear-v1r25keepawake-v1r25
>>> Swap Ins                             0           0           0
>>> Swap Outs                            0         608           0
>>> Direct pages scanned           6910672     3132699     6357298
>>> Kswapd pages scanned          57036946    82488665    56986286
>>> Kswapd pages reclaimed        55993488    63474329    55939113
>>> Direct pages reclaimed         6905990     2964843     6352115
>>
>> These stats are confusing me. The earlier description suggests that this patch
>> should cause less direct reclaim and more kswapd reclaim, but compared to
>> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
>> considering direct + kswapd combined)
>>
> 
> The description is referring to the impact relative to baseline. It is
> true that relative to patch that direct reclaim is higher but there are
> a number of anomalies.
> 
> Note that kswapd is scanning very aggressively in "clear-v1" and overall
> efficiency is down to 76%. It's also not clear in the stats but in
> "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
> due to the patch "mm, vmscan: fix zone balance check in
> prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
> pages from reclaim context and some swapping due to the aggressiveness
> of the scan.
> 
> While direct reclaim activity might be lower, it's due to kswapd scanning
> aggressively and trying to reclaim the world which is not the right thing
> to do.  With the patches applied, there is still direct reclaim but the fast
> bulk of them are when the workload changes phase from "creating work files"
> to starting multiple threads that allocate a lot of anonymous memory with
> a sudden spike in memory pressure that kswapd does not keep ahead of with
> multiple allocating threads.

Thanks for the explanation.

> 
>>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>>  	return sc.order;
>>>  }
>>>  
>>> +/*
>>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
>>> + * allocation request woke kswapd for. When kswapd has not woken recently,
>>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
>>> + * given classzone and returns it or the highest classzone index kswapd
>>> + * was recently woke for.
>>> + */
>>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
>>> +					   enum zone_type classzone_idx)
>>> +{
>>> +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>>> +		return classzone_idx;
>>> +
>>> +	return max(pgdat->kswapd_classzone_idx, classzone_idx);
>>
>> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
>> a local variable with READ_ONCE(), otherwise something can set it to
>> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
>> Probably not an issue with current callers, but I'd rather future-proof it.
>>
> 
> I'm a little wary of adding READ_ONCE unless there is a definite
> problem. Even if it was an issue, I think it would be better to protect
> thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
> if an update is required or a read to fully guarantee the ordering.
> 
> The consequences as they are is that kswapd may miss reclaiming at a
> higher order or classzone than it should have although it is very
> unlikely and the update and read are made with a workqueue wake and
> scheduler wakeup which should be sufficient in terms of barriers.

OK then.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
@ 2017-03-01  9:04         ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-03-01  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML,
	Linux-MM

On 02/23/2017 04:01 PM, Mel Gorman wrote:
> On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote:
>>> With this patch on top, all the latencies relative to the baseline are
>>> improved, particularly write latencies. The read latencies are still high
>>> for the number of threads but it's worth noting that this is mostly due
>>> to the IO scheduler and not directly related to reclaim. The vmstats are
>>> a bit of a mix but the relevant ones are as follows;
>>>
>>>                             4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
>>>                           mmots-20170209 clear-v1r25keepawake-v1r25
>>> Swap Ins                             0           0           0
>>> Swap Outs                            0         608           0
>>> Direct pages scanned           6910672     3132699     6357298
>>> Kswapd pages scanned          57036946    82488665    56986286
>>> Kswapd pages reclaimed        55993488    63474329    55939113
>>> Direct pages reclaimed         6905990     2964843     6352115
>>
>> These stats are confusing me. The earlier description suggests that this patch
>> should cause less direct reclaim and more kswapd reclaim, but compared to
>> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when
>> considering direct + kswapd combined)
>>
> 
> The description is referring to the impact relative to baseline. It is
> true that relative to patch that direct reclaim is higher but there are
> a number of anomalies.
> 
> Note that kswapd is scanning very aggressively in "clear-v1" and overall
> efficiency is down to 76%. It's also not clear in the stats but in
> "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for
> due to the patch "mm, vmscan: fix zone balance check in
> prepare_kswapd_sleep". It's also doing a lot of writing of file-backed
> pages from reclaim context and some swapping due to the aggressiveness
> of the scan.
> 
> While direct reclaim activity might be lower, it's due to kswapd scanning
> aggressively and trying to reclaim the world which is not the right thing
> to do.  With the patches applied, there is still direct reclaim but the fast
> bulk of them are when the workload changes phase from "creating work files"
> to starting multiple threads that allocate a lot of anonymous memory with
> a sudden spike in memory pressure that kswapd does not keep ahead of with
> multiple allocating threads.

Thanks for the explanation.

> 
>>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>>>  	return sc.order;
>>>  }
>>>  
>>> +/*
>>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent
>>> + * allocation request woke kswapd for. When kswapd has not woken recently,
>>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a
>>> + * given classzone and returns it or the highest classzone index kswapd
>>> + * was recently woke for.
>>> + */
>>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat,
>>> +					   enum zone_type classzone_idx)
>>> +{
>>> +	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
>>> +		return classzone_idx;
>>> +
>>> +	return max(pgdat->kswapd_classzone_idx, classzone_idx);
>>
>> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to
>> a local variable with READ_ONCE(), otherwise something can set it to
>> MAX_NR_ZONES between the check and max(), and compiler can decide to reread.
>> Probably not an issue with current callers, but I'd rather future-proof it.
>>
> 
> I'm a little wary of adding READ_ONCE unless there is a definite
> problem. Even if it was an issue, I think it would be better to protect
> thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken
> if an update is required or a read to fully guarantee the ordering.
> 
> The consequences as they are is that kswapd may miss reclaiming at a
> higher order or classzone than it should have although it is very
> unlikely and the update and read are made with a workqueue wake and
> scheduler wakeup which should be sufficient in terms of barriers.

OK then.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-03-01  9:25 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15  9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman
2017-02-15  9:22 ` Mel Gorman
2017-02-15  9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman
2017-02-15  9:22   ` Mel Gorman
2017-02-16  2:50   ` Hillf Danton
2017-02-16  2:50     ` Hillf Danton
2017-02-22  7:00   ` Minchan Kim
2017-02-22  7:00     ` Minchan Kim
2017-02-23 15:05     ` Mel Gorman
2017-02-23 15:05       ` Mel Gorman
2017-02-24  1:17       ` Minchan Kim
2017-02-24  1:17         ` Minchan Kim
2017-02-24  9:11         ` Mel Gorman
2017-02-24  9:11           ` Mel Gorman
2017-02-27  6:16           ` Minchan Kim
2017-02-27  6:16             ` Minchan Kim
2017-02-15  9:22 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman
2017-02-15  9:22   ` Mel Gorman
2017-02-15  9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman
2017-02-15  9:22   ` Mel Gorman
2017-02-16  6:23   ` Hillf Danton
2017-02-16  6:23     ` Hillf Danton
2017-02-16  8:10     ` Mel Gorman
2017-02-16  8:10       ` Mel Gorman
2017-02-16  8:21       ` Hillf Danton
2017-02-16  8:21         ` Hillf Danton
2017-02-16  9:32         ` Mel Gorman
2017-02-16  9:32           ` Mel Gorman
2017-02-20 16:34         ` Vlastimil Babka
2017-02-20 16:34           ` Vlastimil Babka
2017-02-21  4:10           ` Hillf Danton
2017-02-21  4:10             ` Hillf Danton
2017-02-20 16:42   ` Vlastimil Babka
2017-02-20 16:42     ` Vlastimil Babka
2017-02-23 15:01     ` Mel Gorman
2017-02-23 15:01       ` Mel Gorman
2017-03-01  9:04       ` Vlastimil Babka
2017-03-01  9:04         ` Vlastimil Babka
2017-02-15 20:30 ` [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Andrew Morton
2017-02-15 20:30   ` Andrew Morton
2017-02-15 21:29   ` Mel Gorman
2017-02-15 21:29     ` Mel Gorman
2017-02-15 21:56     ` Andrew Morton
2017-02-15 21:56       ` Andrew Morton
2017-02-15 22:15       ` Mel Gorman
2017-02-15 22:15         ` Mel Gorman
2017-02-15 22:00     ` Vlastimil Babka
2017-02-15 22:00       ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.