* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely @ 2017-02-15 9:22 Mel Gorman 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman ` (3 more replies) 0 siblings, 4 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 9:22 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman This patchset is based on mmots as of Feb 9th, 2016. The baseline is important as there are a number of kswapd-related fixes in that tree and a comparison against v4.10-rc7 would be almost meaningless as a result. The series is unusual in that the first patch fixes one problem and introduces a host of other issues and is incomplete. It was not developed by me but it appears to have gotten lost so I picked it up and added to the changelog. Patch 2 makes a minor modification that is worth considering on its own but leaves the kernel in a state where it behaves badly. It's not until patch 3 that there is an improvement against baseline. This was mostly motivated by examining Chris Mason's "simoop" benchmark which puts the VM under similar pressure to HADOOP. It has been reported that the benchmark has regressed severely during the last number of releases. While I cannot reproduce all the same problems Chris experienced due to hardware limitations, there was a number of problems on a 2-socket machine with a single disk. 4.10.0-rc7 4.10.0-rc7 mmots-20170209 keepawake-v1r25 Amean p50-Read 22325202.49 ( 0.00%) 22092755.48 ( 1.04%) Amean p95-Read 26102988.80 ( 0.00%) 26101849.04 ( 0.00%) Amean p99-Read 30935176.53 ( 0.00%) 29746220.52 ( 3.84%) Amean p50-Write 976.44 ( 0.00%) 952.73 ( 2.43%) Amean p95-Write 15471.29 ( 0.00%) 3140.27 ( 79.70%) Amean p99-Write 35108.62 ( 0.00%) 8843.73 ( 74.81%) Amean p50-Allocation 76382.61 ( 0.00%) 76349.22 ( 0.04%) Amean p95-Allocation 127777.39 ( 0.00%) 108630.26 ( 14.98%) Amean p99-Allocation 187937.39 ( 0.00%) 139094.26 ( 25.99%) These are latencies. Read/write are threads reading fixed-size random blocks from a simulated database. The allocation latency is mmaping and faulting regions of memory. The p50, 95 and p99 reports the worst latencies for 50% of the samples, 95% and 99% respectively. For example, the report indicates that while the test was running 99% of writes completed 74.81% faster. It's worth noting that on a UMA machine that no difference in performance with simoop was observed so milage will vary. On UMA, there was a notable difference in the "stutter" benchmark which measures the latency of mmap while large files are being copied. This has been used as a proxy measure for desktop jitter while large amounts of IO were taking place 4.10.0-rc7 4.10.0-rc7 mmots-20170209 keepawake-v1 Min mmap 6.3847 ( 0.00%) 5.9785 ( 6.36%) 1st-qrtle mmap 7.6310 ( 0.00%) 7.4086 ( 2.91%) 2nd-qrtle mmap 9.9959 ( 0.00%) 7.7052 ( 22.92%) 3rd-qrtle mmap 14.8180 ( 0.00%) 8.5895 ( 42.03%) Max-90% mmap 15.8397 ( 0.00%) 13.6974 ( 13.52%) Max-93% mmap 16.4268 ( 0.00%) 14.3175 ( 12.84%) Max-95% mmap 18.3295 ( 0.00%) 16.9233 ( 7.67%) Max-99% mmap 24.2042 ( 0.00%) 20.6182 ( 14.82%) Max mmap 255.0688 ( 0.00%) 265.5818 ( -4.12%) Mean mmap 11.2192 ( 0.00%) 9.1811 ( 18.17%) Latency is measured in milliseconds and indicates that 99% of mmap operations complete 14.82% faster and are 18.17% faster on average with these patches applied. mm/memory_hotplug.c | 2 +- mm/vmscan.c | 128 +++++++++++++++++++++++++++++----------------------- 2 files changed, 72 insertions(+), 58 deletions(-) -- 2.11.0 Mel Gorman (2): mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Shantanu Goel (1): mm, vmscan: fix zone balance check in prepare_kswapd_sleep mm/memory_hotplug.c | 2 +- mm/vmscan.c | 128 +++++++++++++++++++++++++++++----------------------- 2 files changed, 72 insertions(+), 58 deletions(-) -- 2.11.0 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-15 9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman @ 2017-02-15 9:22 ` Mel Gorman 2017-02-16 2:50 ` Hillf Danton 2017-02-22 7:00 ` Minchan Kim 2017-02-15 9:22 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman ` (2 subsequent siblings) 3 siblings, 2 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 9:22 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman From: Shantanu Goel <sgoel01@yahoo.com> The check in prepare_kswapd_sleep needs to match the one in balance_pgdat since the latter will return as soon as any one of the zones in the classzone is above the watermark. This is specially important for higher order allocations since balance_pgdat will typically reset the order to zero relying on compaction to create the higher order pages. Without this patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone balance check fails. On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the zone balance check between balance_pgdat() and prepare_kswapd_sleep(). balance_pgdat() returns as soon as a single zone satisfies the allocation but prepare_kswapd_sleep() requires all zones to do +the same. This causes prepare_kswapd_sleep() to never succeed except in the order == 0 case and consequently, wakeup_kcompactd() is never called. On my machine prior to apply this patch, the state of compaction from /proc/vmstat looked this way after a day and a half +of uptime: compact_migrate_scanned 240496 compact_free_scanned 76238632 compact_isolated 123472 compact_stall 1791 compact_fail 29 compact_success 1762 compact_daemon_wake 0 After applying the patch and about 10 hours of uptime the state looks like this: compact_migrate_scanned 59927299 compact_free_scanned 2021075136 compact_isolated 640926 compact_stall 4 compact_fail 2 compact_success 2 compact_daemon_wake 5160 Further notes from Mel that motivated him to pick this patch up and resend it; It was observed for the simoop workload (pressures the VM similar to HADOOP) that kswapd was failing to keep ahead of direct reclaim. The investigation noted that there was a need to rationalise kswapd decisions to reclaim with kswapd decisions to sleep. With this patch on a 2-socket box, there was a 43% reduction in direct reclaim scanning. However, the impact otherwise is extremely negative. Kswapd reclaim efficiency dropped from 98% to 76%. simoop has three latency-related metrics for read, write and allocation (an anonymous mmap and fault). 4.10.0-rc7 4.10.0-rc7 mmots-20170209 fixcheck-v1 Amean p50-Read 22325202.49 ( 0.00%) 20026926.55 ( 10.29%) Amean p95-Read 26102988.80 ( 0.00%) 27023360.00 ( -3.53%) Amean p99-Read 30935176.53 ( 0.00%) 30994432.00 ( -0.19%) Amean p50-Write 976.44 ( 0.00%) 1905.28 (-95.12%) Amean p95-Write 15471.29 ( 0.00%) 36210.09 (-134.05%) Amean p99-Write 35108.62 ( 0.00%) 479494.96 (-1265.75%) Amean p50-Allocation 76382.61 ( 0.00%) 87603.20 (-14.69%) Amean p95-Allocation 127777.39 ( 0.00%) 244491.38 (-91.34%) Amean p99-Allocation 187937.39 ( 0.00%) 1745237.33 (-828.63%) There are also more allocation stalls. One of the largest impacts was due to pages written back from kswapd context rising from 0 pages to 4516642 pages during the hour the workload ran for. By and large, the patch has very bad behaviour but easily missed as the impact on a UMA machine is negligible. This patch is included with the data in case a bisection leads to this area. This patch is also a pre-requisite for the rest of the series. Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 26c3b405ef34..92fc66bd52bc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (!zone_balanced(zone, order, classzone_idx)) - return false; + if (zone_balanced(zone, order, classzone_idx)) + return true; } - return true; + return false; } /* -- 2.11.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman @ 2017-02-16 2:50 ` Hillf Danton 2017-02-22 7:00 ` Minchan Kim 1 sibling, 0 replies; 25+ messages in thread From: Hillf Danton @ 2017-02-16 2:50 UTC (permalink / raw) To: 'Mel Gorman', 'Andrew Morton' Cc: 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'Vlastimil Babka', 'LKML', 'Linux-MM' On February 15, 2017 5:23 PM Mel Gorman wrote: > > From: Shantanu Goel <sgoel01@yahoo.com> > > The check in prepare_kswapd_sleep needs to match the one in balance_pgdat > since the latter will return as soon as any one of the zones in the > classzone is above the watermark. This is specially important for higher > order allocations since balance_pgdat will typically reset the order to > zero relying on compaction to create the higher order pages. Without this > patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone > balance check fails. > > On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the > zone balance check between balance_pgdat() and prepare_kswapd_sleep(). > balance_pgdat() returns as soon as a single zone satisfies the allocation > but prepare_kswapd_sleep() requires all zones to do +the same. This causes > prepare_kswapd_sleep() to never succeed except in the order == 0 case and > consequently, wakeup_kcompactd() is never called. On my machine prior to > apply this patch, the state of compaction from /proc/vmstat looked this > way after a day and a half +of uptime: > > compact_migrate_scanned 240496 > compact_free_scanned 76238632 > compact_isolated 123472 > compact_stall 1791 > compact_fail 29 > compact_success 1762 > compact_daemon_wake 0 > > After applying the patch and about 10 hours of uptime the state looks > like this: > > compact_migrate_scanned 59927299 > compact_free_scanned 2021075136 > compact_isolated 640926 > compact_stall 4 > compact_fail 2 > compact_success 2 > compact_daemon_wake 5160 > > Further notes from Mel that motivated him to pick this patch up and > resend it; > > It was observed for the simoop workload (pressures the VM similar to HADOOP) > that kswapd was failing to keep ahead of direct reclaim. The investigation > noted that there was a need to rationalise kswapd decisions to reclaim > with kswapd decisions to sleep. With this patch on a 2-socket box, there > was a 43% reduction in direct reclaim scanning. > > However, the impact otherwise is extremely negative. Kswapd reclaim > efficiency dropped from 98% to 76%. simoop has three latency-related > metrics for read, write and allocation (an anonymous mmap and fault). > > 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 fixcheck-v1 > Amean p50-Read 22325202.49 ( 0.00%) 20026926.55 ( 10.29%) > Amean p95-Read 26102988.80 ( 0.00%) 27023360.00 ( -3.53%) > Amean p99-Read 30935176.53 ( 0.00%) 30994432.00 ( -0.19%) > Amean p50-Write 976.44 ( 0.00%) 1905.28 (-95.12%) > Amean p95-Write 15471.29 ( 0.00%) 36210.09 (-134.05%) > Amean p99-Write 35108.62 ( 0.00%) 479494.96 (-1265.75%) > Amean p50-Allocation 76382.61 ( 0.00%) 87603.20 (-14.69%) > Amean p95-Allocation 127777.39 ( 0.00%) 244491.38 (-91.34%) > Amean p99-Allocation 187937.39 ( 0.00%) 1745237.33 (-828.63%) > > There are also more allocation stalls. One of the largest impacts was due > to pages written back from kswapd context rising from 0 pages to 4516642 > pages during the hour the workload ran for. By and large, the patch has very > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > This patch is included with the data in case a bisection leads to this area. > This patch is also a pre-requisite for the rest of the series. > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> > mm/vmscan.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 26c3b405ef34..92fc66bd52bc 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3140,11 +3140,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) > if (!managed_zone(zone)) > continue; > > - if (!zone_balanced(zone, order, classzone_idx)) > - return false; > + if (zone_balanced(zone, order, classzone_idx)) > + return true; > } > > - return true; > + return false; > } > > /* > -- > 2.11.0 ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman 2017-02-16 2:50 ` Hillf Danton @ 2017-02-22 7:00 ` Minchan Kim 2017-02-23 15:05 ` Mel Gorman 1 sibling, 1 reply; 25+ messages in thread From: Minchan Kim @ 2017-02-22 7:00 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM Hi, On Wed, Feb 15, 2017 at 09:22:45AM +0000, Mel Gorman wrote: > From: Shantanu Goel <sgoel01@yahoo.com> > > The check in prepare_kswapd_sleep needs to match the one in balance_pgdat > since the latter will return as soon as any one of the zones in the > classzone is above the watermark. This is specially important for higher > order allocations since balance_pgdat will typically reset the order to > zero relying on compaction to create the higher order pages. Without this > patch, prepare_kswapd_sleep fails to wake up kcompactd since the zone > balance check fails. > > On 4.9.7 kswapd is failing to wake up kcompactd due to a mismatch in the > zone balance check between balance_pgdat() and prepare_kswapd_sleep(). > balance_pgdat() returns as soon as a single zone satisfies the allocation > but prepare_kswapd_sleep() requires all zones to do +the same. This causes > prepare_kswapd_sleep() to never succeed except in the order == 0 case and > consequently, wakeup_kcompactd() is never called. On my machine prior to > apply this patch, the state of compaction from /proc/vmstat looked this > way after a day and a half +of uptime: > > compact_migrate_scanned 240496 > compact_free_scanned 76238632 > compact_isolated 123472 > compact_stall 1791 > compact_fail 29 > compact_success 1762 > compact_daemon_wake 0 > > After applying the patch and about 10 hours of uptime the state looks > like this: > > compact_migrate_scanned 59927299 > compact_free_scanned 2021075136 > compact_isolated 640926 > compact_stall 4 > compact_fail 2 > compact_success 2 > compact_daemon_wake 5160 > > Further notes from Mel that motivated him to pick this patch up and > resend it; > > It was observed for the simoop workload (pressures the VM similar to HADOOP) > that kswapd was failing to keep ahead of direct reclaim. The investigation > noted that there was a need to rationalise kswapd decisions to reclaim > with kswapd decisions to sleep. With this patch on a 2-socket box, there > was a 43% reduction in direct reclaim scanning. > > However, the impact otherwise is extremely negative. Kswapd reclaim > efficiency dropped from 98% to 76%. simoop has three latency-related > metrics for read, write and allocation (an anonymous mmap and fault). > > 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 fixcheck-v1 > Amean p50-Read 22325202.49 ( 0.00%) 20026926.55 ( 10.29%) > Amean p95-Read 26102988.80 ( 0.00%) 27023360.00 ( -3.53%) > Amean p99-Read 30935176.53 ( 0.00%) 30994432.00 ( -0.19%) > Amean p50-Write 976.44 ( 0.00%) 1905.28 (-95.12%) > Amean p95-Write 15471.29 ( 0.00%) 36210.09 (-134.05%) > Amean p99-Write 35108.62 ( 0.00%) 479494.96 (-1265.75%) > Amean p50-Allocation 76382.61 ( 0.00%) 87603.20 (-14.69%) > Amean p95-Allocation 127777.39 ( 0.00%) 244491.38 (-91.34%) > Amean p99-Allocation 187937.39 ( 0.00%) 1745237.33 (-828.63%) > > There are also more allocation stalls. One of the largest impacts was due > to pages written back from kswapd context rising from 0 pages to 4516642 > pages during the hour the workload ran for. By and large, the patch has very > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > This patch is included with the data in case a bisection leads to this area. > This patch is also a pre-requisite for the rest of the series. > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's short sleep point where every eligible zones are balanced. What's the correlation between them? Can't we wake up kcompactd once we found a zone has enough free pages above high watermark like this? diff --git a/mm/vmscan.c b/mm/vmscan.c index 26c3b405ef34..f4f0ad0e9ede 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o * that pages and compaction may succeed so reset the cache. */ reset_isolation_suitable(pgdat); - - /* - * We have freed the memory, now we should compact it to make - * allocation of the requested order possible. - */ - wakeup_kcompactd(pgdat, alloc_order, classzone_idx); - remaining = schedule_timeout(HZ/10); /* @@ -3451,6 +3444,14 @@ static int kswapd(void *p) bool ret; kswapd_try_sleep: + /* + * We have freed the memory, now we should compact it to make + * allocation of the requested order possible. + */ + if (alloc_order > 0 && zone_balanced(zone, reclaim_order, + classzone_idx)) + wakeup_kcompactd(pgdat, alloc_order, classzone_idx); + kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, classzone_idx); -- 2.7.4 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-22 7:00 ` Minchan Kim @ 2017-02-23 15:05 ` Mel Gorman 2017-02-24 1:17 ` Minchan Kim 0 siblings, 1 reply; 25+ messages in thread From: Mel Gorman @ 2017-02-23 15:05 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote: > > There are also more allocation stalls. One of the largest impacts was due > > to pages written back from kswapd context rising from 0 pages to 4516642 > > pages during the hour the workload ran for. By and large, the patch has very > > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > > > This patch is included with the data in case a bisection leads to this area. > > This patch is also a pre-requisite for the rest of the series. > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's > short sleep point where every eligible zones are balanced. > What's the correlation between them? > If kswapd is ready for a short sleep, eligible zones are balanced for order-0 but not necessarily the originally requested order if kswapd gave up reclaiming as compaction was ready to start. As kswapd is ready to sleep for a short period, it's a suitable time for kcompactd to decide if it should start working or not. There is no need for kswapd to be aware of kcompactd's wakeup criteria. > Can't we wake up kcompactd once we found a zone has enough free pages > above high watermark like this? > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 26c3b405ef34..f4f0ad0e9ede 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o > * that pages and compaction may succeed so reset the cache. > */ > reset_isolation_suitable(pgdat); > - > - /* > - * We have freed the memory, now we should compact it to make > - * allocation of the requested order possible. > - */ > - wakeup_kcompactd(pgdat, alloc_order, classzone_idx); > - > remaining = schedule_timeout(HZ/10); > > /* > @@ -3451,6 +3444,14 @@ static int kswapd(void *p) > bool ret; > > kswapd_try_sleep: > + /* > + * We have freed the memory, now we should compact it to make > + * allocation of the requested order possible. > + */ > + if (alloc_order > 0 && zone_balanced(zone, reclaim_order, > + classzone_idx)) > + wakeup_kcompactd(pgdat, alloc_order, classzone_idx); > + > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > classzone_idx); That's functionally very similar to what happens already. wakeup_kcompactd checks the order and does not wake for order-0. It also makes its own decisions that include zone_balanced on whether it is safe to wakeup. I doubt there would be any measurable difference from a patch like this and to my mind at least, it does not improve the readability or flow of the code. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-23 15:05 ` Mel Gorman @ 2017-02-24 1:17 ` Minchan Kim 2017-02-24 9:11 ` Mel Gorman 0 siblings, 1 reply; 25+ messages in thread From: Minchan Kim @ 2017-02-24 1:17 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM Hi Mel, On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote: > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote: > > > There are also more allocation stalls. One of the largest impacts was due > > > to pages written back from kswapd context rising from 0 pages to 4516642 > > > pages during the hour the workload ran for. By and large, the patch has very > > > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > > > > > This patch is included with the data in case a bisection leads to this area. > > > This patch is also a pre-requisite for the rest of the series. > > > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's > > short sleep point where every eligible zones are balanced. > > What's the correlation between them? > > > > If kswapd is ready for a short sleep, eligible zones are balanced for > order-0 but not necessarily the originally requested order if kswapd > gave up reclaiming as compaction was ready to start. As kswapd is ready > to sleep for a short period, it's a suitable time for kcompactd to decide > if it should start working or not. There is no need for kswapd to be aware > of kcompactd's wakeup criteria. If all eligible zones are balanced for order-0, I agree it's good timing because high-order alloc's ratio would be higher since kcompactd can compact eligible zones, not that only classzone. However, this patch breaks it as well as long time kswapd behavior which continues to balance eligible zones for order-0. Is it really okay now? > > > Can't we wake up kcompactd once we found a zone has enough free pages > > above high watermark like this? > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 26c3b405ef34..f4f0ad0e9ede 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -3346,13 +3346,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o > > * that pages and compaction may succeed so reset the cache. > > */ > > reset_isolation_suitable(pgdat); > > - > > - /* > > - * We have freed the memory, now we should compact it to make > > - * allocation of the requested order possible. > > - */ > > - wakeup_kcompactd(pgdat, alloc_order, classzone_idx); > > - > > remaining = schedule_timeout(HZ/10); > > > > /* > > @@ -3451,6 +3444,14 @@ static int kswapd(void *p) > > bool ret; > > > > kswapd_try_sleep: > > + /* > > + * We have freed the memory, now we should compact it to make > > + * allocation of the requested order possible. > > + */ > > + if (alloc_order > 0 && zone_balanced(zone, reclaim_order, > > + classzone_idx)) > > + wakeup_kcompactd(pgdat, alloc_order, classzone_idx); > > + > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > classzone_idx); > > That's functionally very similar to what happens already. wakeup_kcompactd > checks the order and does not wake for order-0. It also makes its own > decisions that include zone_balanced on whether it is safe to wakeup. Agree. > > I doubt there would be any measurable difference from a patch like this > and to my mind at least, it does not improve the readability or flow of > the code. However, my concern is premature kswapd sleep for order-0 which has been long time behavior so I hope it should be documented why it's okay now. Thanks. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-24 1:17 ` Minchan Kim @ 2017-02-24 9:11 ` Mel Gorman 2017-02-27 6:16 ` Minchan Kim 0 siblings, 1 reply; 25+ messages in thread From: Mel Gorman @ 2017-02-24 9:11 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote: > Hi Mel, > > On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote: > > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote: > > > > There are also more allocation stalls. One of the largest impacts was due > > > > to pages written back from kswapd context rising from 0 pages to 4516642 > > > > pages during the hour the workload ran for. By and large, the patch has very > > > > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > > > > > > > This patch is included with the data in case a bisection leads to this area. > > > > This patch is also a pre-requisite for the rest of the series. > > > > > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's > > > short sleep point where every eligible zones are balanced. > > > What's the correlation between them? > > > > > > > If kswapd is ready for a short sleep, eligible zones are balanced for > > order-0 but not necessarily the originally requested order if kswapd > > gave up reclaiming as compaction was ready to start. As kswapd is ready > > to sleep for a short period, it's a suitable time for kcompactd to decide > > if it should start working or not. There is no need for kswapd to be aware > > of kcompactd's wakeup criteria. > > If all eligible zones are balanced for order-0, I agree it's good timing > because high-order alloc's ratio would be higher since kcompactd can compact > eligible zones, not that only classzone. > However, this patch breaks it as well as long time kswapd behavior which > continues to balance eligible zones for order-0. > Is it really okay now? > Reclaim stops in balance_pgdat() if any eligible zone for the requested classzone is free. The initial sleep for kswapd is very different because it'll sleep if all zones are balanced for order-0 which is a bad disconnect. The way node balancing works means there is no guarantee at all that all zones will be balanced even if there is little or no memory pressure and one large zone in a node with multiple zones can be balanced quickly. The short-sleep logic that kswapd uses to decide whether to go to sleep is shortcut and it does not properly try the short sleep checking if the high watermarks are quickly reached or not. Instead, it quickly fails the first attempt at sleep, reenters balance_pgdat(), finds nothing to do and rechecks sleeping based on order-0, classzone-0 which it can easily sleep for but is *not* what kswapd was woken for in the first place. For many allocation requests that initially woke kswapd, the impact is marginal. kswapd sleeps early and is woken in the near future if there is a continual stream of allocations with a risk that direct reclaim is required. While the motivation for the patch was that kcompact is not woken up, the existing behaviour is just wrong -- kswapd should be deciding to sleep based on the classzone it was woken for and if possible, the order it was woken for but the classzone is more important in the common case for order-0 allocations. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep 2017-02-24 9:11 ` Mel Gorman @ 2017-02-27 6:16 ` Minchan Kim 0 siblings, 0 replies; 25+ messages in thread From: Minchan Kim @ 2017-02-27 6:16 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM Hi Mel, On Fri, Feb 24, 2017 at 09:11:28AM +0000, Mel Gorman wrote: > On Fri, Feb 24, 2017 at 10:17:06AM +0900, Minchan Kim wrote: > > Hi Mel, > > > > On Thu, Feb 23, 2017 at 03:05:34PM +0000, Mel Gorman wrote: > > > On Wed, Feb 22, 2017 at 04:00:36PM +0900, Minchan Kim wrote: > > > > > There are also more allocation stalls. One of the largest impacts was due > > > > > to pages written back from kswapd context rising from 0 pages to 4516642 > > > > > pages during the hour the workload ran for. By and large, the patch has very > > > > > bad behaviour but easily missed as the impact on a UMA machine is negligible. > > > > > > > > > > This patch is included with the data in case a bisection leads to this area. > > > > > This patch is also a pre-requisite for the rest of the series. > > > > > > > > > > Signed-off-by: Shantanu Goel <sgoel01@yahoo.com> > > > > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > > > > > > > > Hmm, I don't understand why we should bind wakeup_kcompactd to kswapd's > > > > short sleep point where every eligible zones are balanced. > > > > What's the correlation between them? > > > > > > > > > > If kswapd is ready for a short sleep, eligible zones are balanced for > > > order-0 but not necessarily the originally requested order if kswapd > > > gave up reclaiming as compaction was ready to start. As kswapd is ready > > > to sleep for a short period, it's a suitable time for kcompactd to decide > > > if it should start working or not. There is no need for kswapd to be aware > > > of kcompactd's wakeup criteria. > > > > If all eligible zones are balanced for order-0, I agree it's good timing > > because high-order alloc's ratio would be higher since kcompactd can compact > > eligible zones, not that only classzone. > > However, this patch breaks it as well as long time kswapd behavior which > > continues to balance eligible zones for order-0. > > Is it really okay now? > > > > Reclaim stops in balance_pgdat() if any eligible zone for the requested > classzone is free. The initial sleep for kswapd is very different because > it'll sleep if all zones are balanced for order-0 which is a bad disconnect. > The way node balancing works means there is no guarantee at all that all > zones will be balanced even if there is little or no memory pressure and > one large zone in a node with multiple zones can be balanced quickly. Indeed but it would tip toward direct relcaim more so it could make more failure for allocation relies on kswapd like atomic allocation However, if VM balance all of zones for order-0, it would make excessive reclaim with node-based LRU unlike zone-based, which is bad, too. > > The short-sleep logic that kswapd uses to decide whether to go to sleep > is shortcut and it does not properly try the short sleep checking if the > high watermarks are quickly reached or not. Instead, it quickly fails the > first attempt at sleep, reenters balance_pgdat(), finds nothing to do and > rechecks sleeping based on order-0, classzone-0 which it can easily sleep > for but is *not* what kswapd was woken for in the first place. > > For many allocation requests that initially woke kswapd, the impact is > marginal. kswapd sleeps early and is woken in the near future if there > is a continual stream of allocations with a risk that direct reclaim is > required. While the motivation for the patch was that kcompact is not woken > up, the existing behaviour is just wrong -- kswapd should be deciding to > sleep based on the classzone it was woken for and if possible, the order > it was woken for but the classzone is more important in the common case > for order-0 allocations. I agree but I think it's rather risky to paper over order-0 zone-balancing problem by kcompactd missing problem so at least, it should be documented. Thanks. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced 2017-02-15 9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman @ 2017-02-15 9:22 ` Mel Gorman 2017-02-15 9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman 2017-02-15 20:30 ` [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Andrew Morton 3 siblings, 0 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 9:22 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman A pgdat tracks if recent reclaim encountered too many dirty, writeback or congested pages. The flags control whether kswapd writes pages back from reclaim context, tags pages for immediate reclaim when IO completes, whether processes block on wait_iff_congested and whether kswapd blocks when too many pages marked for immediate reclaim are encountered. The state is cleared in a check function with side-effects. With the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep", the timing of when the bits get cleared changed. Due to the way the check works, it'll clear the bits if ZONE_DMA is balanced for a GFP_DMA allocation because it does not account for lowmem reserves properly. For the simoop workload, kswapd is not stalling when it should due to the premature clearing, writing pages from reclaim context like crazy and generally being unhelpful. This patch resets the pgdat bits related to page reclaim only when kswapd is going to sleep. The comparison with simoop is then 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 fixcheck-v1 clear-v1 Amean p50-Read 22325202.49 ( 0.00%) 20026926.55 ( 10.29%) 19491134.58 ( 12.69%) Amean p95-Read 26102988.80 ( 0.00%) 27023360.00 ( -3.53%) 24294195.20 ( 6.93%) Amean p99-Read 30935176.53 ( 0.00%) 30994432.00 ( -0.19%) 30397053.16 ( 1.74%) Amean p50-Write 976.44 ( 0.00%) 1905.28 (-95.12%) 1077.22 (-10.32%) Amean p95-Write 15471.29 ( 0.00%) 36210.09 (-134.05%) 36419.56 (-135.40%) Amean p99-Write 35108.62 ( 0.00%) 479494.96 (-1265.75%) 102000.36 (-190.53%) Amean p50-Allocation 76382.61 ( 0.00%) 87603.20 (-14.69%) 87485.22 (-14.54%) Amean p95-Allocation 127777.39 ( 0.00%) 244491.38 (-91.34%) 204588.52 (-60.11%) Amean p99-Allocation 187937.39 ( 0.00%) 1745237.33 (-828.63%) 631657.74 (-236.10%) Read latency is improved although write and allocation latency is impacted. Even with the patch, kswapd is still reclaiming inefficiently, pages are being written back from writeback context and a host of other issues. However, given the change, it needed to be spelled out why the side-effect was moved. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 92fc66bd52bc..b47b430ca7ea 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3097,17 +3097,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; - /* - * If any eligible zone is balanced then the node is not considered - * to be congested or dirty - */ - clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); - clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); - clear_bit(PGDAT_WRITEBACK, &zone->zone_pgdat->flags); - return true; } +/* Clear pgdat state for congested, dirty or under writeback. */ +static void clear_pgdat_congested(pg_data_t *pgdat) +{ + clear_bit(PGDAT_CONGESTED, &pgdat->flags); + clear_bit(PGDAT_DIRTY, &pgdat->flags); + clear_bit(PGDAT_WRITEBACK, &pgdat->flags); +} + /* * Prepare kswapd for sleeping. This verifies that there are no processes * waiting in throttle_direct_reclaim() and that watermarks have been met. @@ -3140,8 +3140,10 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (!managed_zone(zone)) continue; - if (zone_balanced(zone, order, classzone_idx)) + if (zone_balanced(zone, order, classzone_idx)) { + clear_pgdat_congested(pgdat); return true; + } } return false; -- 2.11.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-15 9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman 2017-02-15 9:22 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman @ 2017-02-15 9:22 ` Mel Gorman 2017-02-16 6:23 ` Hillf Danton 2017-02-20 16:42 ` Vlastimil Babka 2017-02-15 20:30 ` [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Andrew Morton 3 siblings, 2 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 9:22 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25 keepawake-v1r25 Amean p50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) 22092755.48 ( 1.04%) Amean p95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) 26101849.04 ( 0.00%) Amean p99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) 29746220.52 ( 3.84%) Amean p50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) 952.73 ( 2.43%) Amean p95-Write 15471.29 ( 0.00%) 36419.56 (-135.40%) 3140.27 ( 79.70%) Amean p99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) 8843.73 ( 74.81%) Amean p50-Allocation 76382.61 ( 0.00%) 87485.22 (-14.54%) 76349.22 ( 0.04%) Amean p95-Allocation 127777.39 ( 0.00%) 204588.52 (-60.11%) 108630.26 ( 14.98%) Amean p99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) 139094.26 ( 25.99%) With this patch on top, all the latencies relative to the baseline are improved, particularly write latencies. The read latencies are still high for the number of threads but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs 0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 57036946 82488665 56986286 Kswapd pages reclaimed 55993488 63474329 55939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.306 1393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 1036792 14175203 1042571 Swap-outs are equivalent to baseline Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for the actual test itself. It's a lot of file IO followed by a lot of allocs that reclaims heavily for a brief window. After that, direct reclaim is intermittent when the workload spawns a number of threads periodically to do work. kswapd simply cannot wake and reclaim fast enough between the low and min watermarks. It could be mitigated using vm.watermark_scale_factor but not through special tricks in kswapd. Page writes from reclaim context are at 0 which is the ideal Pages immediately reclaimed after IO completes is back at the baseline On UMA, there is almost no change so this is not expected to be a universal win. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/memory_hotplug.c | 2 +- mm/vmscan.c | 118 +++++++++++++++++++++++++++++----------------------- 2 files changed, 66 insertions(+), 54 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 11581f4cfbb4..481aebc91782 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1209,7 +1209,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) /* Reset the nr_zones, order and classzone_idx before reuse */ pgdat->nr_zones = 0; pgdat->kswapd_order = 0; - pgdat->kswapd_classzone_idx = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; } /* we can use NODE_DATA(nid) from here */ diff --git a/mm/vmscan.c b/mm/vmscan.c index b47b430ca7ea..3bd1ddee09cd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3090,14 +3090,36 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, int classzone_idx) +/* + * Returns true if there is an eligible zone balanced for the request order + * and classzone_idx + */ +static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { - unsigned long mark = high_wmark_pages(zone); + int i; + unsigned long mark = -1; + struct zone *zone; - if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) - return false; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; - return true; + if (!managed_zone(zone)) + continue; + + mark = high_wmark_pages(zone); + if (zone_watermark_ok_safe(zone, order, mark, classzone_idx)) + return true; + } + + /* + * If a node has no populated zone within classzone_idx, it does not + * need balancing by definition. This can happen if a zone-restricted + * allocation tries to wake a remote kswapd. + */ + if (mark == -1) + return true; + + return false; } /* Clear pgdat state for congested, dirty or under writeback. */ @@ -3116,8 +3138,6 @@ static void clear_pgdat_congested(pg_data_t *pgdat) */ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) { - int i; - /* * The throttled processes are normally woken up in balance_pgdat() as * soon as pfmemalloc_watermark_ok() is true. But there is a potential @@ -3134,16 +3154,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (waitqueue_active(&pgdat->pfmemalloc_wait)) wake_up_all(&pgdat->pfmemalloc_wait); - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, order, classzone_idx)) { - clear_pgdat_congested(pgdat); - return true; - } + if (pgdat_balanced(pgdat, order, classzone_idx)) { + clear_pgdat_congested(pgdat); + return true; } return false; @@ -3249,23 +3262,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Check from - * high to low zone as allocations prefer higher zones. - * Scanning from low to high zone would allow congestion to be - * cleared during a very small window when a small low - * zone was balanced even under extreme pressure when the - * overall node may be congested. Note that sc.reclaim_idx - * is not used as buffer_heads_over_limit may have adjusted - * it. + * Only reclaim if there are no eligible zones. Note that + * sc.reclaim_idx is not used as buffer_heads_over_limit may + * have adjusted it. */ - for (i = classzone_idx; i >= 0; i--) { - zone = pgdat->node_zones + i; - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, sc.order, classzone_idx)) - goto out; - } + if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + goto out; /* * Do some background aging of the anon list, to give @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) return sc.order; } +/* + * pgdat->kswapd_classzone_idx is the highest zone index that a recent + * allocation request woke kswapd for. When kswapd has not woken recently, + * the value is MAX_NR_ZONES which is not a valid index. This compares a + * given classzone and returns it or the highest classzone index kswapd + * was recently woke for. + */ +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, + enum zone_type classzone_idx) +{ + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) + return classzone_idx; + + return max(pgdat->kswapd_classzone_idx, classzone_idx); +} + static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, unsigned int classzone_idx) { @@ -3363,7 +3381,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o * the previous request that slept prematurely. */ if (remaining) { - pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx); + pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order); } @@ -3417,7 +3435,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o */ static int kswapd(void *p) { - unsigned int alloc_order, reclaim_order, classzone_idx; + unsigned int alloc_order, reclaim_order; + unsigned int classzone_idx = MAX_NR_ZONES - 1; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; @@ -3447,20 +3466,23 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); - pgdat->kswapd_order = alloc_order = reclaim_order = 0; - pgdat->kswapd_classzone_idx = classzone_idx = 0; + pgdat->kswapd_order = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; for ( ; ; ) { bool ret; + alloc_order = reclaim_order = pgdat->kswapd_order; + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); + kswapd_try_sleep: kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, classzone_idx); /* Read the new order and classzone_idx */ alloc_order = reclaim_order = pgdat->kswapd_order; - classzone_idx = pgdat->kswapd_classzone_idx; + classzone_idx = kswapd_classzone_idx(pgdat, 0); pgdat->kswapd_order = 0; - pgdat->kswapd_classzone_idx = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; ret = try_to_freeze(); if (kthread_should_stop()) @@ -3486,9 +3508,6 @@ static int kswapd(void *p) reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); if (reclaim_order < alloc_order) goto kswapd_try_sleep; - - alloc_order = reclaim_order = pgdat->kswapd_order; - classzone_idx = pgdat->kswapd_classzone_idx; } tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); @@ -3504,7 +3523,6 @@ static int kswapd(void *p) void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) { pg_data_t *pgdat; - int z; if (!managed_zone(zone)) return; @@ -3512,22 +3530,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL)) return; pgdat = zone->zone_pgdat; - pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx); + pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); pgdat->kswapd_order = max(pgdat->kswapd_order, order); if (!waitqueue_active(&pgdat->kswapd_wait)) return; /* Only wake kswapd if all zones are unbalanced */ - for (z = 0; z <= classzone_idx; z++) { - zone = pgdat->node_zones + z; - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, order, classzone_idx)) - return; - } + if (pgdat_balanced(pgdat, order, classzone_idx)) + return; - trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order); wake_up_interruptible(&pgdat->kswapd_wait); } -- 2.11.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-15 9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman @ 2017-02-16 6:23 ` Hillf Danton 2017-02-16 8:10 ` Mel Gorman 2017-02-20 16:42 ` Vlastimil Babka 1 sibling, 1 reply; 25+ messages in thread From: Hillf Danton @ 2017-02-16 6:23 UTC (permalink / raw) To: 'Mel Gorman', 'Andrew Morton' Cc: 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'Vlastimil Babka', 'LKML', 'Linux-MM' On February 15, 2017 5:23 PM Mel Gorman wrote: > */ > static int kswapd(void *p) > { > - unsigned int alloc_order, reclaim_order, classzone_idx; > + unsigned int alloc_order, reclaim_order; > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > pg_data_t *pgdat = (pg_data_t*)p; > struct task_struct *tsk = current; > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > set_freezable(); > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > + pgdat->kswapd_order = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > for ( ; ; ) { > bool ret; > > + alloc_order = reclaim_order = pgdat->kswapd_order; > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > + > kswapd_try_sleep: > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > classzone_idx); > > /* Read the new order and classzone_idx */ > alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > pgdat->kswapd_order = 0; > - pgdat->kswapd_classzone_idx = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > ret = try_to_freeze(); > if (kthread_should_stop()) > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); > if (reclaim_order < alloc_order) > goto kswapd_try_sleep; If we fail order-5 request, can we then give up order-5, and try order-3 if requested, after napping? > - > - alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > } > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-16 6:23 ` Hillf Danton @ 2017-02-16 8:10 ` Mel Gorman 2017-02-16 8:21 ` Hillf Danton 0 siblings, 1 reply; 25+ messages in thread From: Mel Gorman @ 2017-02-16 8:10 UTC (permalink / raw) To: Hillf Danton Cc: 'Andrew Morton', 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'Vlastimil Babka', 'LKML', 'Linux-MM' On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > On February 15, 2017 5:23 PM Mel Gorman wrote: > > */ > > static int kswapd(void *p) > > { > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > + unsigned int alloc_order, reclaim_order; > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > pg_data_t *pgdat = (pg_data_t*)p; > > struct task_struct *tsk = current; > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > set_freezable(); > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > + pgdat->kswapd_order = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > for ( ; ; ) { > > bool ret; > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > + > > kswapd_try_sleep: > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > classzone_idx); > > > > /* Read the new order and classzone_idx */ > > alloc_order = reclaim_order = pgdat->kswapd_order; > > - classzone_idx = pgdat->kswapd_classzone_idx; > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > pgdat->kswapd_order = 0; > > - pgdat->kswapd_classzone_idx = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > ret = try_to_freeze(); > > if (kthread_should_stop()) > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); > > if (reclaim_order < alloc_order) > > goto kswapd_try_sleep; > > If we fail order-5 request, can we then give up order-5, and > try order-3 if requested, after napping? > That has no bearing upon this patch. At this point, kswapd has stopped reclaiming at the requested order and is preparing to sleep. If there is a parallel request for order-3 while it's sleeping, it'll wake and start reclaiming at order-3 as requested. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-16 8:10 ` Mel Gorman @ 2017-02-16 8:21 ` Hillf Danton 2017-02-16 9:32 ` Mel Gorman 2017-02-20 16:34 ` Vlastimil Babka 0 siblings, 2 replies; 25+ messages in thread From: Hillf Danton @ 2017-02-16 8:21 UTC (permalink / raw) To: 'Mel Gorman' Cc: 'Andrew Morton', 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'Vlastimil Babka', 'LKML', 'Linux-MM' On February 16, 2017 4:11 PM Mel Gorman wrote: > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > */ > > > static int kswapd(void *p) > > > { > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > + unsigned int alloc_order, reclaim_order; > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > pg_data_t *pgdat = (pg_data_t*)p; > > > struct task_struct *tsk = current; > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > set_freezable(); > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > + pgdat->kswapd_order = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > for ( ; ; ) { > > > bool ret; > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > > + > > > kswapd_try_sleep: > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > classzone_idx); > > > > > > /* Read the new order and classzone_idx */ > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > pgdat->kswapd_order = 0; > > > - pgdat->kswapd_classzone_idx = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > ret = try_to_freeze(); > > > if (kthread_should_stop()) > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); > > > if (reclaim_order < alloc_order) > > > goto kswapd_try_sleep; > > > > If we fail order-5 request, can we then give up order-5, and > > try order-3 if requested, after napping? > > > > That has no bearing upon this patch. At this point, kswapd has stopped > reclaiming at the requested order and is preparing to sleep. If there is > a parallel request for order-3 while it's sleeping, it'll wake and start > reclaiming at order-3 as requested. > Right, but the order-3 request can also come up while kswapd is active and gives up order-5. thanks Hillf ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-16 8:21 ` Hillf Danton @ 2017-02-16 9:32 ` Mel Gorman 2017-02-20 16:34 ` Vlastimil Babka 1 sibling, 0 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-16 9:32 UTC (permalink / raw) To: Hillf Danton Cc: 'Andrew Morton', 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'Vlastimil Babka', 'LKML', 'Linux-MM' On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: > > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > > */ > > > > static int kswapd(void *p) > > > > { > > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > > + unsigned int alloc_order, reclaim_order; > > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > > pg_data_t *pgdat = (pg_data_t*)p; > > > > struct task_struct *tsk = current; > > > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > > set_freezable(); > > > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > > + pgdat->kswapd_order = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > for ( ; ; ) { > > > > bool ret; > > > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > > > + > > > > kswapd_try_sleep: > > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > > classzone_idx); > > > > > > > > /* Read the new order and classzone_idx */ > > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > > pgdat->kswapd_order = 0; > > > > - pgdat->kswapd_classzone_idx = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > > > ret = try_to_freeze(); > > > > if (kthread_should_stop()) > > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > > reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); > > > > if (reclaim_order < alloc_order) > > > > goto kswapd_try_sleep; > > > > > > If we fail order-5 request, can we then give up order-5, and > > > try order-3 if requested, after napping? > > > > > > > That has no bearing upon this patch. At this point, kswapd has stopped > > reclaiming at the requested order and is preparing to sleep. If there is > > a parallel request for order-3 while it's sleeping, it'll wake and start > > reclaiming at order-3 as requested. > > > > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. > And then it'll be in pgdat->kswapd_order and be picked up on the next wakeup. It won't be immediate but it's also unlikely to be worth picking up immediately. The context here is that a high-order reclaim request failed and rather keeping kswapd awake reclaiming the world, go to sleep until another wakeup request comes in. Staying awake continually for high orders caused problems with excessive reclaim in the past. It could be revisited again but it's not related to what this patch is aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced for a GFP_DMA request which is nowhere in the request stream. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-16 8:21 ` Hillf Danton 2017-02-16 9:32 ` Mel Gorman @ 2017-02-20 16:34 ` Vlastimil Babka 2017-02-21 4:10 ` Hillf Danton 1 sibling, 1 reply; 25+ messages in thread From: Vlastimil Babka @ 2017-02-20 16:34 UTC (permalink / raw) To: Hillf Danton, 'Mel Gorman' Cc: 'Andrew Morton', 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'LKML', 'Linux-MM' On 02/16/2017 09:21 AM, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: >> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: >> > On February 15, 2017 5:23 PM Mel Gorman wrote: >> > > */ >> > > static int kswapd(void *p) >> > > { >> > > - unsigned int alloc_order, reclaim_order, classzone_idx; >> > > + unsigned int alloc_order, reclaim_order; >> > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; >> > > pg_data_t *pgdat = (pg_data_t*)p; >> > > struct task_struct *tsk = current; >> > > >> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) >> > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; >> > > set_freezable(); >> > > >> > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; >> > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; >> > > + pgdat->kswapd_order = 0; >> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > for ( ; ; ) { >> > > bool ret; >> > > >> > > + alloc_order = reclaim_order = pgdat->kswapd_order; >> > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); >> > > + >> > > kswapd_try_sleep: >> > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, >> > > classzone_idx); >> > > >> > > /* Read the new order and classzone_idx */ >> > > alloc_order = reclaim_order = pgdat->kswapd_order; >> > > - classzone_idx = pgdat->kswapd_classzone_idx; >> > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); >> > > pgdat->kswapd_order = 0; >> > > - pgdat->kswapd_classzone_idx = 0; >> > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > >> > > ret = try_to_freeze(); >> > > if (kthread_should_stop()) >> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) >> > > reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); >> > > if (reclaim_order < alloc_order) >> > > goto kswapd_try_sleep; >> > >> > If we fail order-5 request, can we then give up order-5, and >> > try order-3 if requested, after napping? >> > >> >> That has no bearing upon this patch. At this point, kswapd has stopped >> reclaiming at the requested order and is preparing to sleep. If there is >> a parallel request for order-3 while it's sleeping, it'll wake and start >> reclaiming at order-3 as requested. >> > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how kswapd could help an order-3 allocation at that point - it's up to kcompactd. > thanks > Hillf > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-20 16:34 ` Vlastimil Babka @ 2017-02-21 4:10 ` Hillf Danton 0 siblings, 0 replies; 25+ messages in thread From: Hillf Danton @ 2017-02-21 4:10 UTC (permalink / raw) To: 'Vlastimil Babka', 'Mel Gorman' Cc: 'Andrew Morton', 'Shantanu Goel', 'Chris Mason', 'Johannes Weiner', 'LKML', 'Linux-MM' On February 21, 2017 12:34 AM Vlastimil Babka wrote: > On 02/16/2017 09:21 AM, Hillf Danton wrote: > > Right, but the order-3 request can also come up while kswapd is active and > > gives up order-5. > > "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming > order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how > kswapd could help an order-3 allocation at that point - it's up to kcompactd. > cpu0 cpu1 give up order-5 fall back to order-0 wake up kswapd for order-3 wake up kswapd for order-5 fall in sleep wake up kswapd for order-3 what order would we try? It is order-5 in the patch. Given the fresh new world without hike ban after napping, one tenth second or 3 minutes, we feel free IMHO to select any order and go another round of reclaiming pages. thanks Hillf ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-15 9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman 2017-02-16 6:23 ` Hillf Danton @ 2017-02-20 16:42 ` Vlastimil Babka 2017-02-23 15:01 ` Mel Gorman 1 sibling, 1 reply; 25+ messages in thread From: Vlastimil Babka @ 2017-02-20 16:42 UTC (permalink / raw) To: Mel Gorman, Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM On 02/15/2017 10:22 AM, Mel Gorman wrote: > kswapd is woken to reclaim a node based on a failed allocation request > from any eligible zone. Once reclaiming in balance_pgdat(), it will > continue reclaiming until there is an eligible zone available for the > zone it was woken for. kswapd tracks what zone it was recently woken for > in pgdat->kswapd_classzone_idx. If it has not been woken recently, this > zone will be 0. > > However, the decision on whether to sleep is made on kswapd_classzone_idx > which is 0 without a recent wakeup request and that classzone does not > account for lowmem reserves. This allows kswapd to sleep when a low > small zone such as ZONE_DMA is balanced for a GFP_DMA request even if > a stream of allocations cannot use that zone. While kswapd may be woken > again shortly in the near future there are two consequences -- the pgdat > bits that control congestion are cleared prematurely and direct reclaim > is more likely as kswapd slept prematurely. > > This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid > index) when there has been no recent wakeups. If there are no wakeups, > it'll decide whether to sleep based on the highest possible zone available > (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" > decisions during reclaim and when deciding to sleep are the same. If there is > a mismatch, kswapd can stay awake continually trying to balance tiny zones. > > simoop was used to evaluate it again. Two of the preparation patches regressed > the workload so they are included as the second set of results. Otherwise > this patch looks artifically excellent > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 clear-v1r25 keepawake-v1r25 > Amean p50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) 22092755.48 ( 1.04%) > Amean p95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) 26101849.04 ( 0.00%) > Amean p99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) 29746220.52 ( 3.84%) > Amean p50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) 952.73 ( 2.43%) > Amean p95-Write 15471.29 ( 0.00%) 36419.56 (-135.40%) 3140.27 ( 79.70%) > Amean p99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) 8843.73 ( 74.81%) > Amean p50-Allocation 76382.61 ( 0.00%) 87485.22 (-14.54%) 76349.22 ( 0.04%) > Amean p95-Allocation 127777.39 ( 0.00%) 204588.52 (-60.11%) 108630.26 ( 14.98%) > Amean p99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) 139094.26 ( 25.99%) > > With this patch on top, all the latencies relative to the baseline are > improved, particularly write latencies. The read latencies are still high > for the number of threads but it's worth noting that this is mostly due > to the IO scheduler and not directly related to reclaim. The vmstats are > a bit of a mix but the relevant ones are as follows; > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 clear-v1r25keepawake-v1r25 > Swap Ins 0 0 0 > Swap Outs 0 608 0 > Direct pages scanned 6910672 3132699 6357298 > Kswapd pages scanned 57036946 82488665 56986286 > Kswapd pages reclaimed 55993488 63474329 55939113 > Direct pages reclaimed 6905990 2964843 6352115 These stats are confusing me. The earlier description suggests that this patch should cause less direct reclaim and more kswapd reclaim, but compared to "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when considering direct + kswapd combined) > Kswapd efficiency 98% 76% 98% > Kswapd velocity 12494.375 17597.507 12488.065 > Direct efficiency 99% 94% 99% > Direct velocity 1513.835 668.306 1393.148 > Page writes by reclaim 0.000 4410243.000 0.000 > Page writes file 0 4409635 0 > Page writes anon 0 608 0 > Page reclaim immediate 1036792 14175203 1042571 > > Swap-outs are equivalent to baseline > Direct reclaim is reduced but not eliminated. It's worth noting > that there are two periods of direct reclaim for this workload. The > first is when it switches from preparing the files for the actual > test itself. It's a lot of file IO followed by a lot of allocs > that reclaims heavily for a brief window. After that, direct > reclaim is intermittent when the workload spawns a number of > threads periodically to do work. kswapd simply cannot wake and > reclaim fast enough between the low and min watermarks. It could > be mitigated using vm.watermark_scale_factor but not through > special tricks in kswapd. > Page writes from reclaim context are at 0 which is the ideal > Pages immediately reclaimed after IO completes is back at the baseline > > On UMA, there is almost no change so this is not expected to be a universal > win. > > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> [...] > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > return sc.order; > } > > +/* > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent > + * allocation request woke kswapd for. When kswapd has not woken recently, > + * the value is MAX_NR_ZONES which is not a valid index. This compares a > + * given classzone and returns it or the highest classzone index kswapd > + * was recently woke for. > + */ > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, > + enum zone_type classzone_idx) > +{ > + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) > + return classzone_idx; > + > + return max(pgdat->kswapd_classzone_idx, classzone_idx); A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to a local variable with READ_ONCE(), otherwise something can set it to MAX_NR_ZONES between the check and max(), and compiler can decide to reread. Probably not an issue with current callers, but I'd rather future-proof it. Thanks, Vlastimil ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-20 16:42 ` Vlastimil Babka @ 2017-02-23 15:01 ` Mel Gorman 2017-03-01 9:04 ` Vlastimil Babka 0 siblings, 1 reply; 25+ messages in thread From: Mel Gorman @ 2017-02-23 15:01 UTC (permalink / raw) To: Vlastimil Babka Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: > > With this patch on top, all the latencies relative to the baseline are > > improved, particularly write latencies. The read latencies are still high > > for the number of threads but it's worth noting that this is mostly due > > to the IO scheduler and not directly related to reclaim. The vmstats are > > a bit of a mix but the relevant ones are as follows; > > > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > > mmots-20170209 clear-v1r25keepawake-v1r25 > > Swap Ins 0 0 0 > > Swap Outs 0 608 0 > > Direct pages scanned 6910672 3132699 6357298 > > Kswapd pages scanned 57036946 82488665 56986286 > > Kswapd pages reclaimed 55993488 63474329 55939113 > > Direct pages reclaimed 6905990 2964843 6352115 > > These stats are confusing me. The earlier description suggests that this patch > should cause less direct reclaim and more kswapd reclaim, but compared to > "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when > considering direct + kswapd combined) > The description is referring to the impact relative to baseline. It is true that relative to patch that direct reclaim is higher but there are a number of anomalies. Note that kswapd is scanning very aggressively in "clear-v1" and overall efficiency is down to 76%. It's also not clear in the stats but in "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for due to the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep". It's also doing a lot of writing of file-backed pages from reclaim context and some swapping due to the aggressiveness of the scan. While direct reclaim activity might be lower, it's due to kswapd scanning aggressively and trying to reclaim the world which is not the right thing to do. With the patches applied, there is still direct reclaim but the fast bulk of them are when the workload changes phase from "creating work files" to starting multiple threads that allocate a lot of anonymous memory with a sudden spike in memory pressure that kswapd does not keep ahead of with multiple allocating threads. > > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > > return sc.order; > > } > > > > +/* > > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent > > + * allocation request woke kswapd for. When kswapd has not woken recently, > > + * the value is MAX_NR_ZONES which is not a valid index. This compares a > > + * given classzone and returns it or the highest classzone index kswapd > > + * was recently woke for. > > + */ > > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, > > + enum zone_type classzone_idx) > > +{ > > + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) > > + return classzone_idx; > > + > > + return max(pgdat->kswapd_classzone_idx, classzone_idx); > > A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to > a local variable with READ_ONCE(), otherwise something can set it to > MAX_NR_ZONES between the check and max(), and compiler can decide to reread. > Probably not an issue with current callers, but I'd rather future-proof it. > I'm a little wary of adding READ_ONCE unless there is a definite problem. Even if it was an issue, I think it would be better to protect thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken if an update is required or a read to fully guarantee the ordering. The consequences as they are is that kswapd may miss reclaiming at a higher order or classzone than it should have although it is very unlikely and the update and read are made with a workqueue wake and scheduler wakeup which should be sufficient in terms of barriers. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-02-23 15:01 ` Mel Gorman @ 2017-03-01 9:04 ` Vlastimil Babka 0 siblings, 0 replies; 25+ messages in thread From: Vlastimil Babka @ 2017-03-01 9:04 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM On 02/23/2017 04:01 PM, Mel Gorman wrote: > On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: >>> With this patch on top, all the latencies relative to the baseline are >>> improved, particularly write latencies. The read latencies are still high >>> for the number of threads but it's worth noting that this is mostly due >>> to the IO scheduler and not directly related to reclaim. The vmstats are >>> a bit of a mix but the relevant ones are as follows; >>> >>> 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 >>> mmots-20170209 clear-v1r25keepawake-v1r25 >>> Swap Ins 0 0 0 >>> Swap Outs 0 608 0 >>> Direct pages scanned 6910672 3132699 6357298 >>> Kswapd pages scanned 57036946 82488665 56986286 >>> Kswapd pages reclaimed 55993488 63474329 55939113 >>> Direct pages reclaimed 6905990 2964843 6352115 >> >> These stats are confusing me. The earlier description suggests that this patch >> should cause less direct reclaim and more kswapd reclaim, but compared to >> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when >> considering direct + kswapd combined) >> > > The description is referring to the impact relative to baseline. It is > true that relative to patch that direct reclaim is higher but there are > a number of anomalies. > > Note that kswapd is scanning very aggressively in "clear-v1" and overall > efficiency is down to 76%. It's also not clear in the stats but in > "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for > due to the patch "mm, vmscan: fix zone balance check in > prepare_kswapd_sleep". It's also doing a lot of writing of file-backed > pages from reclaim context and some swapping due to the aggressiveness > of the scan. > > While direct reclaim activity might be lower, it's due to kswapd scanning > aggressively and trying to reclaim the world which is not the right thing > to do. With the patches applied, there is still direct reclaim but the fast > bulk of them are when the workload changes phase from "creating work files" > to starting multiple threads that allocate a lot of anonymous memory with > a sudden spike in memory pressure that kswapd does not keep ahead of with > multiple allocating threads. Thanks for the explanation. > >>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) >>> return sc.order; >>> } >>> >>> +/* >>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent >>> + * allocation request woke kswapd for. When kswapd has not woken recently, >>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a >>> + * given classzone and returns it or the highest classzone index kswapd >>> + * was recently woke for. >>> + */ >>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, >>> + enum zone_type classzone_idx) >>> +{ >>> + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) >>> + return classzone_idx; >>> + >>> + return max(pgdat->kswapd_classzone_idx, classzone_idx); >> >> A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx to >> a local variable with READ_ONCE(), otherwise something can set it to >> MAX_NR_ZONES between the check and max(), and compiler can decide to reread. >> Probably not an issue with current callers, but I'd rather future-proof it. >> > > I'm a little wary of adding READ_ONCE unless there is a definite > problem. Even if it was an issue, I think it would be better to protect > thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken > if an update is required or a read to fully guarantee the ordering. > > The consequences as they are is that kswapd may miss reclaiming at a > higher order or classzone than it should have although it is very > unlikely and the update and read are made with a workqueue wake and > scheduler wakeup which should be sufficient in terms of barriers. OK then. Acked-by: Vlastimil Babka <vbabka@suse.cz> ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely 2017-02-15 9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman ` (2 preceding siblings ...) 2017-02-15 9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman @ 2017-02-15 20:30 ` Andrew Morton 2017-02-15 21:29 ` Mel Gorman 3 siblings, 1 reply; 25+ messages in thread From: Andrew Morton @ 2017-02-15 20:30 UTC (permalink / raw) To: Mel Gorman Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > This patchset is based on mmots as of Feb 9th, 2016. The baseline is > important as there are a number of kswapd-related fixes in that tree and > a comparison against v4.10-rc7 would be almost meaningless as a result. It's very late to squeeze this into 4.10. We can make it 4.11 material and perhaps tag it for backporting into 4.10.1? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely 2017-02-15 20:30 ` [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Andrew Morton @ 2017-02-15 21:29 ` Mel Gorman 2017-02-15 21:56 ` Andrew Morton 2017-02-15 22:00 ` Vlastimil Babka 0 siblings, 2 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 21:29 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote: > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is > > important as there are a number of kswapd-related fixes in that tree and > > a comparison against v4.10-rc7 would be almost meaningless as a result. > > It's very late to squeeze this into 4.10. We can make it 4.11 material > and perhaps tag it for backporting into 4.10.1? It would be important that Johannes's patches go along with then because I'm relied on Johannes' fixes to deal with pages being inappropriately written back from reclaim context when I was analysing the workload. I'm thinking specifically about these patches mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch This is 4.11 material for sure but I would not automatically try merging them to 4.10 unless those patches were also included, ideally with a rerun of just those patches against 4.10 to make sure there are no surprises lurking in there. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely 2017-02-15 21:29 ` Mel Gorman @ 2017-02-15 21:56 ` Andrew Morton 2017-02-15 22:15 ` Mel Gorman 2017-02-15 22:00 ` Vlastimil Babka 1 sibling, 1 reply; 25+ messages in thread From: Andrew Morton @ 2017-02-15 21:56 UTC (permalink / raw) To: Mel Gorman Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote: > > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is > > > important as there are a number of kswapd-related fixes in that tree and > > > a comparison against v4.10-rc7 would be almost meaningless as a result. > > > > It's very late to squeeze this into 4.10. We can make it 4.11 material > > and perhaps tag it for backporting into 4.10.1? > > It would be important that Johannes's patches go along with then because > I'm relied on Johannes' fixes to deal with pages being inappropriately > written back from reclaim context when I was analysing the workload. > I'm thinking specifically about these patches > > mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch > mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch > mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch > > This is 4.11 material for sure but I would not automatically try merging > them to 4.10 unless those patches were also included, ideally with a rerun > of just those patches against 4.10 to make sure there are no surprises > lurking in there. Head spinning a bit. You're saying that if the three patches in the series "Reduce amount of time kswapd sleeps prematurely" are held off until 4.11 then the above 6 patches from Johannes should also be held off for 4.11? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely 2017-02-15 21:56 ` Andrew Morton @ 2017-02-15 22:15 ` Mel Gorman 0 siblings, 0 replies; 25+ messages in thread From: Mel Gorman @ 2017-02-15 22:15 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM On Wed, Feb 15, 2017 at 01:56:54PM -0800, Andrew Morton wrote: > On Wed, 15 Feb 2017 21:29:06 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote: > > > On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > This patchset is based on mmots as of Feb 9th, 2016. The baseline is > > > > important as there are a number of kswapd-related fixes in that tree and > > > > a comparison against v4.10-rc7 would be almost meaningless as a result. > > > > > > It's very late to squeeze this into 4.10. We can make it 4.11 material > > > and perhaps tag it for backporting into 4.10.1? > > > > It would be important that Johannes's patches go along with then because > > I'm relied on Johannes' fixes to deal with pages being inappropriately > > written back from reclaim context when I was analysing the workload. > > I'm thinking specifically about these patches > > > > mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch > > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch > > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch > > mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch > > mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch > > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch > > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch > > > > This is 4.11 material for sure but I would not automatically try merging > > them to 4.10 unless those patches were also included, ideally with a rerun > > of just those patches against 4.10 to make sure there are no surprises > > lurking in there. > > Head spinning a bit. You're saying that if the three patches in the > series "Reduce amount of time kswapd sleeps prematurely" are held off > until 4.11 then the above 6 patches from Johannes should also be held > off for 4.11? > Not quite, sorry for the confusion. Johannes's patches stand on their own and they are fine. I evaluated them before against 4.10-rcX and they worked as expected. If they go in first then these patches can go into 4.10-stable on top. If you plan to merge Johannes's patches for 4.10 or 4.10.1 then I have no problem with that whatsoever. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely 2017-02-15 21:29 ` Mel Gorman 2017-02-15 21:56 ` Andrew Morton @ 2017-02-15 22:00 ` Vlastimil Babka 1 sibling, 0 replies; 25+ messages in thread From: Vlastimil Babka @ 2017-02-15 22:00 UTC (permalink / raw) To: Mel Gorman, Andrew Morton Cc: Shantanu Goel, Chris Mason, Johannes Weiner, LKML, Linux-MM On 15.2.2017 22:29, Mel Gorman wrote: > On Wed, Feb 15, 2017 at 12:30:55PM -0800, Andrew Morton wrote: >> On Wed, 15 Feb 2017 09:22:44 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: >> >>> This patchset is based on mmots as of Feb 9th, 2016. The baseline is >>> important as there are a number of kswapd-related fixes in that tree and >>> a comparison against v4.10-rc7 would be almost meaningless as a result. >> >> It's very late to squeeze this into 4.10. We can make it 4.11 material >> and perhaps tag it for backporting into 4.10.1? > > It would be important that Johannes's patches go along with then because > I'm relied on Johannes' fixes to deal with pages being inappropriately > written back from reclaim context when I was analysing the workload. > I'm thinking specifically about these patches > > mm-vmscan-scan-dirty-pages-even-in-laptop-mode.patch > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru.patch > mm-vmscan-kick-flushers-when-we-encounter-dirty-pages-on-the-lru-fix.patch > mm-vmscan-remove-old-flusher-wakeup-from-direct-reclaim-path.patch > mm-vmscan-only-write-dirty-pages-that-the-scanner-has-seen-twice.patch > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed.patch > mm-vmscan-move-dirty-pages-out-of-the-way-until-theyre-flushed-fix.patch > > This is 4.11 material for sure but I would not automatically try merging > them to 4.10 unless those patches were also included, ideally with a rerun > of just those patches against 4.10 to make sure there are no surprises > lurking in there. I wonder if we should also care about 4.9 which will be LTS, if we decide to look at stable at all. IIUC at least the problem that patch 1/3 fixes (wrt kcompactd not being woken up) is there since 4.8? ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 @ 2017-03-09 7:56 Mel Gorman 2017-03-09 7:56 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman 0 siblings, 1 reply; 25+ messages in thread From: Mel Gorman @ 2017-03-09 7:56 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman Changelog since v1 o Rebase to 4.11-rc1 o Add small clarifying comment based on review The series is unusual in that the first patch fixes one problem and introduces of other issues that are noted in the changelog. Patch 2 makes a minor modification that is worth considering on its own but leaves the kernel in a state where it behaves badly. It's not until patch 3 that there is an improvement against baseline. This was mostly motivated by examining Chris Mason's "simoop" benchmark which puts the VM under similar pressure to HADOOP. It has been reported that the benchmark has regressed severely during the last number of releases. While I cannot reproduce all the same problems Chris experienced due to hardware limitations, there was a number of problems on a 2-socket machine with a single disk. simoop latencies 4.11.0-rc1 4.11.0-rc1 vanilla keepawake-v2 Amean p50-Read 21670074.18 ( 0.00%) 22668332.52 ( -4.61%) Amean p95-Read 25456267.64 ( 0.00%) 26738688.00 ( -5.04%) Amean p99-Read 29369064.73 ( 0.00%) 30991404.52 ( -5.52%) Amean p50-Write 1390.30 ( 0.00%) 924.91 ( 33.47%) Amean p95-Write 412901.57 ( 0.00%) 1362.62 ( 99.67%) Amean p99-Write 6668722.09 ( 0.00%) 16854.04 ( 99.75%) Amean p50-Allocation 78714.31 ( 0.00%) 74729.74 ( 5.06%) Amean p95-Allocation 175533.51 ( 0.00%) 101609.74 ( 42.11%) Amean p99-Allocation 247003.02 ( 0.00%) 125765.57 ( 49.08%) These are latencies. Read/write are threads reading fixed-size random blocks from a simulated database. The allocation latency is mmaping and faulting regions of memory. The p50, 95 and p99 reports the worst latencies for 50% of the samples, 95% and 99% respectively. For example, the report indicates that while the test was running 99% of writes completed 99.75% faster. It's worth noting that on a UMA machine that no difference in performance with simoop was observed so milage will vary. It's noted that there is a slight impact to read latencies but it's mostly due to IO scheduler decisions and offset by the large reduction in other latencies. mm/memory_hotplug.c | 2 +- mm/vmscan.c | 136 ++++++++++++++++++++++++++++++---------------------- 2 files changed, 79 insertions(+), 59 deletions(-) -- 2.11.0 ^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx 2017-03-09 7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman @ 2017-03-09 7:56 ` Mel Gorman 0 siblings, 0 replies; 25+ messages in thread From: Mel Gorman @ 2017-03-09 7:56 UTC (permalink / raw) To: Andrew Morton Cc: Shantanu Goel, Johannes Weiner, Vlastimil Babka, LKML, Linux-MM, Mel Gorman kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Amean p50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Amean p95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Amean p99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Amean p50-Write 1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Amean p95-Write 412901.57 ( 0.00%) 34874.98 ( 91.55%) 1362.62 ( 99.67%) Amean p99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Amean p50-Allocation 78714.31 ( 0.00%) 84246.26 ( -7.03%) 74729.74 ( 5.06%) Amean p95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Amean p99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs 0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 57036946 82488665 56986286 Kswapd pages reclaimed 55993488 63474329 55939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.306 1393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 1036792 14175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs 0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 61853702 79961486 54837791 Kswapd pages reclaimed 60768764 60755788 53849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 1073573 14473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for the actual test itself. It's a lot of file IO followed by a lot of allocs that reclaims heavily for a brief window. While direct reclaim is lower with clear-v2, it is due to kswapd scanning aggressively and trying to reclaim the world which is not the right thing to do. With the patches applied, there is still direct reclaim but the phase change from "creating work files" to starting multiple threads that allocate a lot of anonymous memory faster than kswapd can reclaim. Scanning/reclaim efficiency is restored by this patch. Page writes from reclaim context are back at 0 which is ideal. Pages immediately reclaimed after IO completes is slightly improved but it is expected this will vary slightly. On UMA, there is almost no change so this is not expected to be a universal win. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> --- mm/memory_hotplug.c | 2 +- mm/vmscan.c | 118 +++++++++++++++++++++++++++++----------------------- 2 files changed, 66 insertions(+), 54 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 295479b792ec..edff09061e32 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1207,7 +1207,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) /* Reset the nr_zones, order and classzone_idx before reuse */ pgdat->nr_zones = 0; pgdat->kswapd_order = 0; - pgdat->kswapd_classzone_idx = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; } /* we can use NODE_DATA(nid) from here */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 17b1afbce88e..6b09ed5e4bda 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3084,14 +3084,36 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, int classzone_idx) +/* + * Returns true if there is an eligible zone balanced for the request order + * and classzone_idx + */ +static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) { - unsigned long mark = high_wmark_pages(zone); + int i; + unsigned long mark = -1; + struct zone *zone; - if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) - return false; + for (i = 0; i <= classzone_idx; i++) { + zone = pgdat->node_zones + i; - return true; + if (!managed_zone(zone)) + continue; + + mark = high_wmark_pages(zone); + if (zone_watermark_ok_safe(zone, order, mark, classzone_idx)) + return true; + } + + /* + * If a node has no populated zone within classzone_idx, it does not + * need balancing by definition. This can happen if a zone-restricted + * allocation tries to wake a remote kswapd. + */ + if (mark == -1) + return true; + + return false; } /* Clear pgdat state for congested, dirty or under writeback. */ @@ -3110,8 +3132,6 @@ static void clear_pgdat_congested(pg_data_t *pgdat) */ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) { - int i; - /* * The throttled processes are normally woken up in balance_pgdat() as * soon as pfmemalloc_watermark_ok() is true. But there is a potential @@ -3128,16 +3148,9 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int classzone_idx) if (waitqueue_active(&pgdat->pfmemalloc_wait)) wake_up_all(&pgdat->pfmemalloc_wait); - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, order, classzone_idx)) { - clear_pgdat_congested(pgdat); - return true; - } + if (pgdat_balanced(pgdat, order, classzone_idx)) { + clear_pgdat_congested(pgdat); + return true; } return false; @@ -3243,23 +3256,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) } /* - * Only reclaim if there are no eligible zones. Check from - * high to low zone as allocations prefer higher zones. - * Scanning from low to high zone would allow congestion to be - * cleared during a very small window when a small low - * zone was balanced even under extreme pressure when the - * overall node may be congested. Note that sc.reclaim_idx - * is not used as buffer_heads_over_limit may have adjusted - * it. + * Only reclaim if there are no eligible zones. Note that + * sc.reclaim_idx is not used as buffer_heads_over_limit may + * have adjusted it. */ - for (i = classzone_idx; i >= 0; i--) { - zone = pgdat->node_zones + i; - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, sc.order, classzone_idx)) - goto out; - } + if (pgdat_balanced(pgdat, sc.order, classzone_idx)) + goto out; /* * Do some background aging of the anon list, to give @@ -3322,6 +3324,22 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) return sc.order; } +/* + * pgdat->kswapd_classzone_idx is the highest zone index that a recent + * allocation request woke kswapd for. When kswapd has not woken recently, + * the value is MAX_NR_ZONES which is not a valid index. This compares a + * given classzone and returns it or the highest classzone index kswapd + * was recently woke for. + */ +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, + enum zone_type classzone_idx) +{ + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) + return classzone_idx; + + return max(pgdat->kswapd_classzone_idx, classzone_idx); +} + static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, unsigned int classzone_idx) { @@ -3363,7 +3381,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o * the previous request that slept prematurely. */ if (remaining) { - pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx); + pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); pgdat->kswapd_order = max(pgdat->kswapd_order, reclaim_order); } @@ -3417,7 +3435,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_o */ static int kswapd(void *p) { - unsigned int alloc_order, reclaim_order, classzone_idx; + unsigned int alloc_order, reclaim_order; + unsigned int classzone_idx = MAX_NR_ZONES - 1; pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; @@ -3447,20 +3466,23 @@ static int kswapd(void *p) tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; set_freezable(); - pgdat->kswapd_order = alloc_order = reclaim_order = 0; - pgdat->kswapd_classzone_idx = classzone_idx = 0; + pgdat->kswapd_order = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; for ( ; ; ) { bool ret; + alloc_order = reclaim_order = pgdat->kswapd_order; + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); + kswapd_try_sleep: kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, classzone_idx); /* Read the new order and classzone_idx */ alloc_order = reclaim_order = pgdat->kswapd_order; - classzone_idx = pgdat->kswapd_classzone_idx; + classzone_idx = kswapd_classzone_idx(pgdat, 0); pgdat->kswapd_order = 0; - pgdat->kswapd_classzone_idx = 0; + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; ret = try_to_freeze(); if (kthread_should_stop()) @@ -3486,9 +3508,6 @@ static int kswapd(void *p) reclaim_order = balance_pgdat(pgdat, alloc_order, classzone_idx); if (reclaim_order < alloc_order) goto kswapd_try_sleep; - - alloc_order = reclaim_order = pgdat->kswapd_order; - classzone_idx = pgdat->kswapd_classzone_idx; } tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); @@ -3504,7 +3523,6 @@ static int kswapd(void *p) void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) { pg_data_t *pgdat; - int z; if (!managed_zone(zone)) return; @@ -3512,22 +3530,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) if (!cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL)) return; pgdat = zone->zone_pgdat; - pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx, classzone_idx); + pgdat->kswapd_classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); pgdat->kswapd_order = max(pgdat->kswapd_order, order); if (!waitqueue_active(&pgdat->kswapd_wait)) return; /* Only wake kswapd if all zones are unbalanced */ - for (z = 0; z <= classzone_idx; z++) { - zone = pgdat->node_zones + z; - if (!managed_zone(zone)) - continue; - - if (zone_balanced(zone, order, classzone_idx)) - return; - } + if (pgdat_balanced(pgdat, order, classzone_idx)) + return; - trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order); wake_up_interruptible(&pgdat->kswapd_wait); } -- 2.11.0 ^ permalink raw reply related [flat|nested] 25+ messages in thread
end of thread, other threads:[~2017-03-09 7:57 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-15 9:22 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Mel Gorman 2017-02-15 9:22 ` [PATCH 1/3] mm, vmscan: fix zone balance check in prepare_kswapd_sleep Mel Gorman 2017-02-16 2:50 ` Hillf Danton 2017-02-22 7:00 ` Minchan Kim 2017-02-23 15:05 ` Mel Gorman 2017-02-24 1:17 ` Minchan Kim 2017-02-24 9:11 ` Mel Gorman 2017-02-27 6:16 ` Minchan Kim 2017-02-15 9:22 ` [PATCH 2/3] mm, vmscan: Only clear pgdat congested/dirty/writeback state when balanced Mel Gorman 2017-02-15 9:22 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman 2017-02-16 6:23 ` Hillf Danton 2017-02-16 8:10 ` Mel Gorman 2017-02-16 8:21 ` Hillf Danton 2017-02-16 9:32 ` Mel Gorman 2017-02-20 16:34 ` Vlastimil Babka 2017-02-21 4:10 ` Hillf Danton 2017-02-20 16:42 ` Vlastimil Babka 2017-02-23 15:01 ` Mel Gorman 2017-03-01 9:04 ` Vlastimil Babka 2017-02-15 20:30 ` [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely Andrew Morton 2017-02-15 21:29 ` Mel Gorman 2017-02-15 21:56 ` Andrew Morton 2017-02-15 22:15 ` Mel Gorman 2017-02-15 22:00 ` Vlastimil Babka 2017-03-09 7:56 [PATCH 0/3] Reduce amount of time kswapd sleeps prematurely v2 Mel Gorman 2017-03-09 7:56 ` [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).