linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] compaction: changing initial position of scanners
@ 2015-01-19 10:05 Vlastimil Babka
  2015-01-19 10:05 ` [PATCH 1/5] mm, compaction: more robust check for scanners meeting Vlastimil Babka
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, David Rientjes,
	Christoph Lameter, Joonsoo Kim, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

Even after all the patches compaction received in last several versions, it
turns out that its effectivneess degrades considerably as the system ages
after reboot. For example, see how success rates of stress-highalloc from
mmtests degrades when we re-execute it several times, first time being after
fresh reboot:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            4-nothp-1             4-nothp-2             4-nothp-3
Success 1 Min         25.00 (  0.00%)       13.00 ( 48.00%)        9.00 ( 64.00%)
Success 1 Mean        36.20 (  0.00%)       23.40 ( 35.36%)       16.40 ( 54.70%)
Success 1 Max         41.00 (  0.00%)       34.00 ( 17.07%)       25.00 ( 39.02%)
Success 2 Min         25.00 (  0.00%)       15.00 ( 40.00%)       10.00 ( 60.00%)
Success 2 Mean        37.20 (  0.00%)       25.00 ( 32.80%)       17.20 ( 53.76%)
Success 2 Max         44.00 (  0.00%)       36.00 ( 18.18%)       25.00 ( 43.18%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       78.00 (  7.14%)
Success 3 Mean        85.80 (  0.00%)       82.80 (  3.50%)       80.40 (  6.29%)
Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)

Wouldn't it be much better, if it looked like this?

                           3.18                  3.18                  3.18
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

In my humble opinion, it would :) Much lower degradation, and a nice
improvement in the first iteration as a bonus.

So what sorcery is this? Nothing much, just a fundamental change of the
compaction scanners operation...

As everyone knows [1] the migration scanner starts at the first pageblock
of a zone, and goes towards the end, and the free scanner starts at the
last pageblock and goes towards the beginning. Somewhere in the middle of the
zone, the scanners meet:

   zone_start                                                   zone_end
       |                                                           |
       -------------------------------------------------------------
       MMMMMMMMMMMMM| =>                            <= |FFFFFFFFFFFF
               migrate_pfn                         free_pfn

In my tests, the scanners meet around the middle of the pageblock on the first
iteration, and around the 1/3 on subsequent iterations. Which means the
migration scanner doesn't see the larger part of the zone at all. For more
details why it's bad, see Patch 4 description.

To make sure we eventually scan the whole zone with the migration scanner, we
could e.g. reverse the directions after each run. But that would still be
biased, and with 1/3 of zone reachable from each side, we would still omit the
middle 1/3 of a zone.

Or we could stop terminating compaction when the scanners meet, and let them
continue to scan the whole zone. Mel told me it used to be the case long ago,
but that approach would result in migrating pages back and forth during single
compaction run, which wouldn't be cool.

So the approach taken by this patchset is to let scanners start at any
so-called pivot pfn within the zone, and keep their direction:

   zone_start                     pivot                         zone_end
       |                            |                              |
       -------------------------------------------------------------
                         <= |FFFFFFFMMMMMM| =>
                        free_pfn     migrate_pfn

Eventually, one of the scanners will reach the zone boundary and wrap around,
e.g. the in the case of the free scanner:

   zone_start                     pivot                         zone_end
       |                            |                              |
       -------------------------------------------------------------
       FFFFFFFFFFFFFFFFFFFFFFFFFFFFFMMMMMMMMMMMM| =>           <= |F
                                           migrate_pfn        free_pfn

Compaction will again terminate when the scanners meet.


As you can imagine, the required code changes made the termination detection
and the scanners themselves quite hairy. There are lots of corner cases and
the code is often hard to wrap one's head around [puns intended]. The scanner
functions isolate_migratepages() and isolate_freepages() were recently cleaned
up a lot, and this makes them messy again, as they can no longer rely on the
fact that they will meet the other scanner and not the zone boundary.

But the improvements seem to make these complications worth, and I hope
somebody can suggest more elegant solutions to the various parts of the code.
So here it is as a RFC. Patches 1-3 are cleanups that could be applied in any
case. Patch 4 implements the main changes, but leaves the pivot to be the
first zone's pfn, so that free scanner wraps immediately and there's no
actual change. Patch 5 updates the pivot in a conservative way, as explained
in the changelog.

Even with the conservative approach, stress-highalloc results are improved,
mainly on the 2+ iterations after fresh reboot. Here are again the results
before the patches, including compaction stats:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            4-nothp-1             4-nothp-2             4-nothp-3
Success 1 Min         25.00 (  0.00%)       13.00 ( 48.00%)        9.00 ( 64.00%)
Success 1 Mean        36.20 (  0.00%)       23.40 ( 35.36%)       16.40 ( 54.70%)
Success 1 Max         41.00 (  0.00%)       34.00 ( 17.07%)       25.00 ( 39.02%)
Success 2 Min         25.00 (  0.00%)       15.00 ( 40.00%)       10.00 ( 60.00%)
Success 2 Mean        37.20 (  0.00%)       25.00 ( 32.80%)       17.20 ( 53.76%)
Success 2 Max         44.00 (  0.00%)       36.00 ( 18.18%)       25.00 ( 43.18%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       78.00 (  7.14%)
Success 3 Mean        85.80 (  0.00%)       82.80 (  3.50%)       80.40 (  6.29%)
Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)

            3.19-rc4    3.19-rc4    3.19-rc4
           4-nothp-1   4-nothp-2   4-nothp-3
User         6781.16     6931.44     6905.44
System       1073.97     1071.20     1067.92                                                                                                                                                                   
Elapsed      2349.71     2290.40     2255.59                                                                                                                                                                   
                                                                                                                                                                                                               
                              3.19-rc4    3.19-rc4    3.19-rc4                                                                                                                                                 
                             4-nothp-1   4-nothp-2   4-nothp-3                                                                                                                                                 
Minor Faults                 198270153   197020929   197146656                                                                                                                                                 
Major Faults                       498         470         527                                                                                                                                                 
Swap Ins                            60          34         105                                                                                                                                                 
Swap Outs                         2780        1011         425                                                                                                                                                 
Allocation stalls                 8325        4615        4769                                                                                                                                                 
DMA allocs                         161         177         141                                                                                                                                                 
DMA32 allocs                 124477754   123412713   123385291                                                                                                                                                 
Normal allocs                 59366406    59040066    58909035                                                                                                                                                 
Movable allocs                       0           0           0                                                                                                                                                 
Direct pages scanned            413858      280997      267341                                                                                                                                                 
Kswapd pages scanned           2204723     2184556     2147546                                                                                                                                                 
Kswapd pages reclaimed         2199430     2181305     2144395                                                                                                                                                 
Direct pages reclaimed          413062      280323      266557                                                                                                                                                 
Kswapd efficiency                  99%         99%         99%                                                                                                                                                 
Kswapd velocity                914.497     977.868     943.877                                                                                                                                                 
Direct efficiency                  99%         99%         99%                                                                                                                                                 
Direct velocity                171.664     125.782     117.500                                                                                                                                                 
Percentage direct scans            15%         11%         11%
Zone normal velocity           359.993     356.888     339.577
Zone dma32 velocity            726.152     746.745     721.784
Zone dma velocity                0.016       0.017       0.016
Page writes by reclaim        2893.000    1119.800     507.600
Page writes file                   112         108          81
Page writes anon                  2780        1011         425
Page reclaim immediate            1134        1197        1412
Sector Reads                   4747006     4606605     4524025
Sector Writes                 12759301    12562316    12629243
Page rescued immediate               0           0           0
Slabs scanned                  1807821     1528751     1498929
Direct inode steals              24428       12649       13346
Kswapd inode steals              33213       29804       30194
Kswapd skipped wait                  0           0           0
THP fault alloc                    217         207         222
THP collapse alloc                 500         539         373
THP splits                          13          14          13
THP fault fallback                  50           9          16
THP collapse fail                   16          17          22
Compaction stalls                 3136        2310        2072
Compaction success                1123         897         828
Compaction failures               2012        1413        1244
Page migrate success           4319697     3682666     3133699
Page migrate failure             19012        9964        7488
Compaction pages isolated      8974417     7634911     6528981
Compaction migrate scanned   182447073   122423031   127292737
Compaction free scanned      389193883   291257587   269516820
Compaction cost                   5931        4824        4267

As the allocation success rates degrade, so do the compaction success rates,
and so does the scanner activity.

Now after the series:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

            3.19-rc4    3.19-rc4    3.19-rc4
           5-nothp-1   5-nothp-2   5-nothp-3
User         6675.75     6742.26     6707.92
System       1069.78     1070.77     1070.25
Elapsed      2450.31     2363.98     2442.21

                              3.19-rc4    3.19-rc4    3.19-rc4
                             5-nothp-1   5-nothp-2   5-nothp-3
Minor Faults                 197652452   197164571   197946824
Major Faults                       882         743         934
Swap Ins                           113          96         144
Swap Outs                         2144        1340        1799
Allocation stalls                10304        9060       10261
DMA allocs                         142         227         181
DMA32 allocs                 124497911   123947671   124010151                                                                                                                                                 
Normal allocs                 59233600    59160910    59669421                                                                                                                                                 
Movable allocs                       0           0           0                                                                                                                                                 
Direct pages scanned            591368      481119      525158                                                                                                                                                 
Kswapd pages scanned           2217381     2164036     2170388                                                                                                                                                 
Kswapd pages reclaimed         2212285     2160162     2166794                                                                                                                                                 
Direct pages reclaimed          590064      480297      524013                                                                                                                                                 
Kswapd efficiency                  99%         99%         99%                                                                                                                                                 
Kswapd velocity                921.119     937.864     936.558                                                                                                                                                 
Direct efficiency                  99%         99%         99%                                                                                                                                                 
Direct velocity                245.659     208.511     226.615                                                                                                                                                 
Percentage direct scans            21%         18%         19%                                                                                                                                                 
Zone normal velocity           378.957     376.819     383.604                                                                                                                                                 
Zone dma32 velocity            787.805     769.530     779.552                                                                                                                                                 
Zone dma velocity                0.015       0.025       0.016                                                                                                                                                 
Page writes by reclaim        2284.600    1432.400    1972.400                                                                                                                                                 
Page writes file                   140          92         173                                                                                                                                                 
Page writes anon                  2144        1340        1799                                                                                                                                                 
Page reclaim immediate            1689        1315        1440                                                                                                                                                 
Sector Reads                   4920699     4830343     4830263                                                                                                                                                 
Sector Writes                 12643658    12588410    12713518                                                                                                                                                 
Page rescued immediate               0           0           0                                                                                                                                                 
Slabs scanned                  2084358     1922421     1962867
Direct inode steals              35881       37170       45512
Kswapd inode steals              35973       35741       30339
Kswapd skipped wait                  0           0           0
THP fault alloc                    191         224         170
THP collapse alloc                 554         533         516
THP splits                          15          15          11
THP fault fallback                  45           0          76
THP collapse fail                   16          18          18
Compaction stalls                 4069        3746        3951
Compaction success                1507        1388        1366
Compaction failures               2562        2357        2585
Page migrate success           4533617     4912832     5278583
Page migrate failure             20127       15273       17862
Compaction pages isolated      9495626    10235697    10993942
Compaction migrate scanned    93585708    99204656    92052138
Compaction free scanned      510786018   503529022   541298357
Compaction cost                   5541        5988        6332

Less degradation in success rates and no degradation in activity.
Let's look again at compactions stats without the series:

Compaction stalls                 3136        2310        2072
Compaction success                1123         897         828
Compaction failures               2012        1413        1244
Page migrate success           4319697     3682666     3133699
Page migrate failure             19012        9964        7488
Compaction pages isolated      8974417     7634911     6528981
Compaction migrate scanned   182447073   122423031   127292737
Compaction free scanned      389193883   291257587   269516820
Compaction cost                   5931        4824        4267

Interestingly, the number of migrate scanned pages was higher before the
series, even twice in the first iteration. That suggests that the scanner
was indeed scanning in a limited number of pageblocks over and over, despite
their compaction potential "depleted". After the patches, it achieves better
results with less scanned blocks, as it has a better chance of encountering
non-depleted blocks. The free scanner activity has increased after the series,
which just means that higher success rates lead to less deferred compaction.
There is also some increase in page migrations, but it is likely also a
consequence of better success and not due to useless back and forth migrations
due to pivot changes.

Preliminary testing with THP-like allocations has shown similar improvements,
which is somewhat surprising, because THP allocations do not use sync
and thus do not defer compaction; but changing the pivot is currently tied
to restarting from deferred compaction. Possibly this is due to the other
allocation activity than the stress-highalloc test itself. I will post these
results when full measurements are done.

[1] http://lwn.net/Articles/368869/

Vlastimil Babka (5):
  mm, compaction: more robust check for scanners meeting
  mm, compaction: simplify handling restart position in free pages
    scanner
  mm, compaction: encapsulate resetting cached scanner positions
  mm, compaction: allow scanners to start at any pfn within the zone
  mm, compaction: set pivot pfn to the pfn when scanners met last time

 include/linux/mmzone.h |   4 +
 mm/compaction.c        | 276 ++++++++++++++++++++++++++++++++++++++++---------
 mm/internal.h          |   1 +
 3 files changed, 234 insertions(+), 47 deletions(-)

-- 
2.1.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] mm, compaction: more robust check for scanners meeting
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
@ 2015-01-19 10:05 ` Vlastimil Babka
  2015-01-20 13:26   ` Zhang Yanfei
  2015-01-19 10:05 ` [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner Vlastimil Babka
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Compaction should finish when the migration and free scanner meet, i.e. they
reach the same pageblock. Currently however, the test in compact_finished()
simply just compares the exact pfns, which may yield a false negative when the
free scanner position is in the middle of a pageblock and the migration
scanner reaches the begining of the same pageblock.

This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
remember position within pageblock in free pages scanner") allowed the free
scanner position to be in the middle of a pageblock between invocations.
The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
compact_zone") prevented the issue by adding a special check in the migration
scanner to satisfy the current detection of scanners meeting.

However, the proper fix is to make the detection more robust. This patch
introduces the compact_scanners_met() function that returns true when the free
scanner position is in the same or lower pageblock than the migration scanner.
The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is
removed.

Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 546e571..5fdbdb8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -803,6 +803,16 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
 /*
+ * Test whether the free scanner has reached the same or lower pageblock than
+ * the migration scanner, and compaction should thus terminate.
+ */
+static inline bool compact_scanners_met(struct compact_control *cc)
+{
+	return (cc->free_pfn >> pageblock_order)
+		<= (cc->migrate_pfn >> pageblock_order);
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -1027,12 +1037,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	}
 
 	acct_isolated(zone, cc);
-	/*
-	 * Record where migration scanner will be restarted. If we end up in
-	 * the same pageblock as the free scanner, make the scanners fully
-	 * meet so that compact_finished() terminates compaction.
-	 */
-	cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
+	/* Record where migration scanner will be restarted. */
+	cc->migrate_pfn = low_pfn;
 
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
@@ -1047,7 +1053,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_PARTIAL;
 
 	/* Compaction run completes if the migrate and free scanner meet */
-	if (cc->free_pfn <= cc->migrate_pfn) {
+	if (compact_scanners_met(cc)) {
 		/* Let the next compaction start anew. */
 		zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
 		zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
@@ -1238,7 +1244,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 			 * migrate_pages() may return -ENOMEM when scanners meet
 			 * and we want compact_finished() to detect it
 			 */
-			if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
+			if (err == -ENOMEM && !compact_scanners_met(cc)) {
 				ret = COMPACT_PARTIAL;
 				goto out;
 			}
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
  2015-01-19 10:05 ` [PATCH 1/5] mm, compaction: more robust check for scanners meeting Vlastimil Babka
@ 2015-01-19 10:05 ` Vlastimil Babka
  2015-01-20 13:27   ` Zhang Yanfei
  2015-01-19 10:05 ` [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions Vlastimil Babka
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Handling the position where compaction free scanner should restart (stored in
cc->free_pfn) got more complex with commit e14c720efdd7 ("mm, compaction:
remember position within pageblock in free pages scanner"). Currently the
position is updated in each loop iteration isolate_freepages(), although it's
enough to update it only when exiting the loop when we have found enough free
pages, or detected contention in async compaction. Then an extra check outside
the loop updates the position in case we have met the migration scanner.

This can be simplified if we move the test for having isolated enough from
for loop header next to the test for contention, and determining the restart
position only in these cases. We can reuse the isolate_start_pfn variable for
this instead of setting cc->free_pfn directly. Outside the loop, we can simply
set cc->free_pfn to value of isolate_start_pfn without extra check.

We also add VM_BUG_ON to future-proof the code, in case somebody adds a new
condition that terminates isolate_freepages_block() prematurely, which
wouldn't be also considered in isolate_freepages().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 34 +++++++++++++++++++---------------
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 5fdbdb8..45799a4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -849,7 +849,7 @@ static void isolate_freepages(struct compact_control *cc)
 	 * pages on cc->migratepages. We stop searching if the migrate
 	 * and free page scanners meet or enough free pages are isolated.
 	 */
-	for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
+	for (; block_start_pfn >= low_pfn;
 				block_end_pfn = block_start_pfn,
 				block_start_pfn -= pageblock_nr_pages,
 				isolate_start_pfn = block_start_pfn) {
@@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
 		nr_freepages += isolated;
 
 		/*
+		 * If we isolated enough freepages, or aborted due to async
+		 * compaction being contended, terminate the loop.
 		 * Remember where the free scanner should restart next time,
 		 * which is where isolate_freepages_block() left off.
 		 * But if it scanned the whole pageblock, isolate_start_pfn
@@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control *cc)
 		 * In that case we will however want to restart at the start
 		 * of the previous pageblock.
 		 */
-		cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
-				isolate_start_pfn :
-				block_start_pfn - pageblock_nr_pages;
-
-		/*
-		 * isolate_freepages_block() might have aborted due to async
-		 * compaction being contended
-		 */
-		if (cc->contended)
+		if ((nr_freepages > cc->nr_migratepages) || cc->contended) {
+			if (isolate_start_pfn >= block_end_pfn)
+				isolate_start_pfn =
+					block_start_pfn - pageblock_nr_pages;
 			break;
+		} else {
+			/*
+			 * isolate_freepages_block() should not terminate
+			 * prematurely unless contended, or isolated enough
+			 */
+			VM_BUG_ON(isolate_start_pfn < block_end_pfn);
+		}
 	}
 
 	/* split_free_page does not map the pages */
 	map_pages(freelist);
 
 	/*
-	 * If we crossed the migrate scanner, we want to keep it that way
-	 * so that compact_finished() may detect this
+	 * Record where the free scanner will restart next time. Either we
+	 * broke from the loop and set isolate_start_pfn based on the last
+	 * call to isolate_freepages_block(), or we met the migration scanner
+	 * and the loop terminated due to isolate_start_pfn < low_pfn
 	 */
-	if (block_start_pfn < low_pfn)
-		cc->free_pfn = cc->migrate_pfn;
-
+	cc->free_pfn = isolate_start_pfn;
 	cc->nr_freepages = nr_freepages;
 }
 
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
  2015-01-19 10:05 ` [PATCH 1/5] mm, compaction: more robust check for scanners meeting Vlastimil Babka
  2015-01-19 10:05 ` [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner Vlastimil Babka
@ 2015-01-19 10:05 ` Vlastimil Babka
  2015-01-20 13:30   ` Zhang Yanfei
  2015-01-19 10:05 ` [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone Vlastimil Babka
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Reseting the cached compaction scanner positions is now done implicitly in
__reset_isolation_suitable() and compact_finished(). Encapsulate the
functionality in a new function reset_cached_positions() and call it
explicitly where needed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 45799a4..5626220 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -123,6 +123,13 @@ static inline bool isolation_suitable(struct compact_control *cc,
 	return !get_pageblock_skip(page);
 }
 
+static void reset_cached_positions(struct zone *zone)
+{
+	zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
+	zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
+	zone->compact_cached_free_pfn = zone_end_pfn(zone);
+}
+
 /*
  * This function is called to clear all cached information on pageblocks that
  * should be skipped for page isolation when the migrate and free page scanner
@@ -134,9 +141,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 	unsigned long end_pfn = zone_end_pfn(zone);
 	unsigned long pfn;
 
-	zone->compact_cached_migrate_pfn[0] = start_pfn;
-	zone->compact_cached_migrate_pfn[1] = start_pfn;
-	zone->compact_cached_free_pfn = end_pfn;
 	zone->compact_blockskip_flush = false;
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
@@ -166,8 +170,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 			continue;
 
 		/* Only flush if a full compaction finished recently */
-		if (zone->compact_blockskip_flush)
+		if (zone->compact_blockskip_flush) {
 			__reset_isolation_suitable(zone);
+			reset_cached_positions(zone);
+		}
 	}
 }
 
@@ -1059,9 +1065,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (compact_scanners_met(cc)) {
 		/* Let the next compaction start anew. */
-		zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
-		zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
-		zone->compact_cached_free_pfn = zone_end_pfn(zone);
+		reset_cached_positions(zone);
 
 		/*
 		 * Mark that the PG_migrate_skip information should be cleared
@@ -1187,8 +1191,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	 * is about to be retried after being deferred. kswapd does not do
 	 * this reset as it'll reset the cached information when going to sleep.
 	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+	if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) {
 		__reset_isolation_suitable(zone);
+		reset_cached_positions(zone);
+	}
 
 	/*
 	 * Setup to move all movable pages to the end of the zone. Used cached
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (2 preceding siblings ...)
  2015-01-19 10:05 ` [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions Vlastimil Babka
@ 2015-01-19 10:05 ` Vlastimil Babka
  2015-01-20 15:18   ` Zhang Yanfei
  2015-01-19 10:05 ` [RFC PATCH 5/5] mm, compaction: set pivot pfn to the pfn when scanners met last time Vlastimil Babka
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Compaction employs two page scanners - migration scanner isolates pages to be
the source of migration, free page scanner isolates pages to be the target of
migration. Currently, migration scanner starts at the zone's first pageblock
and progresses towards the last one. Free scanner starts at the last pageblock
and progresses towards the first one. Within a pageblock, each scanner scans
pages from the first to the last one. When the scanners meet within the same
pageblock, compaction terminates.

One consequence of the current scheme, that turns out to be unfortunate, is
that the migration scanner does not encounter the pageblocks which were
scanned by the free scanner. In a test with stress-highalloc from mmtests,
the scanners were observed to meet around the middle of the zone in first two
phases (with background memory pressure) of the test when executed after fresh
reboot. On further executions without reboot, the meeting point shifts to
roughly third of the zone, and compaction activity as well as allocation
success rates deteriorates compared to the run after fresh reboot.

It turns out that the deterioration is indeed due to the migration scanner
processing only a small part of the zone. Compaction also keeps making this
bias worse by its activity - by moving all migratable pages towards end of the
zone, the free scanner has to scan a lot of full pageblocks to find more free
pages. The beginning of the zone contains pageblocks that have been compacted
as much as possible, but the free pages there cannot be further merged into
larger orders due to unmovable pages. The rest of the zone might contain more
suitable pageblocks, but the migration scanner will not reach them. It also
isn't be able to move movable pages out of unmovable pageblocks there, which
affects fragmentation.

This patch is the first step to remove this bias. It allows the compaction
scanners to start at arbitrary pfn (aligned to pageblock for practical
purposes), called pivot, within the zone. The migration scanner starts at the
exact pfn, the free scanner starts at the pageblock preceding the pivot. The
direction of scanning is unaffected, but when the migration scanner reaches
the last pageblock of the zone, or the free scanner reaches the first
pageblock, they wrap and continue with the first or last pageblock,
respectively. Compaction terminates when any of the scanners wrap and both
meet within the same pageblock.

For easier bisection of potential regressions, this patch always uses the
first zone's pfn as the pivot. That means the free scanner immediately wraps
to the last pageblock and the operation of scanners is thus unchanged. The
actual pivot changing is done by the next patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/mmzone.h |   2 +
 mm/compaction.c        | 204 +++++++++++++++++++++++++++++++++++++++++++------
 mm/internal.h          |   1 +
 3 files changed, 182 insertions(+), 25 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2f0856d..47aa181 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -503,6 +503,8 @@ struct zone {
 	unsigned long percpu_drift_mark;
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where compaction scanners have initially started last time */
+	unsigned long		compact_cached_pivot_pfn;
 	/* pfn where compaction free scanner should start */
 	unsigned long		compact_cached_free_pfn;
 	/* pfn where async and sync compaction migration scanner should start */
diff --git a/mm/compaction.c b/mm/compaction.c
index 5626220..abae89a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct compact_control *cc,
 	return !get_pageblock_skip(page);
 }
 
+/*
+ * Invalidate cached compaction scanner positions, so that compact_zone()
+ * will reinitialize them on the next compaction.
+ */
 static void reset_cached_positions(struct zone *zone)
 {
-	zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
-	zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
-	zone->compact_cached_free_pfn = zone_end_pfn(zone);
+	/* Invalid values are re-initialized in compact_zone */
+	zone->compact_cached_migrate_pfn[0] = 0;
+	zone->compact_cached_migrate_pfn[1] = 0;
+	zone->compact_cached_free_pfn = 0;
 }
 
 /*
@@ -172,11 +177,35 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 		/* Only flush if a full compaction finished recently */
 		if (zone->compact_blockskip_flush) {
 			__reset_isolation_suitable(zone);
-			reset_cached_positions(zone);
+			reset_cached_positions(zone, false);
 		}
 	}
 }
 
+static void update_cached_migrate_pfn(unsigned long pfn,
+		unsigned long pivot_pfn, unsigned long *old_pfn)
+{
+	/* Both old and new pfn either wrapped or not, and new is higher */
+	if (((*old_pfn >= pivot_pfn) == (pfn >= pivot_pfn))
+	    && (pfn > *old_pfn))
+		*old_pfn = pfn;
+	/* New pfn has wrapped and the old didn't yet */
+	else if ((*old_pfn >= pivot_pfn) && (pfn < pivot_pfn))
+		*old_pfn = pfn;
+}
+
+static void update_cached_free_pfn(unsigned long pfn,
+		unsigned long pivot_pfn, unsigned long *old_pfn)
+{
+	/* Both old and new either pfn wrapped or not, and new is lower */
+	if (((*old_pfn < pivot_pfn) == (pfn < pivot_pfn))
+	    && (pfn < *old_pfn))
+		*old_pfn = pfn;
+	/* New pfn has wrapped and the old didn't yet */
+	else if ((*old_pfn < pivot_pfn) && (pfn >= pivot_pfn))
+		*old_pfn = pfn;
+}
+
 /*
  * If no pages were isolated then mark this pageblock to be skipped in the
  * future. The information is later cleared by __reset_isolation_suitable().
@@ -186,6 +215,7 @@ static void update_pageblock_skip(struct compact_control *cc,
 			bool migrate_scanner)
 {
 	struct zone *zone = cc->zone;
+	unsigned long pivot_pfn = cc->pivot_pfn;
 	unsigned long pfn;
 
 	if (cc->ignore_skip_hint)
@@ -203,14 +233,14 @@ static void update_pageblock_skip(struct compact_control *cc,
 
 	/* Update where async and sync compaction should restart */
 	if (migrate_scanner) {
-		if (pfn > zone->compact_cached_migrate_pfn[0])
-			zone->compact_cached_migrate_pfn[0] = pfn;
-		if (cc->mode != MIGRATE_ASYNC &&
-		    pfn > zone->compact_cached_migrate_pfn[1])
-			zone->compact_cached_migrate_pfn[1] = pfn;
+		update_cached_migrate_pfn(pfn, pivot_pfn,
+					&zone->compact_cached_migrate_pfn[0]);
+		if (cc->mode != MIGRATE_ASYNC)
+			update_cached_migrate_pfn(pfn, pivot_pfn,
+					&zone->compact_cached_migrate_pfn[1]);
 	} else {
-		if (pfn < zone->compact_cached_free_pfn)
-			zone->compact_cached_free_pfn = pfn;
+		update_cached_free_pfn(pfn, pivot_pfn,
+					&zone->compact_cached_free_pfn);
 	}
 }
 #else
@@ -808,14 +838,41 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
+static inline bool migrate_scanner_wrapped(struct compact_control *cc)
+{
+	return cc->migrate_pfn < cc->pivot_pfn;
+}
+
+static inline bool free_scanner_wrapped(struct compact_control *cc)
+{
+	return cc->free_pfn >= cc->pivot_pfn;
+}
+
 /*
  * Test whether the free scanner has reached the same or lower pageblock than
  * the migration scanner, and compaction should thus terminate.
  */
 static inline bool compact_scanners_met(struct compact_control *cc)
 {
-	return (cc->free_pfn >> pageblock_order)
-		<= (cc->migrate_pfn >> pageblock_order);
+	bool free_below_migrate = (cc->free_pfn >> pageblock_order)
+		                <= (cc->migrate_pfn >> pageblock_order);
+
+	if (migrate_scanner_wrapped(cc) != free_scanner_wrapped(cc))
+		/*
+		 * Only one of the scanners have wrapped. Terminate if free
+		 * scanner is in the same or lower pageblock than migration
+		 * scanner.
+		*/
+		return free_below_migrate;
+	else
+		/*
+		 * If neither scanner has wrapped, then free < start <=
+		 * migration and we return false by definition.
+		 * It shouldn't happen that both have wrapped, but even if it
+		 * does due to e.g. reading mismatched zone cached pfn's, then
+		 * migration < start <= free, so we return true and terminate.
+		 */
+		return !free_below_migrate;
 }
 
 /*
@@ -832,7 +889,10 @@ static void isolate_freepages(struct compact_control *cc)
 	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
 	int nr_freepages = cc->nr_freepages;
 	struct list_head *freelist = &cc->freepages;
+	bool wrapping; /* set to true in the first pageblock of the zone */
+	bool wrapped; /* set to true when either scanner has wrapped */
 
+wrap:
 	/*
 	 * Initialise the free scanner. The starting point is where we last
 	 * successfully isolated from, zone-cached value, or the end of the
@@ -848,14 +908,25 @@ static void isolate_freepages(struct compact_control *cc)
 	block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
 	block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
 						zone_end_pfn(zone));
-	low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
+
+	wrapping = false;
+	wrapped = free_scanner_wrapped(cc) || migrate_scanner_wrapped(cc);
+	if (!wrapped)
+		/* 
+		 * If neither scanner wrapped yet, we are limited by zone's
+		 * beginning. Here we pretend that the zone starts pageblock
+		 * aligned to make the for-loop condition simpler.
+		 */
+		low_pfn = zone->zone_start_pfn & ~(pageblock_nr_pages-1);
+	else
+		low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
 
 	/*
 	 * Isolate free pages until enough are available to migrate the
 	 * pages on cc->migratepages. We stop searching if the migrate
 	 * and free page scanners meet or enough free pages are isolated.
 	 */
-	for (; block_start_pfn >= low_pfn;
+	for (; !wrapping && block_start_pfn >= low_pfn;
 				block_end_pfn = block_start_pfn,
 				block_start_pfn -= pageblock_nr_pages,
 				isolate_start_pfn = block_start_pfn) {
@@ -870,6 +941,24 @@ static void isolate_freepages(struct compact_control *cc)
 						&& compact_should_abort(cc))
 			break;
 
+		/*
+		 * When we are limited by zone boundary, this means we have
+		 * reached its first pageblock.
+		 */
+		if (!wrapped && block_start_pfn <= zone->zone_start_pfn) {
+			/* The zone might start in the middle of the pageblock */
+			block_start_pfn = zone->zone_start_pfn;
+			if (isolate_start_pfn <= zone->zone_start_pfn)
+				isolate_start_pfn = zone->zone_start_pfn;
+			/*
+			 * For e.g. DMA zone with zone_start_pfn == 1, we will
+			 * underflow block_start_pfn in the next loop
+			 * iteration. We have to terminate the loop with other
+			 * means.
+			 */
+			wrapping = true;
+		}
+
 		page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
 									zone);
 		if (!page)
@@ -903,6 +992,12 @@ static void isolate_freepages(struct compact_control *cc)
 			if (isolate_start_pfn >= block_end_pfn)
 				isolate_start_pfn =
 					block_start_pfn - pageblock_nr_pages;
+			else if (wrapping)
+				/*
+				 * We have been in the first pageblock of the
+				 * zone, but have not finished it yet.
+				 */
+				wrapping = false;
 			break;
 		} else {
 			/*
@@ -913,6 +1008,20 @@ static void isolate_freepages(struct compact_control *cc)
 		}
 	}
 
+	/* Did we reach the beginning of the zone? Wrap to the end. */
+	if (!wrapped && wrapping) {
+		isolate_start_pfn = (zone_end_pfn(zone)-1) &
+						~(pageblock_nr_pages-1);
+		/*
+		 * If we haven't isolated anything, we have to continue
+		 * immediately, otherwise page migration will fail.
+		 */
+		if (!nr_freepages && !cc->contended) {
+			cc->free_pfn = isolate_start_pfn;
+			goto wrap;
+		}
+	}
+
 	/* split_free_page does not map the pages */
 	map_pages(freelist);
 
@@ -984,10 +1093,11 @@ typedef enum {
 static isolate_migrate_t isolate_migratepages(struct zone *zone,
 					struct compact_control *cc)
 {
-	unsigned long low_pfn, end_pfn;
+	unsigned long low_pfn, end_pfn, max_pfn;
 	struct page *page;
 	const isolate_mode_t isolate_mode =
 		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+	bool wrapped = migrate_scanner_wrapped(cc) || free_scanner_wrapped(cc);
 
 	/*
 	 * Start at where we last stopped, or beginning of the zone as
@@ -998,13 +1108,27 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	/* Only scan within a pageblock boundary */
 	end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
 
+	if (!wrapped) {
+		/* 
+		 * Neither of the scanners has wrapped yet, we are limited by
+		 * zone end. Here we pretend it's aligned to pageblock
+		 * boundary to make the for-loop condition simpler
+		 */
+		max_pfn = ALIGN(zone_end_pfn(zone), pageblock_nr_pages);
+	} else {
+		/* If any of the scanners wrapped, we will meet free scanner */
+		max_pfn = cc->free_pfn;
+	}
+
 	/*
 	 * Iterate over whole pageblocks until we find the first suitable.
-	 * Do not cross the free scanner.
+	 * Do not cross the free scanner or the end of the zone.
 	 */
-	for (; end_pfn <= cc->free_pfn;
+	for (; end_pfn <= max_pfn;
 			low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
 
+		if (!wrapped && end_pfn > zone_end_pfn(zone))
+			end_pfn = zone_end_pfn(zone);
 		/*
 		 * This can potentially iterate a massively long zone with
 		 * many pageblocks unsuitable, so periodically check if we
@@ -1047,6 +1171,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	}
 
 	acct_isolated(zone, cc);
+	/* Did we reach the end of the zone? Wrap to the beginning */
+	if (!wrapped && low_pfn >= zone_end_pfn(zone))
+		low_pfn = zone->zone_start_pfn;
+
 	/* Record where migration scanner will be restarted. */
 	cc->migrate_pfn = low_pfn;
 
@@ -1197,22 +1325,48 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	}
 
 	/*
-	 * Setup to move all movable pages to the end of the zone. Used cached
+	 * Setup the scanner positions according to pivot pfn. Use cached
 	 * information on where the scanners should start but check that it
 	 * is initialised by ensuring the values are within zone boundaries.
 	 */
-	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
-	cc->free_pfn = zone->compact_cached_free_pfn;
-	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
-		cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
-		zone->compact_cached_free_pfn = cc->free_pfn;
+	cc->pivot_pfn = zone->compact_cached_pivot_pfn;
+	if (cc->pivot_pfn < start_pfn || cc->pivot_pfn > end_pfn) {
+		cc->pivot_pfn = start_pfn;
+		zone->compact_cached_pivot_pfn = cc->pivot_pfn;
+		/* When starting position was invalid, reset the rest */
+		reset_cached_positions(zone);
 	}
+
+	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
 	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
-		cc->migrate_pfn = start_pfn;
+		cc->migrate_pfn = cc->pivot_pfn;
 		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
 		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
 	}
 
+	cc->free_pfn = zone->compact_cached_free_pfn;
+	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn)
+		cc->free_pfn = cc->pivot_pfn;
+
+	/*
+	 * Free scanner should start on the beginning of the pageblock below
+	 * the cc->pivot_pfn. If that's below the zone boundary, wrap to the
+	 * last pageblock of the zone.
+	 */
+	if (cc->free_pfn == cc->pivot_pfn) {
+		/* Don't underflow in zones starting with e.g. pfn 1 */
+		if (cc->pivot_pfn < pageblock_nr_pages) {
+			cc->free_pfn = (end_pfn-1) & ~(pageblock_nr_pages-1);
+		} else {
+			cc->free_pfn = (cc->pivot_pfn - pageblock_nr_pages);
+			cc->free_pfn &= ~(pageblock_nr_pages-1);
+			if (cc->free_pfn < start_pfn)
+				cc->free_pfn = (end_pfn-1) &
+					~(pageblock_nr_pages-1);
+		}
+		zone->compact_cached_free_pfn = cc->free_pfn;
+	}
+
 	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn);
 
 	migrate_prep_local();
diff --git a/mm/internal.h b/mm/internal.h
index efad241..cb7b297 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,6 +157,7 @@ struct compact_control {
 	struct list_head migratepages;	/* List of pages being migrated */
 	unsigned long nr_freepages;	/* Number of isolated free pages */
 	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long pivot_pfn;	/* Where the scanners initially start */
 	unsigned long free_pfn;		/* isolate_freepages search base */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	enum migrate_mode mode;		/* Async or sync migration mode */
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC PATCH 5/5] mm, compaction: set pivot pfn to the pfn when scanners met last time
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (3 preceding siblings ...)
  2015-01-19 10:05 ` [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone Vlastimil Babka
@ 2015-01-19 10:05 ` Vlastimil Babka
  2015-01-19 10:10 ` [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:05 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Vlastimil Babka, Minchan Kim,
	Mel Gorman, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

The previous patch has prepared compaction scanners to start at arbitrary
pivot pfn within the zone, but left the pivot at the first pfn of the zone.
This patch introduces actual changing of the pivot pfn.

Our goal is to remove the bias in compaction under memory pressure, where
the migration scanner scans only the first half (or less) of the zone where
it cannot succeed anymore. At the same time we want to avoid frequent changes
of the pivot which would result in migrating pages back and forth without much
benefit. So the question is how often to change the pivot, and to which pfn
it should be set.

Another thing to consider is that the scanners mark pageblocks as unsuitable
for scanning via update_pageblock_skip(), which is a single bit per pageblock.
However, pageblock being unsuitable as a source of free pages is completely
different condition from pageblock being unsuitable as the source of
migratable pages. Thus, changing the pivot should be accompanied with
resetting the skip bits. The resetting is currently done either when kswapd
goes to sleep, or when compaction is being restarted from the longest possible
deferred compaction period.

Thus as a conservative first step, this patch does not increase the frequency
of skip bits resetting, and ties changing the pivot only to the situations
where compaction is restarted from being deferred. This happens when
compaction has failed a lot with the previous pivot, and most pageblocks were
already marked as unsuitable. Thus, most migrations occured relatively long
ago and we are not going to frequently migrate back and forth.

The pivot position is simply set to the pageblock where the scanners have met
during the last finished compaction. This means that migration scanner will
immediately scan pageblocks that it couldn't reach with the previous pivot.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/mmzone.h |  2 ++
 mm/compaction.c        | 22 +++++++++++++++++-----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 47aa181..7801886 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -505,6 +505,8 @@ struct zone {
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 	/* pfn where compaction scanners have initially started last time */
 	unsigned long		compact_cached_pivot_pfn;
+	/* pfn where compaction scanners have met last time */
+	unsigned long		compact_cached_last_met_pfn;
 	/* pfn where compaction free scanner should start */
 	unsigned long		compact_cached_free_pfn;
 	/* pfn where async and sync compaction migration scanner should start */
diff --git a/mm/compaction.c b/mm/compaction.c
index abae89a..70792c5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -125,10 +125,16 @@ static inline bool isolation_suitable(struct compact_control *cc,
 
 /*
  * Invalidate cached compaction scanner positions, so that compact_zone()
- * will reinitialize them on the next compaction.
+ * will reinitialize them on the next compaction. Optionally reset the
+ * initial pivot position for the scanners to the position where the scanners
+ * have met the last time.
  */
-static void reset_cached_positions(struct zone *zone)
+static void reset_cached_positions(struct zone *zone, bool update_pivot)
 {
+	if (update_pivot)
+		zone->compact_cached_pivot_pfn =
+					zone->compact_cached_last_met_pfn;
+
 	/* Invalid values are re-initialized in compact_zone */
 	zone->compact_cached_migrate_pfn[0] = 0;
 	zone->compact_cached_migrate_pfn[1] = 0;
@@ -1193,7 +1199,13 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (compact_scanners_met(cc)) {
 		/* Let the next compaction start anew. */
-		reset_cached_positions(zone);
+		reset_cached_positions(zone, false);
+		/* 
+		 * Remember where compaction scanners met for the next time
+		 * the pivot pfn is changed.
+		 */
+		zone->compact_cached_last_met_pfn =
+				cc->migrate_pfn & ~(pageblock_nr_pages-1);
 
 		/*
 		 * Mark that the PG_migrate_skip information should be cleared
@@ -1321,7 +1333,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	 */
 	if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) {
 		__reset_isolation_suitable(zone);
-		reset_cached_positions(zone);
+		reset_cached_positions(zone, true);
 	}
 
 	/*
@@ -1334,7 +1346,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		cc->pivot_pfn = start_pfn;
 		zone->compact_cached_pivot_pfn = cc->pivot_pfn;
 		/* When starting position was invalid, reset the rest */
-		reset_cached_positions(zone);
+		reset_cached_positions(zone, false);
 	}
 
 	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
-- 
2.1.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (4 preceding siblings ...)
  2015-01-19 10:05 ` [RFC PATCH 5/5] mm, compaction: set pivot pfn to the pfn when scanners met last time Vlastimil Babka
@ 2015-01-19 10:10 ` Vlastimil Babka
  2015-01-20  9:02 ` Vlastimil Babka
  2015-02-03  6:49 ` Joonsoo Kim
  7 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-19 10:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, David Rientjes, Christoph Lameter,
	Joonsoo Kim, Mel Gorman, Michal Nazarewicz, Minchan Kim,
	Naoya Horiguchi, Rik van Riel

Looks like wrapping got busted in the results part. Let's try again:
-----

Even after all the patches compaction received in last several versions, it
turns out that its effectivneess degrades considerably as the system ages
after reboot. For example, see how success rates of stress-highalloc from
mmtests degrades when we re-execute it several times, first time being after
fresh reboot:
                             3.19-rc4              3.19-rc4              3.19-rc4
                            4-nothp-1             4-nothp-2             4-nothp-3
Success 1 Min         25.00 (  0.00%)       13.00 ( 48.00%)        9.00 ( 64.00%)
Success 1 Mean        36.20 (  0.00%)       23.40 ( 35.36%)       16.40 ( 54.70%)
Success 1 Max         41.00 (  0.00%)       34.00 ( 17.07%)       25.00 ( 39.02%)
Success 2 Min         25.00 (  0.00%)       15.00 ( 40.00%)       10.00 ( 60.00%)
Success 2 Mean        37.20 (  0.00%)       25.00 ( 32.80%)       17.20 ( 53.76%)
Success 2 Max         44.00 (  0.00%)       36.00 ( 18.18%)       25.00 ( 43.18%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       78.00 (  7.14%)
Success 3 Mean        85.80 (  0.00%)       82.80 (  3.50%)       80.40 (  6.29%)
Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)

Wouldn't it be much better, if it looked like this?

                           3.18                  3.18                  3.18
                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

In my humble opinion, it would :) Much lower degradation, and a nice
improvement in the first iteration as a bonus.

So what sorcery is this? Nothing much, just a fundamental change of the
compaction scanners operation...

As everyone knows [1] the migration scanner starts at the first pageblock
of a zone, and goes towards the end, and the free scanner starts at the
last pageblock and goes towards the beginning. Somewhere in the middle of the
zone, the scanners meet:

   zone_start                                                   zone_end
       |                                                           |
       -------------------------------------------------------------
       MMMMMMMMMMMMM| =>                            <= |FFFFFFFFFFFF
               migrate_pfn                         free_pfn

In my tests, the scanners meet around the middle of the pageblock on the first
iteration, and around the 1/3 on subsequent iterations. Which means the
migration scanner doesn't see the larger part of the zone at all. For more
details why it's bad, see Patch 4 description.

To make sure we eventually scan the whole zone with the migration scanner, we
could e.g. reverse the directions after each run. But that would still be
biased, and with 1/3 of zone reachable from each side, we would still omit the
middle 1/3 of a zone.

Or we could stop terminating compaction when the scanners meet, and let them
continue to scan the whole zone. Mel told me it used to be the case long ago,
but that approach would result in migrating pages back and forth during single
compaction run, which wouldn't be cool.

So the approach taken by this patchset is to let scanners start at any
so-called pivot pfn within the zone, and keep their direction:

   zone_start                     pivot                         zone_end
       |                            |                              |
       -------------------------------------------------------------
                         <= |FFFFFFFMMMMMM| =>
                        free_pfn     migrate_pfn

Eventually, one of the scanners will reach the zone boundary and wrap around,
e.g. the in the case of the free scanner:

   zone_start                     pivot                         zone_end
       |                            |                              |
       -------------------------------------------------------------
       FFFFFFFFFFFFFFFFFFFFFFFFFFFFFMMMMMMMMMMMM| =>           <= |F
                                           migrate_pfn        free_pfn

Compaction will again terminate when the scanners meet.


As you can imagine, the required code changes made the termination detection
and the scanners themselves quite hairy. There are lots of corner cases and
the code is often hard to wrap one's head around [puns intended]. The scanner
functions isolate_migratepages() and isolate_freepages() were recently cleaned
up a lot, and this makes them messy again, as they can no longer rely on the
fact that they will meet the other scanner and not the zone boundary.

But the improvements seem to make these complications worth, and I hope
somebody can suggest more elegant solutions to the various parts of the code.
So here it is as a RFC. Patches 1-3 are cleanups that could be applied in any
case. Patch 4 implements the main changes, but leaves the pivot to be the
first zone's pfn, so that free scanner wraps immediately and there's no
actual change. Patch 5 updates the pivot in a conservative way, as explained
in the changelog.

Even with the conservative approach, stress-highalloc results are improved,
mainly on the 2+ iterations after fresh reboot. Here are again the results
before the patches, including compaction stats:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            4-nothp-1             4-nothp-2             4-nothp-3
Success 1 Min         25.00 (  0.00%)       13.00 ( 48.00%)        9.00 ( 64.00%)
Success 1 Mean        36.20 (  0.00%)       23.40 ( 35.36%)       16.40 ( 54.70%)
Success 1 Max         41.00 (  0.00%)       34.00 ( 17.07%)       25.00 ( 39.02%)
Success 2 Min         25.00 (  0.00%)       15.00 ( 40.00%)       10.00 ( 60.00%)
Success 2 Mean        37.20 (  0.00%)       25.00 ( 32.80%)       17.20 ( 53.76%)
Success 2 Max         44.00 (  0.00%)       36.00 ( 18.18%)       25.00 ( 43.18%)
Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       78.00 (  7.14%)
Success 3 Mean        85.80 (  0.00%)       82.80 (  3.50%)       80.40 (  6.29%)
Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)

            3.19-rc4    3.19-rc4    3.19-rc4
           4-nothp-1   4-nothp-2   4-nothp-3
User         6781.16     6931.44     6905.44
System       1073.97     1071.20     1067.92                                                                                                                                                                   
Elapsed      2349.71     2290.40     2255.59                                                                                                                                                                   
                                                                                                                                                                                                               
                              3.19-rc4    3.19-rc4    3.19-rc4                                                                                                                                                 
                             4-nothp-1   4-nothp-2   4-nothp-3                                                                                                                                                 
Minor Faults                 198270153   197020929   197146656                                                                                                                                                 
Major Faults                       498         470         527                                                                                                                                                 
Swap Ins                            60          34         105                                                                                                                                                 
Swap Outs                         2780        1011         425                                                                                                                                                 
Allocation stalls                 8325        4615        4769                                                                                                                                                 
DMA allocs                         161         177         141                                                                                                                                                 
DMA32 allocs                 124477754   123412713   123385291                                                                                                                                                 
Normal allocs                 59366406    59040066    58909035                                                                                                                                                 
Movable allocs                       0           0           0                                                                                                                                                 
Direct pages scanned            413858      280997      267341                                                                                                                                                 
Kswapd pages scanned           2204723     2184556     2147546                                                                                                                                                 
Kswapd pages reclaimed         2199430     2181305     2144395                                                                                                                                                 
Direct pages reclaimed          413062      280323      266557                                                                                                                                                 
Kswapd efficiency                  99%         99%         99%                                                                                                                                                 
Kswapd velocity                914.497     977.868     943.877                                                                                                                                                 
Direct efficiency                  99%         99%         99%                                                                                                                                                 
Direct velocity                171.664     125.782     117.500                                                                                                                                                 
Percentage direct scans            15%         11%         11%
Zone normal velocity           359.993     356.888     339.577
Zone dma32 velocity            726.152     746.745     721.784
Zone dma velocity                0.016       0.017       0.016
Page writes by reclaim        2893.000    1119.800     507.600
Page writes file                   112         108          81
Page writes anon                  2780        1011         425
Page reclaim immediate            1134        1197        1412
Sector Reads                   4747006     4606605     4524025
Sector Writes                 12759301    12562316    12629243
Page rescued immediate               0           0           0
Slabs scanned                  1807821     1528751     1498929
Direct inode steals              24428       12649       13346
Kswapd inode steals              33213       29804       30194
Kswapd skipped wait                  0           0           0
THP fault alloc                    217         207         222
THP collapse alloc                 500         539         373
THP splits                          13          14          13
THP fault fallback                  50           9          16
THP collapse fail                   16          17          22
Compaction stalls                 3136        2310        2072
Compaction success                1123         897         828
Compaction failures               2012        1413        1244
Page migrate success           4319697     3682666     3133699
Page migrate failure             19012        9964        7488
Compaction pages isolated      8974417     7634911     6528981
Compaction migrate scanned   182447073   122423031   127292737
Compaction free scanned      389193883   291257587   269516820
Compaction cost                   5931        4824        4267

As the allocation success rates degrade, so do the compaction success rates,
and so does the scanner activity.

Now after the series:

                             3.19-rc4              3.19-rc4              3.19-rc4
                            5-nothp-1             5-nothp-2             5-nothp-3
Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)

            3.19-rc4    3.19-rc4    3.19-rc4
           5-nothp-1   5-nothp-2   5-nothp-3
User         6675.75     6742.26     6707.92
System       1069.78     1070.77     1070.25
Elapsed      2450.31     2363.98     2442.21

                              3.19-rc4    3.19-rc4    3.19-rc4
                             5-nothp-1   5-nothp-2   5-nothp-3
Minor Faults                 197652452   197164571   197946824
Major Faults                       882         743         934
Swap Ins                           113          96         144
Swap Outs                         2144        1340        1799
Allocation stalls                10304        9060       10261
DMA allocs                         142         227         181
DMA32 allocs                 124497911   123947671   124010151                                                                                                                                                 
Normal allocs                 59233600    59160910    59669421                                                                                                                                                 
Movable allocs                       0           0           0                                                                                                                                                 
Direct pages scanned            591368      481119      525158                                                                                                                                                 
Kswapd pages scanned           2217381     2164036     2170388                                                                                                                                                 
Kswapd pages reclaimed         2212285     2160162     2166794                                                                                                                                                 
Direct pages reclaimed          590064      480297      524013                                                                                                                                                 
Kswapd efficiency                  99%         99%         99%                                                                                                                                                 
Kswapd velocity                921.119     937.864     936.558                                                                                                                                                 
Direct efficiency                  99%         99%         99%                                                                                                                                                 
Direct velocity                245.659     208.511     226.615                                                                                                                                                 
Percentage direct scans            21%         18%         19%                                                                                                                                                 
Zone normal velocity           378.957     376.819     383.604                                                                                                                                                 
Zone dma32 velocity            787.805     769.530     779.552                                                                                                                                                 
Zone dma velocity                0.015       0.025       0.016                                                                                                                                                 
Page writes by reclaim        2284.600    1432.400    1972.400                                                                                                                                                 
Page writes file                   140          92         173                                                                                                                                                 
Page writes anon                  2144        1340        1799                                                                                                                                                 
Page reclaim immediate            1689        1315        1440                                                                                                                                                 
Sector Reads                   4920699     4830343     4830263                                                                                                                                                 
Sector Writes                 12643658    12588410    12713518                                                                                                                                                 
Page rescued immediate               0           0           0                                                                                                                                                 
Slabs scanned                  2084358     1922421     1962867
Direct inode steals              35881       37170       45512
Kswapd inode steals              35973       35741       30339
Kswapd skipped wait                  0           0           0
THP fault alloc                    191         224         170
THP collapse alloc                 554         533         516
THP splits                          15          15          11
THP fault fallback                  45           0          76
THP collapse fail                   16          18          18
Compaction stalls                 4069        3746        3951
Compaction success                1507        1388        1366
Compaction failures               2562        2357        2585
Page migrate success           4533617     4912832     5278583
Page migrate failure             20127       15273       17862
Compaction pages isolated      9495626    10235697    10993942
Compaction migrate scanned    93585708    99204656    92052138
Compaction free scanned      510786018   503529022   541298357
Compaction cost                   5541        5988        6332

Less degradation in success rates and no degradation in activity.
Let's look again at compactions stats without the series:

Compaction stalls                 3136        2310        2072
Compaction success                1123         897         828
Compaction failures               2012        1413        1244
Page migrate success           4319697     3682666     3133699
Page migrate failure             19012        9964        7488
Compaction pages isolated      8974417     7634911     6528981
Compaction migrate scanned   182447073   122423031   127292737
Compaction free scanned      389193883   291257587   269516820
Compaction cost                   5931        4824        4267

Interestingly, the number of migrate scanned pages was higher before the
series, even twice in the first iteration. That suggests that the scanner
was indeed scanning in a limited number of pageblocks over and over, despite
their compaction potential "depleted". After the patches, it achieves better
results with less scanned blocks, as it has a better chance of encountering
non-depleted blocks. The free scanner activity has increased after the series,
which just means that higher success rates lead to less deferred compaction.
There is also some increase in page migrations, but it is likely also a
consequence of better success and not due to useless back and forth migrations
due to pivot changes.

Preliminary testing with THP-like allocations has shown similar improvements,
which is somewhat surprising, because THP allocations do not use sync
and thus do not defer compaction; but changing the pivot is currently tied
to restarting from deferred compaction. Possibly this is due to the other
allocation activity than the stress-highalloc test itself. I will post these
results when full measurements are done.

[1] http://lwn.net/Articles/368869/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (5 preceding siblings ...)
  2015-01-19 10:10 ` [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
@ 2015-01-20  9:02 ` Vlastimil Babka
  2015-02-03  6:49 ` Joonsoo Kim
  7 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-20  9:02 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, David Rientjes, Christoph Lameter,
	Joonsoo Kim, Mel Gorman, Michal Nazarewicz, Minchan Kim,
	Naoya Horiguchi, Rik van Riel

On 01/19/2015 11:05 AM, Vlastimil Babka wrote:
> Preliminary testing with THP-like allocations has shown similar improvements,
> which is somewhat surprising, because THP allocations do not use sync
> and thus do not defer compaction; but changing the pivot is currently tied
> to restarting from deferred compaction. Possibly this is due to the other
> allocation activity than the stress-highalloc test itself. I will post these
> results when full measurements are done.

So here are the full results, before patch 5:

                             3.19-rc4              3.19-rc4              3.19-rc4
                              4-thp-1               4-thp-2               4-thp-3
Success 1 Min         24.00 (  0.00%)       14.00 ( 41.67%)       16.00 ( 33.33%)
Success 1 Mean        29.80 (  0.00%)       17.00 ( 42.95%)       19.80 ( 33.56%)
Success 1 Max         33.00 (  0.00%)       25.00 ( 24.24%)       29.00 ( 12.12%)
Success 2 Min         26.00 (  0.00%)       15.00 ( 42.31%)       17.00 ( 34.62%)
Success 2 Mean        32.60 (  0.00%)       19.20 ( 41.10%)       21.20 ( 34.97%)
Success 2 Max         37.00 (  0.00%)       28.00 ( 24.32%)       30.00 ( 18.92%)
Success 3 Min         85.00 (  0.00%)       82.00 (  3.53%)       80.00 (  5.88%)
Success 3 Mean        86.20 (  0.00%)       83.20 (  3.48%)       81.20 (  5.80%)
Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)

            3.19-rc4    3.19-rc4    3.19-rc4
             4-thp-1     4-thp-2     4-thp-3
User         6798.70     6905.43     6941.09
System       1064.04     1062.94     1062.76
Elapsed      2108.98     2026.84     2039.40

                              3.19-rc4    3.19-rc4    3.19-rc4
                               4-thp-1     4-thp-2     4-thp-3
Minor Faults                 198099852   197531505   197503750
Major Faults                       483         422         490
Swap Ins                            86          55         113
Swap Outs                         3138         887         399
Allocation stalls                 6523        4619        4816
DMA allocs                          20          21          29
DMA32 allocs                 124645618   123937328   123927500
Normal allocs                 58904757    58753982    58818313                                                                                                                                                 
Movable allocs                       0           0           0                                                                                                                                                 
Direct pages scanned            428919      361735      402238                                                                                                                                                 
Kswapd pages scanned           2110062     2070576     2042000                                                                                                                                                 
Kswapd pages reclaimed         2104207     2067640     2026147                                                                                                                                                 
Direct pages reclaimed          427965      361045      401398                                                                                                                                                 
Kswapd efficiency                  99%         99%         99%                                                                                                                                                 
Kswapd velocity               1019.443    1032.145    1015.507                                                                                                                                                 
Direct efficiency                  99%         99%         99%                                                                                                                                                 
Direct velocity                207.225     180.319     200.037                                                                                                                                                 
Percentage direct scans            16%         14%         16%                                                                                                                                                 
Zone normal velocity           408.313     391.138     385.933                                                                                                                                                 
Zone dma32 velocity            818.355     821.320     829.605                                                                                                                                                 
Zone dma velocity                0.000       0.006       0.006                                                                                                                                                 
Page writes by reclaim        3352.000     963.000     517.000                                                                                                                                                 
Page writes file                   213          75         117                                                                                                                                                 
Page writes anon                  3138         887         399                                                                                                                                                 
Page reclaim immediate            1144        1066       14181                                                                                                                                                 
Sector Reads                   4628816     4522758     4532221                                                                                                                                                 
Sector Writes                 12767080    12671744    12651826                                                                                                                                                 
Page rescued immediate               0           0           0                                                                                                                                                 
Slabs scanned                  1697125     1504150     1497662                                                                                                                                                 
Direct inode steals              18505       12670       12166
Kswapd inode steals              34595       31472       30069
Kswapd skipped wait                  0           0           0
THP fault alloc                    239         220         235
THP collapse alloc                 507         406         478
THP splits                          14          13          12
THP fault fallback                  28          25           6
THP collapse fail                   16          21          19
Compaction stalls                 2661        1953        2062
Compaction success                1005         738         832
Compaction failures               1656        1215        1229
Page migrate success           2137343     1941288     1984697
Page migrate failure              9345        5923        7109
Compaction pages isolated      4630988     4169422     4271015
Compaction migrate scanned    42109445    35090957    37926766
Compaction free scanned      150533970   131933800   126682354
Compaction cost                   2601        2339        2406

After patch 5:

                             3.19-rc4              3.19-rc4              3.19-rc4
                              5-thp-1               5-thp-2               5-thp-3
Success 1 Min         43.00 (  0.00%)       35.00 ( 18.60%)       33.00 ( 23.26%)
Success 1 Mean        44.80 (  0.00%)       37.20 ( 16.96%)       36.40 ( 18.75%)
Success 1 Max         46.00 (  0.00%)       38.00 ( 17.39%)       41.00 ( 10.87%)
Success 2 Min         53.00 (  0.00%)       44.00 ( 16.98%)       40.00 ( 24.53%)
Success 2 Mean        55.60 (  0.00%)       47.60 ( 14.39%)       43.60 ( 21.58%)
Success 2 Max         58.00 (  0.00%)       50.00 ( 13.79%)       46.00 ( 20.69%)
Success 3 Min         85.00 (  0.00%)       81.00 (  4.71%)       79.00 (  7.06%)
Success 3 Mean        86.40 (  0.00%)       83.40 (  3.47%)       80.20 (  7.18%)
Success 3 Max         88.00 (  0.00%)       85.00 (  3.41%)       82.00 (  6.82%)

            3.19-rc4    3.19-rc4    3.19-rc4
             5-thp-1     5-thp-2     5-thp-3
User         6690.38     6734.92     6772.31
System       1065.54     1063.26     1064.48
Elapsed      2172.84     2153.99     2162.77

                              3.19-rc4    3.19-rc4    3.19-rc4
                               5-thp-1     5-thp-2     5-thp-3
Minor Faults                 196642124   197072025   196458816
Major Faults                       712         629         663
Swap Ins                           112          89         119
Swap Outs                         2425        1615         955
Allocation stalls                10987        8689        8072
DMA allocs                          13          13          12
DMA32 allocs                 124048790   124176073   123636557
Normal allocs                 58641381    58758990    58548760
Movable allocs                       0           0           0
Direct pages scanned            607943      519167      491800
Kswapd pages scanned           2091239     2093530     2080230
Kswapd pages reclaimed         2085935     2090069     2076784
Direct pages reclaimed          607103      518178      490870
Kswapd efficiency                  99%         99%         99%
Kswapd velocity                953.698     980.452     978.692
Direct efficiency                  99%         99%         99%
Direct velocity                277.249     243.139     231.379
Percentage direct scans            22%         19%         19%
Zone normal velocity           411.137     400.519     393.191
Zone dma32 velocity            819.809     823.068     816.877
Zone dma velocity                0.000       0.004       0.004
Page writes by reclaim        2568.800    1735.400    1112.200
Page writes file                   143         120         156
Page writes anon                  2425        1615         955
Page reclaim immediate            1316        1240        1460
Sector Reads                   4801026     4810421     4766392
Sector Writes                 12521168    12599312    12490746
Page rescued immediate               0           0           0
Slabs scanned                  1998998     1873533     1826614
Direct inode steals              39894       44540       29447
Kswapd inode steals              34313       24112       32393
Kswapd skipped wait                  0           0           0
THP fault alloc                    204         227         190
THP collapse alloc                 554         555         463
THP splits                          14          16          11
THP fault fallback                   3           3           0
THP collapse fail                   15          16          20
Compaction stalls                 3742        3350        3257
Compaction success                1427        1298        1262
Compaction failures               2315        2051        1995
Page migrate success           2557969     2615986     2766293
Page migrate failure             11532        7297        8779
Compaction pages isolated      5601557     5698249     6044777
Compaction migrate scanned    16776380    20331554    22910553
Compaction free scanned      213184847   172781034   169740956
Compaction cost                   2879        2966        3146

Success rates are still worse on second iteration compared to first, but much better overal.

Let's repeat compaction stats before patch 5, for easier comparison:

Compaction stalls                 2661        1953        2062
Compaction success                1005         738         832
Compaction failures               1656        1215        1229
Page migrate success           2137343     1941288     1984697
Page migrate failure              9345        5923        7109
Compaction pages isolated      4630988     4169422     4271015
Compaction migrate scanned    42109445    35090957    37926766
Compaction free scanned      150533970   131933800   126682354
Compaction cost                   2601        2339        2406

Again, migrate scanned went actually down after patch 5, the rest went up as
compaction was more successful and less deferred.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] mm, compaction: more robust check for scanners meeting
  2015-01-19 10:05 ` [PATCH 1/5] mm, compaction: more robust check for scanners meeting Vlastimil Babka
@ 2015-01-20 13:26   ` Zhang Yanfei
  0 siblings, 0 replies; 22+ messages in thread
From: Zhang Yanfei @ 2015-01-20 13:26 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction should finish when the migration and free scanner meet, i.e. they
> reach the same pageblock. Currently however, the test in compact_finished()
> simply just compares the exact pfns, which may yield a false negative when the
> free scanner position is in the middle of a pageblock and the migration
> scanner reaches the begining of the same pageblock.
> 
> This hasn't been a problem until commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner") allowed the free
> scanner position to be in the middle of a pageblock between invocations.
> The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in
> compact_zone") prevented the issue by adding a special check in the migration
> scanner to satisfy the current detection of scanners meeting.
> 
> However, the proper fix is to make the detection more robust. This patch
> introduces the compact_scanners_met() function that returns true when the free
> scanner position is in the same or lower pageblock than the migration scanner.
> The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is
> removed.
> 
> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 546e571..5fdbdb8 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -803,6 +803,16 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
>  #ifdef CONFIG_COMPACTION
>  /*
> + * Test whether the free scanner has reached the same or lower pageblock than
> + * the migration scanner, and compaction should thus terminate.
> + */
> +static inline bool compact_scanners_met(struct compact_control *cc)
> +{
> +	return (cc->free_pfn >> pageblock_order)
> +		<= (cc->migrate_pfn >> pageblock_order);
> +}
> +
> +/*
>   * Based on information in the current compact_control, find blocks
>   * suitable for isolating free pages from and then isolate them.
>   */
> @@ -1027,12 +1037,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	}
>  
>  	acct_isolated(zone, cc);
> -	/*
> -	 * Record where migration scanner will be restarted. If we end up in
> -	 * the same pageblock as the free scanner, make the scanners fully
> -	 * meet so that compact_finished() terminates compaction.
> -	 */
> -	cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;
> +	/* Record where migration scanner will be restarted. */
> +	cc->migrate_pfn = low_pfn;
>  
>  	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
>  }
> @@ -1047,7 +1053,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>  		return COMPACT_PARTIAL;
>  
>  	/* Compaction run completes if the migrate and free scanner meet */
> -	if (cc->free_pfn <= cc->migrate_pfn) {
> +	if (compact_scanners_met(cc)) {
>  		/* Let the next compaction start anew. */
>  		zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
>  		zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> @@ -1238,7 +1244,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  			 * migrate_pages() may return -ENOMEM when scanners meet
>  			 * and we want compact_finished() to detect it
>  			 */
> -			if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
> +			if (err == -ENOMEM && !compact_scanners_met(cc)) {
>  				ret = COMPACT_PARTIAL;
>  				goto out;
>  			}
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner
  2015-01-19 10:05 ` [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner Vlastimil Babka
@ 2015-01-20 13:27   ` Zhang Yanfei
  2015-01-20 13:31     ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Zhang Yanfei @ 2015-01-20 13:27 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Hello,

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Handling the position where compaction free scanner should restart (stored in
> cc->free_pfn) got more complex with commit e14c720efdd7 ("mm, compaction:
> remember position within pageblock in free pages scanner"). Currently the
> position is updated in each loop iteration isolate_freepages(), although it's
> enough to update it only when exiting the loop when we have found enough free
> pages, or detected contention in async compaction. Then an extra check outside
> the loop updates the position in case we have met the migration scanner.
> 
> This can be simplified if we move the test for having isolated enough from
> for loop header next to the test for contention, and determining the restart
> position only in these cases. We can reuse the isolate_start_pfn variable for
> this instead of setting cc->free_pfn directly. Outside the loop, we can simply
> set cc->free_pfn to value of isolate_start_pfn without extra check.
> 
> We also add VM_BUG_ON to future-proof the code, in case somebody adds a new
> condition that terminates isolate_freepages_block() prematurely, which
> wouldn't be also considered in isolate_freepages().
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 34 +++++++++++++++++++---------------
>  1 file changed, 19 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5fdbdb8..45799a4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -849,7 +849,7 @@ static void isolate_freepages(struct compact_control *cc)
>  	 * pages on cc->migratepages. We stop searching if the migrate
>  	 * and free page scanners meet or enough free pages are isolated.
>  	 */
> -	for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
> +	for (; block_start_pfn >= low_pfn;
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages,
>  				isolate_start_pfn = block_start_pfn) {
> @@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
>  		nr_freepages += isolated;
>  
>  		/*
> +		 * If we isolated enough freepages, or aborted due to async
> +		 * compaction being contended, terminate the loop.
>  		 * Remember where the free scanner should restart next time,
>  		 * which is where isolate_freepages_block() left off.
>  		 * But if it scanned the whole pageblock, isolate_start_pfn
> @@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control *cc)
>  		 * In that case we will however want to restart at the start
>  		 * of the previous pageblock.
>  		 */
> -		cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
> -				isolate_start_pfn :
> -				block_start_pfn - pageblock_nr_pages;
> -
> -		/*
> -		 * isolate_freepages_block() might have aborted due to async
> -		 * compaction being contended
> -		 */
> -		if (cc->contended)
> +		if ((nr_freepages > cc->nr_migratepages) || cc->contended) {

Shouldn't this be nr_freepages >= cc->nr_migratepages?

Thanks

> +			if (isolate_start_pfn >= block_end_pfn)
> +				isolate_start_pfn =
> +					block_start_pfn - pageblock_nr_pages;
>  			break;
> +		} else {
> +			/*
> +			 * isolate_freepages_block() should not terminate
> +			 * prematurely unless contended, or isolated enough
> +			 */
> +			VM_BUG_ON(isolate_start_pfn < block_end_pfn);
> +		}
>  	}
>  
>  	/* split_free_page does not map the pages */
>  	map_pages(freelist);
>  
>  	/*
> -	 * If we crossed the migrate scanner, we want to keep it that way
> -	 * so that compact_finished() may detect this
> +	 * Record where the free scanner will restart next time. Either we
> +	 * broke from the loop and set isolate_start_pfn based on the last
> +	 * call to isolate_freepages_block(), or we met the migration scanner
> +	 * and the loop terminated due to isolate_start_pfn < low_pfn
>  	 */
> -	if (block_start_pfn < low_pfn)
> -		cc->free_pfn = cc->migrate_pfn;
> -
> +	cc->free_pfn = isolate_start_pfn;
>  	cc->nr_freepages = nr_freepages;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions
  2015-01-19 10:05 ` [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions Vlastimil Babka
@ 2015-01-20 13:30   ` Zhang Yanfei
  2015-01-22 15:22     ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Zhang Yanfei @ 2015-01-20 13:30 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Reseting the cached compaction scanner positions is now done implicitly in
> __reset_isolation_suitable() and compact_finished(). Encapsulate the
> functionality in a new function reset_cached_positions() and call it
> explicitly where needed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Should the new function be inline?

Thanks.

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/compaction.c | 22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 45799a4..5626220 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,6 +123,13 @@ static inline bool isolation_suitable(struct compact_control *cc,
>  	return !get_pageblock_skip(page);
>  }
>  
> +static void reset_cached_positions(struct zone *zone)
> +{
> +	zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> +	zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> +	zone->compact_cached_free_pfn = zone_end_pfn(zone);
> +}
> +
>  /*
>   * This function is called to clear all cached information on pageblocks that
>   * should be skipped for page isolation when the migrate and free page scanner
> @@ -134,9 +141,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>  	unsigned long end_pfn = zone_end_pfn(zone);
>  	unsigned long pfn;
>  
> -	zone->compact_cached_migrate_pfn[0] = start_pfn;
> -	zone->compact_cached_migrate_pfn[1] = start_pfn;
> -	zone->compact_cached_free_pfn = end_pfn;
>  	zone->compact_blockskip_flush = false;
>  
>  	/* Walk the zone and mark every pageblock as suitable for isolation */
> @@ -166,8 +170,10 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>  			continue;
>  
>  		/* Only flush if a full compaction finished recently */
> -		if (zone->compact_blockskip_flush)
> +		if (zone->compact_blockskip_flush) {
>  			__reset_isolation_suitable(zone);
> +			reset_cached_positions(zone);
> +		}
>  	}
>  }
>  
> @@ -1059,9 +1065,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
>  	/* Compaction run completes if the migrate and free scanner meet */
>  	if (compact_scanners_met(cc)) {
>  		/* Let the next compaction start anew. */
> -		zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> -		zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> -		zone->compact_cached_free_pfn = zone_end_pfn(zone);
> +		reset_cached_positions(zone);
>  
>  		/*
>  		 * Mark that the PG_migrate_skip information should be cleared
> @@ -1187,8 +1191,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	 * is about to be retried after being deferred. kswapd does not do
>  	 * this reset as it'll reset the cached information when going to sleep.
>  	 */
> -	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
> +	if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) {
>  		__reset_isolation_suitable(zone);
> +		reset_cached_positions(zone);
> +	}
>  
>  	/*
>  	 * Setup to move all movable pages to the end of the zone. Used cached
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner
  2015-01-20 13:27   ` Zhang Yanfei
@ 2015-01-20 13:31     ` Vlastimil Babka
  0 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-20 13:31 UTC (permalink / raw)
  To: Zhang Yanfei, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

On 01/20/2015 02:27 PM, Zhang Yanfei wrote:
> Hello,
> 
> 在 2015/1/19 18:05, Vlastimil Babka 写道:
>> @@ -883,6 +883,8 @@ static void isolate_freepages(struct compact_control *cc)
>>  		nr_freepages += isolated;
>>  
>>  		/*
>> +		 * If we isolated enough freepages, or aborted due to async
>> +		 * compaction being contended, terminate the loop.
>>  		 * Remember where the free scanner should restart next time,
>>  		 * which is where isolate_freepages_block() left off.
>>  		 * But if it scanned the whole pageblock, isolate_start_pfn
>> @@ -891,28 +893,30 @@ static void isolate_freepages(struct compact_control *cc)
>>  		 * In that case we will however want to restart at the start
>>  		 * of the previous pageblock.
>>  		 */
>> -		cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
>> -				isolate_start_pfn :
>> -				block_start_pfn - pageblock_nr_pages;
>> -
>> -		/*
>> -		 * isolate_freepages_block() might have aborted due to async
>> -		 * compaction being contended
>> -		 */
>> -		if (cc->contended)
>> +		if ((nr_freepages > cc->nr_migratepages) || cc->contended) {
> 
> Shouldn't this be nr_freepages >= cc->nr_migratepages?

Ah, of course. Thanks for catching that!


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone
  2015-01-19 10:05 ` [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone Vlastimil Babka
@ 2015-01-20 15:18   ` Zhang Yanfei
  2015-01-22 15:24     ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Zhang Yanfei @ 2015-01-20 15:18 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

Hello Vlastimil

在 2015/1/19 18:05, Vlastimil Babka 写道:
> Compaction employs two page scanners - migration scanner isolates pages to be
> the source of migration, free page scanner isolates pages to be the target of
> migration. Currently, migration scanner starts at the zone's first pageblock
> and progresses towards the last one. Free scanner starts at the last pageblock
> and progresses towards the first one. Within a pageblock, each scanner scans
> pages from the first to the last one. When the scanners meet within the same
> pageblock, compaction terminates.
> 
> One consequence of the current scheme, that turns out to be unfortunate, is
> that the migration scanner does not encounter the pageblocks which were
> scanned by the free scanner. In a test with stress-highalloc from mmtests,
> the scanners were observed to meet around the middle of the zone in first two
> phases (with background memory pressure) of the test when executed after fresh
> reboot. On further executions without reboot, the meeting point shifts to
> roughly third of the zone, and compaction activity as well as allocation
> success rates deteriorates compared to the run after fresh reboot.
> 
> It turns out that the deterioration is indeed due to the migration scanner
> processing only a small part of the zone. Compaction also keeps making this
> bias worse by its activity - by moving all migratable pages towards end of the
> zone, the free scanner has to scan a lot of full pageblocks to find more free
> pages. The beginning of the zone contains pageblocks that have been compacted
> as much as possible, but the free pages there cannot be further merged into
> larger orders due to unmovable pages. The rest of the zone might contain more
> suitable pageblocks, but the migration scanner will not reach them. It also
> isn't be able to move movable pages out of unmovable pageblocks there, which
> affects fragmentation.
> 
> This patch is the first step to remove this bias. It allows the compaction
> scanners to start at arbitrary pfn (aligned to pageblock for practical
> purposes), called pivot, within the zone. The migration scanner starts at the
> exact pfn, the free scanner starts at the pageblock preceding the pivot. The
> direction of scanning is unaffected, but when the migration scanner reaches
> the last pageblock of the zone, or the free scanner reaches the first
> pageblock, they wrap and continue with the first or last pageblock,
> respectively. Compaction terminates when any of the scanners wrap and both
> meet within the same pageblock.
> 
> For easier bisection of potential regressions, this patch always uses the
> first zone's pfn as the pivot. That means the free scanner immediately wraps
> to the last pageblock and the operation of scanners is thus unchanged. The
> actual pivot changing is done by the next patch.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I read through the whole patch, and you can feel free to add:

Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

I agree with you and the approach to improve the current scheme. One thing
I think should be carefully treated is how to avoid migrating back and forth
since the pivot pfn can be changed. I see patch 5 has introduced a policy to
change the pivot so we can have a careful observation on it.

(The changes in the patch make the code more difficult to understand now...
and I just find a tiny mistake, please see below)

> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  include/linux/mmzone.h |   2 +
>  mm/compaction.c        | 204 +++++++++++++++++++++++++++++++++++++++++++------
>  mm/internal.h          |   1 +
>  3 files changed, 182 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2f0856d..47aa181 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -503,6 +503,8 @@ struct zone {
>  	unsigned long percpu_drift_mark;
>  
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> +	/* pfn where compaction scanners have initially started last time */
> +	unsigned long		compact_cached_pivot_pfn;
>  	/* pfn where compaction free scanner should start */
>  	unsigned long		compact_cached_free_pfn;
>  	/* pfn where async and sync compaction migration scanner should start */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 5626220..abae89a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct compact_control *cc,
>  	return !get_pageblock_skip(page);
>  }
>  
> +/*
> + * Invalidate cached compaction scanner positions, so that compact_zone()
> + * will reinitialize them on the next compaction.
> + */
>  static void reset_cached_positions(struct zone *zone)
>  {
> -	zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
> -	zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
> -	zone->compact_cached_free_pfn = zone_end_pfn(zone);
> +	/* Invalid values are re-initialized in compact_zone */
> +	zone->compact_cached_migrate_pfn[0] = 0;
> +	zone->compact_cached_migrate_pfn[1] = 0;
> +	zone->compact_cached_free_pfn = 0;
>  }
>  
>  /*
> @@ -172,11 +177,35 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>  		/* Only flush if a full compaction finished recently */
>  		if (zone->compact_blockskip_flush) {
>  			__reset_isolation_suitable(zone);
> -			reset_cached_positions(zone);
> +			reset_cached_positions(zone, false);

The second argument should be in patch 5 instead of here, right? ^-^

Thanks.

>  		}
>  	}
>  }
>  
> +static void update_cached_migrate_pfn(unsigned long pfn,
> +		unsigned long pivot_pfn, unsigned long *old_pfn)
> +{
> +	/* Both old and new pfn either wrapped or not, and new is higher */
> +	if (((*old_pfn >= pivot_pfn) == (pfn >= pivot_pfn))
> +	    && (pfn > *old_pfn))
> +		*old_pfn = pfn;
> +	/* New pfn has wrapped and the old didn't yet */
> +	else if ((*old_pfn >= pivot_pfn) && (pfn < pivot_pfn))
> +		*old_pfn = pfn;
> +}
> +
> +static void update_cached_free_pfn(unsigned long pfn,
> +		unsigned long pivot_pfn, unsigned long *old_pfn)
> +{
> +	/* Both old and new either pfn wrapped or not, and new is lower */
> +	if (((*old_pfn < pivot_pfn) == (pfn < pivot_pfn))
> +	    && (pfn < *old_pfn))
> +		*old_pfn = pfn;
> +	/* New pfn has wrapped and the old didn't yet */
> +	else if ((*old_pfn < pivot_pfn) && (pfn >= pivot_pfn))
> +		*old_pfn = pfn;
> +}
> +
>  /*
>   * If no pages were isolated then mark this pageblock to be skipped in the
>   * future. The information is later cleared by __reset_isolation_suitable().
> @@ -186,6 +215,7 @@ static void update_pageblock_skip(struct compact_control *cc,
>  			bool migrate_scanner)
>  {
>  	struct zone *zone = cc->zone;
> +	unsigned long pivot_pfn = cc->pivot_pfn;
>  	unsigned long pfn;
>  
>  	if (cc->ignore_skip_hint)
> @@ -203,14 +233,14 @@ static void update_pageblock_skip(struct compact_control *cc,
>  
>  	/* Update where async and sync compaction should restart */
>  	if (migrate_scanner) {
> -		if (pfn > zone->compact_cached_migrate_pfn[0])
> -			zone->compact_cached_migrate_pfn[0] = pfn;
> -		if (cc->mode != MIGRATE_ASYNC &&
> -		    pfn > zone->compact_cached_migrate_pfn[1])
> -			zone->compact_cached_migrate_pfn[1] = pfn;
> +		update_cached_migrate_pfn(pfn, pivot_pfn,
> +					&zone->compact_cached_migrate_pfn[0]);
> +		if (cc->mode != MIGRATE_ASYNC)
> +			update_cached_migrate_pfn(pfn, pivot_pfn,
> +					&zone->compact_cached_migrate_pfn[1]);
>  	} else {
> -		if (pfn < zone->compact_cached_free_pfn)
> -			zone->compact_cached_free_pfn = pfn;
> +		update_cached_free_pfn(pfn, pivot_pfn,
> +					&zone->compact_cached_free_pfn);
>  	}
>  }
>  #else
> @@ -808,14 +838,41 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
>  
>  #endif /* CONFIG_COMPACTION || CONFIG_CMA */
>  #ifdef CONFIG_COMPACTION
> +static inline bool migrate_scanner_wrapped(struct compact_control *cc)
> +{
> +	return cc->migrate_pfn < cc->pivot_pfn;
> +}
> +
> +static inline bool free_scanner_wrapped(struct compact_control *cc)
> +{
> +	return cc->free_pfn >= cc->pivot_pfn;
> +}
> +
>  /*
>   * Test whether the free scanner has reached the same or lower pageblock than
>   * the migration scanner, and compaction should thus terminate.
>   */
>  static inline bool compact_scanners_met(struct compact_control *cc)
>  {
> -	return (cc->free_pfn >> pageblock_order)
> -		<= (cc->migrate_pfn >> pageblock_order);
> +	bool free_below_migrate = (cc->free_pfn >> pageblock_order)
> +		                <= (cc->migrate_pfn >> pageblock_order);
> +
> +	if (migrate_scanner_wrapped(cc) != free_scanner_wrapped(cc))
> +		/*
> +		 * Only one of the scanners have wrapped. Terminate if free
> +		 * scanner is in the same or lower pageblock than migration
> +		 * scanner.
> +		*/
> +		return free_below_migrate;
> +	else
> +		/*
> +		 * If neither scanner has wrapped, then free < start <=
> +		 * migration and we return false by definition.
> +		 * It shouldn't happen that both have wrapped, but even if it
> +		 * does due to e.g. reading mismatched zone cached pfn's, then
> +		 * migration < start <= free, so we return true and terminate.
> +		 */
> +		return !free_below_migrate;
>  }
>  
>  /*
> @@ -832,7 +889,10 @@ static void isolate_freepages(struct compact_control *cc)
>  	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
>  	int nr_freepages = cc->nr_freepages;
>  	struct list_head *freelist = &cc->freepages;
> +	bool wrapping; /* set to true in the first pageblock of the zone */
> +	bool wrapped; /* set to true when either scanner has wrapped */
>  
> +wrap:
>  	/*
>  	 * Initialise the free scanner. The starting point is where we last
>  	 * successfully isolated from, zone-cached value, or the end of the
> @@ -848,14 +908,25 @@ static void isolate_freepages(struct compact_control *cc)
>  	block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
>  	block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
>  						zone_end_pfn(zone));
> -	low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
> +
> +	wrapping = false;
> +	wrapped = free_scanner_wrapped(cc) || migrate_scanner_wrapped(cc);
> +	if (!wrapped)
> +		/* 
> +		 * If neither scanner wrapped yet, we are limited by zone's
> +		 * beginning. Here we pretend that the zone starts pageblock
> +		 * aligned to make the for-loop condition simpler.
> +		 */
> +		low_pfn = zone->zone_start_pfn & ~(pageblock_nr_pages-1);
> +	else
> +		low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
>  
>  	/*
>  	 * Isolate free pages until enough are available to migrate the
>  	 * pages on cc->migratepages. We stop searching if the migrate
>  	 * and free page scanners meet or enough free pages are isolated.
>  	 */
> -	for (; block_start_pfn >= low_pfn;
> +	for (; !wrapping && block_start_pfn >= low_pfn;
>  				block_end_pfn = block_start_pfn,
>  				block_start_pfn -= pageblock_nr_pages,
>  				isolate_start_pfn = block_start_pfn) {
> @@ -870,6 +941,24 @@ static void isolate_freepages(struct compact_control *cc)
>  						&& compact_should_abort(cc))
>  			break;
>  
> +		/*
> +		 * When we are limited by zone boundary, this means we have
> +		 * reached its first pageblock.
> +		 */
> +		if (!wrapped && block_start_pfn <= zone->zone_start_pfn) {
> +			/* The zone might start in the middle of the pageblock */
> +			block_start_pfn = zone->zone_start_pfn;
> +			if (isolate_start_pfn <= zone->zone_start_pfn)
> +				isolate_start_pfn = zone->zone_start_pfn;
> +			/*
> +			 * For e.g. DMA zone with zone_start_pfn == 1, we will
> +			 * underflow block_start_pfn in the next loop
> +			 * iteration. We have to terminate the loop with other
> +			 * means.
> +			 */
> +			wrapping = true;
> +		}
> +
>  		page = pageblock_pfn_to_page(block_start_pfn, block_end_pfn,
>  									zone);
>  		if (!page)
> @@ -903,6 +992,12 @@ static void isolate_freepages(struct compact_control *cc)
>  			if (isolate_start_pfn >= block_end_pfn)
>  				isolate_start_pfn =
>  					block_start_pfn - pageblock_nr_pages;
> +			else if (wrapping)
> +				/*
> +				 * We have been in the first pageblock of the
> +				 * zone, but have not finished it yet.
> +				 */
> +				wrapping = false;
>  			break;
>  		} else {
>  			/*
> @@ -913,6 +1008,20 @@ static void isolate_freepages(struct compact_control *cc)
>  		}
>  	}
>  
> +	/* Did we reach the beginning of the zone? Wrap to the end. */
> +	if (!wrapped && wrapping) {
> +		isolate_start_pfn = (zone_end_pfn(zone)-1) &
> +						~(pageblock_nr_pages-1);
> +		/*
> +		 * If we haven't isolated anything, we have to continue
> +		 * immediately, otherwise page migration will fail.
> +		 */
> +		if (!nr_freepages && !cc->contended) {
> +			cc->free_pfn = isolate_start_pfn;
> +			goto wrap;
> +		}
> +	}
> +
>  	/* split_free_page does not map the pages */
>  	map_pages(freelist);
>  
> @@ -984,10 +1093,11 @@ typedef enum {
>  static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  					struct compact_control *cc)
>  {
> -	unsigned long low_pfn, end_pfn;
> +	unsigned long low_pfn, end_pfn, max_pfn;
>  	struct page *page;
>  	const isolate_mode_t isolate_mode =
>  		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> +	bool wrapped = migrate_scanner_wrapped(cc) || free_scanner_wrapped(cc);
>  
>  	/*
>  	 * Start at where we last stopped, or beginning of the zone as
> @@ -998,13 +1108,27 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	/* Only scan within a pageblock boundary */
>  	end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
>  
> +	if (!wrapped) {
> +		/* 
> +		 * Neither of the scanners has wrapped yet, we are limited by
> +		 * zone end. Here we pretend it's aligned to pageblock
> +		 * boundary to make the for-loop condition simpler
> +		 */
> +		max_pfn = ALIGN(zone_end_pfn(zone), pageblock_nr_pages);
> +	} else {
> +		/* If any of the scanners wrapped, we will meet free scanner */
> +		max_pfn = cc->free_pfn;
> +	}
> +
>  	/*
>  	 * Iterate over whole pageblocks until we find the first suitable.
> -	 * Do not cross the free scanner.
> +	 * Do not cross the free scanner or the end of the zone.
>  	 */
> -	for (; end_pfn <= cc->free_pfn;
> +	for (; end_pfn <= max_pfn;
>  			low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
>  
> +		if (!wrapped && end_pfn > zone_end_pfn(zone))
> +			end_pfn = zone_end_pfn(zone);
>  		/*
>  		 * This can potentially iterate a massively long zone with
>  		 * many pageblocks unsuitable, so periodically check if we
> @@ -1047,6 +1171,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	}
>  
>  	acct_isolated(zone, cc);
> +	/* Did we reach the end of the zone? Wrap to the beginning */
> +	if (!wrapped && low_pfn >= zone_end_pfn(zone))
> +		low_pfn = zone->zone_start_pfn;
> +
>  	/* Record where migration scanner will be restarted. */
>  	cc->migrate_pfn = low_pfn;
>  
> @@ -1197,22 +1325,48 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	}
>  
>  	/*
> -	 * Setup to move all movable pages to the end of the zone. Used cached
> +	 * Setup the scanner positions according to pivot pfn. Use cached
>  	 * information on where the scanners should start but check that it
>  	 * is initialised by ensuring the values are within zone boundaries.
>  	 */
> -	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
> -	cc->free_pfn = zone->compact_cached_free_pfn;
> -	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn) {
> -		cc->free_pfn = end_pfn & ~(pageblock_nr_pages-1);
> -		zone->compact_cached_free_pfn = cc->free_pfn;
> +	cc->pivot_pfn = zone->compact_cached_pivot_pfn;
> +	if (cc->pivot_pfn < start_pfn || cc->pivot_pfn > end_pfn) {
> +		cc->pivot_pfn = start_pfn;
> +		zone->compact_cached_pivot_pfn = cc->pivot_pfn;
> +		/* When starting position was invalid, reset the rest */
> +		reset_cached_positions(zone);
>  	}
> +
> +	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
>  	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn > end_pfn) {
> -		cc->migrate_pfn = start_pfn;
> +		cc->migrate_pfn = cc->pivot_pfn;
>  		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>  		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
>  	}
>  
> +	cc->free_pfn = zone->compact_cached_free_pfn;
> +	if (cc->free_pfn < start_pfn || cc->free_pfn > end_pfn)
> +		cc->free_pfn = cc->pivot_pfn;
> +
> +	/*
> +	 * Free scanner should start on the beginning of the pageblock below
> +	 * the cc->pivot_pfn. If that's below the zone boundary, wrap to the
> +	 * last pageblock of the zone.
> +	 */
> +	if (cc->free_pfn == cc->pivot_pfn) {
> +		/* Don't underflow in zones starting with e.g. pfn 1 */
> +		if (cc->pivot_pfn < pageblock_nr_pages) {
> +			cc->free_pfn = (end_pfn-1) & ~(pageblock_nr_pages-1);
> +		} else {
> +			cc->free_pfn = (cc->pivot_pfn - pageblock_nr_pages);
> +			cc->free_pfn &= ~(pageblock_nr_pages-1);
> +			if (cc->free_pfn < start_pfn)
> +				cc->free_pfn = (end_pfn-1) &
> +					~(pageblock_nr_pages-1);
> +		}
> +		zone->compact_cached_free_pfn = cc->free_pfn;
> +	}
> +
>  	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn);
>  
>  	migrate_prep_local();
> diff --git a/mm/internal.h b/mm/internal.h
> index efad241..cb7b297 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,6 +157,7 @@ struct compact_control {
>  	struct list_head migratepages;	/* List of pages being migrated */
>  	unsigned long nr_freepages;	/* Number of isolated free pages */
>  	unsigned long nr_migratepages;	/* Number of pages to migrate */
> +	unsigned long pivot_pfn;	/* Where the scanners initially start */
>  	unsigned long free_pfn;		/* isolate_freepages search base */
>  	unsigned long migrate_pfn;	/* isolate_migratepages search base */
>  	enum migrate_mode mode;		/* Async or sync migration mode */
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions
  2015-01-20 13:30   ` Zhang Yanfei
@ 2015-01-22 15:22     ` Vlastimil Babka
  0 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-22 15:22 UTC (permalink / raw)
  To: Zhang Yanfei, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

On 01/20/2015 02:30 PM, Zhang Yanfei wrote:
> 在 2015/1/19 18:05, Vlastimil Babka 写道:
>> Reseting the cached compaction scanner positions is now done implicitly in
>> __reset_isolation_suitable() and compact_finished(). Encapsulate the
>> functionality in a new function reset_cached_positions() and call it
>> explicitly where needed.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks.

> Should the new function be inline?

I'll try comparing with bloat-o-meter before next submission, if there's 
any difference.

> Thanks.
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone
  2015-01-20 15:18   ` Zhang Yanfei
@ 2015-01-22 15:24     ` Vlastimil Babka
  0 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2015-01-22 15:24 UTC (permalink / raw)
  To: Zhang Yanfei, linux-mm
  Cc: linux-kernel, Andrew Morton, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, David Rientjes

On 01/20/2015 04:18 PM, Zhang Yanfei wrote:
> Hello Vlastimil
>
> 在 2015/1/19 18:05, Vlastimil Babka 写道:
>> Compaction employs two page scanners - migration scanner isolates pages to be
>> the source of migration, free page scanner isolates pages to be the target of
>> migration. Currently, migration scanner starts at the zone's first pageblock
>> and progresses towards the last one. Free scanner starts at the last pageblock
>> and progresses towards the first one. Within a pageblock, each scanner scans
>> pages from the first to the last one. When the scanners meet within the same
>> pageblock, compaction terminates.
>>
>> One consequence of the current scheme, that turns out to be unfortunate, is
>> that the migration scanner does not encounter the pageblocks which were
>> scanned by the free scanner. In a test with stress-highalloc from mmtests,
>> the scanners were observed to meet around the middle of the zone in first two
>> phases (with background memory pressure) of the test when executed after fresh
>> reboot. On further executions without reboot, the meeting point shifts to
>> roughly third of the zone, and compaction activity as well as allocation
>> success rates deteriorates compared to the run after fresh reboot.
>>
>> It turns out that the deterioration is indeed due to the migration scanner
>> processing only a small part of the zone. Compaction also keeps making this
>> bias worse by its activity - by moving all migratable pages towards end of the
>> zone, the free scanner has to scan a lot of full pageblocks to find more free
>> pages. The beginning of the zone contains pageblocks that have been compacted
>> as much as possible, but the free pages there cannot be further merged into
>> larger orders due to unmovable pages. The rest of the zone might contain more
>> suitable pageblocks, but the migration scanner will not reach them. It also
>> isn't be able to move movable pages out of unmovable pageblocks there, which
>> affects fragmentation.
>>
>> This patch is the first step to remove this bias. It allows the compaction
>> scanners to start at arbitrary pfn (aligned to pageblock for practical
>> purposes), called pivot, within the zone. The migration scanner starts at the
>> exact pfn, the free scanner starts at the pageblock preceding the pivot. The
>> direction of scanning is unaffected, but when the migration scanner reaches
>> the last pageblock of the zone, or the free scanner reaches the first
>> pageblock, they wrap and continue with the first or last pageblock,
>> respectively. Compaction terminates when any of the scanners wrap and both
>> meet within the same pageblock.
>>
>> For easier bisection of potential regressions, this patch always uses the
>> first zone's pfn as the pivot. That means the free scanner immediately wraps
>> to the last pageblock and the operation of scanners is thus unchanged. The
>> actual pivot changing is done by the next patch.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> I read through the whole patch, and you can feel free to add:
>
> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

Thanks.

> I agree with you and the approach to improve the current scheme. One thing
> I think should be carefully treated is how to avoid migrating back and forth
> since the pivot pfn can be changed. I see patch 5 has introduced a policy to
> change the pivot so we can have a careful observation on it.
>
> (The changes in the patch make the code more difficult to understand now...
> and I just find a tiny mistake, please see below)
>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> ---
>>   include/linux/mmzone.h |   2 +
>>   mm/compaction.c        | 204 +++++++++++++++++++++++++++++++++++++++++++------
>>   mm/internal.h          |   1 +
>>   3 files changed, 182 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 2f0856d..47aa181 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -503,6 +503,8 @@ struct zone {
>>   	unsigned long percpu_drift_mark;
>>
>>   #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>> +	/* pfn where compaction scanners have initially started last time */
>> +	unsigned long		compact_cached_pivot_pfn;
>>   	/* pfn where compaction free scanner should start */
>>   	unsigned long		compact_cached_free_pfn;
>>   	/* pfn where async and sync compaction migration scanner should start */
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 5626220..abae89a 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -123,11 +123,16 @@ static inline bool isolation_suitable(struct compact_control *cc,
>>   	return !get_pageblock_skip(page);
>>   }
>>
>> +/*
>> + * Invalidate cached compaction scanner positions, so that compact_zone()
>> + * will reinitialize them on the next compaction.
>> + */
>>   static void reset_cached_positions(struct zone *zone)
>>   {
>> -	zone->compact_cached_migrate_pfn[0] = zone->zone_start_pfn;
>> -	zone->compact_cached_migrate_pfn[1] = zone->zone_start_pfn;
>> -	zone->compact_cached_free_pfn = zone_end_pfn(zone);
>> +	/* Invalid values are re-initialized in compact_zone */
>> +	zone->compact_cached_migrate_pfn[0] = 0;
>> +	zone->compact_cached_migrate_pfn[1] = 0;
>> +	zone->compact_cached_free_pfn = 0;
>>   }
>>
>>   /*
>> @@ -172,11 +177,35 @@ void reset_isolation_suitable(pg_data_t *pgdat)
>>   		/* Only flush if a full compaction finished recently */
>>   		if (zone->compact_blockskip_flush) {
>>   			__reset_isolation_suitable(zone);
>> -			reset_cached_positions(zone);
>> +			reset_cached_positions(zone, false);
>
> The second argument should be in patch 5 instead of here, right? ^-^

Yes, a mistake with some last-minute rebasing :)

> Thanks.
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
                   ` (6 preceding siblings ...)
  2015-01-20  9:02 ` Vlastimil Babka
@ 2015-02-03  6:49 ` Joonsoo Kim
  2015-02-03  9:05   ` Vlastimil Babka
  7 siblings, 1 reply; 22+ messages in thread
From: Joonsoo Kim @ 2015-02-03  6:49 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	Christoph Lameter, Mel Gorman, Michal Nazarewicz, Minchan Kim,
	Naoya Horiguchi, Rik van Riel

On Mon, Jan 19, 2015 at 11:05:15AM +0100, Vlastimil Babka wrote:
> Even after all the patches compaction received in last several versions, it
> turns out that its effectivneess degrades considerably as the system ages
> after reboot. For example, see how success rates of stress-highalloc from
> mmtests degrades when we re-execute it several times, first time being after
> fresh reboot:
>                              3.19-rc4              3.19-rc4              3.19-rc4
>                             4-nothp-1             4-nothp-2             4-nothp-3
> Success 1 Min         25.00 (  0.00%)       13.00 ( 48.00%)        9.00 ( 64.00%)
> Success 1 Mean        36.20 (  0.00%)       23.40 ( 35.36%)       16.40 ( 54.70%)
> Success 1 Max         41.00 (  0.00%)       34.00 ( 17.07%)       25.00 ( 39.02%)
> Success 2 Min         25.00 (  0.00%)       15.00 ( 40.00%)       10.00 ( 60.00%)
> Success 2 Mean        37.20 (  0.00%)       25.00 ( 32.80%)       17.20 ( 53.76%)
> Success 2 Max         44.00 (  0.00%)       36.00 ( 18.18%)       25.00 ( 43.18%)
> Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       78.00 (  7.14%)
> Success 3 Mean        85.80 (  0.00%)       82.80 (  3.50%)       80.40 (  6.29%)
> Success 3 Max         87.00 (  0.00%)       84.00 (  3.45%)       82.00 (  5.75%)
> 
> Wouldn't it be much better, if it looked like this?
> 
>                            3.18                  3.18                  3.18
>                              3.19-rc4              3.19-rc4              3.19-rc4
>                             5-nothp-1             5-nothp-2             5-nothp-3
> Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
> Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
> Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
> Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
> Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
> Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
> Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
> Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
> Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
> 
> In my humble opinion, it would :) Much lower degradation, and a nice
> improvement in the first iteration as a bonus.
> 
> So what sorcery is this? Nothing much, just a fundamental change of the
> compaction scanners operation...
> 
> As everyone knows [1] the migration scanner starts at the first pageblock
> of a zone, and goes towards the end, and the free scanner starts at the
> last pageblock and goes towards the beginning. Somewhere in the middle of the
> zone, the scanners meet:
> 
>    zone_start                                                   zone_end
>        |                                                           |
>        -------------------------------------------------------------
>        MMMMMMMMMMMMM| =>                            <= |FFFFFFFFFFFF
>                migrate_pfn                         free_pfn
> 
> In my tests, the scanners meet around the middle of the pageblock on the first
> iteration, and around the 1/3 on subsequent iterations. Which means the
> migration scanner doesn't see the larger part of the zone at all. For more
> details why it's bad, see Patch 4 description.
> 
> To make sure we eventually scan the whole zone with the migration scanner, we
> could e.g. reverse the directions after each run. But that would still be
> biased, and with 1/3 of zone reachable from each side, we would still omit the
> middle 1/3 of a zone.
> 
> Or we could stop terminating compaction when the scanners meet, and let them
> continue to scan the whole zone. Mel told me it used to be the case long ago,
> but that approach would result in migrating pages back and forth during single
> compaction run, which wouldn't be cool.
> 
> So the approach taken by this patchset is to let scanners start at any
> so-called pivot pfn within the zone, and keep their direction:
> 
>    zone_start                     pivot                         zone_end
>        |                            |                              |
>        -------------------------------------------------------------
>                          <= |FFFFFFFMMMMMM| =>
>                         free_pfn     migrate_pfn
> 
> Eventually, one of the scanners will reach the zone boundary and wrap around,
> e.g. the in the case of the free scanner:
> 
>    zone_start                     pivot                         zone_end
>        |                            |                              |
>        -------------------------------------------------------------
>        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFMMMMMMMMMMMM| =>           <= |F
>                                            migrate_pfn        free_pfn
> 
> Compaction will again terminate when the scanners meet.
> 
> 
> As you can imagine, the required code changes made the termination detection
> and the scanners themselves quite hairy. There are lots of corner cases and
> the code is often hard to wrap one's head around [puns intended]. The scanner
> functions isolate_migratepages() and isolate_freepages() were recently cleaned
> up a lot, and this makes them messy again, as they can no longer rely on the
> fact that they will meet the other scanner and not the zone boundary.
> 
> But the improvements seem to make these complications worth, and I hope
> somebody can suggest more elegant solutions to the various parts of the code.
> So here it is as a RFC. Patches 1-3 are cleanups that could be applied in any
> case. Patch 4 implements the main changes, but leaves the pivot to be the
> first zone's pfn, so that free scanner wraps immediately and there's no
> actual change. Patch 5 updates the pivot in a conservative way, as explained
> in the changelog.

Hello,

I don't have any elegant idea, but, have some humble opinion.

The point is that migrate scanner should scan whole zone.
Although your pivot approach makes some sense and it can scan whole zone,
it could cause back and forth migration in a very short term whenever
both scanners get toward and passed each other. I think that if we permit
overlap of scanner, we don't need to adhere to reverse linear scanning
in freepage scanner since reverse liner scan doesn't prevent back and
forth migration from now on.

There are two solutions on this problem.
One is that free scanner scans pfn in same direction where migrate scanner
goes with having proper interval.

|=========================|
MMM==>  <Interval>  FFF==>

Enough interval guarantees to prevent back and forth migration,
at least, in a very short period.

Or, we could make free scanner totally different with linear scan.
Linear scanning to get freepage wastes much time if system memory
is really big and most of it is used. If we takes freepage from the
buddy, we can eliminate this scanning overhead. With additional
logic, that is, comparing position of freepage with migrate scanner
and selectively taking it, we can avoid back and forth migration
in a very short period.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-03  6:49 ` Joonsoo Kim
@ 2015-02-03  9:05   ` Vlastimil Babka
  2015-02-03 15:00     ` Joonsoo Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-02-03  9:05 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: linux-mm, linux-kernel, Andrew Morton, David Rientjes,
	Christoph Lameter, Mel Gorman, Michal Nazarewicz, Minchan Kim,
	Naoya Horiguchi, Rik van Riel

On 02/03/2015 07:49 AM, Joonsoo Kim wrote:
> On Mon, Jan 19, 2015 at 11:05:15AM +0100, Vlastimil Babka wrote:
> 
> Hello,
> 
> I don't have any elegant idea, but, have some humble opinion.
> 
> The point is that migrate scanner should scan whole zone.
> Although your pivot approach makes some sense and it can scan whole zone,
> it could cause back and forth migration in a very short term whenever
> both scanners get toward and passed each other.

I don't understand the scenario you suggest? The scanners don't overlap in any
single run, that doesn't change. If they meet, compaction terminates. They can
"overlap" if you compare the current run with previous run, after pivot change.
The it's true that e.g. migration scanner will operate on pageblocks where the
free scanner has operated on previously. But pivot changes are only done after
the full defer cycle, which is not short term.

> I think that if we permit
> overlap of scanner, we don't need to adhere to reverse linear scanning
> in freepage scanner since reverse liner scan doesn't prevent back and
> forth migration from now on.

I believe that we still don't permit overlap, but anyway...

> There are two solutions on this problem.
> One is that free scanner scans pfn in same direction where migrate scanner
> goes with having proper interval.
> 
> |=========================|
> MMM==>  <Interval>  FFF==>
> 
> Enough interval guarantees to prevent back and forth migration,
> at least, in a very short period.

That would depend on the termination criteria and what to do after restart.
You would have to terminate as soon as one scanner approaches the position where
the other started. Otherwise you overlap and migrate back in a single run. So
you terminate and that will typically mean one of the scanners did not finish
its part fully, so there are pageblocks scanned by neither one. You could adjust
the interval to find the optimal one. But you shouldn't do it immediately next
run, as that would overlap the previous run too soon. Or maybe adjust it only a
little... I don't know if that's simpler than my approach, it seems more quirky.

> Or, we could make free scanner totally different with linear scan.
> Linear scanning to get freepage wastes much time if system memory
> is really big and most of it is used. If we takes freepage from the
> buddy, we can eliminate this scanning overhead. With additional
> logic, that is, comparing position of freepage with migrate scanner
> and selectively taking it, we can avoid back and forth migration
> in a very short period.

I think the metric we should be looking is the ration between free pages scanned
and migrate pages scanned. It's true that after this series in the
allocate-as-thp scenario, it was more than 10 in the first run after reboot.
So maybe it could be more efficient to search the buddy lists. But then again,
when to terminate in this case? The free lists are changing continuously. And to
compare position, you also need to predict how much the migrate scanner will
progress in the single run, because you don't want to take pages from there.

> Thanks.
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-03  9:05   ` Vlastimil Babka
@ 2015-02-03 15:00     ` Joonsoo Kim
  2015-02-03 15:51       ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Joonsoo Kim @ 2015-02-03 15:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, LKML, Andrew Morton,
	David Rientjes, Christoph Lameter, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

2015-02-03 18:05 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 02/03/2015 07:49 AM, Joonsoo Kim wrote:
>> On Mon, Jan 19, 2015 at 11:05:15AM +0100, Vlastimil Babka wrote:
>>
>> Hello,
>>
>> I don't have any elegant idea, but, have some humble opinion.
>>
>> The point is that migrate scanner should scan whole zone.
>> Although your pivot approach makes some sense and it can scan whole zone,
>> it could cause back and forth migration in a very short term whenever
>> both scanners get toward and passed each other.
>
> I don't understand the scenario you suggest? The scanners don't overlap in any
> single run, that doesn't change. If they meet, compaction terminates. They can
> "overlap" if you compare the current run with previous run, after pivot change.

Yeah, I mean this case.

I think that we should regard single run as whole zone scan rather than just
terminating criteria we have artificially defined and try to avoid
back and forth
problem as much as possible in this scale. Not overlapping in a single run you
mentioned doesn't solve this problem in this scale.

> The it's true that e.g. migration scanner will operate on pageblocks where the
> free scanner has operated on previously. But pivot changes are only done after
> the full defer cycle, which is not short term.

I don't think it's not short term. After successful run, if next high
order request
comes immediately, migrate scanner will immediately restart at the position
where previous free scanner has operated.

>
>> I think that if we permit
>> overlap of scanner, we don't need to adhere to reverse linear scanning
>> in freepage scanner since reverse liner scan doesn't prevent back and
>> forth migration from now on.
>
> I believe that we still don't permit overlap, but anyway...
>
>> There are two solutions on this problem.
>> One is that free scanner scans pfn in same direction where migrate scanner
>> goes with having proper interval.
>>
>> |=========================|
>> MMM==>  <Interval>  FFF==>
>>
>> Enough interval guarantees to prevent back and forth migration,
>> at least, in a very short period.
>
> That would depend on the termination criteria and what to do after restart.
> You would have to terminate as soon as one scanner approaches the position where
> the other started. Otherwise you overlap and migrate back in a single run. So
> you terminate and that will typically mean one of the scanners did not finish
> its part fully, so there are pageblocks scanned by neither one. You could adjust
> the interval to find the optimal one. But you shouldn't do it immediately next
> run, as that would overlap the previous run too soon. Or maybe adjust it only a
> little... I don't know if that's simpler than my approach, it seems more quirky.

Yeah, the idea comes from quick thought so it's not perfect.
In fact, if we regard single run as whole zone scan, back and forth problem is
inevitable. What we can do best is reducing bad effect of that problem. With
interval, we don't try to migrate page which we immediately use for freepage
in a very short period.

I think that we can break relationship of free scanner and migrate scanner.
It's not necessary that summation of scanned range of both scanner is whole
zone. The point is migrate scanner should scan whole zone. Free scanner
would adjust scanning position based on the position where
migrate scanner is, whenever necessary.

>> Or, we could make free scanner totally different with linear scan.
>> Linear scanning to get freepage wastes much time if system memory
>> is really big and most of it is used. If we takes freepage from the
>> buddy, we can eliminate this scanning overhead. With additional
>> logic, that is, comparing position of freepage with migrate scanner
>> and selectively taking it, we can avoid back and forth migration
>> in a very short period.
>
> I think the metric we should be looking is the ration between free pages scanned
> and migrate pages scanned. It's true that after this series in the
> allocate-as-thp scenario, it was more than 10 in the first run after reboot.
> So maybe it could be more efficient to search the buddy lists. But then again,
> when to terminate in this case? The free lists are changing continuously. And to
> compare position, you also need to predict how much the migrate scanner will
> progress in the single run, because you don't want to take pages from there.

We can terminate when whole zone is scanned. As I mentioned above, we can't
avoid back and forth problem in case of whole zone range and I'd like to do what
we can do our best, avoiding back and forth in a very short term. Maybe, taking
freepage positioned zone_range/2 far from with migrated scanner at that time
would prevent the problem occurrence in a very short term.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-03 15:00     ` Joonsoo Kim
@ 2015-02-03 15:51       ` Vlastimil Babka
  2015-02-03 17:07         ` Joonsoo Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-02-03 15:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Linux Memory Management List, LKML, Andrew Morton,
	David Rientjes, Christoph Lameter, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

On 02/03/2015 04:00 PM, Joonsoo Kim wrote:
> 2015-02-03 18:05 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>> On 02/03/2015 07:49 AM, Joonsoo Kim wrote:
>>> On Mon, Jan 19, 2015 at 11:05:15AM +0100, Vlastimil Babka wrote:
>>>
>>> Hello,
>>>
>>> I don't have any elegant idea, but, have some humble opinion.
>>>
>>> The point is that migrate scanner should scan whole zone.
>>> Although your pivot approach makes some sense and it can scan whole zone,
>>> it could cause back and forth migration in a very short term whenever
>>> both scanners get toward and passed each other.
>>
>> I don't understand the scenario you suggest? The scanners don't overlap in any
>> single run, that doesn't change. If they meet, compaction terminates. They can
>> "overlap" if you compare the current run with previous run, after pivot change.
> 
> Yeah, I mean this case.
> 
> I think that we should regard single run as whole zone scan rather than just
> terminating criteria we have artificially defined and try to avoid
> back and forth
> problem as much as possible in this scale. Not overlapping in a single run you
> mentioned doesn't solve this problem in this scale.
> 
>> The it's true that e.g. migration scanner will operate on pageblocks where the
>> free scanner has operated on previously. But pivot changes are only done after
>> the full defer cycle, which is not short term.
> 
> I don't think it's not short term. After successful run, if next high
> order request
> comes immediately, migrate scanner will immediately restart at the position
> where previous free scanner has operated.

Ah, I think I see where the misunderstanding comes from now. So to clarify,
let's consider

1. single compaction run - single invocation of compact_zone(). It can start
from cached pfn's from previous run, or zone boundaries (or pivot, after this
series), and terminate with scanners meeting or not meeting.

2. full zone compaction - consists one or more compaction runs, where the first
run starts at boundaries (pivot). It ends when scanners meet -
compact_finished() returns COMPACT_COMPLETE

3. compaction after full defer cycle - this is full zone compaction, where
compaction_restarting() returns true in its first run

My understanding is that you think pivot changing occurs after each full zone
compaction (definition 2), but in fact it occurs only each defer cycle
(definition 3). See patch 5 for detailed reasoning. I don't think it's short
term. It means full zone compactions (def 2) already failed many times and then
was deferred for further time, using the same unchanged pivot.

I think any of the alternatives you suggested below where migrate scanner
processes whole zone during full zone compaction (2), would necessarily result
in shorter-term back and forth migration than this scheme. On the other hand,
the pivot changing proposed here might be too long-term. But it's a first
attempt, and the frequency can be further tuned.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-03 15:51       ` Vlastimil Babka
@ 2015-02-03 17:07         ` Joonsoo Kim
  2015-02-04 14:39           ` Vlastimil Babka
  0 siblings, 1 reply; 22+ messages in thread
From: Joonsoo Kim @ 2015-02-03 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Joonsoo Kim, Linux Memory Management List, LKML, Andrew Morton,
	David Rientjes, Christoph Lameter, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

2015-02-04 0:51 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 02/03/2015 04:00 PM, Joonsoo Kim wrote:
>> 2015-02-03 18:05 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>>> On 02/03/2015 07:49 AM, Joonsoo Kim wrote:
>>>> On Mon, Jan 19, 2015 at 11:05:15AM +0100, Vlastimil Babka wrote:
>>>>
>>>> Hello,
>>>>
>>>> I don't have any elegant idea, but, have some humble opinion.
>>>>
>>>> The point is that migrate scanner should scan whole zone.
>>>> Although your pivot approach makes some sense and it can scan whole zone,
>>>> it could cause back and forth migration in a very short term whenever
>>>> both scanners get toward and passed each other.
>>>
>>> I don't understand the scenario you suggest? The scanners don't overlap in any
>>> single run, that doesn't change. If they meet, compaction terminates. They can
>>> "overlap" if you compare the current run with previous run, after pivot change.
>>
>> Yeah, I mean this case.
>>
>> I think that we should regard single run as whole zone scan rather than just
>> terminating criteria we have artificially defined and try to avoid
>> back and forth
>> problem as much as possible in this scale. Not overlapping in a single run you
>> mentioned doesn't solve this problem in this scale.
>>
>>> The it's true that e.g. migration scanner will operate on pageblocks where the
>>> free scanner has operated on previously. But pivot changes are only done after
>>> the full defer cycle, which is not short term.
>>
>> I don't think it's not short term. After successful run, if next high
>> order request
>> comes immediately, migrate scanner will immediately restart at the position
>> where previous free scanner has operated.
>
> Ah, I think I see where the misunderstanding comes from now. So to clarify,
> let's consider
>
> 1. single compaction run - single invocation of compact_zone(). It can start
> from cached pfn's from previous run, or zone boundaries (or pivot, after this
> series), and terminate with scanners meeting or not meeting.
>
> 2. full zone compaction - consists one or more compaction runs, where the first
> run starts at boundaries (pivot). It ends when scanners meet -
> compact_finished() returns COMPACT_COMPLETE
>
> 3. compaction after full defer cycle - this is full zone compaction, where
> compaction_restarting() returns true in its first run
>
> My understanding is that you think pivot changing occurs after each full zone
> compaction (definition 2), but in fact it occurs only each defer cycle
> (definition 3). See patch 5 for detailed reasoning. I don't think it's short
> term. It means full zone compactions (def 2) already failed many times and then
> was deferred for further time, using the same unchanged pivot.

Ah... thanks for clarifying. I actually think pivot changing occurs at
definition 2
as you guess. :)

> I think any of the alternatives you suggested below where migrate scanner
> processes whole zone during full zone compaction (2), would necessarily result
> in shorter-term back and forth migration than this scheme. On the other hand,
> the pivot changing proposed here might be too long-term. But it's a first
> attempt, and the frequency can be further tuned.

Yes, your proposal would be less problematic on back and forth problem than
my suggestion.

Hmm...nevertheless, I can't completely agree with pivot approach.

I'd like to remove dependency of migrate scanner and free scanner such as
termination criteria at this chance. Meeting position of both scanner is roughly
determined by on amount of free memory in the zone. If 200 MB is free in
the zone, migrate scanner can scan at maximum 200 MB from the start pfn
of the pivot. Without changing pivot quickly, we can scan only
this region regardless zone size so it cause bad effect to high order
allocation for a long time.

In stress-highalloc test, it doesn't matter since we try to attempt a lot of
allocations. This bad effect would not appear easily. Although middle of
allocation attempts are failed, latter attempts would succeed
since pivot would be changed in the middle of attempts.

But, in real world scenario, all allocation attempts are precise and
it'd be better
first come high order allocation request to succeed and this is another problem
than allocation success rate in stress-highalloc test. To accomplish it, we
need to change pivot as soon as possible. Without it, we could miss some
precise allocation attempt until pivot is changed. For this purpose, we should
remove defer logic or change it more loosely and then, resetting pivot would
occur soon so we could encounter back and forth problem frequently.

Therefore, it's better to change compaction logic more fundamentally.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-03 17:07         ` Joonsoo Kim
@ 2015-02-04 14:39           ` Vlastimil Babka
  2015-02-10  8:54             ` Joonsoo Kim
  0 siblings, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2015-02-04 14:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Linux Memory Management List, LKML, Andrew Morton,
	David Rientjes, Christoph Lameter, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

On 02/03/2015 06:07 PM, Joonsoo Kim wrote:
> 2015-02-04 0:51 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
>> Ah, I think I see where the misunderstanding comes from now. So to clarify,
>> let's consider
>>
>> 1. single compaction run - single invocation of compact_zone(). It can start
>> from cached pfn's from previous run, or zone boundaries (or pivot, after this
>> series), and terminate with scanners meeting or not meeting.
>>
>> 2. full zone compaction - consists one or more compaction runs, where the first
>> run starts at boundaries (pivot). It ends when scanners meet -
>> compact_finished() returns COMPACT_COMPLETE
>>
>> 3. compaction after full defer cycle - this is full zone compaction, where
>> compaction_restarting() returns true in its first run
>>
>> My understanding is that you think pivot changing occurs after each full zone
>> compaction (definition 2), but in fact it occurs only each defer cycle
>> (definition 3). See patch 5 for detailed reasoning. I don't think it's short
>> term. It means full zone compactions (def 2) already failed many times and then
>> was deferred for further time, using the same unchanged pivot.
>
> Ah... thanks for clarifying. I actually think pivot changing occurs at
> definition 2
> as you guess. :)

Great it's clarified :)

>> I think any of the alternatives you suggested below where migrate scanner
>> processes whole zone during full zone compaction (2), would necessarily result
>> in shorter-term back and forth migration than this scheme. On the other hand,
>> the pivot changing proposed here might be too long-term. But it's a first
>> attempt, and the frequency can be further tuned.
>
> Yes, your proposal would be less problematic on back and forth problem than
> my suggestion.
>
> Hmm...nevertheless, I can't completely agree with pivot approach.
>
> I'd like to remove dependency of migrate scanner and free scanner such as
> termination criteria at this chance. Meeting position of both scanner is roughly

Well at some point compaction should terminate if it scanned the whole 
zone, and failed. How else to check that than using the scanner positions?

> determined by on amount of free memory in the zone. If 200 MB is free in
> the zone, migrate scanner can scan at maximum 200 MB from the start pfn
> of the pivot. Without changing pivot quickly, we can scan only
> this region regardless zone size so it cause bad effect to high order
> allocation for a long time.
>
> In stress-highalloc test, it doesn't matter since we try to attempt a lot of
> allocations. This bad effect would not appear easily. Although middle of
> allocation attempts are failed, latter attempts would succeed
> since pivot would be changed in the middle of attempts.

OK, that might be true. It's not a perfect benchmark.

> But, in real world scenario, all allocation attempts are precise and
> it'd be better
> first come high order allocation request to succeed and this is another problem
> than allocation success rate in stress-highalloc test. To accomplish it, we
> need to change pivot as soon as possible. Without it, we could miss some
> precise allocation attempt until pivot is changed. For this purpose, we should
> remove defer logic or change it more loosely and then, resetting pivot would
> occur soon so we could encounter back and forth problem frequently.

It seems to me that you can't have both the "migration scanner should 
try scanning whole zone during single compaction (or during relatively 
few attempts)" and "we shouldn't migrate pages that we have just 
(relatively recently) migrated", in any scheme including the two you 
proposed in previous mail. These features just go against each other.

In any scheme you should divide the zone between part that's scanned for 
pages to migrate from, and part that scanned for free pages as migration 
targets. If you don't divide, then you end up migrating back and forth 
instantly, which would be bad.

So what happens after you don't have any free pages in the part that was 
for the free scanner (this is what happen in current scheme when 
scanners meet). If you wanted to continue with the migration scanner, 
the only free pages are in the part which the migration scanner just 
processed. And funnily enough, the pivot changing scheme will put the 
free scanner just in the position to scan this part. But doing that 
immediately could mean excessive migration.

Your alternative scheme where free scanner follows the migration scanner 
at some distance is not very different in this sense. If you manage to 
tune the distance properly, you will also scan for free pages the part 
that was just processed by the migration scanner. It might be more 
efficient in that you don't rescan the part that the migration scanner 
didn't reach both before and after pivot change. But fundamentally, it 
means migrating pages that were recently migrated.

> Therefore, it's better to change compaction logic more fundamentally.

Maybe it's indeed better to excessively migrate than keep rescanning the 
same pageblocks and then defer compaction. But we shouldn't forget that 
immediate success rate is not the only criteria. We should also keep the 
overhead sane. That's why there's pageblock suitability bitfield, 
deferred compaction etc, which I'm not sure how those would fit into the 
"continuously progressing migration scanner" scheme.

So what I think should precede such increase in compact aggressivity:
- on direct compaction, only try migrate when successfully isolated all 
pages needed for merging the desired order page. I've had such patch 
already in one series last year, but it affected the anti-fragmentation 
effects of compaction.
- no more THP page faults (also for other good reasons), leave 
collapsing to khugepaged, or rather task_work, leaving only the 
expensive sync compaction to dedicated per-node daemons. These should 
hopefully solve the anti-fragmentation issue as well.

> Thanks.
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC PATCH 0/5] compaction: changing initial position of scanners
  2015-02-04 14:39           ` Vlastimil Babka
@ 2015-02-10  8:54             ` Joonsoo Kim
  0 siblings, 0 replies; 22+ messages in thread
From: Joonsoo Kim @ 2015-02-10  8:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux Memory Management List, LKML, Andrew Morton,
	David Rientjes, Christoph Lameter, Mel Gorman, Michal Nazarewicz,
	Minchan Kim, Naoya Horiguchi, Rik van Riel

On Wed, Feb 04, 2015 at 03:39:25PM +0100, Vlastimil Babka wrote:
> On 02/03/2015 06:07 PM, Joonsoo Kim wrote:
> >2015-02-04 0:51 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> >>Ah, I think I see where the misunderstanding comes from now. So to clarify,
> >>let's consider
> >>
> >>1. single compaction run - single invocation of compact_zone(). It can start
> >>from cached pfn's from previous run, or zone boundaries (or pivot, after this
> >>series), and terminate with scanners meeting or not meeting.
> >>
> >>2. full zone compaction - consists one or more compaction runs, where the first
> >>run starts at boundaries (pivot). It ends when scanners meet -
> >>compact_finished() returns COMPACT_COMPLETE
> >>
> >>3. compaction after full defer cycle - this is full zone compaction, where
> >>compaction_restarting() returns true in its first run
> >>
> >>My understanding is that you think pivot changing occurs after each full zone
> >>compaction (definition 2), but in fact it occurs only each defer cycle
> >>(definition 3). See patch 5 for detailed reasoning. I don't think it's short
> >>term. It means full zone compactions (def 2) already failed many times and then
> >>was deferred for further time, using the same unchanged pivot.
> >
> >Ah... thanks for clarifying. I actually think pivot changing occurs at
> >definition 2
> >as you guess. :)
> 
> Great it's clarified :)
> 
> >>I think any of the alternatives you suggested below where migrate scanner
> >>processes whole zone during full zone compaction (2), would necessarily result
> >>in shorter-term back and forth migration than this scheme. On the other hand,
> >>the pivot changing proposed here might be too long-term. But it's a first
> >>attempt, and the frequency can be further tuned.
> >
> >Yes, your proposal would be less problematic on back and forth problem than
> >my suggestion.
> >
> >Hmm...nevertheless, I can't completely agree with pivot approach.
> >
> >I'd like to remove dependency of migrate scanner and free scanner such as
> >termination criteria at this chance. Meeting position of both scanner is roughly
> 
> Well at some point compaction should terminate if it scanned the
> whole zone, and failed. How else to check that than using the
> scanner positions?

We can count number of pageblock we are scanning and if it matches
with zone span we can terminate.

> >determined by on amount of free memory in the zone. If 200 MB is free in
> >the zone, migrate scanner can scan at maximum 200 MB from the start pfn
> >of the pivot. Without changing pivot quickly, we can scan only
> >this region regardless zone size so it cause bad effect to high order
> >allocation for a long time.
> >
> >In stress-highalloc test, it doesn't matter since we try to attempt a lot of
> >allocations. This bad effect would not appear easily. Although middle of
> >allocation attempts are failed, latter attempts would succeed
> >since pivot would be changed in the middle of attempts.
> 
> OK, that might be true. It's not a perfect benchmark.
> 
> >But, in real world scenario, all allocation attempts are precise and
> >it'd be better
> >first come high order allocation request to succeed and this is another problem
> >than allocation success rate in stress-highalloc test. To accomplish it, we
> >need to change pivot as soon as possible. Without it, we could miss some
> >precise allocation attempt until pivot is changed. For this purpose, we should
> >remove defer logic or change it more loosely and then, resetting pivot would
> >occur soon so we could encounter back and forth problem frequently.
> 
> It seems to me that you can't have both the "migration scanner
> should try scanning whole zone during single compaction (or during
> relatively few attempts)" and "we shouldn't migrate pages that we
> have just (relatively recently) migrated", in any scheme including
> the two you proposed in previous mail. These features just go
> against each other.
> 
> In any scheme you should divide the zone between part that's scanned
> for pages to migrate from, and part that scanned for free pages as
> migration targets. If you don't divide, then you end up migrating
> back and forth instantly, which would be bad.

Okay. My proposal isn't perfect but it is just quick thought. :)
I hope that it is a seed for better idea, not the solution.
It may be true that we need to divide region for each scanner.

> 
> So what happens after you don't have any free pages in the part that
> was for the free scanner (this is what happen in current scheme when
> scanners meet). If you wanted to continue with the migration
> scanner, the only free pages are in the part which the migration
> scanner just processed. And funnily enough, the pivot changing
> scheme will put the free scanner just in the position to scan this
> part. But doing that immediately could mean excessive migration.
> 
> Your alternative scheme where free scanner follows the migration
> scanner at some distance is not very different in this sense. If you
> manage to tune the distance properly, you will also scan for free
> pages the part that was just processed by the migration scanner. It
> might be more efficient in that you don't rescan the part that the
> migration scanner didn't reach both before and after pivot change.
> But fundamentally, it means migrating pages that were recently
> migrated.

What I'm worrying about in pivot approach is that if pivot is changed
frequently due to burst of failed high order allocation or we decide
to make compaction aggressive, immediate migration back and forth
could happen. Isn't it the problem we need to consider here?

One of main difference of my approach is the direction of free scanner
and if we scan in same direction with migration scanner, we can
guarantees that certain migrated pages are migrated again as late as
possible. But, with reverse direction, it's not possible to keep this
order. See following example.

PIVOT(REVERSE DIRECTION)
|---1---2---3-----|------------------|
=> after compaction
|-----------------|---3---2---1------|
=> after pivot changed, #3 page which is migrated recently is migrated
first.

SAME DIRECTION
|---1---2---3-----|------------------|
=> after compaction
|-----------------|---1---2---3------|
=> we can keep the order of migratable pages.

I know that if we want to scan whole zone range, we can't avoid back
and forth migration. Best thing we can do is something like this that
preserve order of migration and my proposal aims at this purpose.

Thanks.

> >Therefore, it's better to change compaction logic more fundamentally.
> 
> Maybe it's indeed better to excessively migrate than keep rescanning
> the same pageblocks and then defer compaction. But we shouldn't
> forget that immediate success rate is not the only criteria. We
> should also keep the overhead sane. That's why there's pageblock
> suitability bitfield, deferred compaction etc, which I'm not sure
> how those would fit into the "continuously progressing migration
> scanner" scheme.
> So what I think should precede such increase in compact aggressivity:
> - on direct compaction, only try migrate when successfully isolated
> all pages needed for merging the desired order page. I've had such
> patch already in one series last year, but it affected the
> anti-fragmentation effects of compaction.
> - no more THP page faults (also for other good reasons), leave
> collapsing to khugepaged, or rather task_work, leaving only the
> expensive sync compaction to dedicated per-node daemons. These
> should hopefully solve the anti-fragmentation issue as well.
> 
> >Thanks.
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-02-10  8:52 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-19 10:05 [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
2015-01-19 10:05 ` [PATCH 1/5] mm, compaction: more robust check for scanners meeting Vlastimil Babka
2015-01-20 13:26   ` Zhang Yanfei
2015-01-19 10:05 ` [PATCH 2/5] mm, compaction: simplify handling restart position in free pages scanner Vlastimil Babka
2015-01-20 13:27   ` Zhang Yanfei
2015-01-20 13:31     ` Vlastimil Babka
2015-01-19 10:05 ` [PATCH 3/5] mm, compaction: encapsulate resetting cached scanner positions Vlastimil Babka
2015-01-20 13:30   ` Zhang Yanfei
2015-01-22 15:22     ` Vlastimil Babka
2015-01-19 10:05 ` [RFC PATCH 4/5] mm, compaction: allow scanners to start at any pfn within the zone Vlastimil Babka
2015-01-20 15:18   ` Zhang Yanfei
2015-01-22 15:24     ` Vlastimil Babka
2015-01-19 10:05 ` [RFC PATCH 5/5] mm, compaction: set pivot pfn to the pfn when scanners met last time Vlastimil Babka
2015-01-19 10:10 ` [RFC PATCH 0/5] compaction: changing initial position of scanners Vlastimil Babka
2015-01-20  9:02 ` Vlastimil Babka
2015-02-03  6:49 ` Joonsoo Kim
2015-02-03  9:05   ` Vlastimil Babka
2015-02-03 15:00     ` Joonsoo Kim
2015-02-03 15:51       ` Vlastimil Babka
2015-02-03 17:07         ` Joonsoo Kim
2015-02-04 14:39           ` Vlastimil Babka
2015-02-10  8:54             ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).