linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Memory compaction efficiency improvements
@ 2013-11-25 14:26 Vlastimil Babka
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

The broad goal of the series is to improve allocation success rates for huge
pages through memory compaction, while trying not to increase the compaction
overhead. The original objective was to reintroduce capturing of high-order
pages freed by the compaction, before they are split by concurrent activity.
However, several bugs and opportunities for simple improvements were found in
the current implementation, mostly through extra tracepoints (which are however
too ugly for now to be considered for sending).

The patches mostly deal with two mechanisms that reduce compaction overhead,
which is caching the progress of migrate and free scanners, and marking
pageblocks where isolation failed to be skipped during further scans.

Patch 1 encapsulates the some functionality for handling deferred compactions
        for better maintainability, without a functional change
        type is not determined without being actually needed.

Patch 2 fixes a bug where cached scanner pfn's are sometimes reset only after
        they have been read to initialize a compaction run.

Patch 3 fixes a bug where scanners meeting is sometimes not properly detected
        and can lead to multiple compaction attempts quitting early without
        doing any work.

Patch 4 improves the chances of sync compaction to process pageblocks that
        async compaction has skipped due to being !MIGRATE_MOVABLE.

Patch 5 improves the chances of sync direct compaction to actually do anything
        when called after async compaction fails during allocation slowpath.


Some preliminary results with mmtests's stress-highalloc benchmark on a x86_64
machine with 4GB memory. First, the default GFP_HIGHUSER_MOVABLE allocations,
with the patches stacked on top of mainline master as of Friday (commit
a5d6e633 merging fixes from Andrew). Patch 1 is OK to serve as baseline due to
no functional change. Comments below.

stress-highalloc
                         master                master                master                master                master
                        1-nothp               2-nothp               3-nothp               4-nothp               5-nothp
Success 1       34.00 (  0.00%)       20.00 ( 41.18%)       44.00 (-29.41%)       45.00 (-32.35%)       25.00 ( 26.47%)
Success 2       31.00 (  0.00%)       21.00 ( 32.26%)       47.00 (-51.61%)       47.00 (-51.61%)       28.00 (  9.68%)
Success 3       68.00 (  0.00%)       88.00 (-29.41%)       86.00 (-26.47%)       87.00 (-27.94%)       88.00 (-29.41%)

              master      master      master      master      master
             1-nothp     2-nothp     3-nothp     4-nothp     5-nothp
User         6334.04     6343.09     5938.15     5860.00     6674.38
System       1044.15     1035.84     1022.68     1021.11     1055.76
Elapsed      1787.06     1714.76     1829.14     1850.91     1789.83

                                master      master      master      master      master
                               1-nothp     2-nothp     3-nothp     4-nothp     5-nothp
Minor Faults                 248365069   244975796   247192462   243720231   248888409
Major Faults                       427         442         563         504         414
Swap Ins                             7           3           8           7           0
Swap Outs                          345         338         570         235         415
Direct pages scanned            239929      166220      276238      277310      202409
Kswapd pages scanned           1759082     1819998     1880477     1850421     1809928
Kswapd pages reclaimed         1756781     1813653     1877783     1847704     1806347
Direct pages reclaimed          239291      165988      276163      277048      202092
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity                984.344    1061.372    1028.066     999.736    1011.229
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                134.259      96.935     151.021     149.824     113.088
Percentage direct scans            12%          8%         12%         13%         10%
Zone normal velocity           362.126     440.499     374.597     354.049     360.196
Zone dma32 velocity            756.478     717.808     804.490     795.511     764.122
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         450.000     476.000     570.000     306.000     639.000
Page writes file                   105         138           0          71         224
Page writes anon                   345         338         570         235         415
Page reclaim immediate             660        4407         167         843        1553
Sector Reads                   2734844     2725576     2951744     2830472     2791216
Sector Writes                 11938520    11729108    11769760    11743120    11805320
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1596544     1520768     1767552     1774720     1555584
Direct inode steals               9764        6640       14010       15320        8315
Kswapd inode steals              47445       42888       49705       51043       43283
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                     78          30          43          34          31
THP collapse alloc                 485         371         570         559         306
THP splits                           6           1           2           4           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                   13          16          11          12          16
Compaction stalls                 1067        1072        1629        1578        1140
Compaction success                 339         275         568         595         329
Compaction failures                728         797        1061         983         811
Page migrate success           1115929     1113188     3966997     4076178     4220010
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      2423867     2425024     8351264     8583856     8789144
Compaction migrate scanned    38956505    62526876   153906340   174085307   114170442
Compaction free scanned       83126040    51071610   396724121   358193857   389459415
Compaction cost                   1477        1639        5353        5612        5346
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

Observations:
- The "Success 3" line is allocation success rate with system idle (phases 1
  and 2 are with background interference). I used to get values around 85%
  with vanilla 3.11 and observed occasional drop to around 65% in 3.12, with
  about 50% chance. This was bisected to commit 81c0a2bb ("mm: page_alloc:
  fair zone allocator policy") using 10 repeats of the benchmark and marking
  as 'bad' a commit as long as the bad result appeared at least once (to fight
  the uncertainty). As explained in comment for patch 3, I don't think the
  commit is wrong, but that it makes the effect of bugs worse. From patch 3
  onwards, the results are OK. Here it might seem that patch 2 helps, but
  that's just the uncertainty. I plan to add support for more iterations and
  statistical summarizing of the results to fight this...
- It might seem that patch 5 is regressing phases 1 and 2, but since that was
  not the case when testing against 3.12, I would say it's just different
  case of unstable results. Phases 1 and 2 are more amenable to that in
  general. However, I never seen unpatched 3.11 or 3.12 go above 40% as
  the patch 3 does.
- Compaction cost and number of scanned pages is higher, especially due to
  patch 3. However, keep in mind that patches 2 and 3 fix existing bugs in the
  current design of overhead mitigation, they do not change it. If overhead is
  found unacceptable, then it should be decreased differently (and consistently,
  not due to random conditions) than the current implementation does. In
  contrast, patches 4 and 5 (which are not strictly bug fixes) do not
  increase the overhead (but also not success rates).

Another set of preliminary results is when configuring stress-highalloc to
allocate with similar flags as THP uses:
 (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

stress-highalloc
                         master                master                master                master                master
                          1-thp                 2-thp                 3-thp                 4-thp                 5-thp
Success 1       29.00 (  0.00%)        7.00 ( 75.86%)       25.00 ( 13.79%)       32.00 (-10.34%)       32.00 (-10.34%)
Success 2       30.00 (  0.00%)        7.00 ( 76.67%)       29.00 (  3.33%)       34.00 (-13.33%)       37.00 (-23.33%)
Success 3       70.00 (  0.00%)       70.00 (  0.00%)       85.00 (-21.43%)       85.00 (-21.43%)       85.00 (-21.43%)

              master      master      master      master      master
               1-thp       2-thp       3-thp       4-thp       5-thp
User         5915.36     6769.19     6350.04     6421.90     6571.80
System       1017.80     1053.70     1039.06     1051.84     1061.59
Elapsed      1757.87     1724.31     1744.66     1822.78     1841.42

                                master      master      master      master      master
                                 1-thp       2-thp       3-thp       4-thp       5-thp
Minor Faults                 246004967   248169249   244469991   248893104   245151725
Major Faults                       403         282         354         369         436
Swap Ins                             8           8          10           7           8
Swap Outs                          534         530         325         694         687
Direct pages scanned            106122       76339      168386      202576      170449
Kswapd pages scanned           1924013     1803706     1855293     1872408     1907170
Kswapd pages reclaimed         1920762     1800403     1852989     1869573     1904070
Direct pages reclaimed          105986       76291      168183      202440      170343
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity               1094.514    1046.045    1063.412    1027.227    1035.706
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                 60.370      44.272      96.515     111.136      92.564
Percentage direct scans             5%          4%          8%          9%          8%
Zone normal velocity           362.047     386.497     361.529     371.628     369.295
Zone dma32 velocity            792.836     703.820     798.398     766.734     758.975
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         741.000     751.000     325.000     694.000     924.000
Page writes file                   207         221           0           0         237
Page writes anon                   534         530         325         694         687
Page reclaim immediate             895         856         479         396         512
Sector Reads                   2769992     2627604     2735740     2828672     2836412
Sector Writes                 11748724    11660652    11598304    11800576    11753996
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1485952     1233024     1457280     1492096     1544320
Direct inode steals               2565         537        3384        6389        3205
Kswapd inode steals              50112       42207       46892       45371       49542
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                     28           2          23          31          28
THP collapse alloc                 485         276         417         539         514
THP splits                           0           0           0           2           3
THP fault fallback                   0           0           0           0           0
THP collapse fail                   13          19          17          12          12
Compaction stalls                  813         474         964        1052        1050
Compaction success                 332          92         359         434         411
Compaction failures                481         382         605         617         639
Page migrate success            582816      359101      973579      950980     1085585
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      1327894      806679     2256066     2195431     2461078
Compaction migrate scanned    13244945     7977159    21513942    23189436    30051866
Compaction free scanned       35192520    19254827    76152850    71159488    77702117
Compaction cost                    722         443        1204        1191        1383
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

                      master      master      master      master      master
                       1-thp       2-thp       3-thp       4-thp       5-thp
Mean sda-avgqz         46.01       46.31       46.43       46.87       45.94
Mean sda-await        271.19      273.75      273.84      270.12      269.69
Mean sda-r_await       35.33       35.52       34.26       33.98       33.61
Mean sda-w_await      474.54      497.59      603.64      567.32      488.48
Max  sda-avgqz        158.33      168.62      166.68      165.51      165.82
Max  sda-await       1461.41     1374.49     1380.31     1427.35     1402.61
Max  sda-r_await      197.46      286.67      112.65      112.07      158.24
Max  sda-w_await     9986.97    11363.36    16119.59    12365.75    11706.65

There are some differences from the previous results for THP-like allocations:
 - Here, the bad result for unpatched kernel in phase 3 is much more consistent
   to be between 65-70% and not due to the "regression" in 3.12. Still there is
   the improvement from patch 3 onwards, which brings it on par with simple
   GFP_HIGHUSER_MOVABLE allocations.
 - Patch 2 is again not a regression but due to results variability.
 - The compaction overhead in patches 2 and 3 and arguments are similar as
   above.
 - Patch 5 increases the number of migrate-scanned pages significantly. This
   is most likely due to __GFP_NO_KSWAPD flag, which means the cached pfn's are
   not reset by kswapd, and the patch thus helps the sync-after-async
   compaction. It doesn't however show that the sync compaction would help with
   success rates. One of the further patches I'm considering for future
   versions is to ignore or clear pageblock skip information for sync
   compaction. But in that case, THP clearly should be changed so that it does
   not fallback to the sync compaction.




Vlastimil Babka (5):
  mm: compaction: encapsulate defer reset logic
  mm: compaction: reset cached scanner pfn's before reading them
  mm: compaction: detect when scanners meet in isolate_freepages
  mm: compaction: do not mark unmovable pageblocks as skipped in async
    compaction
  mm: compaction: reset scanner positions immediately when they meet

 include/linux/compaction.h | 12 +++++++++++
 mm/compaction.c            | 53 ++++++++++++++++++++++++++++++----------------
 mm/page_alloc.c            |  5 +----
 3 files changed, 48 insertions(+), 22 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] mm: compaction: encapsulate defer reset logic
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
@ 2013-11-25 14:26 ` Vlastimil Babka
  2013-11-25 22:08   ` Rik van Riel
  2013-11-26 10:16   ` Mel Gorman
  2013-11-25 14:26 ` [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them Vlastimil Babka
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

Currently there are several functions to manipulate the deferred compaction
state variables. The remaining case where the variables are touched directly
is when a successful allocation occurs in direct compaction, or is expected
to be successful in the future by kswapd. Here, the lowest order that is
expected to fail is updated, and in the case of direct compaction, the deferred
status is reset completely.

Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional change.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h | 12 ++++++++++++
 mm/compaction.c            |  9 ++++-----
 mm/page_alloc.c            |  5 +----
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..da39978 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -62,6 +62,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 	return zone->compact_considered < defer_limit;
 }
 
+/* Update defer tracking counters after successful allocation of given order */
+static inline void compaction_defer_reset(struct zone *zone, int order,
+		bool reset_shift)
+{
+	if (reset_shift) {
+		zone->compact_considered = 0;
+		zone->compact_defer_shift = 0;
+	}
+	if (order >= zone->compact_order_failed)
+		zone->compact_order_failed = order + 1;
+}
+
 /* Returns true if restarting compaction after many failures */
 static inline bool compaction_restarting(struct zone *zone, int order)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 805165b..7c0073e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1116,12 +1116,11 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 			compact_zone(zone, cc);
 
 		if (cc->order > 0) {
-			int ok = zone_watermark_ok(zone, cc->order,
-						low_wmark_pages(zone), 0, 0);
-			if (ok && cc->order >= zone->compact_order_failed)
-				zone->compact_order_failed = cc->order + 1;
+			if (zone_watermark_ok(zone, cc->order,
+						low_wmark_pages(zone), 0, 0))
+				compaction_defer_reset(zone, cc->order, false);
 			/* Currently async compaction is never deferred. */
-			else if (!ok && cc->sync)
+			else if (cc->sync)
 				defer_compaction(zone, cc->order);
 		}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580a5f0..50c7f67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2243,10 +2243,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				preferred_zone, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
-			preferred_zone->compact_considered = 0;
-			preferred_zone->compact_defer_shift = 0;
-			if (order >= preferred_zone->compact_order_failed)
-				preferred_zone->compact_order_failed = order + 1;
+			compaction_defer_reset(preferred_zone, order, true);
 			count_vm_event(COMPACTSUCCESS);
 			return page;
 		}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
@ 2013-11-25 14:26 ` Vlastimil Babka
  2013-11-26 10:23   ` Mel Gorman
  2013-11-26 13:16   ` Rik van Riel
  2013-11-25 14:26 ` [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages Vlastimil Babka
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

Compaction caches pfn's for its migrate and free scanners to avoid scanning
the whole zone each time. In compact_zone(), the cached values are read to
set up initial values for the scanners. There are several situations when
these cached pfn's are reset to the first and last pfn of the zone,
respectively. One of these situations is when a compaction has been deferred
for a zone and is now being restarted during a direct compaction, which is also
done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before* resetting
them. This means the reset doesn't affect the compaction that performs it, and
with good chance also subsequent compactions, as update_pageblock_skip() is
likely to be called and update the cached pfn's to those being processed.
Another chance for a successful reset is when a direct compaction detects that
migration and free scanners meet (which has its own problems addressed by
another patch) and sets update_pageblock_skip flag which kswapd uses to do the
reset because it goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so this patch
moves the cached pfn reset to be performed *before* the values are read.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7c0073e..6a2f0c2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -943,6 +943,14 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	}
 
 	/*
+	 * Clear pageblock skip if there were failures recently and compaction
+	 * is about to be retried after being deferred. kswapd does not do
+	 * this reset as it'll reset the cached information when going to sleep.
+	 */
+	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
+
+	/*
 	 * Setup to move all movable pages to the end of the zone. Used cached
 	 * information on where the scanners should start but check that it
 	 * is initialised by ensuring the values are within zone boundaries.
@@ -958,14 +966,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
 	}
 
-	/*
-	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
-	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
-		__reset_isolation_suitable(zone);
-
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
  2013-11-25 14:26 ` [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them Vlastimil Babka
@ 2013-11-25 14:26 ` Vlastimil Babka
  2013-11-26 10:45   ` Mel Gorman
  2013-11-25 14:26 ` [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction Vlastimil Babka
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

Compaction of a zone is finished when the migrate scanner (which begins at the
zone's lowest pfn) meets the free page scanner (which begins at the zone's
highest pfn). This is detected in compact_zone() and in the case of direct
compaction, the compact_blockskip_flush flag is set so that kswapd later resets
the cached scanner pfn's, and a new compaction may again start at the zone's
borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free page
scanner, due to two problems. First, isolate_freepages() keeps free_pfn at the
highest block where it isolated pages from, for the purposes of not missing the
pages that are returned back to allocator when migration fails. Second, failing
to isolate enough free pages due to scanners meeting results in -ENOMEM being
returned by migrate_pages(), which makes compact_zone() bail out immediately
without calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated attempts at
compaction of a zone that keep starting from the cached pfn's close to the
meeting point, and quickly failing through the -ENOMEM path, without the cached
pfns being reset, over and over. This has been observed (through additional
tracepoints) in the third phase of the mmtests stress-highalloc benchmark, where
the allocator runs on an otherwise idle system. The problem was observed in the
DMA32 zone, which was used as a fallback to the preferred Normal zone, but on
the 4GB system it was actually the largest zone. The problem is even amplified
for such fallback zone - the deferred compaction logic, which could (after
being fixed by a previous patch) reset the cached scanner pfn's, is only
applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by commit
81c0a2bb ("mm: page_alloc: fair zone allocator policy") which resulted in a
non-deterministic regression of the allocation success rate from ~85% to ~65%.
This occurs in about half of benchmark runs, making bisection problematic.
It is unlikely that the commit itself is buggy, but it should put more pressure
on the DMA32 zone during phases 1 and 2, which may leave it more fragmented in
phase 3 and expose the bugs that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way, and
to check in compact_zone() for scanners meeting when migrate_pages() returns
-ENOMEM. The result is that compact_finished() also detects scanners meeting
and sets the compact_blockskip_flush flag to make kswapd reset the scanner
pfn's.

The results in stress-highalloc benchmark show that the "regression" by commit
81c0a2bb in phase 3 no longer occurs, and phase 1 and 2 allocation success rates
are significantly improved.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 6a2f0c2..0702bdf 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -656,7 +656,7 @@ static void isolate_freepages(struct zone *zone,
 	 * is the end of the pageblock the migration scanner is using.
 	 */
 	pfn = cc->free_pfn;
-	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
 
 	/*
 	 * Take care that if the migration scanner is at the end of the zone
@@ -672,7 +672,7 @@ static void isolate_freepages(struct zone *zone,
 	 * pages on cc->migratepages. We stop searching if the migrate
 	 * and free page scanners meet or enough free pages are isolated.
 	 */
-	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+	for (; pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
 					pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
 
@@ -734,7 +734,14 @@ static void isolate_freepages(struct zone *zone,
 	/* split_free_page does not map the pages */
 	map_pages(freelist);
 
-	cc->free_pfn = high_pfn;
+        /*
+         * If we crossed the migrate scanner, we want to keep it that way
+	 * so that compact_finished() may detect this
+	 */
+	if (pfn < low_pfn)
+		cc->free_pfn = max(pfn, zone->zone_start_pfn);
+	else
+		cc->free_pfn = high_pfn;
 	cc->nr_freepages = nr_freepages;
 }
 
@@ -999,7 +1006,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		if (err) {
 			putback_movable_pages(&cc->migratepages);
 			cc->nr_migratepages = 0;
-			if (err == -ENOMEM) {
+			/*
+			 * migrate_pages() may return -ENOMEM when scanners meet
+			 * and we want compact_finished() to detect it
+			 */
+			if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
 				ret = COMPACT_PARTIAL;
 				goto out;
 			}
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
                   ` (2 preceding siblings ...)
  2013-11-25 14:26 ` [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages Vlastimil Babka
@ 2013-11-25 14:26 ` Vlastimil Babka
  2013-11-26 10:58   ` Mel Gorman
  2013-11-25 14:26 ` [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet Vlastimil Babka
  2013-12-04 14:30 ` [PATCH] mm: compaction: Trace compaction begin and end Mel Gorman
  5 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

Compaction temporarily marks pageblocks where it fails to isolate pages as
to-be-skipped in further compactions, in order to improve efficiency. One of
the reasons to fail isolating pages is that isolation is not attempted in
pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

The problem is that blocks skipped due to not being MIGRATE_MOVABLE in async
compaction become skipped due to the temporary mark also in future sync
compaction. Moreover, this may follow quite soon during __alloc_page_slowpath,
without much time for kswapd to clear the pageblock skip marks. This goes
against the idea that sync compaction should try to scan these blocks more
thoroughly than the async compaction.

The fix is to ensure in async compaction that these !MIGRATE_MOVABLE blocks are
not marked to be skipped. Note this should not affect performance or locking
impact of further async compactions, as skipping a block due to being
!MIGRATE_MOVABLE is done soon after skipping a block marked to be skipped, both
without locking.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0702bdf..f481193 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -455,6 +455,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 	unsigned long flags;
 	bool locked = false;
 	struct page *page = NULL, *valid_page = NULL;
+	bool skipped_unmovable = false;
+
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -530,6 +532,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
 		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
 			cc->finished_update_migrate = true;
+			skipped_unmovable = true;
 			goto next_pageblock;
 		}
 
@@ -624,7 +627,7 @@ next_pageblock:
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 
 	/* Update the pageblock-skip if the whole pageblock was scanned */
-	if (low_pfn == end_pfn)
+	if (low_pfn == end_pfn && !skipped_unmovable)
 		update_pageblock_skip(cc, valid_page, nr_isolated, true);
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
                   ` (3 preceding siblings ...)
  2013-11-25 14:26 ` [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction Vlastimil Babka
@ 2013-11-25 14:26 ` Vlastimil Babka
  2013-11-26 11:03   ` Mel Gorman
  2013-12-04 14:30 ` [PATCH] mm: compaction: Trace compaction begin and end Mel Gorman
  5 siblings, 1 reply; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

Compaction used to start its migrate and free page scaners at the zone's lowest
and highest pfn, respectively. Later, caching was introduced to remember the
scanners' progress across compaction attempts so that pageblocks are not
re-scanned uselessly. Additionally, pageblocks where isolation failed are
marked to be quickly skipped when encountered again in future compactions.

Currently, both the reset of cached pfn's and clearing of the pageblock skip
information for a zone is done in __reset_isolation_suitable(). This function
gets called when:
 - compaction is restarting after being deferred
 - compact_blockskip_flush flag is set in compact_finished() when the scanners
   meet (and not again cleared when direct compaction succeeds in allocation)
   and kswapd acts upon this flag before going to sleep

This behavior is suboptimal for several reasons:
 - when direct sync compaction is called after async compaction fails (in the
   allocation slowpath), it will effectively do nothing, unless kswapd
   happens to process the compact_blockskip_flush flag meanwhile. This is racy
   and goes against the purpose of sync compaction to more thoroughly retry
   the compaction of a zone where async compaction has failed.
   The restart-after-deferring path cannot help here as deferring happens only
   after the sync compaction fails. It is also done only for the preferred
   zone, while the compaction might be done for a fallback zone.
 - the mechanism of marking pageblock to be skipped has little value since the
   cached pfn's are reset only together with the pageblock skip flags. This
   effectively limits pageblock skip usage to parallel compactions.

This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet. Clearing pageblock skip flags is unchanged,
as well as the other situations where cached pfn's are reset. This allows the
sync-after-async compaction to retry pageblocks not marked as skipped, such as
blocks !MIGRATE_MOVABLE blocks that async compactions now skips without
marking them.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index f481193..2c2cc4a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -843,6 +843,10 @@ static int compact_finished(struct zone *zone,
 
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
+		/* Let the next compaction start anew. */
+		zone->compact_cached_migrate_pfn = zone->zone_start_pfn;
+	        zone->compact_cached_free_pfn = zone_end_pfn(zone);
+
 		/*
 		 * Mark that the PG_migrate_skip information should be cleared
 		 * by kswapd when it goes to sleep. kswapd does not set the
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: compaction: encapsulate defer reset logic
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
@ 2013-11-25 22:08   ` Rik van Riel
  2013-11-26 10:16   ` Mel Gorman
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2013-11-25 22:08 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm; +Cc: linux-kernel, Mel Gorman

On 11/25/2013 09:26 AM, Vlastimil Babka wrote:
> Currently there are several functions to manipulate the deferred compaction
> state variables. The remaining case where the variables are touched directly
> is when a successful allocation occurs in direct compaction, or is expected
> to be successful in the future by kswapd. Here, the lowest order that is
> expected to fail is updated, and in the case of direct compaction, the deferred
> status is reset completely.
>
> Create a new function compaction_defer_reset() to encapsulate this
> functionality and make it easier to understand the code. No functional change.
>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] mm: compaction: encapsulate defer reset logic
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
  2013-11-25 22:08   ` Rik van Riel
@ 2013-11-26 10:16   ` Mel Gorman
  1 sibling, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2013-11-26 10:16 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Rik van Riel

On Mon, Nov 25, 2013 at 03:26:06PM +0100, Vlastimil Babka wrote:
> Currently there are several functions to manipulate the deferred compaction
> state variables. The remaining case where the variables are touched directly
> is when a successful allocation occurs in direct compaction, or is expected
> to be successful in the future by kswapd. Here, the lowest order that is
> expected to fail is updated, and in the case of direct compaction, the deferred
> status is reset completely.
> 
> Create a new function compaction_defer_reset() to encapsulate this
> functionality and make it easier to understand the code. No functional change.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  include/linux/compaction.h | 12 ++++++++++++
>  mm/compaction.c            |  9 ++++-----
>  mm/page_alloc.c            |  5 +----
>  3 files changed, 17 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 091d72e..da39978 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -62,6 +62,18 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  	return zone->compact_considered < defer_limit;
>  }
>  
> +/* Update defer tracking counters after successful allocation of given order */
> +static inline void compaction_defer_reset(struct zone *zone, int order,
> +		bool reset_shift)
> +{
> +	if (reset_shift) {
> +		zone->compact_considered = 0;
> +		zone->compact_defer_shift = 0;
> +	}
> +	if (order >= zone->compact_order_failed)
> +		zone->compact_order_failed = order + 1;
> +}
> +

Nit pick

The comment says this is called after a successful allocation but that
is only true in one case. s/allocation/compaction/ ?

reset_shift says what it does but not why and exposes an unnecessary. If
this sees a second revision, maybe consider renaming it to something like
alloc_success?

With or without changes;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them
  2013-11-25 14:26 ` [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them Vlastimil Babka
@ 2013-11-26 10:23   ` Mel Gorman
  2013-11-26 13:16   ` Rik van Riel
  1 sibling, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2013-11-26 10:23 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Rik van Riel

On Mon, Nov 25, 2013 at 03:26:07PM +0100, Vlastimil Babka wrote:
> Compaction caches pfn's for its migrate and free scanners to avoid scanning
> the whole zone each time. In compact_zone(), the cached values are read to
> set up initial values for the scanners. There are several situations when
> these cached pfn's are reset to the first and last pfn of the zone,
> respectively. One of these situations is when a compaction has been deferred
> for a zone and is now being restarted during a direct compaction, which is also
> done in compact_zone().
> 
> However, compact_zone() currently reads the cached pfn's *before* resetting
> them. This means the reset doesn't affect the compaction that performs it, and
> with good chance also subsequent compactions, as update_pageblock_skip() is
> likely to be called and update the cached pfn's to those being processed.
> Another chance for a successful reset is when a direct compaction detects that
> migration and free scanners meet (which has its own problems addressed by
> another patch) and sets update_pageblock_skip flag which kswapd uses to do the
> reset because it goes to sleep.
> 
> This is clearly a bug that results in non-deterministic behavior, so this patch
> moves the cached pfn reset to be performed *before* the values are read.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages
  2013-11-25 14:26 ` [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages Vlastimil Babka
@ 2013-11-26 10:45   ` Mel Gorman
  2013-11-26 16:44     ` Vlastimil Babka
  0 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2013-11-26 10:45 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Rik van Riel

On Mon, Nov 25, 2013 at 03:26:08PM +0100, Vlastimil Babka wrote:
> Compaction of a zone is finished when the migrate scanner (which begins at the
> zone's lowest pfn) meets the free page scanner (which begins at the zone's
> highest pfn). This is detected in compact_zone() and in the case of direct
> compaction, the compact_blockskip_flush flag is set so that kswapd later resets
> the cached scanner pfn's, and a new compaction may again start at the zone's
> borders.
> 
> The meeting of the scanners can happen during either scanner's activity.
> However, it may currently fail to be detected when it occurs in the free page
> scanner, due to two problems. First, isolate_freepages() keeps free_pfn at the
> highest block where it isolated pages from, for the purposes of not missing the
> pages that are returned back to allocator when migration fails. Second, failing
> to isolate enough free pages due to scanners meeting results in -ENOMEM being
> returned by migrate_pages(), which makes compact_zone() bail out immediately
> without calling compact_finished() that would detect scanners meeting.
> 
> This failure to detect scanners meeting might result in repeated attempts at
> compaction of a zone that keep starting from the cached pfn's close to the
> meeting point, and quickly failing through the -ENOMEM path, without the cached
> pfns being reset, over and over. This has been observed (through additional
> tracepoints) in the third phase of the mmtests stress-highalloc benchmark, where
> the allocator runs on an otherwise idle system. The problem was observed in the
> DMA32 zone, which was used as a fallback to the preferred Normal zone, but on
> the 4GB system it was actually the largest zone. The problem is even amplified
> for such fallback zone - the deferred compaction logic, which could (after
> being fixed by a previous patch) reset the cached scanner pfn's, is only
> applied to the preferred zone and not for the fallbacks.
> 
> The problem in the third phase of the benchmark was further amplified by commit
> 81c0a2bb ("mm: page_alloc: fair zone allocator policy") which resulted in a
> non-deterministic regression of the allocation success rate from ~85% to ~65%.
> This occurs in about half of benchmark runs, making bisection problematic.
> It is unlikely that the commit itself is buggy, but it should put more pressure
> on the DMA32 zone during phases 1 and 2, which may leave it more fragmented in
> phase 3 and expose the bugs that this patch fixes.
> 
> The fix is to make scanners meeting in isolate_freepage() stay that way, and
> to check in compact_zone() for scanners meeting when migrate_pages() returns
> -ENOMEM. The result is that compact_finished() also detects scanners meeting
> and sets the compact_blockskip_flush flag to make kswapd reset the scanner
> pfn's.
> 
> The results in stress-highalloc benchmark show that the "regression" by commit
> 81c0a2bb in phase 3 no longer occurs, and phase 1 and 2 allocation success rates
> are significantly improved.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/compaction.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 6a2f0c2..0702bdf 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -656,7 +656,7 @@ static void isolate_freepages(struct zone *zone,
>  	 * is the end of the pageblock the migration scanner is using.
>  	 */
>  	pfn = cc->free_pfn;
> -	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> +	low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
>  
>  	/*
>  	 * Take care that if the migration scanner is at the end of the zone
> @@ -672,7 +672,7 @@ static void isolate_freepages(struct zone *zone,
>  	 * pages on cc->migratepages. We stop searching if the migrate
>  	 * and free page scanners meet or enough free pages are isolated.
>  	 */
> -	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> +	for (; pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
>  					pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
>  
> @@ -734,7 +734,14 @@ static void isolate_freepages(struct zone *zone,
>  	/* split_free_page does not map the pages */
>  	map_pages(freelist);
>  
> -	cc->free_pfn = high_pfn;
> +        /*
> +         * If we crossed the migrate scanner, we want to keep it that way
> +	 * so that compact_finished() may detect this
> +	 */

Whitespace damage.

> +	if (pfn < low_pfn)
> +		cc->free_pfn = max(pfn, zone->zone_start_pfn);

Is it even possible for this condition to occur? low_pfn bound is
cc->migrate_pfn and the free scanner should only start after some pages
have already been isolated for migration.

> +	else
> +		cc->free_pfn = high_pfn;
>  	cc->nr_freepages = nr_freepages;
>  }
>  
> @@ -999,7 +1006,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  		if (err) {
>  			putback_movable_pages(&cc->migratepages);
>  			cc->nr_migratepages = 0;
> -			if (err == -ENOMEM) {
> +			/*
> +			 * migrate_pages() may return -ENOMEM when scanners meet
> +			 * and we want compact_finished() to detect it
> +			 */
> +			if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
>  				ret = COMPACT_PARTIAL;
>  				goto out;
>  			}
> -- 
> 1.8.1.4
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
  2013-11-25 14:26 ` [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction Vlastimil Babka
@ 2013-11-26 10:58   ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2013-11-26 10:58 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Rik van Riel

On Mon, Nov 25, 2013 at 03:26:09PM +0100, Vlastimil Babka wrote:
> Compaction temporarily marks pageblocks where it fails to isolate pages as
> to-be-skipped in further compactions, in order to improve efficiency. One of
> the reasons to fail isolating pages is that isolation is not attempted in
> pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.
> 
> The problem is that blocks skipped due to not being MIGRATE_MOVABLE in async
> compaction become skipped due to the temporary mark also in future sync
> compaction. Moreover, this may follow quite soon during __alloc_page_slowpath,
> without much time for kswapd to clear the pageblock skip marks. This goes
> against the idea that sync compaction should try to scan these blocks more
> thoroughly than the async compaction.
> 
> The fix is to ensure in async compaction that these !MIGRATE_MOVABLE blocks are
> not marked to be skipped. Note this should not affect performance or locking
> impact of further async compactions, as skipping a block due to being
> !MIGRATE_MOVABLE is done soon after skipping a block marked to be skipped, both
> without locking.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/compaction.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 0702bdf..f481193 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -455,6 +455,8 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  	unsigned long flags;
>  	bool locked = false;
>  	struct page *page = NULL, *valid_page = NULL;
> +	bool skipped_unmovable = false;
> +
>  

whitespace damage.

>  	/*
>  	 * Ensure that there are not too many pages isolated from the LRU
> @@ -530,6 +532,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
>  		if (!cc->sync && last_pageblock_nr != pageblock_nr &&
>  		    !migrate_async_suitable(get_pageblock_migratetype(page))) {
>  			cc->finished_update_migrate = true;
> +			skipped_unmovable = true;
>  			goto next_pageblock;
>  		}
>  

Minor nitpick, but it's also unreclaimable and isolate blocks that are
being skipped here. If you do another revision, consider rephrasing
s/unmovable/unsuitable/ where appropriate. It's fairly obvious from
context so if you decide not to, that's fine too.

> @@ -624,7 +627,7 @@ next_pageblock:
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  
>  	/* Update the pageblock-skip if the whole pageblock was scanned */
> -	if (low_pfn == end_pfn)
> +	if (low_pfn == end_pfn && !skipped_unmovable)
>  		update_pageblock_skip(cc, valid_page, nr_isolated, true);
>  

This comment is now out of date. If the comment gets updated then feel
free to add

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet
  2013-11-25 14:26 ` [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet Vlastimil Babka
@ 2013-11-26 11:03   ` Mel Gorman
  0 siblings, 0 replies; 20+ messages in thread
From: Mel Gorman @ 2013-11-26 11:03 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Rik van Riel

On Mon, Nov 25, 2013 at 03:26:10PM +0100, Vlastimil Babka wrote:
> Compaction used to start its migrate and free page scaners at the zone's lowest
> and highest pfn, respectively. Later, caching was introduced to remember the
> scanners' progress across compaction attempts so that pageblocks are not
> re-scanned uselessly. Additionally, pageblocks where isolation failed are
> marked to be quickly skipped when encountered again in future compactions.
> 
> Currently, both the reset of cached pfn's and clearing of the pageblock skip
> information for a zone is done in __reset_isolation_suitable(). This function
> gets called when:
>  - compaction is restarting after being deferred
>  - compact_blockskip_flush flag is set in compact_finished() when the scanners
>    meet (and not again cleared when direct compaction succeeds in allocation)
>    and kswapd acts upon this flag before going to sleep
> 
> This behavior is suboptimal for several reasons:
>  - when direct sync compaction is called after async compaction fails (in the
>    allocation slowpath), it will effectively do nothing, unless kswapd
>    happens to process the compact_blockskip_flush flag meanwhile. This is racy
>    and goes against the purpose of sync compaction to more thoroughly retry
>    the compaction of a zone where async compaction has failed.
>    The restart-after-deferring path cannot help here as deferring happens only
>    after the sync compaction fails. It is also done only for the preferred
>    zone, while the compaction might be done for a fallback zone.
>  - the mechanism of marking pageblock to be skipped has little value since the
>    cached pfn's are reset only together with the pageblock skip flags. This
>    effectively limits pageblock skip usage to parallel compactions.
> 
> This patch changes compact_finished() so that cached pfn's are reset
> immediately when the scanners meet. Clearing pageblock skip flags is unchanged,
> as well as the other situations where cached pfn's are reset. This allows the
> sync-after-async compaction to retry pageblocks not marked as skipped, such as
> blocks !MIGRATE_MOVABLE blocks that async compactions now skips without
> marking them.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them
  2013-11-25 14:26 ` [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them Vlastimil Babka
  2013-11-26 10:23   ` Mel Gorman
@ 2013-11-26 13:16   ` Rik van Riel
  1 sibling, 0 replies; 20+ messages in thread
From: Rik van Riel @ 2013-11-26 13:16 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-kernel, Mel Gorman

On 11/25/2013 09:26 AM, Vlastimil Babka wrote:
> Compaction caches pfn's for its migrate and free scanners to avoid scanning
> the whole zone each time. In compact_zone(), the cached values are read to
> set up initial values for the scanners. There are several situations when
> these cached pfn's are reset to the first and last pfn of the zone,
> respectively. One of these situations is when a compaction has been deferred
> for a zone and is now being restarted during a direct compaction, which is also
> done in compact_zone().
> 
> However, compact_zone() currently reads the cached pfn's *before* resetting
> them. This means the reset doesn't affect the compaction that performs it, and
> with good chance also subsequent compactions, as update_pageblock_skip() is
> likely to be called and update the cached pfn's to those being processed.
> Another chance for a successful reset is when a direct compaction detects that
> migration and free scanners meet (which has its own problems addressed by
> another patch) and sets update_pageblock_skip flag which kswapd uses to do the
> reset because it goes to sleep.
> 
> This is clearly a bug that results in non-deterministic behavior, so this patch
> moves the cached pfn reset to be performed *before* the values are read.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages
  2013-11-26 10:45   ` Mel Gorman
@ 2013-11-26 16:44     ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-26 16:44 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel, Rik van Riel

On 11/26/2013 11:45 AM, Mel Gorman wrote:
> On Mon, Nov 25, 2013 at 03:26:08PM +0100, Vlastimil Babka wrote:
>> Compaction of a zone is finished when the migrate scanner (which begins at the
>> zone's lowest pfn) meets the free page scanner (which begins at the zone's
>> highest pfn). This is detected in compact_zone() and in the case of direct
>> compaction, the compact_blockskip_flush flag is set so that kswapd later resets
>> the cached scanner pfn's, and a new compaction may again start at the zone's
>> borders.
>>
>> The meeting of the scanners can happen during either scanner's activity.
>> However, it may currently fail to be detected when it occurs in the free page
>> scanner, due to two problems. First, isolate_freepages() keeps free_pfn at the
>> highest block where it isolated pages from, for the purposes of not missing the
>> pages that are returned back to allocator when migration fails. Second, failing
>> to isolate enough free pages due to scanners meeting results in -ENOMEM being
>> returned by migrate_pages(), which makes compact_zone() bail out immediately
>> without calling compact_finished() that would detect scanners meeting.
>>
>> This failure to detect scanners meeting might result in repeated attempts at
>> compaction of a zone that keep starting from the cached pfn's close to the
>> meeting point, and quickly failing through the -ENOMEM path, without the cached
>> pfns being reset, over and over. This has been observed (through additional
>> tracepoints) in the third phase of the mmtests stress-highalloc benchmark, where
>> the allocator runs on an otherwise idle system. The problem was observed in the
>> DMA32 zone, which was used as a fallback to the preferred Normal zone, but on
>> the 4GB system it was actually the largest zone. The problem is even amplified
>> for such fallback zone - the deferred compaction logic, which could (after
>> being fixed by a previous patch) reset the cached scanner pfn's, is only
>> applied to the preferred zone and not for the fallbacks.
>>
>> The problem in the third phase of the benchmark was further amplified by commit
>> 81c0a2bb ("mm: page_alloc: fair zone allocator policy") which resulted in a
>> non-deterministic regression of the allocation success rate from ~85% to ~65%.
>> This occurs in about half of benchmark runs, making bisection problematic.
>> It is unlikely that the commit itself is buggy, but it should put more pressure
>> on the DMA32 zone during phases 1 and 2, which may leave it more fragmented in
>> phase 3 and expose the bugs that this patch fixes.
>>
>> The fix is to make scanners meeting in isolate_freepage() stay that way, and
>> to check in compact_zone() for scanners meeting when migrate_pages() returns
>> -ENOMEM. The result is that compact_finished() also detects scanners meeting
>> and sets the compact_blockskip_flush flag to make kswapd reset the scanner
>> pfn's.
>>
>> The results in stress-highalloc benchmark show that the "regression" by commit
>> 81c0a2bb in phase 3 no longer occurs, and phase 1 and 2 allocation success rates
>> are significantly improved.
>>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/compaction.c | 19 +++++++++++++++----
>>  1 file changed, 15 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 6a2f0c2..0702bdf 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -656,7 +656,7 @@ static void isolate_freepages(struct zone *zone,
>>  	 * is the end of the pageblock the migration scanner is using.
>>  	 */
>>  	pfn = cc->free_pfn;
>> -	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
>> +	low_pfn = ALIGN(cc->migrate_pfn + 1, pageblock_nr_pages);
>>  
>>  	/*
>>  	 * Take care that if the migration scanner is at the end of the zone
>> @@ -672,7 +672,7 @@ static void isolate_freepages(struct zone *zone,
>>  	 * pages on cc->migratepages. We stop searching if the migrate
>>  	 * and free page scanners meet or enough free pages are isolated.
>>  	 */
>> -	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
>> +	for (; pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
>>  					pfn -= pageblock_nr_pages) {
>>  		unsigned long isolated;
>>  
>> @@ -734,7 +734,14 @@ static void isolate_freepages(struct zone *zone,
>>  	/* split_free_page does not map the pages */
>>  	map_pages(freelist);
>>  
>> -	cc->free_pfn = high_pfn;
>> +        /*
>> +         * If we crossed the migrate scanner, we want to keep it that way
>> +	 * so that compact_finished() may detect this
>> +	 */
> 
> Whitespace damage.

Thanks, will fix.

>> +	if (pfn < low_pfn)
>> +		cc->free_pfn = max(pfn, zone->zone_start_pfn);
> 
> Is it even possible for this condition to occur? low_pfn bound is
> cc->migrate_pfn and the free scanner should only start after some pages
> have already been isolated for migration.

If a zone starts in a middle of a pageblock and migrate scanner isolates
enough pages early to stay within that pageblock, low_pfn will be at the
end of that pageblock and after the for cycle in this function ends, pfn
might be at the beginning of that pageblock. It might not be an actual
problem (this compaction will finish at this point, and if someone else
is racing, he will probably check the boundaries himself), but I played
it safe.

>> +	else
>> +		cc->free_pfn = high_pfn;
>>  	cc->nr_freepages = nr_freepages;
>>  }
>>  
>> @@ -999,7 +1006,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>>  		if (err) {
>>  			putback_movable_pages(&cc->migratepages);
>>  			cc->nr_migratepages = 0;
>> -			if (err == -ENOMEM) {
>> +			/*
>> +			 * migrate_pages() may return -ENOMEM when scanners meet
>> +			 * and we want compact_finished() to detect it
>> +			 */
>> +			if (err == -ENOMEM && cc->free_pfn > cc->migrate_pfn) {
>>  				ret = COMPACT_PARTIAL;
>>  				goto out;
>>  			}
>> -- 
>> 1.8.1.4
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm: compaction: Trace compaction begin and end
  2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
                   ` (4 preceding siblings ...)
  2013-11-25 14:26 ` [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet Vlastimil Babka
@ 2013-12-04 14:30 ` Mel Gorman
  2013-12-04 14:51   ` Vlastimil Babka
  5 siblings, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2013-12-04 14:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel, Vlastimil Babka

This patch adds two tracepoints for compaction begin and end of a zone. Using
this it is possible to calculate how much time a workload is spending
within compaction and potentially debug problems related to cached pfns
for scanning. In combination with the direct reclaim and slab trace points
it should be possible to estimate most allocation-related overhead for
a workload.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                   |  4 ++++
 2 files changed, 46 insertions(+)

diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index fde1b3e..f4e115a 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
 		__entry->nr_failed)
 );
 
+TRACE_EVENT(mm_compaction_begin,
+	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
+		unsigned long zone_end, unsigned long free_start),
+
+	TP_ARGS(zone_start, migrate_start, zone_end, free_start),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, zone_start)
+		__field(unsigned long, migrate_start)
+		__field(unsigned long, zone_end)
+		__field(unsigned long, free_start)
+	),
+
+	TP_fast_assign(
+		__entry->zone_start = zone_start;
+		__entry->migrate_start = migrate_start;
+		__entry->zone_end = zone_end;
+		__entry->free_start = free_start;
+	),
+
+	TP_printk("zone_start=%lu migrate_start=%lu zone_end=%lu free_start=%lu",
+		__entry->zone_start,
+		__entry->migrate_start,
+		__entry->zone_end,
+		__entry->free_start)
+);
+
+TRACE_EVENT(mm_compaction_end,
+	TP_PROTO(int status),
+
+	TP_ARGS(status),
+
+	TP_STRUCT__entry(
+		__field(int, status)
+	),
+
+	TP_fast_assign(
+		__entry->status = status;
+	),
+
+	TP_printk("status=%d", __entry->status)
+);
 
 #endif /* _TRACE_COMPACTION_H */
 
diff --git a/mm/compaction.c b/mm/compaction.c
index c437893..78ff866 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -960,6 +960,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
 		__reset_isolation_suitable(zone);
 
+	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, end_pfn, cc->free_pfn);
+
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
@@ -1005,6 +1007,8 @@ out:
 	cc->nr_freepages -= release_freepages(&cc->freepages);
 	VM_BUG_ON(cc->nr_freepages != 0);
 
+	trace_mm_compaction_end(ret);
+
 	return ret;
 }
 
-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: compaction: Trace compaction begin and end
  2013-12-04 14:30 ` [PATCH] mm: compaction: Trace compaction begin and end Mel Gorman
@ 2013-12-04 14:51   ` Vlastimil Babka
  2013-12-05  9:05     ` Mel Gorman
  2013-12-05  9:07     ` [PATCH] mm: compaction: Trace compaction begin and end v2 Mel Gorman
  0 siblings, 2 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-12-04 14:51 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel

On 12/04/2013 03:30 PM, Mel Gorman wrote:
> This patch adds two tracepoints for compaction begin and end of a zone. Using
> this it is possible to calculate how much time a workload is spending
> within compaction and potentially debug problems related to cached pfns
> for scanning.

I guess for debugging pfns it would be also useful to print their values 
also in mm_compaction_end.

> In combination with the direct reclaim and slab trace points
> it should be possible to estimate most allocation-related overhead for
> a workload.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
>   mm/compaction.c                   |  4 ++++
>   2 files changed, 46 insertions(+)
>
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index fde1b3e..f4e115a 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
>   		__entry->nr_failed)
>   );
>
> +TRACE_EVENT(mm_compaction_begin,
> +	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
> +		unsigned long zone_end, unsigned long free_start),
> +
> +	TP_ARGS(zone_start, migrate_start, zone_end, free_start),

IMHO a better order would be:
  zone_start, migrate_start, free_start, zone_end
(well especially in the TP_printk part anyway).

> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, zone_start)
> +		__field(unsigned long, migrate_start)
> +		__field(unsigned long, zone_end)
> +		__field(unsigned long, free_start)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->zone_start = zone_start;
> +		__entry->migrate_start = migrate_start;
> +		__entry->zone_end = zone_end;
> +		__entry->free_start = free_start;
> +	),
> +
> +	TP_printk("zone_start=%lu migrate_start=%lu zone_end=%lu free_start=%lu",
> +		__entry->zone_start,
> +		__entry->migrate_start,
> +		__entry->zone_end,
> +		__entry->free_start)
> +);
> +
> +TRACE_EVENT(mm_compaction_end,
> +	TP_PROTO(int status),
> +
> +	TP_ARGS(status),
> +
> +	TP_STRUCT__entry(
> +		__field(int, status)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->status = status;
> +	),
> +
> +	TP_printk("status=%d", __entry->status)
> +);
>
>   #endif /* _TRACE_COMPACTION_H */
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c437893..78ff866 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -960,6 +960,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>   	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
>   		__reset_isolation_suitable(zone);
>
> +	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, end_pfn, cc->free_pfn);
> +
>   	migrate_prep_local();
>
>   	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> @@ -1005,6 +1007,8 @@ out:
>   	cc->nr_freepages -= release_freepages(&cc->freepages);
>   	VM_BUG_ON(cc->nr_freepages != 0);
>
> +	trace_mm_compaction_end(ret);
> +
>   	return ret;
>   }
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: compaction: Trace compaction begin and end
  2013-12-04 14:51   ` Vlastimil Babka
@ 2013-12-05  9:05     ` Mel Gorman
  2013-12-06  9:50       ` Vlastimil Babka
  2013-12-05  9:07     ` [PATCH] mm: compaction: Trace compaction begin and end v2 Mel Gorman
  1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2013-12-05  9:05 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel

On Wed, Dec 04, 2013 at 03:51:57PM +0100, Vlastimil Babka wrote:
> On 12/04/2013 03:30 PM, Mel Gorman wrote:
> >This patch adds two tracepoints for compaction begin and end of a zone. Using
> >this it is possible to calculate how much time a workload is spending
> >within compaction and potentially debug problems related to cached pfns
> >for scanning.
> 
> I guess for debugging pfns it would be also useful to print their
> values also in mm_compaction_end.
> 

What additional information would we get from it and what new
conclusions could we draw? We could guess how much work the
scanners did but the trace_mm_compaction_isolate_freepages and
trace_mm_compaction_isolate_migratepages tracepoints already accurately
tell us that. The scanner PFNs alone do not tell us if the cached pfns
were updated and even if it did, the information can be changed by
parallel resets so it would be hard to draw reasonable conclusions from
the information. We could guess where compaction hotspots might be but
without the skip information, we could not detect it accurately.  If we
wanted to detect that accurately, the mm_compaction_isolate* tracepoints
would be the one to update.

I was primarily concerned about compaction time so I might be looking
at this the wrong way but it feels like having the PFNs at the end of a
compaction cycle would be of marginal benefit.

> >In combination with the direct reclaim and slab trace points
> >it should be possible to estimate most allocation-related overhead for
> >a workload.
> >
> >Signed-off-by: Mel Gorman <mgorman@suse.de>
> >---
> >  include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
> >  mm/compaction.c                   |  4 ++++
> >  2 files changed, 46 insertions(+)
> >
> >diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> >index fde1b3e..f4e115a 100644
> >--- a/include/trace/events/compaction.h
> >+++ b/include/trace/events/compaction.h
> >@@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
> >  		__entry->nr_failed)
> >  );
> >
> >+TRACE_EVENT(mm_compaction_begin,
> >+	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
> >+		unsigned long zone_end, unsigned long free_start),
> >+
> >+	TP_ARGS(zone_start, migrate_start, zone_end, free_start),
> 
> IMHO a better order would be:
>  zone_start, migrate_start, free_start, zone_end
> (well especially in the TP_printk part anyway).
> 

Ok, that would put them in PFN order which may be easier to visualise.
I'll post a V2 with that change at least.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm: compaction: Trace compaction begin and end v2
  2013-12-04 14:51   ` Vlastimil Babka
  2013-12-05  9:05     ` Mel Gorman
@ 2013-12-05  9:07     ` Mel Gorman
  2013-12-06  9:50       ` Vlastimil Babka
  1 sibling, 1 reply; 20+ messages in thread
From: Mel Gorman @ 2013-12-05  9:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vlastimil Babka, linux-mm, linux-kernel, Rik van Riel

Changelog since V1
o Print output parameters in pfn order			(vbabka)

This patch adds two tracepoints for compaction begin and end of a zone. Using
this it is possible to calculate how much time a workload is spending
within compaction and potentially debug problems related to cached pfns
for scanning. In combination with the direct reclaim and slab trace points
it should be possible to estimate most allocation-related overhead for
a workload.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
 mm/compaction.c                   |  4 ++++
 2 files changed, 46 insertions(+)

diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index fde1b3e..06f544e 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
 		__entry->nr_failed)
 );
 
+TRACE_EVENT(mm_compaction_begin,
+	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
+		unsigned long free_start, unsigned long zone_end),
+
+	TP_ARGS(zone_start, migrate_start, free_start, zone_end),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, zone_start)
+		__field(unsigned long, migrate_start)
+		__field(unsigned long, free_start)
+		__field(unsigned long, zone_end)
+	),
+
+	TP_fast_assign(
+		__entry->zone_start = zone_start;
+		__entry->migrate_start = migrate_start;
+		__entry->free_start = free_start;
+		__entry->zone_end = zone_end;
+	),
+
+	TP_printk("zone_start=%lu migrate_start=%lu free_start=%lu zone_end=%lu",
+		__entry->zone_start,
+		__entry->migrate_start,
+		__entry->free_start,
+		__entry->zone_end)
+);
+
+TRACE_EVENT(mm_compaction_end,
+	TP_PROTO(int status),
+
+	TP_ARGS(status),
+
+	TP_STRUCT__entry(
+		__field(int, status)
+	),
+
+	TP_fast_assign(
+		__entry->status = status;
+	),
+
+	TP_printk("status=%d", __entry->status)
+);
 
 #endif /* _TRACE_COMPACTION_H */
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 805165b..bb50fd3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -966,6 +966,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
 		__reset_isolation_suitable(zone);
 
+	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn);
+
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
@@ -1011,6 +1013,8 @@ out:
 	cc->nr_freepages -= release_freepages(&cc->freepages);
 	VM_BUG_ON(cc->nr_freepages != 0);
 
+	trace_mm_compaction_end(ret);
+
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: compaction: Trace compaction begin and end
  2013-12-05  9:05     ` Mel Gorman
@ 2013-12-06  9:50       ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-12-06  9:50 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel, Rik van Riel

On 12/05/2013 10:05 AM, Mel Gorman wrote:
> On Wed, Dec 04, 2013 at 03:51:57PM +0100, Vlastimil Babka wrote:
>> On 12/04/2013 03:30 PM, Mel Gorman wrote:
>>> This patch adds two tracepoints for compaction begin and end of a zone. Using
>>> this it is possible to calculate how much time a workload is spending
>>> within compaction and potentially debug problems related to cached pfns
>>> for scanning.
>>
>> I guess for debugging pfns it would be also useful to print their
>> values also in mm_compaction_end.
>>
>
> What additional information would we get from it and what new
> conclusions could we draw? We could guess how much work the
> scanners did but the trace_mm_compaction_isolate_freepages and
> trace_mm_compaction_isolate_migratepages tracepoints already accurately
> tell us that. The scanner PFNs alone do not tell us if the cached pfns
> were updated and even if it did, the information can be changed by
> parallel resets so it would be hard to draw reasonable conclusions from
> the information. We could guess where compaction hotspots might be but
> without the skip information, we could not detect it accurately.  If we
> wanted to detect that accurately, the mm_compaction_isolate* tracepoints
> would be the one to update.

OK, I agree. I guess multiple compaction_begin events would hint at 
scanners being stuck anyway.

> I was primarily concerned about compaction time so I might be looking
> at this the wrong way but it feels like having the PFNs at the end of a
> compaction cycle would be of marginal benefit.
>
>>> In combination with the direct reclaim and slab trace points
>>> it should be possible to estimate most allocation-related overhead for
>>> a workload.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>> ---
>>>   include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
>>>   mm/compaction.c                   |  4 ++++
>>>   2 files changed, 46 insertions(+)
>>>
>>> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
>>> index fde1b3e..f4e115a 100644
>>> --- a/include/trace/events/compaction.h
>>> +++ b/include/trace/events/compaction.h
>>> @@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
>>>   		__entry->nr_failed)
>>>   );
>>>
>>> +TRACE_EVENT(mm_compaction_begin,
>>> +	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
>>> +		unsigned long zone_end, unsigned long free_start),
>>> +
>>> +	TP_ARGS(zone_start, migrate_start, zone_end, free_start),
>>
>> IMHO a better order would be:
>>   zone_start, migrate_start, free_start, zone_end
>> (well especially in the TP_printk part anyway).
>>
>
> Ok, that would put them in PFN order which may be easier to visualise.
> I'll post a V2 with that change at least.
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm: compaction: Trace compaction begin and end v2
  2013-12-05  9:07     ` [PATCH] mm: compaction: Trace compaction begin and end v2 Mel Gorman
@ 2013-12-06  9:50       ` Vlastimil Babka
  0 siblings, 0 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-12-06  9:50 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton; +Cc: linux-mm, linux-kernel, Rik van Riel

On 12/05/2013 10:07 AM, Mel Gorman wrote:
> Changelog since V1
> o Print output parameters in pfn order			(vbabka)
>
> This patch adds two tracepoints for compaction begin and end of a zone. Using
> this it is possible to calculate how much time a workload is spending
> within compaction and potentially debug problems related to cached pfns
> for scanning. In combination with the direct reclaim and slab trace points
> it should be possible to estimate most allocation-related overhead for
> a workload.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   include/trace/events/compaction.h | 42 +++++++++++++++++++++++++++++++++++++++
>   mm/compaction.c                   |  4 ++++
>   2 files changed, 46 insertions(+)
>
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index fde1b3e..06f544e 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -67,6 +67,48 @@ TRACE_EVENT(mm_compaction_migratepages,
>   		__entry->nr_failed)
>   );
>
> +TRACE_EVENT(mm_compaction_begin,
> +	TP_PROTO(unsigned long zone_start, unsigned long migrate_start,
> +		unsigned long free_start, unsigned long zone_end),
> +
> +	TP_ARGS(zone_start, migrate_start, free_start, zone_end),
> +
> +	TP_STRUCT__entry(
> +		__field(unsigned long, zone_start)
> +		__field(unsigned long, migrate_start)
> +		__field(unsigned long, free_start)
> +		__field(unsigned long, zone_end)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->zone_start = zone_start;
> +		__entry->migrate_start = migrate_start;
> +		__entry->free_start = free_start;
> +		__entry->zone_end = zone_end;
> +	),
> +
> +	TP_printk("zone_start=%lu migrate_start=%lu free_start=%lu zone_end=%lu",
> +		__entry->zone_start,
> +		__entry->migrate_start,
> +		__entry->free_start,
> +		__entry->zone_end)
> +);
> +
> +TRACE_EVENT(mm_compaction_end,
> +	TP_PROTO(int status),
> +
> +	TP_ARGS(status),
> +
> +	TP_STRUCT__entry(
> +		__field(int, status)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->status = status;
> +	),
> +
> +	TP_printk("status=%d", __entry->status)
> +);
>
>   #endif /* _TRACE_COMPACTION_H */
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 805165b..bb50fd3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -966,6 +966,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>   	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
>   		__reset_isolation_suitable(zone);
>
> +	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn);
> +
>   	migrate_prep_local();
>
>   	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
> @@ -1011,6 +1013,8 @@ out:
>   	cc->nr_freepages -= release_freepages(&cc->freepages);
>   	VM_BUG_ON(cc->nr_freepages != 0);
>
> +	trace_mm_compaction_end(ret);
> +
>   	return ret;
>   }
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-12-06  9:51 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-25 14:26 [RFC PATCH 0/5] Memory compaction efficiency improvements Vlastimil Babka
2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
2013-11-25 22:08   ` Rik van Riel
2013-11-26 10:16   ` Mel Gorman
2013-11-25 14:26 ` [PATCH 2/5] mm: compaction: reset cached scanner pfn's before reading them Vlastimil Babka
2013-11-26 10:23   ` Mel Gorman
2013-11-26 13:16   ` Rik van Riel
2013-11-25 14:26 ` [PATCH 3/5] mm: compaction: detect when scanners meet in isolate_freepages Vlastimil Babka
2013-11-26 10:45   ` Mel Gorman
2013-11-26 16:44     ` Vlastimil Babka
2013-11-25 14:26 ` [PATCH 4/5] mm: compaction: do not mark unmovable pageblocks as skipped in async compaction Vlastimil Babka
2013-11-26 10:58   ` Mel Gorman
2013-11-25 14:26 ` [PATCH 5/5] mm: compaction: reset scanner positions immediately when they meet Vlastimil Babka
2013-11-26 11:03   ` Mel Gorman
2013-12-04 14:30 ` [PATCH] mm: compaction: Trace compaction begin and end Mel Gorman
2013-12-04 14:51   ` Vlastimil Babka
2013-12-05  9:05     ` Mel Gorman
2013-12-06  9:50       ` Vlastimil Babka
2013-12-05  9:07     ` [PATCH] mm: compaction: Trace compaction begin and end v2 Mel Gorman
2013-12-06  9:50       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).