[RFC PATCH 0/5] Memory compaction efficiency improvements

* [RFC PATCH 0/5] Memory compaction efficiency improvements
@ 2013-11-25 14:26 Vlastimil Babka
  2013-11-25 14:26 ` [PATCH 1/5] mm: compaction: encapsulate defer reset logic Vlastimil Babka
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Vlastimil Babka @ 2013-11-25 14:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Vlastimil Babka, linux-kernel, Mel Gorman, Rik van Riel

The broad goal of the series is to improve allocation success rates for huge
pages through memory compaction, while trying not to increase the compaction
overhead. The original objective was to reintroduce capturing of high-order
pages freed by the compaction, before they are split by concurrent activity.
However, several bugs and opportunities for simple improvements were found in
the current implementation, mostly through extra tracepoints (which are however
too ugly for now to be considered for sending).

The patches mostly deal with two mechanisms that reduce compaction overhead,
which is caching the progress of migrate and free scanners, and marking
pageblocks where isolation failed to be skipped during further scans.

Patch 1 encapsulates the some functionality for handling deferred compactions
        for better maintainability, without a functional change
        type is not determined without being actually needed.

Patch 2 fixes a bug where cached scanner pfn's are sometimes reset only after
        they have been read to initialize a compaction run.

Patch 3 fixes a bug where scanners meeting is sometimes not properly detected
        and can lead to multiple compaction attempts quitting early without
        doing any work.

Patch 4 improves the chances of sync compaction to process pageblocks that
        async compaction has skipped due to being !MIGRATE_MOVABLE.

Patch 5 improves the chances of sync direct compaction to actually do anything
        when called after async compaction fails during allocation slowpath.

Some preliminary results with mmtests's stress-highalloc benchmark on a x86_64
machine with 4GB memory. First, the default GFP_HIGHUSER_MOVABLE allocations,
with the patches stacked on top of mainline master as of Friday (commit
a5d6e633 merging fixes from Andrew). Patch 1 is OK to serve as baseline due to
no functional change. Comments below.

stress-highalloc
                         master                master                master                master                master
                        1-nothp               2-nothp               3-nothp               4-nothp               5-nothp
Success 1       34.00 (  0.00%)       20.00 ( 41.18%)       44.00 (-29.41%)       45.00 (-32.35%)       25.00 ( 26.47%)
Success 2       31.00 (  0.00%)       21.00 ( 32.26%)       47.00 (-51.61%)       47.00 (-51.61%)       28.00 (  9.68%)
Success 3       68.00 (  0.00%)       88.00 (-29.41%)       86.00 (-26.47%)       87.00 (-27.94%)       88.00 (-29.41%)

              master      master      master      master      master
             1-nothp     2-nothp     3-nothp     4-nothp     5-nothp
User         6334.04     6343.09     5938.15     5860.00     6674.38
System       1044.15     1035.84     1022.68     1021.11     1055.76
Elapsed      1787.06     1714.76     1829.14     1850.91     1789.83

                                master      master      master      master      master
                               1-nothp     2-nothp     3-nothp     4-nothp     5-nothp
Minor Faults                 248365069   244975796   247192462   243720231   248888409
Major Faults                       427         442         563         504         414
Swap Ins                             7           3           8           7           0
Swap Outs                          345         338         570         235         415
Direct pages scanned            239929      166220      276238      277310      202409
Kswapd pages scanned           1759082     1819998     1880477     1850421     1809928
Kswapd pages reclaimed         1756781     1813653     1877783     1847704     1806347
Direct pages reclaimed          239291      165988      276163      277048      202092
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity                984.344    1061.372    1028.066     999.736    1011.229
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                134.259      96.935     151.021     149.824     113.088
Percentage direct scans            12%          8%         12%         13%         10%
Zone normal velocity           362.126     440.499     374.597     354.049     360.196
Zone dma32 velocity            756.478     717.808     804.490     795.511     764.122
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         450.000     476.000     570.000     306.000     639.000
Page writes file                   105         138           0          71         224
Page writes anon                   345         338         570         235         415
Page reclaim immediate             660        4407         167         843        1553
Sector Reads                   2734844     2725576     2951744     2830472     2791216
Sector Writes                 11938520    11729108    11769760    11743120    11805320
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1596544     1520768     1767552     1774720     1555584
Direct inode steals               9764        6640       14010       15320        8315
Kswapd inode steals              47445       42888       49705       51043       43283
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                     78          30          43          34          31
THP collapse alloc                 485         371         570         559         306
THP splits                           6           1           2           4           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                   13          16          11          12          16
Compaction stalls                 1067        1072        1629        1578        1140
Compaction success                 339         275         568         595         329
Compaction failures                728         797        1061         983         811
Page migrate success           1115929     1113188     3966997     4076178     4220010
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      2423867     2425024     8351264     8583856     8789144
Compaction migrate scanned    38956505    62526876   153906340   174085307   114170442
Compaction free scanned       83126040    51071610   396724121   358193857   389459415
Compaction cost                   1477        1639        5353        5612        5346
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

Observations:
- The "Success 3" line is allocation success rate with system idle (phases 1
  and 2 are with background interference). I used to get values around 85%
  with vanilla 3.11 and observed occasional drop to around 65% in 3.12, with
  about 50% chance. This was bisected to commit 81c0a2bb ("mm: page_alloc:
  fair zone allocator policy") using 10 repeats of the benchmark and marking
  as 'bad' a commit as long as the bad result appeared at least once (to fight
  the uncertainty). As explained in comment for patch 3, I don't think the
  commit is wrong, but that it makes the effect of bugs worse. From patch 3
  onwards, the results are OK. Here it might seem that patch 2 helps, but
  that's just the uncertainty. I plan to add support for more iterations and
  statistical summarizing of the results to fight this...
- It might seem that patch 5 is regressing phases 1 and 2, but since that was
  not the case when testing against 3.12, I would say it's just different
  case of unstable results. Phases 1 and 2 are more amenable to that in
  general. However, I never seen unpatched 3.11 or 3.12 go above 40% as
  the patch 3 does.
- Compaction cost and number of scanned pages is higher, especially due to
  patch 3. However, keep in mind that patches 2 and 3 fix existing bugs in the
  current design of overhead mitigation, they do not change it. If overhead is
  found unacceptable, then it should be decreased differently (and consistently,
  not due to random conditions) than the current implementation does. In
  contrast, patches 4 and 5 (which are not strictly bug fixes) do not
  increase the overhead (but also not success rates).

Another set of preliminary results is when configuring stress-highalloc to
allocate with similar flags as THP uses:
 (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

stress-highalloc
                         master                master                master                master                master
                          1-thp                 2-thp                 3-thp                 4-thp                 5-thp
Success 1       29.00 (  0.00%)        7.00 ( 75.86%)       25.00 ( 13.79%)       32.00 (-10.34%)       32.00 (-10.34%)
Success 2       30.00 (  0.00%)        7.00 ( 76.67%)       29.00 (  3.33%)       34.00 (-13.33%)       37.00 (-23.33%)
Success 3       70.00 (  0.00%)       70.00 (  0.00%)       85.00 (-21.43%)       85.00 (-21.43%)       85.00 (-21.43%)

              master      master      master      master      master
               1-thp       2-thp       3-thp       4-thp       5-thp
User         5915.36     6769.19     6350.04     6421.90     6571.80
System       1017.80     1053.70     1039.06     1051.84     1061.59
Elapsed      1757.87     1724.31     1744.66     1822.78     1841.42

                                master      master      master      master      master
                                 1-thp       2-thp       3-thp       4-thp       5-thp
Minor Faults                 246004967   248169249   244469991   248893104   245151725
Major Faults                       403         282         354         369         436
Swap Ins                             8           8          10           7           8
Swap Outs                          534         530         325         694         687
Direct pages scanned            106122       76339      168386      202576      170449
Kswapd pages scanned           1924013     1803706     1855293     1872408     1907170
Kswapd pages reclaimed         1920762     1800403     1852989     1869573     1904070
Direct pages reclaimed          105986       76291      168183      202440      170343
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity               1094.514    1046.045    1063.412    1027.227    1035.706
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                 60.370      44.272      96.515     111.136      92.564
Percentage direct scans             5%          4%          8%          9%          8%
Zone normal velocity           362.047     386.497     361.529     371.628     369.295
Zone dma32 velocity            792.836     703.820     798.398     766.734     758.975
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         741.000     751.000     325.000     694.000     924.000
Page writes file                   207         221           0           0         237
Page writes anon                   534         530         325         694         687
Page reclaim immediate             895         856         479         396         512
Sector Reads                   2769992     2627604     2735740     2828672     2836412
Sector Writes                 11748724    11660652    11598304    11800576    11753996
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1485952     1233024     1457280     1492096     1544320
Direct inode steals               2565         537        3384        6389        3205
Kswapd inode steals              50112       42207       46892       45371       49542
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                     28           2          23          31          28
THP collapse alloc                 485         276         417         539         514
THP splits                           0           0           0           2           3
THP fault fallback                   0           0           0           0           0
THP collapse fail                   13          19          17          12          12
Compaction stalls                  813         474         964        1052        1050
Compaction success                 332          92         359         434         411
Compaction failures                481         382         605         617         639
Page migrate success            582816      359101      973579      950980     1085585
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      1327894      806679     2256066     2195431     2461078
Compaction migrate scanned    13244945     7977159    21513942    23189436    30051866
Compaction free scanned       35192520    19254827    76152850    71159488    77702117
Compaction cost                    722         443        1204        1191        1383
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

                      master      master      master      master      master
                       1-thp       2-thp       3-thp       4-thp       5-thp
Mean sda-avgqz         46.01       46.31       46.43       46.87       45.94
Mean sda-await        271.19      273.75      273.84      270.12      269.69
Mean sda-r_await       35.33       35.52       34.26       33.98       33.61
Mean sda-w_await      474.54      497.59      603.64      567.32      488.48
Max  sda-avgqz        158.33      168.62      166.68      165.51      165.82
Max  sda-await       1461.41     1374.49     1380.31     1427.35     1402.61
Max  sda-r_await      197.46      286.67      112.65      112.07      158.24
Max  sda-w_await     9986.97    11363.36    16119.59    12365.75    11706.65

There are some differences from the previous results for THP-like allocations:
 - Here, the bad result for unpatched kernel in phase 3 is much more consistent
   to be between 65-70% and not due to the "regression" in 3.12. Still there is
   the improvement from patch 3 onwards, which brings it on par with simple
   GFP_HIGHUSER_MOVABLE allocations.
 - Patch 2 is again not a regression but due to results variability.
 - The compaction overhead in patches 2 and 3 and arguments are similar as
   above.
 - Patch 5 increases the number of migrate-scanned pages significantly. This
   is most likely due to __GFP_NO_KSWAPD flag, which means the cached pfn's are
   not reset by kswapd, and the patch thus helps the sync-after-async
   compaction. It doesn't however show that the sync compaction would help with
   success rates. One of the further patches I'm considering for future
   versions is to ignore or clear pageblock skip information for sync
   compaction. But in that case, THP clearly should be changed so that it does
   not fallback to the sync compaction.

Vlastimil Babka (5):
  mm: compaction: encapsulate defer reset logic
  mm: compaction: reset cached scanner pfn's before reading them
  mm: compaction: detect when scanners meet in isolate_freepages
  mm: compaction: do not mark unmovable pageblocks as skipped in async
    compaction
  mm: compaction: reset scanner positions immediately when they meet

 include/linux/compaction.h | 12 +++++++++++
 mm/compaction.c            | 53 ++++++++++++++++++++++++++++++----------------
 mm/page_alloc.c            |  5 +----
 3 files changed, 48 insertions(+), 22 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 20+ messages in thread