[PATCH 0/8] Use memory compaction instead of lumpy reclaim during high-order allocations

* [PATCH 0/8] Use memory compaction instead of lumpy reclaim during high-order allocations
@ 2010-11-17 16:22 Mel Gorman
  2010-11-17 16:22 ` [PATCH 1/8] mm: compaction: Add trace events for memory compaction activity Mel Gorman
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Mel Gorman @ 2010-11-17 16:22 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: KOSAKI Motohiro, Andrew Morton, Rik van Riel, Johannes Weiner,
	Mel Gorman, linux-mm, linux-kernel

Huge page allocations are not expected to be cheap but lumpy reclaim
is still very disruptive. While it is far better than reclaiming random
order-0 pages, it ignores the reference bit of pages near the reference
page selected from the LRU. Memory compaction was merged in 2.6.35 to use
less lumpy reclaim by moving pages around instead of reclaiming when there
were enough pages free. It has been tested fairly heavily at this point.
This is a prototype series to use compaction more aggressively.

When CONFIG_COMPACTION is set, lumpy reclaim is no longer used. What it
does instead is reclaim a number of order-0 pages and then compact the zone
to try and satisfy the allocation. This keeps a larger number of active
pages in memory at the cost of increased use of migration and compaction
scanning. With the full series applied, latencies when allocating huge pages
are significantly reduced. By the end of the series, hints are taken from
the LRU on where the best place to start migrating from might be.

Six kernels are tested

lumpyreclaim-traceonly	This kernel is not using compaction but has the
			first patch related to tracepoints applied. It acts
			as a comparison point.

traceonly		This kernel is using compaction and has the
			tracepoints applied.

blindcompact		First three patches. A number of order-0 pages
			are applied and then the zone is compacted. This
			replaces lumpy reclaim but lumpy reclaim is still
			available if compaction is unset.

obeysync		First four patches. Migration will happen
			asynchronously if requested by the caller.
			This reduces the latency of compaction at a time
			when it is not willing to call wait_on_page_writeback

fastscan		First six patches applied. try_to_compact_pages()
			uses shortcuts in the faster compaction path to
			reduce latency.

compacthint		First seven patches applied. The migration scanner
			takes a hint from the LRU on where to start instead
			of always starting from the beginning of the zone.
			If the hint does not work, the full zone is still
			scanned.

The final patch is just a rename so it is not reported.  The target test was
a high-order allocation stress test. Testing was based on kernel 2.6.37-rc1
with commit d88c0922 applied which fixes an important bug related to page
reference counting. The test machine was x86-64 with 3G of RAM.

STRESS-HIGHALLOC
               lumpyreclaim
               traceonly-v2r21   traceonly	    blindcompact      obeysync          fastscan          compacthint
Pass 1          76.00 ( 0.00%)    91.00 (15.00%)    90.00 (14.00%)    86.00 (10.00%)    89.00 (13.00%)    88.00 (12.00%)
Pass 2          92.00 ( 0.00%)    92.00 ( 0.00%)    91.00 (-1.00%)    89.00 (-3.00%)    89.00 (-3.00%)    90.00 (-2.00%)
At Rest         95.00 ( 0.00%)    95.00 ( 0.00%)    96.00 ( 1.00%)    94.00 (-1.00%)    94.00 (-1.00%)    95.00 ( 0.00%)

As you'd expect, using compaction in any form improves the allocation
success rates. This is no surprise but I know that the results for ppc64
are a lot more dramatic. Otherwise, the series does not significantly
affect success rates - this is expected.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3339.94   3356.03   3301.15   3297.02   3277.88   3278.23
Total Elapsed Time (seconds)               2226.20   1962.12   2066.27   1573.86   1416.15   1474.68

Using compaction completes the test faster - no surprise there. Otherwise,
the series reduces the total time it takes to complete the test. The savings
from the vanilla kernel using compaction to the full series is over 8 minutes
which is fairly significant. Typically I'd expect the duration of the test
to vary by up to 2 minutes so 8 minutes is well outside the noise.

FTrace Reclaim Statistics: vmscan
                                       lumpyreclaim
                                          traceonly traceonly blindcompact obeysync   fastscan compacthint
Direct reclaims                               1388        537        376        488        430        480 
Direct reclaim pages scanned                205098      74810     287899     364595     313537     419062 
Direct reclaim pages reclaimed              110395      47344     129716     153689     139506     164719 
Direct reclaim write file async I/O           5703       1463       3313       4425       5257       6658 
Direct reclaim write anon async I/O          42539       8631      17326      25676      12942      25786 
Direct reclaim write file sync I/O               0          0          0          0          0          0 
Direct reclaim write anon sync I/O             339         45          4          3          1          4 
Wake kswapd requests                           855        755        764        814        822        876 
Kswapd wakeups                                 523        573        381        308        328        280 
Kswapd pages scanned                       4231634    4268032    3804355    2907194    2593046    2430099 
Kswapd pages reclaimed                     2200266    2221518    2161870    1826345    1722521    1705105 
Kswapd reclaim write file async I/O          51070      52174      35718      32378      25862      25292 
Kswapd reclaim write anon async I/O         770924     667264     147534      73974      29785      25709 
Kswapd reclaim write file sync I/O               0          0          0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0          0          0 
Time stalled direct reclaim (seconds)      1035.70     113.12     190.79     292.82     111.68     165.71 
Time kswapd awake (seconds)                 885.31     772.61     786.08     484.38     339.97     405.29 

Total pages scanned                        4436732   4342842   4092254   3271789   2906583   2849161
Total pages reclaimed                      2310661   2268862   2291586   1980034   1862027   1869824
%age total pages scanned/reclaimed          52.08%    52.24%    56.00%    60.52%    64.06%    65.63%
%age total pages scanned/written            19.62%    16.80%     4.98%     4.17%     2.54%     2.93%
%age  file pages scanned/written             1.28%     1.24%     0.95%     1.12%     1.07%     1.12%
Percentage Time Spent Direct Reclaim        23.67%     3.26%     5.46%     8.16%     3.29%     4.81%
Percentage Time kswapd Awake                39.77%    39.38%    38.04%    30.78%    24.01%    27.48%

These are the reclaim statistics. Compaction reduces the time spent in
direct reclaim and kswapd awake - no surprise there again. The time spent in
direct reclaim appears to increase once blindcompact and further patches
are applied. This is due to compaction now taking place within reclaim so
there is more going on.

The series overall though reduces the time kswapd spends awake and once
compaction is used within reclaim, the later patches in the series reduces
the time spent. Overall, the series significantly reduces the number of
pages scanned and reclaimed reducing the level of disruption to the system.

FTrace Reclaim Statistics: compaction
                                      lumpyreclaim
                                         traceonly  traceonly blindcompact obeysync   fastscan compacthint
Migrate Pages Scanned                            0   71353874  238633502  264640773  261021041  206180024 
Migrate Pages Isolated                           0     269123     573527     675472     728335    1070987 
Free    Pages Scanned                            0   28821923   86306036  100851634  104049634  148208575 
Free    Pages Isolated                           0     344335     693444     908822     942124    1299588 
Migrated Pages                                   0     265478     565774     652310     707870    1048643 
Migration Failures                               0       3645       7753      23162      20465      22344 

These are some statistics on compaction activity. Obviously with compaction
disabled, nothing happens. Using compaction from within reclaim drastically
increases the amount of compaction activity which is expected - it's offset
by the reduced amount of pages that get reclaimed but there is room for
improvement in how compaction is implemented. I guess the most interesting
part of this result is that "compacthint" initialising the compaction
migration scanner based on the LRU drastically reduces the number of pages
scanned for migration even though the impact on latencies is not obvious.

Judging from the raw figures here, it's tricky to tell if things are really
better or not as they are aggregate figures for the duration of the test. This
brings me to the average latencies.

X86-64
http://www.csn.ul.ie/~mel/postings/memorycompact-20101117/highalloc-interlatency-hydra-mean.ps
http://www.csn.ul.ie/~mel/postings/memorycompact-20101117/highalloc-interlatency-hydra-stddev.ps

The mean latencies are pushed *way* down implying that the amount of work
to allocate each huge page is drastically reduced. As one would expect,
lumpy reclaim has terrible latencies but using compaction pushes it
down. Always using compaction (blindcompact) pushes them further down and
"obeysync" drops them close to the absolute minium latency that can be
achieved. "fastscan" and "compacthint" slightly improve the allocation
success rates while reducing the amount of work performed by the kernel.

For completeness, here are the graphs for a similar test on PPC64. I won't
go into the raw figures because the conclusions are more or less the same.

PPC64
http://www.csn.ul.ie/~mel/postings/memorycompact-20101117/highalloc-interlatency-powyah-mean.ps
http://www.csn.ul.ie/~mel/postings/memorycompact-20101117/highalloc-interlatency-powyah-stddev.ps

PPC64 has to work a lot harder (16M huge pages instead of 2M) The
success rates without compaction are pretty dire due to the large delay
when using lumpy reclaim but with compaction the success rates are all
comparable. Similar to X86-64, the latencies are pushed way down. They are
above the ideal performance but are still drastically improved.

I haven't pushed hard on the concept of lumpy compaction yet and right
now I don't intend to during this cycle. The initial prototypes did not
behave as well as expected and this series improves the current situation
a lot without introducing new algorithms. Hence, I'd like this series to
be considered for merging. I'm hoping that this series also removes the
necessity for the "delete lumpy reclaim" patch from the THP tree.

 include/linux/compaction.h        |    9 ++-
 include/linux/kernel.h            |    7 ++
 include/linux/migrate.h           |   12 ++-
 include/linux/mmzone.h            |    2 +
 include/trace/events/compaction.h |   74 ++++++++++++++++
 include/trace/events/vmscan.h     |    6 +-
 mm/compaction.c                   |  171 ++++++++++++++++++++++++++++---------
 mm/memory-failure.c               |    3 +-
 mm/memory_hotplug.c               |    3 +-
 mm/mempolicy.c                    |    6 +-
 mm/migrate.c                      |   24 +++--
 mm/vmscan.c                       |   90 ++++++++++++-------
 12 files changed, 313 insertions(+), 94 deletions(-)
 create mode 100644 include/trace/events/compaction.h

^ permalink raw reply	[flat|nested] 34+ messages in thread