[PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2

* [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-09-15 12:27 ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

This is v2 of a series to reduce some of the latencies seen in page reclaim
and to improve the efficiency a bit.  There are a number of changes in this
revision. The first is to drop the patches avoiding writeback from direct
reclaim again. Wu asked me to look at a large number of his patches and I felt
it was best to do that independent of this series which should be relatively
uncontroversial. The second big change is to wait_iff_congested(). There
were a few complaints that the avoidance heuristic was way too fuzzy and
so I tried following Andrew's suggestion to take note of the return value
of bdi_write_congested() in may_write_to_queue() to identify when a zone
is congested.

Changelog since V2
  o Reshuffle patches to order from least to most controversial
  o Drop the patches dealing with writeback avoidance. Wu is working
    on some patches that potentially collide with this area so it
    will be revisited later
  o Use BDI congestion feedback in wait_iff_congested() instead of
    making a determination based on number of pages currently being
    written back
  o Do not use lock_page in pageout path
  o Rebase to 2.6.36-rc4

Changelog since V1
  o Fix mis-named function in documentation
  o Added reviewed and acked bys

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
puts together two different patch sets which in combination should tackle
some of the root causes of latency problems being reported.

Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.

Patch 2 accounts for time spent in congestion_wait.

Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
wait_iff_congested() only sleeps if there is a BDI that is currently congested.

Patch 8 notes that any BDI being congested is not necessarily a problem
because there could be multiple BDIs of varying speeds and numberous zones. It
attempts to track when a zone being reclaimed contains many pages backed
by a congested BDI and if so, reclaimers wait on the congestion queue.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc4

traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3:   Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested

nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
	compile-based benchmark. Smoke test performance

sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.

stress-highalloc
	Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was reclaim
activity but there were no difference of major interest between the kernels.

X86-64 micro-mapped-file-stream

                                      traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)

These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher because
we are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan

                                traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims                                443        273        513       1568 
Direct reclaim pages scanned                305968     280402     600825     957933 
Direct reclaim pages reclaimed               43503      19005      30327     117191 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          3          4         12 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        187649     132338     191695     267701 
Kswapd wakeups                                   3          1          4          1 
Kswapd pages scanned                       4599269    4454162    4296815    3891906 
Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
Kswapd reclaim write file async I/O              1          0          1          1 
Kswapd reclaim write anon async I/O             59        187         41        222 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 

Total pages scanned                        4905237   4734564   4897640   4849839
Total pages reclaimed                      2339450   2447439   2430145   2436897
%age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                87        196         64          0 
Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
Direct full   congest     waited                72        145         53          0 
Direct number conditional waited                 0          0        324       1315 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0          0 
KSwapd number congest     waited                20         10         15          7 
KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
KSwapd full   congest     waited                10          4          6          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76

Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.

PPC64 micro-mapped-file-stream
pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
Direct reclaims                                977       2709       2098       5136 
Direct reclaim pages scanned                629825     963814    1063938    1711935 
Direct reclaim pages reclaimed               75550     242538     150904     387647 
Direct reclaim write file async I/O              0          0          0          2 
Direct reclaim write anon async I/O              0         10          0          4 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        392119    1201712     571935     571921 
Kswapd wakeups                                   3          2          3          3 
Kswapd pages scanned                       4601307    4128076    3912317    3377165 
Kswapd pages reclaimed                     2432523    2318797    2312673    2144616 
Kswapd reclaim write file async I/O             20          1          1          1 
Kswapd reclaim write anon async I/O             57        132         11        121 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88 
Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90 

Total pages scanned                        5231132   5091890   4976255   5089100
Total pages reclaimed                      2508073   2561335   2463577   2532263
%age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%

Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed remains
roughly similar. The downside is that kswapd and direct reclaim was awake
longer and for a larger percentage of the overall workload. It's possible
there were big differences in the amount of time spent reclaiming slab
pages between the different kernels which is plausible considering that
the micro tests runs after fsmark and sysbench.

Trace Reclaim Statistics: congestion_wait
Direct number congest     waited               845       1312        104          0 
Direct time   congest     waited           19416ms    26560ms     7544ms        0ms 
Direct full   congest     waited               745       1105         72          0 
Direct number conditional waited                 0          0       1322       2935 
Direct time   conditional waited               0ms        0ms       12ms      312ms 
Direct full   conditional waited                 0          0          0          3 
KSwapd number congest     waited                39        102         75         63 
KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms 
KSwapd full   congest     waited                20         48         46         25 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
asleep with the patches.  The time kswapd spent congest waited was also
reduced by a large factor.

MMTests Statistics: duration
ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88

With all patches applies, the completion times are very similar.

X86-64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)

Success figures across the board are broadly similar.

                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                               1045        944        886        887 
Direct reclaim pages scanned                135091     119604     109382     101019 
Direct reclaim pages reclaimed               88599      47535      47863      46671 
Direct reclaim write file async I/O            494        283        465        280 
Direct reclaim write anon async I/O          29357      13710      16656      13462 
Direct reclaim write file sync I/O             154          2          2          3 
Direct reclaim write anon sync I/O           14594        571        509        561 
Wake kswapd requests                          7491        933        872        892 
Kswapd wakeups                                 814        778        731        780 
Kswapd pages scanned                       7290822   15341158   11916436   13703442 
Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
Kswapd reclaim write file async I/O          91975      32317      28022      29628 
Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 

Total pages scanned                        7425913  15460762  12025818  13804461
Total pages reclaimed                      3675935   3190031   3142255   3233822
%age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
%age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
%age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%

Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be more
efficient and indeed, the time spent in direct reclaim is much reduced by
the full series and kswapd spends a little less time awake.

Overall, indications here are that allocations were
happening much faster and this can be seen with a graph of
the latency figures as the allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1333        204        169          4 
Direct time   congest     waited           78896ms     8288ms     7260ms      200ms 
Direct full   congest     waited               756         92         69          2 
Direct number conditional waited                 0          0         26        186 
Direct time   conditional waited               0ms        0ms        0ms     2504ms 
Direct full   conditional waited                 0          0          0         25 
KSwapd number congest     waited                 4        395        227        282 
KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms 
KSwapd full   congest     waited                 3        232         98        176 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 
KSwapd full   conditional waited               318          0        312          9 

Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not.  Overall the sleep imes are reduced though - from 79ish seconds to
about 19.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39

With the full series, the time to complete the tests are reduced by 30%

PPC64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                                499        505        564        509 
Direct reclaim pages scanned                223478      41898      51818      45605 
Direct reclaim pages reclaimed              137730      21148      27161      23455 
Direct reclaim write file async I/O            399        136        162        136 
Direct reclaim write anon async I/O          46977       2865       4686       3998 
Direct reclaim write file sync I/O              29          0          1          3 
Direct reclaim write anon sync I/O           31023        159        237        239 
Wake kswapd requests                           420        351        360        326 
Kswapd wakeups                                 185        294        249        277 
Kswapd pages scanned                      15703488   16392500   17821724   17598737 
Kswapd pages reclaimed                     5808466    2908858    3139386    3145435 
Kswapd reclaim write file async I/O         159938      18400      18717      13473 
Kswapd reclaim write anon async I/O        3467554     228957     322799     234278 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23 
Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88 

Total pages scanned                       15926966  16434398  17873542  17644342
Total pages reclaimed                      5946196   2930006   3166547   3168890
%age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
%age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
%age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%

While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct reclaim
and with kswapd are massively reduced, mostly by the lowlumpy patches.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               725        303        126          3 
Direct time   congest     waited           45524ms     9180ms     5936ms      300ms 
Direct full   congest     waited               487        190         52          3 
Direct number conditional waited                 0          0        200        301 
Direct time   conditional waited               0ms        0ms        0ms     1904ms 
Direct full   conditional waited                 0          0          0         19 
KSwapd number congest     waited                 0          2         23          4 
KSwapd time   congest     waited               0ms      200ms      420ms      404ms 
KSwapd full   congest     waited                 0          2          2          4 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85

The time to complete this test goes way down. With the full series, we are allocating
over twice the number of huge pages in 30% of the time and there is a corresponding
impact on the allocation latency graph available at.

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps

I think this series is ready for much wider testing. The lowlumpy patches in
particular should be relatively uncontroversial. While their largest impact
can be seen in the high order stress tests, they would also have an impact
if SLUB was configured (these tests are based on slab) and stalls in lumpy
reclaim could be partially responsible for some desktop stalling reports.

The congestion_wait avoidance stuff was controversial in v1 because the
heuristic used to avoid the wait was a bit shaky. I'm expecting that this
version is more predictable.

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
 include/linux/backing-dev.h                        |    2 +-
 include/linux/mmzone.h                             |    8 +
 include/trace/events/vmscan.h                      |   44 ++++-
 include/trace/events/writeback.h                   |   35 +++
 mm/backing-dev.c                                   |   66 ++++++-
 mm/page_alloc.c                                    |    4 +-
 mm/vmscan.c                                        |  226 ++++++++++++++------
 8 files changed, 341 insertions(+), 83 deletions(-)

^ permalink raw reply	[flat|nested] 59+ messages in thread