All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-09-15 12:27 ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

This is v2 of a series to reduce some of the latencies seen in page reclaim
and to improve the efficiency a bit.  There are a number of changes in this
revision. The first is to drop the patches avoiding writeback from direct
reclaim again. Wu asked me to look at a large number of his patches and I felt
it was best to do that independent of this series which should be relatively
uncontroversial. The second big change is to wait_iff_congested(). There
were a few complaints that the avoidance heuristic was way too fuzzy and
so I tried following Andrew's suggestion to take note of the return value
of bdi_write_congested() in may_write_to_queue() to identify when a zone
is congested.

Changelog since V2
  o Reshuffle patches to order from least to most controversial
  o Drop the patches dealing with writeback avoidance. Wu is working
    on some patches that potentially collide with this area so it
    will be revisited later
  o Use BDI congestion feedback in wait_iff_congested() instead of
    making a determination based on number of pages currently being
    written back
  o Do not use lock_page in pageout path
  o Rebase to 2.6.36-rc4

Changelog since V1
  o Fix mis-named function in documentation
  o Added reviewed and acked bys

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
puts together two different patch sets which in combination should tackle
some of the root causes of latency problems being reported.

Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.

Patch 2 accounts for time spent in congestion_wait.

Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
wait_iff_congested() only sleeps if there is a BDI that is currently congested.

Patch 8 notes that any BDI being congested is not necessarily a problem
because there could be multiple BDIs of varying speeds and numberous zones. It
attempts to track when a zone being reclaimed contains many pages backed
by a congested BDI and if so, reclaimers wait on the congestion queue.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc4

traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3:   Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested

nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
	compile-based benchmark. Smoke test performance

sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.

stress-highalloc
	Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was reclaim
activity but there were no difference of major interest between the kernels.

X86-64 micro-mapped-file-stream

                                      traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)

These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher because
we are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan

                                traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims                                443        273        513       1568 
Direct reclaim pages scanned                305968     280402     600825     957933 
Direct reclaim pages reclaimed               43503      19005      30327     117191 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          3          4         12 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        187649     132338     191695     267701 
Kswapd wakeups                                   3          1          4          1 
Kswapd pages scanned                       4599269    4454162    4296815    3891906 
Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
Kswapd reclaim write file async I/O              1          0          1          1 
Kswapd reclaim write anon async I/O             59        187         41        222 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 

Total pages scanned                        4905237   4734564   4897640   4849839
Total pages reclaimed                      2339450   2447439   2430145   2436897
%age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                87        196         64          0 
Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
Direct full   congest     waited                72        145         53          0 
Direct number conditional waited                 0          0        324       1315 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0          0 
KSwapd number congest     waited                20         10         15          7 
KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
KSwapd full   congest     waited                10          4          6          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76

Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.

PPC64 micro-mapped-file-stream
pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
Direct reclaims                                977       2709       2098       5136 
Direct reclaim pages scanned                629825     963814    1063938    1711935 
Direct reclaim pages reclaimed               75550     242538     150904     387647 
Direct reclaim write file async I/O              0          0          0          2 
Direct reclaim write anon async I/O              0         10          0          4 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        392119    1201712     571935     571921 
Kswapd wakeups                                   3          2          3          3 
Kswapd pages scanned                       4601307    4128076    3912317    3377165 
Kswapd pages reclaimed                     2432523    2318797    2312673    2144616 
Kswapd reclaim write file async I/O             20          1          1          1 
Kswapd reclaim write anon async I/O             57        132         11        121 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88 
Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90 

Total pages scanned                        5231132   5091890   4976255   5089100
Total pages reclaimed                      2508073   2561335   2463577   2532263
%age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%

Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed remains
roughly similar. The downside is that kswapd and direct reclaim was awake
longer and for a larger percentage of the overall workload. It's possible
there were big differences in the amount of time spent reclaiming slab
pages between the different kernels which is plausible considering that
the micro tests runs after fsmark and sysbench.

Trace Reclaim Statistics: congestion_wait
Direct number congest     waited               845       1312        104          0 
Direct time   congest     waited           19416ms    26560ms     7544ms        0ms 
Direct full   congest     waited               745       1105         72          0 
Direct number conditional waited                 0          0       1322       2935 
Direct time   conditional waited               0ms        0ms       12ms      312ms 
Direct full   conditional waited                 0          0          0          3 
KSwapd number congest     waited                39        102         75         63 
KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms 
KSwapd full   congest     waited                20         48         46         25 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
asleep with the patches.  The time kswapd spent congest waited was also
reduced by a large factor.

MMTests Statistics: duration
ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88

With all patches applies, the completion times are very similar.


X86-64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)

Success figures across the board are broadly similar.

                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                               1045        944        886        887 
Direct reclaim pages scanned                135091     119604     109382     101019 
Direct reclaim pages reclaimed               88599      47535      47863      46671 
Direct reclaim write file async I/O            494        283        465        280 
Direct reclaim write anon async I/O          29357      13710      16656      13462 
Direct reclaim write file sync I/O             154          2          2          3 
Direct reclaim write anon sync I/O           14594        571        509        561 
Wake kswapd requests                          7491        933        872        892 
Kswapd wakeups                                 814        778        731        780 
Kswapd pages scanned                       7290822   15341158   11916436   13703442 
Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
Kswapd reclaim write file async I/O          91975      32317      28022      29628 
Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 

Total pages scanned                        7425913  15460762  12025818  13804461
Total pages reclaimed                      3675935   3190031   3142255   3233822
%age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
%age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
%age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%

Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be more
efficient and indeed, the time spent in direct reclaim is much reduced by
the full series and kswapd spends a little less time awake.

Overall, indications here are that allocations were
happening much faster and this can be seen with a graph of
the latency figures as the allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1333        204        169          4 
Direct time   congest     waited           78896ms     8288ms     7260ms      200ms 
Direct full   congest     waited               756         92         69          2 
Direct number conditional waited                 0          0         26        186 
Direct time   conditional waited               0ms        0ms        0ms     2504ms 
Direct full   conditional waited                 0          0          0         25 
KSwapd number congest     waited                 4        395        227        282 
KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms 
KSwapd full   congest     waited                 3        232         98        176 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 
KSwapd full   conditional waited               318          0        312          9 


Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not.  Overall the sleep imes are reduced though - from 79ish seconds to
about 19.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39

With the full series, the time to complete the tests are reduced by 30%

PPC64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                                499        505        564        509 
Direct reclaim pages scanned                223478      41898      51818      45605 
Direct reclaim pages reclaimed              137730      21148      27161      23455 
Direct reclaim write file async I/O            399        136        162        136 
Direct reclaim write anon async I/O          46977       2865       4686       3998 
Direct reclaim write file sync I/O              29          0          1          3 
Direct reclaim write anon sync I/O           31023        159        237        239 
Wake kswapd requests                           420        351        360        326 
Kswapd wakeups                                 185        294        249        277 
Kswapd pages scanned                      15703488   16392500   17821724   17598737 
Kswapd pages reclaimed                     5808466    2908858    3139386    3145435 
Kswapd reclaim write file async I/O         159938      18400      18717      13473 
Kswapd reclaim write anon async I/O        3467554     228957     322799     234278 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23 
Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88 

Total pages scanned                       15926966  16434398  17873542  17644342
Total pages reclaimed                      5946196   2930006   3166547   3168890
%age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
%age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
%age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%

While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct reclaim
and with kswapd are massively reduced, mostly by the lowlumpy patches.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               725        303        126          3 
Direct time   congest     waited           45524ms     9180ms     5936ms      300ms 
Direct full   congest     waited               487        190         52          3 
Direct number conditional waited                 0          0        200        301 
Direct time   conditional waited               0ms        0ms        0ms     1904ms 
Direct full   conditional waited                 0          0          0         19 
KSwapd number congest     waited                 0          2         23          4 
KSwapd time   congest     waited               0ms      200ms      420ms      404ms 
KSwapd full   congest     waited                 0          2          2          4 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 


Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85

The time to complete this test goes way down. With the full series, we are allocating
over twice the number of huge pages in 30% of the time and there is a corresponding
impact on the allocation latency graph available at.

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps

I think this series is ready for much wider testing. The lowlumpy patches in
particular should be relatively uncontroversial. While their largest impact
can be seen in the high order stress tests, they would also have an impact
if SLUB was configured (these tests are based on slab) and stalls in lumpy
reclaim could be partially responsible for some desktop stalling reports.

The congestion_wait avoidance stuff was controversial in v1 because the
heuristic used to avoid the wait was a bit shaky. I'm expecting that this
version is more predictable.

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
 include/linux/backing-dev.h                        |    2 +-
 include/linux/mmzone.h                             |    8 +
 include/trace/events/vmscan.h                      |   44 ++++-
 include/trace/events/writeback.h                   |   35 +++
 mm/backing-dev.c                                   |   66 ++++++-
 mm/page_alloc.c                                    |    4 +-
 mm/vmscan.c                                        |  226 ++++++++++++++------
 8 files changed, 341 insertions(+), 83 deletions(-)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-09-15 12:27 ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

This is v2 of a series to reduce some of the latencies seen in page reclaim
and to improve the efficiency a bit.  There are a number of changes in this
revision. The first is to drop the patches avoiding writeback from direct
reclaim again. Wu asked me to look at a large number of his patches and I felt
it was best to do that independent of this series which should be relatively
uncontroversial. The second big change is to wait_iff_congested(). There
were a few complaints that the avoidance heuristic was way too fuzzy and
so I tried following Andrew's suggestion to take note of the return value
of bdi_write_congested() in may_write_to_queue() to identify when a zone
is congested.

Changelog since V2
  o Reshuffle patches to order from least to most controversial
  o Drop the patches dealing with writeback avoidance. Wu is working
    on some patches that potentially collide with this area so it
    will be revisited later
  o Use BDI congestion feedback in wait_iff_congested() instead of
    making a determination based on number of pages currently being
    written back
  o Do not use lock_page in pageout path
  o Rebase to 2.6.36-rc4

Changelog since V1
  o Fix mis-named function in documentation
  o Added reviewed and acked bys

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
puts together two different patch sets which in combination should tackle
some of the root causes of latency problems being reported.

Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.

Patch 2 accounts for time spent in congestion_wait.

Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
wait_iff_congested() only sleeps if there is a BDI that is currently congested.

Patch 8 notes that any BDI being congested is not necessarily a problem
because there could be multiple BDIs of varying speeds and numberous zones. It
attempts to track when a zone being reclaimed contains many pages backed
by a congested BDI and if so, reclaimers wait on the congestion queue.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc4

traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3:   Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested

nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
	compile-based benchmark. Smoke test performance

sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.

stress-highalloc
	Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was reclaim
activity but there were no difference of major interest between the kernels.

X86-64 micro-mapped-file-stream

                                      traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)

These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher because
we are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan

                                traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims                                443        273        513       1568 
Direct reclaim pages scanned                305968     280402     600825     957933 
Direct reclaim pages reclaimed               43503      19005      30327     117191 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          3          4         12 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        187649     132338     191695     267701 
Kswapd wakeups                                   3          1          4          1 
Kswapd pages scanned                       4599269    4454162    4296815    3891906 
Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
Kswapd reclaim write file async I/O              1          0          1          1 
Kswapd reclaim write anon async I/O             59        187         41        222 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 

Total pages scanned                        4905237   4734564   4897640   4849839
Total pages reclaimed                      2339450   2447439   2430145   2436897
%age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                87        196         64          0 
Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
Direct full   congest     waited                72        145         53          0 
Direct number conditional waited                 0          0        324       1315 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0          0 
KSwapd number congest     waited                20         10         15          7 
KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
KSwapd full   congest     waited                10          4          6          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76

Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.

PPC64 micro-mapped-file-stream
pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
Direct reclaims                                977       2709       2098       5136 
Direct reclaim pages scanned                629825     963814    1063938    1711935 
Direct reclaim pages reclaimed               75550     242538     150904     387647 
Direct reclaim write file async I/O              0          0          0          2 
Direct reclaim write anon async I/O              0         10          0          4 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        392119    1201712     571935     571921 
Kswapd wakeups                                   3          2          3          3 
Kswapd pages scanned                       4601307    4128076    3912317    3377165 
Kswapd pages reclaimed                     2432523    2318797    2312673    2144616 
Kswapd reclaim write file async I/O             20          1          1          1 
Kswapd reclaim write anon async I/O             57        132         11        121 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88 
Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90 

Total pages scanned                        5231132   5091890   4976255   5089100
Total pages reclaimed                      2508073   2561335   2463577   2532263
%age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%

Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed remains
roughly similar. The downside is that kswapd and direct reclaim was awake
longer and for a larger percentage of the overall workload. It's possible
there were big differences in the amount of time spent reclaiming slab
pages between the different kernels which is plausible considering that
the micro tests runs after fsmark and sysbench.

Trace Reclaim Statistics: congestion_wait
Direct number congest     waited               845       1312        104          0 
Direct time   congest     waited           19416ms    26560ms     7544ms        0ms 
Direct full   congest     waited               745       1105         72          0 
Direct number conditional waited                 0          0       1322       2935 
Direct time   conditional waited               0ms        0ms       12ms      312ms 
Direct full   conditional waited                 0          0          0          3 
KSwapd number congest     waited                39        102         75         63 
KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms 
KSwapd full   congest     waited                20         48         46         25 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 

The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
asleep with the patches.  The time kswapd spent congest waited was also
reduced by a large factor.

MMTests Statistics: duration
ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88

With all patches applies, the completion times are very similar.


X86-64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)

Success figures across the board are broadly similar.

                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                               1045        944        886        887 
Direct reclaim pages scanned                135091     119604     109382     101019 
Direct reclaim pages reclaimed               88599      47535      47863      46671 
Direct reclaim write file async I/O            494        283        465        280 
Direct reclaim write anon async I/O          29357      13710      16656      13462 
Direct reclaim write file sync I/O             154          2          2          3 
Direct reclaim write anon sync I/O           14594        571        509        561 
Wake kswapd requests                          7491        933        872        892 
Kswapd wakeups                                 814        778        731        780 
Kswapd pages scanned                       7290822   15341158   11916436   13703442 
Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
Kswapd reclaim write file async I/O          91975      32317      28022      29628 
Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 

Total pages scanned                        7425913  15460762  12025818  13804461
Total pages reclaimed                      3675935   3190031   3142255   3233822
%age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
%age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
%age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%

Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be more
efficient and indeed, the time spent in direct reclaim is much reduced by
the full series and kswapd spends a little less time awake.

Overall, indications here are that allocations were
happening much faster and this can be seen with a graph of
the latency figures as the allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1333        204        169          4 
Direct time   congest     waited           78896ms     8288ms     7260ms      200ms 
Direct full   congest     waited               756         92         69          2 
Direct number conditional waited                 0          0         26        186 
Direct time   conditional waited               0ms        0ms        0ms     2504ms 
Direct full   conditional waited                 0          0          0         25 
KSwapd number congest     waited                 4        395        227        282 
KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms 
KSwapd full   congest     waited                 3        232         98        176 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 
KSwapd full   conditional waited               318          0        312          9 


Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not.  Overall the sleep imes are reduced though - from 79ish seconds to
about 19.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39

With the full series, the time to complete the tests are reduced by 30%

PPC64 STRESS-HIGHALLOC
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
Direct reclaims                                499        505        564        509 
Direct reclaim pages scanned                223478      41898      51818      45605 
Direct reclaim pages reclaimed              137730      21148      27161      23455 
Direct reclaim write file async I/O            399        136        162        136 
Direct reclaim write anon async I/O          46977       2865       4686       3998 
Direct reclaim write file sync I/O              29          0          1          3 
Direct reclaim write anon sync I/O           31023        159        237        239 
Wake kswapd requests                           420        351        360        326 
Kswapd wakeups                                 185        294        249        277 
Kswapd pages scanned                      15703488   16392500   17821724   17598737 
Kswapd pages reclaimed                     5808466    2908858    3139386    3145435 
Kswapd reclaim write file async I/O         159938      18400      18717      13473 
Kswapd reclaim write anon async I/O        3467554     228957     322799     234278 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23 
Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88 

Total pages scanned                       15926966  16434398  17873542  17644342
Total pages reclaimed                      5946196   2930006   3166547   3168890
%age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
%age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
%age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%

While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct reclaim
and with kswapd are massively reduced, mostly by the lowlumpy patches.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               725        303        126          3 
Direct time   congest     waited           45524ms     9180ms     5936ms      300ms 
Direct full   congest     waited               487        190         52          3 
Direct number conditional waited                 0          0        200        301 
Direct time   conditional waited               0ms        0ms        0ms     1904ms 
Direct full   conditional waited                 0          0          0         19 
KSwapd number congest     waited                 0          2         23          4 
KSwapd time   congest     waited               0ms      200ms      420ms      404ms 
KSwapd full   congest     waited                 0          2          2          4 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0          0 


Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85

The time to complete this test goes way down. With the full series, we are allocating
over twice the number of huge pages in 30% of the time and there is a corresponding
impact on the allocation latency graph available at.

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps

I think this series is ready for much wider testing. The lowlumpy patches in
particular should be relatively uncontroversial. While their largest impact
can be seen in the high order stress tests, they would also have an impact
if SLUB was configured (these tests are based on slab) and stalls in lumpy
reclaim could be partially responsible for some desktop stalling reports.

The congestion_wait avoidance stuff was controversial in v1 because the
heuristic used to avoid the wait was a bit shaky. I'm expecting that this
version is more predictable.

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
 include/linux/backing-dev.h                        |    2 +-
 include/linux/mmzone.h                             |    8 +
 include/trace/events/vmscan.h                      |   44 ++++-
 include/trace/events/writeback.h                   |   35 +++
 mm/backing-dev.c                                   |   66 ++++++-
 mm/page_alloc.c                                    |    4 +-
 mm/vmscan.c                                        |  226 ++++++++++++++------
 8 files changed, 341 insertions(+), 83 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/8] tracing, vmscan: Add trace events for LRU list shrinking
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

This patch adds a trace event for shrink_inactive_list() and updates the
sample postprocessing script appropriately. It can be used to determine
how many pages were reclaimed and for non-lumpy reclaim where exactly the
pages were reclaimed from.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++++++++++++-----
 include/trace/events/vmscan.h                      |   42 ++++++++++++++++++++
 mm/vmscan.c                                        |    6 +++
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY		=> 20;
 use constant HIGH_KSWAPD_REWAKEUP		=> 21;
 use constant HIGH_NR_SCANNED			=> 22;
 use constant HIGH_NR_TAKEN			=> 23;
-use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_RECLAIMED			=> 24;
 use constant HIGH_NR_CONTIG_DIRTY		=> 25;
 
 my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
 my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
 my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
 my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
 my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
 my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
 my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
 my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_inactive",
 			$regex_lru_shrink_inactive_default,
 			"nid", "zid",
-			"lru",
-			"nr_scanned", "nr_reclaimed", "priority");
+			"nr_scanned", "nr_reclaimed", "priority",
+			"flags");
 $regex_lru_shrink_active = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_active",
 			$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
 			my $nr_contig_dirty = $7;
 			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+			$details = $5;
+			if ($details !~ /$regex_lru_shrink_inactive/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_shrink_inactive/o\n";
+				next;
+			}
+			my $nr_reclaimed = $4;
+			$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
 		} elsif ($tracepoint eq "mm_vmscan_writepage") {
 			$details = $5;
 			if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
 
 	# Print out process activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO",  "Stalled");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
 			$index++;
 		}
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8u %8u %8.3f",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
 
 	# Print out kswapd activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
 		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
@@ -579,6 +595,7 @@ sub dump_stats {
 	print "\nSummary\n";
 	print "Direct reclaims:     			$total_direct_reclaim\n";
 	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim pages reclaimed:		$total_direct_nr_reclaimed\n";
 	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
 	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd pages reclaimed:			$total_kswapd_nr_reclaimed\n";
 	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..ecf9521 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
 
 #define RECLAIM_WB_ANON		0x0001u
 #define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_MIXED	0x0010u
 #define RECLAIM_WB_SYNC		0x0004u
 #define RECLAIM_WB_ASYNC	0x0008u
 
@@ -17,6 +18,7 @@
 	(flags) ? __print_flags(flags, "|",			\
 		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
 		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_MIXED,	"RECLAIM_WB_MIXED"},	\
 		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
 	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
+#define trace_shrink_flags(file, sync) ( \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+	TP_PROTO(int nid, int zid,
+			unsigned long nr_scanned, unsigned long nr_reclaimed,
+			int priority, int reclaim_flags),
+
+	TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_reclaimed)
+		__field(int, priority)
+		__field(int, reclaim_flags)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zid = zid;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_reclaimed = nr_reclaimed;
+		__entry->priority = priority;
+		__entry->reclaim_flags = reclaim_flags;
+	),
+
+	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid, __entry->zid,
+		__entry->nr_scanned, __entry->nr_reclaimed,
+		__entry->priority,
+		show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+		zone_idx(zone),
+		nr_scanned, nr_reclaimed,
+		priority,
+		trace_shrink_flags(file, sc->lumpy_reclaim_mode));
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/8] tracing, vmscan: Add trace events for LRU list shrinking
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

This patch adds a trace event for shrink_inactive_list() and updates the
sample postprocessing script appropriately. It can be used to determine
how many pages were reclaimed and for non-lumpy reclaim where exactly the
pages were reclaimed from.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++++++++++++-----
 include/trace/events/vmscan.h                      |   42 ++++++++++++++++++++
 mm/vmscan.c                                        |    6 +++
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY		=> 20;
 use constant HIGH_KSWAPD_REWAKEUP		=> 21;
 use constant HIGH_NR_SCANNED			=> 22;
 use constant HIGH_NR_TAKEN			=> 23;
-use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_RECLAIMED			=> 24;
 use constant HIGH_NR_CONTIG_DIRTY		=> 25;
 
 my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
 my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
 my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
 my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
 my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
 my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
 my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
 my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_inactive",
 			$regex_lru_shrink_inactive_default,
 			"nid", "zid",
-			"lru",
-			"nr_scanned", "nr_reclaimed", "priority");
+			"nr_scanned", "nr_reclaimed", "priority",
+			"flags");
 $regex_lru_shrink_active = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_active",
 			$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
 			my $nr_contig_dirty = $7;
 			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+			$details = $5;
+			if ($details !~ /$regex_lru_shrink_inactive/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_shrink_inactive/o\n";
+				next;
+			}
+			my $nr_reclaimed = $4;
+			$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
 		} elsif ($tracepoint eq "mm_vmscan_writepage") {
 			$details = $5;
 			if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
 
 	# Print out process activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO",  "Stalled");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
 			$index++;
 		}
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8u %8u %8.3f",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
 
 	# Print out kswapd activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
 		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
@@ -579,6 +595,7 @@ sub dump_stats {
 	print "\nSummary\n";
 	print "Direct reclaims:     			$total_direct_reclaim\n";
 	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim pages reclaimed:		$total_direct_nr_reclaimed\n";
 	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
 	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd pages reclaimed:			$total_kswapd_nr_reclaimed\n";
 	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..ecf9521 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
 
 #define RECLAIM_WB_ANON		0x0001u
 #define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_MIXED	0x0010u
 #define RECLAIM_WB_SYNC		0x0004u
 #define RECLAIM_WB_ASYNC	0x0008u
 
@@ -17,6 +18,7 @@
 	(flags) ? __print_flags(flags, "|",			\
 		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
 		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_MIXED,	"RECLAIM_WB_MIXED"},	\
 		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
 	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
+#define trace_shrink_flags(file, sync) ( \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+	TP_PROTO(int nid, int zid,
+			unsigned long nr_scanned, unsigned long nr_reclaimed,
+			int priority, int reclaim_flags),
+
+	TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_reclaimed)
+		__field(int, priority)
+		__field(int, reclaim_flags)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zid = zid;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_reclaimed = nr_reclaimed;
+		__entry->priority = priority;
+		__entry->reclaim_flags = reclaim_flags;
+	),
+
+	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid, __entry->zid,
+		__entry->nr_scanned, __entry->nr_reclaimed,
+		__entry->priority,
+		show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+		zone_idx(zone),
+		nr_scanned, nr_reclaimed,
+		priority,
+		trace_shrink_flags(file, sc->lumpy_reclaim_mode));
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/8] writeback: Account for time spent congestion_waited
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/trace/events/writeback.h |   28 ++++++++++++++++++++++++++++
 mm/backing-dev.c                 |    5 +++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= usec_timeout;
+		__entry->usec_delayed	= usec_delayed;
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c2bf86f..e891794 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -762,12 +762,17 @@ EXPORT_SYMBOL(set_bdi_congested);
 long congestion_wait(int sync, long timeout)
 {
 	long ret;
+	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/8] writeback: Account for time spent congestion_waited
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/trace/events/writeback.h |   28 ++++++++++++++++++++++++++++
 mm/backing-dev.c                 |    5 +++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= usec_timeout;
+		__entry->usec_delayed	= usec_delayed;
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c2bf86f..e891794 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -762,12 +762,17 @@ EXPORT_SYMBOL(set_bdi_congested);
 long congestion_wait(int sync, long timeout)
 {
 	long ret;
+	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/8] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

congestion_wait() mean "waiting queue congestion is cleared".  However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..e8b5224 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
 		/*
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/8] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

congestion_wait() mean "waiting queue congestion is cleared".  However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..e8b5224 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
 		/*
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/8] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as

  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished

When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying.  This patch causes lumpy reclaim to abort
when it is known it will fail.

Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again

So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).

After this patch, pageout() behaves as follows;

 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/trace/events/vmscan.h |    6 +-
 mm/vmscan.c                   |  120 +++++++++++++++++++++++++---------------
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index ecf9521..c255fcc 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
 
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
 #define trace_shrink_flags(file, sync) ( \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
 			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
 	)
 
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8b5224..b352b92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+enum lumpy_mode {
+	LUMPY_MODE_NONE,
+	LUMPY_MODE_ASYNC,
+	LUMPY_MODE_SYNC,
+};
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
 	 * Intend to reclaim enough contenious memory rather than to reclaim
 	 * enough amount memory. I.e, it's the mode for high order allocation.
 	 */
-	bool lumpy_reclaim_mode;
+	enum lumpy_mode lumpy_reclaim_mode;
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	return ret;
 }
 
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+				   bool sync)
+{
+	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+	/*
+	 * Some reclaim have alredy been failed. No worth to try synchronous
+	 * lumpy reclaim.
+	 */
+	if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+		return;
+
+	/*
+	 * If we need a large contiguous chunk of memory, or have
+	 * trouble getting a small set of contiguous pages, we
+	 * will reclaim both active and inactive pages.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		sc->lumpy_reclaim_mode = mode;
+	else if (sc->order && priority < DEF_PRIORITY - 2)
+		sc->lumpy_reclaim_mode = mode;
+	else
+		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
 	/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+			      struct scan_control *sc)
 {
 	if (current->flags & PF_SWAPWRITE)
 		return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 		return 1;
 	if (bdi == current->backing_dev_info)
 		return 1;
+
+	/* lumpy reclaim for hugepage often need a lot of write */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 	return 0;
 }
 
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-/* Request for sync pageout. */
-enum pageout_io {
-	PAGEOUT_IO_ASYNC,
-	PAGEOUT_IO_SYNC,
-};
-
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
  * Calls ->writepage().
  */
 static pageout_t pageout(struct page *page, struct address_space *mapping,
-						enum pageout_io sync_writeback)
+			 struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
 		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		if (PageWriteback(page) &&
+		    sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			wait_on_page_writeback(page);
 
 		if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			trace_reclaim_flags(page, sync_writeback));
+			trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -694,10 +731,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+			    may_enter_fs)
 				wait_on_page_writeback(page);
-			else
-				goto keep_locked;
+			else {
+				unlock_page(page);
+				goto keep_lumpy;
+			}
 		}
 
 		references = page_check_references(page, sc);
@@ -751,14 +791,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch (pageout(page, mapping, sync_writeback)) {
+			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
-				if (PageWriteback(page) || PageDirty(page))
+				if (PageWriteback(page))
+					goto keep_lumpy;
+				if (PageDirty(page))
 					goto keep;
+
 				/*
 				 * A synchronous write - probably a ramdisk.  Go
 				 * ahead and try to reclaim the page.
@@ -841,6 +884,7 @@ cull_mlocked:
 			try_to_free_swap(page);
 		unlock_page(page);
 		putback_lru_page(page);
+		disable_lumpy_reclaim_mode(sc);
 		continue;
 
 activate_locked:
@@ -853,6 +897,8 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
+		disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
@@ -1253,7 +1299,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 		return false;
 
 	/* Only stall on lumpy reclaim */
-	if (!sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
 	/* If we have relaimed everything on the isolated list, no stall */
@@ -1298,15 +1344,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			return SWAP_CLUSTER_MAX;
 	}
 
-
+	set_lumpy_reclaim_mode(priority, sc, false);
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1318,8 +1364,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
 		/*
@@ -1337,7 +1383,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1348,7 +1394,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+		set_lumpy_reclaim_mode(priority, sc, true);
+		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
 
 	local_irq_disable();
@@ -1725,21 +1772,6 @@ out:
 	}
 }
 
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
-	/*
-	 * If we need a large contiguous chunk of memory, or have
-	 * trouble getting a small set of contiguous pages, we
-	 * will reclaim both active and inactive pages.
-	 */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
-		sc->lumpy_reclaim_mode = 1;
-	else
-		sc->lumpy_reclaim_mode = 0;
-}
-
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1754,8 +1786,6 @@ static void shrink_zone(int priority, struct zone *zone,
 
 	get_scan_count(zone, sc, nr, priority);
 
-	set_lumpy_reclaim_mode(priority, sc);
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/8] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as

  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished

When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying.  This patch causes lumpy reclaim to abort
when it is known it will fail.

Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again

So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).

After this patch, pageout() behaves as follows;

 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/trace/events/vmscan.h |    6 +-
 mm/vmscan.c                   |  120 +++++++++++++++++++++++++---------------
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index ecf9521..c255fcc 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
 
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
 #define trace_shrink_flags(file, sync) ( \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
 			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
 	)
 
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e8b5224..b352b92 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+enum lumpy_mode {
+	LUMPY_MODE_NONE,
+	LUMPY_MODE_ASYNC,
+	LUMPY_MODE_SYNC,
+};
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
 	 * Intend to reclaim enough contenious memory rather than to reclaim
 	 * enough amount memory. I.e, it's the mode for high order allocation.
 	 */
-	bool lumpy_reclaim_mode;
+	enum lumpy_mode lumpy_reclaim_mode;
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	return ret;
 }
 
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+				   bool sync)
+{
+	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+	/*
+	 * Some reclaim have alredy been failed. No worth to try synchronous
+	 * lumpy reclaim.
+	 */
+	if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+		return;
+
+	/*
+	 * If we need a large contiguous chunk of memory, or have
+	 * trouble getting a small set of contiguous pages, we
+	 * will reclaim both active and inactive pages.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		sc->lumpy_reclaim_mode = mode;
+	else if (sc->order && priority < DEF_PRIORITY - 2)
+		sc->lumpy_reclaim_mode = mode;
+	else
+		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
 	/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+			      struct scan_control *sc)
 {
 	if (current->flags & PF_SWAPWRITE)
 		return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 		return 1;
 	if (bdi == current->backing_dev_info)
 		return 1;
+
+	/* lumpy reclaim for hugepage often need a lot of write */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 	return 0;
 }
 
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-/* Request for sync pageout. */
-enum pageout_io {
-	PAGEOUT_IO_ASYNC,
-	PAGEOUT_IO_SYNC,
-};
-
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
  * Calls ->writepage().
  */
 static pageout_t pageout(struct page *page, struct address_space *mapping,
-						enum pageout_io sync_writeback)
+			 struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
 		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		if (PageWriteback(page) &&
+		    sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			wait_on_page_writeback(page);
 
 		if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			trace_reclaim_flags(page, sync_writeback));
+			trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -694,10 +731,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+			    may_enter_fs)
 				wait_on_page_writeback(page);
-			else
-				goto keep_locked;
+			else {
+				unlock_page(page);
+				goto keep_lumpy;
+			}
 		}
 
 		references = page_check_references(page, sc);
@@ -751,14 +791,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch (pageout(page, mapping, sync_writeback)) {
+			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
-				if (PageWriteback(page) || PageDirty(page))
+				if (PageWriteback(page))
+					goto keep_lumpy;
+				if (PageDirty(page))
 					goto keep;
+
 				/*
 				 * A synchronous write - probably a ramdisk.  Go
 				 * ahead and try to reclaim the page.
@@ -841,6 +884,7 @@ cull_mlocked:
 			try_to_free_swap(page);
 		unlock_page(page);
 		putback_lru_page(page);
+		disable_lumpy_reclaim_mode(sc);
 		continue;
 
 activate_locked:
@@ -853,6 +897,8 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
+		disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
@@ -1253,7 +1299,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 		return false;
 
 	/* Only stall on lumpy reclaim */
-	if (!sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
 	/* If we have relaimed everything on the isolated list, no stall */
@@ -1298,15 +1344,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			return SWAP_CLUSTER_MAX;
 	}
 
-
+	set_lumpy_reclaim_mode(priority, sc, false);
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1318,8 +1364,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
 		/*
@@ -1337,7 +1383,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1348,7 +1394,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+		set_lumpy_reclaim_mode(priority, sc, true);
+		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
 
 	local_irq_disable();
@@ -1725,21 +1772,6 @@ out:
 	}
 }
 
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
-	/*
-	 * If we need a large contiguous chunk of memory, or have
-	 * trouble getting a small set of contiguous pages, we
-	 * will reclaim both active and inactive pages.
-	 */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
-		sc->lumpy_reclaim_mode = 1;
-	else
-		sc->lumpy_reclaim_mode = 0;
-}
-
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1754,8 +1786,6 @@ static void shrink_zone(int priority, struct zone *zone,
 
 	get_scan_count(zone, sc, nr, priority);
 
-	set_lumpy_reclaim_mode(priority, sc);
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/8] vmscan: Remove dead code in shrink_inactive_list()
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/vmscan.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b352b92..00075f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1332,7 +1332,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1387,13 +1386,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
-
 		set_lumpy_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/8] vmscan: Remove dead code in shrink_inactive_list()
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/vmscan.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b352b92..00075f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1332,7 +1332,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1387,13 +1386,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
-
 		set_lumpy_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/8] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00075f3..2836913 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1052,7 +1052,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
-				continue;
+				break;
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1060,8 +1060,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-					!PageSwapCache(cursor_page))
-				continue;
+			    !PageSwapCache(cursor_page))
+				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1072,11 +1072,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 					nr_lumpy_dirty++;
 				scan++;
 			} else {
-				if (mode == ISOLATE_BOTH &&
-						page_count(cursor_page))
-					nr_lumpy_failed++;
+				/* the page is freed already. */
+				if (!page_count(cursor_page))
+					continue;
+				break;
 			}
 		}
+
+		/* If we break out of the loop above, lumpy reclaim failed */
+		if (pfn < end_pfn)
+			nr_lumpy_failed++;
 	}
 
 	*scanned = scan;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/8] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/vmscan.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00075f3..2836913 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1052,7 +1052,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 
 			/* Check that we have not crossed a zone boundary. */
 			if (unlikely(page_zone_id(cursor_page) != zone_id))
-				continue;
+				break;
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1060,8 +1060,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-					!PageSwapCache(cursor_page))
-				continue;
+			    !PageSwapCache(cursor_page))
+				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1072,11 +1072,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 					nr_lumpy_dirty++;
 				scan++;
 			} else {
-				if (mode == ISOLATE_BOTH &&
-						page_count(cursor_page))
-					nr_lumpy_failed++;
+				/* the page is freed already. */
+				if (!page_count(cursor_page))
+					continue;
+				break;
 			}
 		}
+
+		/* If we break out of the loop above, lumpy reclaim failed */
+		if (pfn < end_pfn)
+			nr_lumpy_failed++;
 	}
 
 	*scanned = scan;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

If congestion_wait() is called with no BDI congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested else, it calls cond_resched() to ensure the caller is not hogging
the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h      |    2 +-
 include/trace/events/writeback.h |    7 +++++
 mm/backing-dev.c                 |   54 ++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c                  |    4 +-
 4 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..72bb510 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e891794..3caf679 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -727,6 +727,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -734,7 +735,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	clear_bit(bit, &bdi->state);
+	if (test_and_clear_bit(bit, &bdi->state))
+		atomic_dec(&nr_bdi_congested[sync]);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -746,7 +748,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	set_bit(bit, &bdi->state);
+	if (!test_and_set_bit(bit, &bdi->state))
+		atomic_inc(&nr_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -777,3 +780,50 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/**
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * In the event of a congested backing_dev (any backing_dev), this waits for up
+ * to @timeout jiffies for either a BDI to exit congestion of the given @sync
+ * queue.
+ *
+ * If there is no congestion, then cond_resched() is called to yield the
+ * processor if necessary but otherwise does not sleep.
+ *
+ * The return value is 0 if the sleep is for the full timeout. Otherwise,
+ * it is the number of jiffies that were still remaining when the function
+ * returned. return_value == timeout implies the function did not sleep.
+ */
+long wait_iff_congested(int sync, long timeout)
+{
+	long ret;
+	unsigned long start = jiffies;
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	/* If there is no congestion, yield if necessary instead of sleeping */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+		cond_resched();
+
+		/* In case we scheduled, work out time remaining */
+		ret = timeout - (jiffies - start);
+		if (ret < 0)
+			ret = 0;
+
+		goto out;
+	}
+
+	/* Sleep until uncongested or a write happens */
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+out:
+	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
+	return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8cfa9c..9b66c75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1906,7 +1906,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2094,7 +2094,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

If congestion_wait() is called with no BDI congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested else, it calls cond_resched() to ensure the caller is not hogging
the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h      |    2 +-
 include/trace/events/writeback.h |    7 +++++
 mm/backing-dev.c                 |   54 ++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c                  |    4 +-
 4 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..72bb510 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e891794..3caf679 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -727,6 +727,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -734,7 +735,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	clear_bit(bit, &bdi->state);
+	if (test_and_clear_bit(bit, &bdi->state))
+		atomic_dec(&nr_bdi_congested[sync]);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -746,7 +748,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	set_bit(bit, &bdi->state);
+	if (!test_and_set_bit(bit, &bdi->state))
+		atomic_inc(&nr_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -777,3 +780,50 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/**
+ * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * In the event of a congested backing_dev (any backing_dev), this waits for up
+ * to @timeout jiffies for either a BDI to exit congestion of the given @sync
+ * queue.
+ *
+ * If there is no congestion, then cond_resched() is called to yield the
+ * processor if necessary but otherwise does not sleep.
+ *
+ * The return value is 0 if the sleep is for the full timeout. Otherwise,
+ * it is the number of jiffies that were still remaining when the function
+ * returned. return_value == timeout implies the function did not sleep.
+ */
+long wait_iff_congested(int sync, long timeout)
+{
+	long ret;
+	unsigned long start = jiffies;
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	/* If there is no congestion, yield if necessary instead of sleeping */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+		cond_resched();
+
+		/* In case we scheduled, work out time remaining */
+		ret = timeout - (jiffies - start);
+		if (ret < 0)
+			ret = 0;
+
+		goto out;
+	}
+
+	/* Sleep until uncongested or a write happens */
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+out:
+	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
+	return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8cfa9c..9b66c75 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1906,7 +1906,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2094,7 +2094,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-15 12:27   ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

If wait_iff_congested() is called with no BDI congested, the function simply
calls cond_resched(). In the event there is significant writeback happening
in the zone that is being reclaimed, this can be a poor decision as reclaim
would succeed once writeback was completed. Without any backoff logic,
younger clean pages can be reclaimed resulting in more reclaim overall and
poor performance.

This patch tracks how many pages backed by a congested BDI were found during
scanning. If all the dirty pages encountered on a list isolated from the
LRU belong to a congested BDI, the zone is marked congested until the zone
reaches the high watermark.  wait_iff_congested() then checks both the
number of congested BDIs and if the current zone is one that has encounted
congestion recently, it will sleep on the congestion queue. Otherwise it
will call cond_reched() to yield the processor if necessary.

The end result is that waiting on the congestion queue is avoided when
necessary but when significant congestion is being encountered,
reclaimers and page allocators will back off.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h |    2 +-
 include/linux/mmzone.h      |    8 ++++
 mm/backing-dev.c            |   23 ++++++++----
 mm/page_alloc.c             |    4 +-
 mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
 5 files changed, 98 insertions(+), 22 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 72bb510..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..747384a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -421,6 +421,9 @@ struct zone {
 typedef enum {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
+	ZONE_CONGESTED,			/* zone has many dirty pages backed by
+					 * a congested BDI
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -438,6 +441,11 @@ static inline void zone_clear_flag(struct zone *zone, zone_flags_t flag)
 	clear_bit(flag, &zone->flags);
 }
 
+static inline int zone_is_reclaim_congested(const struct zone *zone)
+{
+	return test_bit(ZONE_CONGESTED, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3caf679..c34df85 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -782,29 +782,36 @@ EXPORT_SYMBOL(congestion_wait);
 
 /**
  * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
+ * @zone: A zone to check if it is heavily congested
  * @sync: SYNC or ASYNC IO
  * @timeout: timeout in jiffies
  *
- * In the event of a congested backing_dev (any backing_dev), this waits for up
- * to @timeout jiffies for either a BDI to exit congestion of the given @sync
- * queue.
+ * In the event of a congested backing_dev (any backing_dev) and the given
+ * @zone has experienced recent congestion, this waits for up to @timeout
+ * jiffies for either a BDI to exit congestion of the given @sync queue
+ * or a write to complete.
  *
- * If there is no congestion, then cond_resched() is called to yield the
- * processor if necessary but otherwise does not sleep.
+ * In the absense of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
  */
-long wait_iff_congested(int sync, long timeout)
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-	/* If there is no congestion, yield if necessary instead of sleeping */
-	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+	/*
+	 * If there is no congestion, or heavy congestion is not being
+	 * encountered in the current zone, yield if necessary instead
+	 * of sleeping on the congestion queue
+	 */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
+			!zone_is_reclaim_congested(zone)) {
 		cond_resched();
 
 		/* In case we scheduled, work out time remaining */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9b66c75..64c9c76 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1906,7 +1906,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2094,7 +2094,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2836913..5ef6294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,20 +311,30 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi,
+enum bdi_queue_status {
+	QUEUEWRITE_DENIED,
+	QUEUEWRITE_CONGESTED,
+	QUEUEWRITE_ALLOWED,
+};
+
+static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
 			      struct scan_control *sc)
 {
+	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
+
 	if (current->flags & PF_SWAPWRITE)
-		return 1;
+		return QUEUEWRITE_ALLOWED;
 	if (!bdi_write_congested(bdi))
-		return 1;
+		return QUEUEWRITE_ALLOWED;
+	else
+		ret = QUEUEWRITE_CONGESTED;
 	if (bdi == current->backing_dev_info)
-		return 1;
+		return QUEUEWRITE_ALLOWED;
 
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
-	return 0;
+		return QUEUEWRITE_ALLOWED;
+	return ret;
 }
 
 /*
@@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
 typedef enum {
 	/* failed to write page out, page is locked */
 	PAGE_KEEP,
+	/* failed to write page out due to congestion, page is locked */
+	PAGE_KEEP_CONGESTED,
 	/* move page to the active list, page is locked */
 	PAGE_ACTIVATE,
 	/* page has been sent to the disk successfully, page is unlocked */
@@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
+	case QUEUEWRITE_CONGESTED:
+		return PAGE_KEEP_CONGESTED;
+	case QUEUEWRITE_DENIED:
 		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	case QUEUEWRITE_ALLOWED:
+		;
 	}
 
 	if (clear_page_dirty_for_io(page)) {
@@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
+				      struct zone *zone,
 				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(page_zone(page) != zone);
 
 		sc->nr_scanned++;
 
@@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
+			case PAGE_KEEP_CONGESTED:
+				nr_congested++;
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -903,6 +928,15 @@ keep_lumpy:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	/*
+	 * Tag a zone as congested if all the dirty pages encountered were
+	 * backed by a congested BDI. In this case, reclaimers should just
+	 * back off and wait for congestion to clear because further reclaim
+	 * will encounter the same problem
+	 */
+	if (nr_dirty == nr_congested)
+		zone_set_flag(zone, ZONE_CONGESTED);
+
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
@@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_lumpy_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, sc);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
 	}
 
 	local_irq_disable();
@@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		    priority < DEF_PRIORITY - 2) {
+			struct zone *active_zone = NULL;
+			unsigned long max_writeback = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				unsigned long writeback;
+
+				/* Initialise for first zone */
+				if (active_zone == NULL)
+					active_zone = zone;
+
+				writeback = zone_page_state(zone, NR_WRITEBACK);
+				if (writeback > max_writeback) {
+					max_writeback = writeback;
+					active_zone = zone;
+				}
+			}
+
+			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+		}
 	}
 
 out:
@@ -2251,6 +2303,15 @@ loop_again:
 				if (!zone_watermark_ok(zone, order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
+			} else {
+				/*
+				 * If a zone reaches its high watermark,
+				 * consider it to be no longer congested. It's
+				 * possible there are dirty pages backed by
+				 * congested BDIs but as pressure is relieved,
+				 * spectulatively avoid congestion waits
+				 */
+				zone_clear_flag(zone, ZONE_CONGESTED);
 			}
 
 		}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-15 12:27   ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-15 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Mel Gorman

If wait_iff_congested() is called with no BDI congested, the function simply
calls cond_resched(). In the event there is significant writeback happening
in the zone that is being reclaimed, this can be a poor decision as reclaim
would succeed once writeback was completed. Without any backoff logic,
younger clean pages can be reclaimed resulting in more reclaim overall and
poor performance.

This patch tracks how many pages backed by a congested BDI were found during
scanning. If all the dirty pages encountered on a list isolated from the
LRU belong to a congested BDI, the zone is marked congested until the zone
reaches the high watermark.  wait_iff_congested() then checks both the
number of congested BDIs and if the current zone is one that has encounted
congestion recently, it will sleep on the congestion queue. Otherwise it
will call cond_reched() to yield the processor if necessary.

The end result is that waiting on the congestion queue is avoided when
necessary but when significant congestion is being encountered,
reclaimers and page allocators will back off.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h |    2 +-
 include/linux/mmzone.h      |    8 ++++
 mm/backing-dev.c            |   23 ++++++++----
 mm/page_alloc.c             |    4 +-
 mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
 5 files changed, 98 insertions(+), 22 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 72bb510..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..747384a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -421,6 +421,9 @@ struct zone {
 typedef enum {
 	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
+	ZONE_CONGESTED,			/* zone has many dirty pages backed by
+					 * a congested BDI
+					 */
 } zone_flags_t;
 
 static inline void zone_set_flag(struct zone *zone, zone_flags_t flag)
@@ -438,6 +441,11 @@ static inline void zone_clear_flag(struct zone *zone, zone_flags_t flag)
 	clear_bit(flag, &zone->flags);
 }
 
+static inline int zone_is_reclaim_congested(const struct zone *zone)
+{
+	return test_bit(ZONE_CONGESTED, &zone->flags);
+}
+
 static inline int zone_is_reclaim_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3caf679..c34df85 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -782,29 +782,36 @@ EXPORT_SYMBOL(congestion_wait);
 
 /**
  * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
+ * @zone: A zone to check if it is heavily congested
  * @sync: SYNC or ASYNC IO
  * @timeout: timeout in jiffies
  *
- * In the event of a congested backing_dev (any backing_dev), this waits for up
- * to @timeout jiffies for either a BDI to exit congestion of the given @sync
- * queue.
+ * In the event of a congested backing_dev (any backing_dev) and the given
+ * @zone has experienced recent congestion, this waits for up to @timeout
+ * jiffies for either a BDI to exit congestion of the given @sync queue
+ * or a write to complete.
  *
- * If there is no congestion, then cond_resched() is called to yield the
- * processor if necessary but otherwise does not sleep.
+ * In the absense of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
  * returned. return_value == timeout implies the function did not sleep.
  */
-long wait_iff_congested(int sync, long timeout)
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-	/* If there is no congestion, yield if necessary instead of sleeping */
-	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+	/*
+	 * If there is no congestion, or heavy congestion is not being
+	 * encountered in the current zone, yield if necessary instead
+	 * of sleeping on the congestion queue
+	 */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
+			!zone_is_reclaim_congested(zone)) {
 		cond_resched();
 
 		/* In case we scheduled, work out time remaining */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9b66c75..64c9c76 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1906,7 +1906,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2094,7 +2094,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		wait_iff_congested(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2836913..5ef6294 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,20 +311,30 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi,
+enum bdi_queue_status {
+	QUEUEWRITE_DENIED,
+	QUEUEWRITE_CONGESTED,
+	QUEUEWRITE_ALLOWED,
+};
+
+static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
 			      struct scan_control *sc)
 {
+	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
+
 	if (current->flags & PF_SWAPWRITE)
-		return 1;
+		return QUEUEWRITE_ALLOWED;
 	if (!bdi_write_congested(bdi))
-		return 1;
+		return QUEUEWRITE_ALLOWED;
+	else
+		ret = QUEUEWRITE_CONGESTED;
 	if (bdi == current->backing_dev_info)
-		return 1;
+		return QUEUEWRITE_ALLOWED;
 
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		return 1;
-	return 0;
+		return QUEUEWRITE_ALLOWED;
+	return ret;
 }
 
 /*
@@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
 typedef enum {
 	/* failed to write page out, page is locked */
 	PAGE_KEEP,
+	/* failed to write page out due to congestion, page is locked */
+	PAGE_KEEP_CONGESTED,
 	/* move page to the active list, page is locked */
 	PAGE_ACTIVATE,
 	/* page has been sent to the disk successfully, page is unlocked */
@@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
+	case QUEUEWRITE_CONGESTED:
+		return PAGE_KEEP_CONGESTED;
+	case QUEUEWRITE_DENIED:
 		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	case QUEUEWRITE_ALLOWED:
+		;
 	}
 
 	if (clear_page_dirty_for_io(page)) {
@@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
+				      struct zone *zone,
 				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
+		VM_BUG_ON(page_zone(page) != zone);
 
 		sc->nr_scanned++;
 
@@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty++;
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
+			case PAGE_KEEP_CONGESTED:
+				nr_congested++;
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
@@ -903,6 +928,15 @@ keep_lumpy:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	/*
+	 * Tag a zone as congested if all the dirty pages encountered were
+	 * backed by a congested BDI. In this case, reclaimers should just
+	 * back off and wait for congestion to clear because further reclaim
+	 * will encounter the same problem
+	 */
+	if (nr_dirty == nr_congested)
+		zone_set_flag(zone, ZONE_CONGESTED);
+
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
@@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_lumpy_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, sc);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
 	}
 
 	local_irq_disable();
@@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		    priority < DEF_PRIORITY - 2) {
+			struct zone *active_zone = NULL;
+			unsigned long max_writeback = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				unsigned long writeback;
+
+				/* Initialise for first zone */
+				if (active_zone == NULL)
+					active_zone = zone;
+
+				writeback = zone_page_state(zone, NR_WRITEBACK);
+				if (writeback > max_writeback) {
+					max_writeback = writeback;
+					active_zone = zone;
+				}
+			}
+
+			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+		}
 	}
 
 out:
@@ -2251,6 +2303,15 @@ loop_again:
 				if (!zone_watermark_ok(zone, order,
 					    min_wmark_pages(zone), end_zone, 0))
 					has_under_min_watermark_zone = 1;
+			} else {
+				/*
+				 * If a zone reaches its high watermark,
+				 * consider it to be no longer congested. It's
+				 * possible there are dirty pages backed by
+				 * congested BDIs but as pressure is relieved,
+				 * spectulatively avoid congestion waits
+				 */
+				zone_clear_flag(zone, ZONE_CONGESTED);
 			}
 
 		}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
  2010-09-15 12:27   ` Mel Gorman
@ 2010-09-16  7:59     ` Minchan Kim
  -1 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16  7:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Sep 15, 2010 at 01:27:50PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDI congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested else, it calls cond_resched() to ensure the caller is not hogging
> the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.

I confused due to kswapd you mentioned.
This patch affects only direct reclaim.
Please, complete the description. 

"This patch affects direct reclaimer to reduce stall"
Otherwise, looks good to me. 

> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
@ 2010-09-16  7:59     ` Minchan Kim
  0 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16  7:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Sep 15, 2010 at 01:27:50PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDI congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested else, it calls cond_resched() to ensure the caller is not hogging
> the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.

I confused due to kswapd you mentioned.
This patch affects only direct reclaim.
Please, complete the description. 

"This patch affects direct reclaimer to reduce stall"
Otherwise, looks good to me. 

> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-15 12:27   ` Mel Gorman
@ 2010-09-16  8:13     ` Minchan Kim
  -1 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16  8:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> If wait_iff_congested() is called with no BDI congested, the function simply
> calls cond_resched(). In the event there is significant writeback happening
> in the zone that is being reclaimed, this can be a poor decision as reclaim
> would succeed once writeback was completed. Without any backoff logic,
> younger clean pages can be reclaimed resulting in more reclaim overall and
> poor performance.

I agree. 

> 
> This patch tracks how many pages backed by a congested BDI were found during
> scanning. If all the dirty pages encountered on a list isolated from the
> LRU belong to a congested BDI, the zone is marked congested until the zone

I am not sure it works well. 
We just met the condition once but we backoff it until high watermark.
(ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
First impression is rather _aggressive_.

How about more checking?
For example, if above pattern continues repeately above some threshold,
we can regard "zone is congested" and then if the pattern isn't repeated 
during some threshold, we can regard "zone isn't congested any more.".

> reaches the high watermark.  wait_iff_congested() then checks both the
> number of congested BDIs and if the current zone is one that has encounted
> congestion recently, it will sleep on the congestion queue. Otherwise it
> will call cond_reched() to yield the processor if necessary.
> 
> The end result is that waiting on the congestion queue is avoided when
> necessary but when significant congestion is being encountered,
> reclaimers and page allocators will back off.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h |    2 +-
>  include/linux/mmzone.h      |    8 ++++
>  mm/backing-dev.c            |   23 ++++++++----
>  mm/page_alloc.c             |    4 +-
>  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
>  5 files changed, 98 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 72bb510..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,

<snip>

>  			      struct scan_control *sc)
>  {
> +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> +
>  	if (current->flags & PF_SWAPWRITE)
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
>  	if (!bdi_write_congested(bdi))
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
> +	else
> +		ret = QUEUEWRITE_CONGESTED;
>  	if (bdi == current->backing_dev_info)
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
>  
>  	/* lumpy reclaim for hugepage often need a lot of write */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> -		return 1;
> -	return 0;
> +		return QUEUEWRITE_ALLOWED;
> +	return ret;
>  }

The function can't return QUEUEXXX_DENIED.
It can affect disable_lumpy_reclaim. 

>  
>  /*
> @@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
>  typedef enum {
>  	/* failed to write page out, page is locked */
>  	PAGE_KEEP,
> +	/* failed to write page out due to congestion, page is locked */
> +	PAGE_KEEP_CONGESTED,
>  	/* move page to the active list, page is locked */
>  	PAGE_ACTIVATE,
>  	/* page has been sent to the disk successfully, page is unlocked */
> @@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  	}
>  	if (mapping->a_ops->writepage == NULL)
>  		return PAGE_ACTIVATE;
> -	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> +	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
> +	case QUEUEWRITE_CONGESTED:
> +		return PAGE_KEEP_CONGESTED;
> +	case QUEUEWRITE_DENIED:
>  		disable_lumpy_reclaim_mode(sc);
>  		return PAGE_KEEP;
> +	case QUEUEWRITE_ALLOWED:
> +		;
>  	}
>  
>  	if (clear_page_dirty_for_io(page)) {
> @@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> +				      struct zone *zone,
>  				      struct scan_control *sc)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
> +	unsigned long nr_congested = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(page_zone(page) != zone);
>  
>  		sc->nr_scanned++;
>  
> @@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty++;
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  			/* Page is dirty, try to write it out here */
>  			switch (pageout(page, mapping, sc)) {
> +			case PAGE_KEEP_CONGESTED:
> +				nr_congested++;
>  			case PAGE_KEEP:
>  				goto keep_locked;
>  			case PAGE_ACTIVATE:
> @@ -903,6 +928,15 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * Tag a zone as congested if all the dirty pages encountered were
> +	 * backed by a congested BDI. In this case, reclaimers should just
> +	 * back off and wait for congestion to clear because further reclaim
> +	 * will encounter the same problem
> +	 */
> +	if (nr_dirty == nr_congested)
> +		zone_set_flag(zone, ZONE_CONGESTED);
> +
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> @@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc);
> +	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>  		set_lumpy_reclaim_mode(priority, sc, true);
> -		nr_reclaimed += shrink_page_list(&page_list, sc);
> +		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>  	}
>  
>  	local_irq_disable();
> @@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> -		    priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		    priority < DEF_PRIORITY - 2) {
> +			struct zone *active_zone = NULL;
> +			unsigned long max_writeback = 0;
> +			for_each_zone_zonelist(zone, z, zonelist,
> +					gfp_zone(sc->gfp_mask)) {
> +				unsigned long writeback;
> +
> +				/* Initialise for first zone */
> +				if (active_zone == NULL)
> +					active_zone = zone;
> +
> +				writeback = zone_page_state(zone, NR_WRITEBACK);
> +				if (writeback > max_writeback) {
> +					max_writeback = writeback;
> +					active_zone = zone;
> +				}
> +			}
> +
> +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> +		}

Other place just considers preferred zone. 
What is the rationale that consider max writeback zone in all zone of zonelist to 
call wait_iff_congeested?
Maybe max writeback zone can be much slow bdi but this process could be not related
to the bdi. It can make random stall by point of view of this proces.

>  	}
>  
>  out:
> @@ -2251,6 +2303,15 @@ loop_again:
>  				if (!zone_watermark_ok(zone, order,
>  					    min_wmark_pages(zone), end_zone, 0))
>  					has_under_min_watermark_zone = 1;
> +			} else {
> +				/*
> +				 * If a zone reaches its high watermark,
> +				 * consider it to be no longer congested. It's
> +				 * possible there are dirty pages backed by
> +				 * congested BDIs but as pressure is relieved,
> +				 * spectulatively avoid congestion waits
> +				 */
> +				zone_clear_flag(zone, ZONE_CONGESTED);
>  			}
>  
>  		}
> -- 
> 1.7.1
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-16  8:13     ` Minchan Kim
  0 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16  8:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> If wait_iff_congested() is called with no BDI congested, the function simply
> calls cond_resched(). In the event there is significant writeback happening
> in the zone that is being reclaimed, this can be a poor decision as reclaim
> would succeed once writeback was completed. Without any backoff logic,
> younger clean pages can be reclaimed resulting in more reclaim overall and
> poor performance.

I agree. 

> 
> This patch tracks how many pages backed by a congested BDI were found during
> scanning. If all the dirty pages encountered on a list isolated from the
> LRU belong to a congested BDI, the zone is marked congested until the zone

I am not sure it works well. 
We just met the condition once but we backoff it until high watermark.
(ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
First impression is rather _aggressive_.

How about more checking?
For example, if above pattern continues repeately above some threshold,
we can regard "zone is congested" and then if the pattern isn't repeated 
during some threshold, we can regard "zone isn't congested any more.".

> reaches the high watermark.  wait_iff_congested() then checks both the
> number of congested BDIs and if the current zone is one that has encounted
> congestion recently, it will sleep on the congestion queue. Otherwise it
> will call cond_reched() to yield the processor if necessary.
> 
> The end result is that waiting on the congestion queue is avoided when
> necessary but when significant congestion is being encountered,
> reclaimers and page allocators will back off.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h |    2 +-
>  include/linux/mmzone.h      |    8 ++++
>  mm/backing-dev.c            |   23 ++++++++----
>  mm/page_alloc.c             |    4 +-
>  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
>  5 files changed, 98 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 72bb510..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,

<snip>

>  			      struct scan_control *sc)
>  {
> +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> +
>  	if (current->flags & PF_SWAPWRITE)
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
>  	if (!bdi_write_congested(bdi))
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
> +	else
> +		ret = QUEUEWRITE_CONGESTED;
>  	if (bdi == current->backing_dev_info)
> -		return 1;
> +		return QUEUEWRITE_ALLOWED;
>  
>  	/* lumpy reclaim for hugepage often need a lot of write */
>  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> -		return 1;
> -	return 0;
> +		return QUEUEWRITE_ALLOWED;
> +	return ret;
>  }

The function can't return QUEUEXXX_DENIED.
It can affect disable_lumpy_reclaim. 

>  
>  /*
> @@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
>  typedef enum {
>  	/* failed to write page out, page is locked */
>  	PAGE_KEEP,
> +	/* failed to write page out due to congestion, page is locked */
> +	PAGE_KEEP_CONGESTED,
>  	/* move page to the active list, page is locked */
>  	PAGE_ACTIVATE,
>  	/* page has been sent to the disk successfully, page is unlocked */
> @@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
>  	}
>  	if (mapping->a_ops->writepage == NULL)
>  		return PAGE_ACTIVATE;
> -	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> +	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
> +	case QUEUEWRITE_CONGESTED:
> +		return PAGE_KEEP_CONGESTED;
> +	case QUEUEWRITE_DENIED:
>  		disable_lumpy_reclaim_mode(sc);
>  		return PAGE_KEEP;
> +	case QUEUEWRITE_ALLOWED:
> +		;
>  	}
>  
>  	if (clear_page_dirty_for_io(page)) {
> @@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> +				      struct zone *zone,
>  				      struct scan_control *sc)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
> +	unsigned long nr_congested = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(page_zone(page) != zone);
>  
>  		sc->nr_scanned++;
>  
> @@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty++;
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  			/* Page is dirty, try to write it out here */
>  			switch (pageout(page, mapping, sc)) {
> +			case PAGE_KEEP_CONGESTED:
> +				nr_congested++;
>  			case PAGE_KEEP:
>  				goto keep_locked;
>  			case PAGE_ACTIVATE:
> @@ -903,6 +928,15 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * Tag a zone as congested if all the dirty pages encountered were
> +	 * backed by a congested BDI. In this case, reclaimers should just
> +	 * back off and wait for congestion to clear because further reclaim
> +	 * will encounter the same problem
> +	 */
> +	if (nr_dirty == nr_congested)
> +		zone_set_flag(zone, ZONE_CONGESTED);
> +
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> @@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc);
> +	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>  		set_lumpy_reclaim_mode(priority, sc, true);
> -		nr_reclaimed += shrink_page_list(&page_list, sc);
> +		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>  	}
>  
>  	local_irq_disable();
> @@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  
>  		/* Take a nap, wait for some writeback to complete */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> -		    priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		    priority < DEF_PRIORITY - 2) {
> +			struct zone *active_zone = NULL;
> +			unsigned long max_writeback = 0;
> +			for_each_zone_zonelist(zone, z, zonelist,
> +					gfp_zone(sc->gfp_mask)) {
> +				unsigned long writeback;
> +
> +				/* Initialise for first zone */
> +				if (active_zone == NULL)
> +					active_zone = zone;
> +
> +				writeback = zone_page_state(zone, NR_WRITEBACK);
> +				if (writeback > max_writeback) {
> +					max_writeback = writeback;
> +					active_zone = zone;
> +				}
> +			}
> +
> +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> +		}

Other place just considers preferred zone. 
What is the rationale that consider max writeback zone in all zone of zonelist to 
call wait_iff_congeested?
Maybe max writeback zone can be much slow bdi but this process could be not related
to the bdi. It can make random stall by point of view of this proces.

>  	}
>  
>  out:
> @@ -2251,6 +2303,15 @@ loop_again:
>  				if (!zone_watermark_ok(zone, order,
>  					    min_wmark_pages(zone), end_zone, 0))
>  					has_under_min_watermark_zone = 1;
> +			} else {
> +				/*
> +				 * If a zone reaches its high watermark,
> +				 * consider it to be no longer congested. It's
> +				 * possible there are dirty pages backed by
> +				 * congested BDIs but as pressure is relieved,
> +				 * spectulatively avoid congestion waits
> +				 */
> +				zone_clear_flag(zone, ZONE_CONGESTED);
>  			}
>  
>  		}
> -- 
> 1.7.1
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
  2010-09-16  7:59     ` Minchan Kim
@ 2010-09-16  8:23       ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16  8:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 04:59:49PM +0900, Minchan Kim wrote:
> On Wed, Sep 15, 2010 at 01:27:50PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDI congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested else, it calls cond_resched() to ensure the caller is not hogging
> > the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> 
> I confused due to kswapd you mentioned.
> This patch affects only direct reclaim.
> Please, complete the description. 
> 

My bad, when the description was first written, both were affected and I
neglected to correct the description. I'm still debating with myself as
to whether the kswapd congestion_wait() should be wait_iff_congested()
or not.

Thanks

> "This patch affects direct reclaimer to reduce stall"
> Otherwise, looks good to me. 
> 
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs
@ 2010-09-16  8:23       ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16  8:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 04:59:49PM +0900, Minchan Kim wrote:
> On Wed, Sep 15, 2010 at 01:27:50PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDI congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested else, it calls cond_resched() to ensure the caller is not hogging
> > the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> 
> I confused due to kswapd you mentioned.
> This patch affects only direct reclaim.
> Please, complete the description. 
> 

My bad, when the description was first written, both were affected and I
neglected to correct the description. I'm still debating with myself as
to whether the kswapd congestion_wait() should be wait_iff_congested()
or not.

Thanks

> "This patch affects direct reclaimer to reduce stall"
> Otherwise, looks good to me. 
> 
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-16  8:13     ` Minchan Kim
@ 2010-09-16  9:18       ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 05:13:38PM +0900, Minchan Kim wrote:
> On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> > If wait_iff_congested() is called with no BDI congested, the function simply
> > calls cond_resched(). In the event there is significant writeback happening
> > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > would succeed once writeback was completed. Without any backoff logic,
> > younger clean pages can be reclaimed resulting in more reclaim overall and
> > poor performance.
> 
> I agree. 
> 
> > 
> > This patch tracks how many pages backed by a congested BDI were found during
> > scanning. If all the dirty pages encountered on a list isolated from the
> > LRU belong to a congested BDI, the zone is marked congested until the zone
> 
> I am not sure it works well. 

Check the competion times for the micro-mapped-file-stream benchmark in
the leader mail. Backing off like this is faster overall for some
workloads.

> We just met the condition once but we backoff it until high watermark.

Reaching the high watermark is considered to be a relieving of pressure.

> (ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
> First impression is rather _aggressive_.
> 

Yes, it is. I intended to start with something quite aggressive that is
close to existing behaviour and then experiment with alternatives.

For example, I considered clearing zone congestion when but nr_bdi_congested
drops to 0. This would be less aggressive in terms of congestion waiting but
it is further from todays behaviour. I felt it would be best to introduce
wait_iff_congested() in one kernel cycle but wait to a later cycle to deviate
a lot from congestion_wait().

> How about more checking?
> For example, if above pattern continues repeately above some threshold,
> we can regard "zone is congested" and then if the pattern isn't repeated 
> during some threshold, we can regard "zone isn't congested any more.".
> 

I also considered these options and got stuck at what the "some
threshold" is and how to record the history. Should it be recorded on a
per BDI basis for example? I think all these questions can be answered
but should be in a different cycle.

> > reaches the high watermark.  wait_iff_congested() then checks both the
> > number of congested BDIs and if the current zone is one that has encounted
> > congestion recently, it will sleep on the congestion queue. Otherwise it
> > will call cond_reched() to yield the processor if necessary.
> > 
> > The end result is that waiting on the congestion queue is avoided when
> > necessary but when significant congestion is being encountered,
> > reclaimers and page allocators will back off.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/backing-dev.h |    2 +-
> >  include/linux/mmzone.h      |    8 ++++
> >  mm/backing-dev.c            |   23 ++++++++----
> >  mm/page_alloc.c             |    4 +-
> >  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
> >  5 files changed, 98 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 72bb510..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
> 
> <snip>
> 
> >  			      struct scan_control *sc)
> >  {
> > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > +
> >  	if (current->flags & PF_SWAPWRITE)
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> >  	if (!bdi_write_congested(bdi))
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> > +	else
> > +		ret = QUEUEWRITE_CONGESTED;
> >  	if (bdi == current->backing_dev_info)
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> >  
> >  	/* lumpy reclaim for hugepage often need a lot of write */
> >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > -		return 1;
> > -	return 0;
> > +		return QUEUEWRITE_ALLOWED;
> > +	return ret;
> >  }
> 
> The function can't return QUEUEXXX_DENIED.
> It can affect disable_lumpy_reclaim. 
> 

Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
objection.

> >  
> >  /*
> > @@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
> >  typedef enum {
> >  	/* failed to write page out, page is locked */
> >  	PAGE_KEEP,
> > +	/* failed to write page out due to congestion, page is locked */
> > +	PAGE_KEEP_CONGESTED,
> >  	/* move page to the active list, page is locked */
> >  	PAGE_ACTIVATE,
> >  	/* page has been sent to the disk successfully, page is unlocked */
> > @@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> >  	}
> >  	if (mapping->a_ops->writepage == NULL)
> >  		return PAGE_ACTIVATE;
> > -	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> > +	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
> > +	case QUEUEWRITE_CONGESTED:
> > +		return PAGE_KEEP_CONGESTED;
> > +	case QUEUEWRITE_DENIED:
> >  		disable_lumpy_reclaim_mode(sc);
> >  		return PAGE_KEEP;
> > +	case QUEUEWRITE_ALLOWED:
> > +		;
> >  	}
> >  
> >  	if (clear_page_dirty_for_io(page)) {
> > @@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > +				      struct zone *zone,
> >  				      struct scan_control *sc)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> > +	unsigned long nr_congested = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep;
> >  
> >  		VM_BUG_ON(PageActive(page));
> > +		VM_BUG_ON(page_zone(page) != zone);
> >  
> >  		sc->nr_scanned++;
> >  
> > @@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty++;
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  			/* Page is dirty, try to write it out here */
> >  			switch (pageout(page, mapping, sc)) {
> > +			case PAGE_KEEP_CONGESTED:
> > +				nr_congested++;
> >  			case PAGE_KEEP:
> >  				goto keep_locked;
> >  			case PAGE_ACTIVATE:
> > @@ -903,6 +928,15 @@ keep_lumpy:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * Tag a zone as congested if all the dirty pages encountered were
> > +	 * backed by a congested BDI. In this case, reclaimers should just
> > +	 * back off and wait for congestion to clear because further reclaim
> > +	 * will encounter the same problem
> > +	 */
> > +	if (nr_dirty == nr_congested)
> > +		zone_set_flag(zone, ZONE_CONGESTED);
> > +
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > @@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc);
> > +	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> >  
> >  	/* Check if we should syncronously wait for writeback */
> >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> >  		set_lumpy_reclaim_mode(priority, sc, true);
> > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > +		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> >  	}
> >  
> >  	local_irq_disable();
> > @@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  
> >  		/* Take a nap, wait for some writeback to complete */
> >  		if (!sc->hibernation_mode && sc->nr_scanned &&
> > -		    priority < DEF_PRIORITY - 2)
> > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +		    priority < DEF_PRIORITY - 2) {
> > +			struct zone *active_zone = NULL;
> > +			unsigned long max_writeback = 0;
> > +			for_each_zone_zonelist(zone, z, zonelist,
> > +					gfp_zone(sc->gfp_mask)) {
> > +				unsigned long writeback;
> > +
> > +				/* Initialise for first zone */
> > +				if (active_zone == NULL)
> > +					active_zone = zone;
> > +
> > +				writeback = zone_page_state(zone, NR_WRITEBACK);
> > +				if (writeback > max_writeback) {
> > +					max_writeback = writeback;
> > +					active_zone = zone;
> > +				}
> > +			}
> > +
> > +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > +		}
> 
> Other place just considers preferred zone. 
> What is the rationale that consider max writeback zone in all zone of zonelist to 
> call wait_iff_congeested?

Initially, it was because wait_iff_congested() heuristic was based on
writeback, not zone congestion.  This time around, it was because I
wanted to be aggressive in terms of triggering the congestion wait to be
better than existing behaviour but not too far from it.

> Maybe max writeback zone can be much slow bdi but this process could be not related
> to the bdi. It can make random stall by point of view of this proces.
> 

Fair point, I will retest using the preferred zone.

> >  	}
> >  
> >  out:
> > @@ -2251,6 +2303,15 @@ loop_again:
> >  				if (!zone_watermark_ok(zone, order,
> >  					    min_wmark_pages(zone), end_zone, 0))
> >  					has_under_min_watermark_zone = 1;
> > +			} else {
> > +				/*
> > +				 * If a zone reaches its high watermark,
> > +				 * consider it to be no longer congested. It's
> > +				 * possible there are dirty pages backed by
> > +				 * congested BDIs but as pressure is relieved,
> > +				 * spectulatively avoid congestion waits
> > +				 */
> > +				zone_clear_flag(zone, ZONE_CONGESTED);
> >  			}
> >  
> >  		}
> > -- 
> > 1.7.1
> > 
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-16  9:18       ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16  9:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 05:13:38PM +0900, Minchan Kim wrote:
> On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> > If wait_iff_congested() is called with no BDI congested, the function simply
> > calls cond_resched(). In the event there is significant writeback happening
> > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > would succeed once writeback was completed. Without any backoff logic,
> > younger clean pages can be reclaimed resulting in more reclaim overall and
> > poor performance.
> 
> I agree. 
> 
> > 
> > This patch tracks how many pages backed by a congested BDI were found during
> > scanning. If all the dirty pages encountered on a list isolated from the
> > LRU belong to a congested BDI, the zone is marked congested until the zone
> 
> I am not sure it works well. 

Check the competion times for the micro-mapped-file-stream benchmark in
the leader mail. Backing off like this is faster overall for some
workloads.

> We just met the condition once but we backoff it until high watermark.

Reaching the high watermark is considered to be a relieving of pressure.

> (ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
> First impression is rather _aggressive_.
> 

Yes, it is. I intended to start with something quite aggressive that is
close to existing behaviour and then experiment with alternatives.

For example, I considered clearing zone congestion when but nr_bdi_congested
drops to 0. This would be less aggressive in terms of congestion waiting but
it is further from todays behaviour. I felt it would be best to introduce
wait_iff_congested() in one kernel cycle but wait to a later cycle to deviate
a lot from congestion_wait().

> How about more checking?
> For example, if above pattern continues repeately above some threshold,
> we can regard "zone is congested" and then if the pattern isn't repeated 
> during some threshold, we can regard "zone isn't congested any more.".
> 

I also considered these options and got stuck at what the "some
threshold" is and how to record the history. Should it be recorded on a
per BDI basis for example? I think all these questions can be answered
but should be in a different cycle.

> > reaches the high watermark.  wait_iff_congested() then checks both the
> > number of congested BDIs and if the current zone is one that has encounted
> > congestion recently, it will sleep on the congestion queue. Otherwise it
> > will call cond_reched() to yield the processor if necessary.
> > 
> > The end result is that waiting on the congestion queue is avoided when
> > necessary but when significant congestion is being encountered,
> > reclaimers and page allocators will back off.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/backing-dev.h |    2 +-
> >  include/linux/mmzone.h      |    8 ++++
> >  mm/backing-dev.c            |   23 ++++++++----
> >  mm/page_alloc.c             |    4 +-
> >  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
> >  5 files changed, 98 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 72bb510..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
> 
> <snip>
> 
> >  			      struct scan_control *sc)
> >  {
> > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > +
> >  	if (current->flags & PF_SWAPWRITE)
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> >  	if (!bdi_write_congested(bdi))
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> > +	else
> > +		ret = QUEUEWRITE_CONGESTED;
> >  	if (bdi == current->backing_dev_info)
> > -		return 1;
> > +		return QUEUEWRITE_ALLOWED;
> >  
> >  	/* lumpy reclaim for hugepage often need a lot of write */
> >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > -		return 1;
> > -	return 0;
> > +		return QUEUEWRITE_ALLOWED;
> > +	return ret;
> >  }
> 
> The function can't return QUEUEXXX_DENIED.
> It can affect disable_lumpy_reclaim. 
> 

Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
objection.

> >  
> >  /*
> > @@ -352,6 +362,8 @@ static void handle_write_error(struct address_space *mapping,
> >  typedef enum {
> >  	/* failed to write page out, page is locked */
> >  	PAGE_KEEP,
> > +	/* failed to write page out due to congestion, page is locked */
> > +	PAGE_KEEP_CONGESTED,
> >  	/* move page to the active list, page is locked */
> >  	PAGE_ACTIVATE,
> >  	/* page has been sent to the disk successfully, page is unlocked */
> > @@ -401,9 +413,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> >  	}
> >  	if (mapping->a_ops->writepage == NULL)
> >  		return PAGE_ACTIVATE;
> > -	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> > +	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
> > +	case QUEUEWRITE_CONGESTED:
> > +		return PAGE_KEEP_CONGESTED;
> > +	case QUEUEWRITE_DENIED:
> >  		disable_lumpy_reclaim_mode(sc);
> >  		return PAGE_KEEP;
> > +	case QUEUEWRITE_ALLOWED:
> > +		;
> >  	}
> >  
> >  	if (clear_page_dirty_for_io(page)) {
> > @@ -682,11 +699,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > +				      struct zone *zone,
> >  				      struct scan_control *sc)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> > +	unsigned long nr_congested = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep;
> >  
> >  		VM_BUG_ON(PageActive(page));
> > +		VM_BUG_ON(page_zone(page) != zone);
> >  
> >  		sc->nr_scanned++;
> >  
> > @@ -783,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty++;
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -792,6 +815,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  
> >  			/* Page is dirty, try to write it out here */
> >  			switch (pageout(page, mapping, sc)) {
> > +			case PAGE_KEEP_CONGESTED:
> > +				nr_congested++;
> >  			case PAGE_KEEP:
> >  				goto keep_locked;
> >  			case PAGE_ACTIVATE:
> > @@ -903,6 +928,15 @@ keep_lumpy:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * Tag a zone as congested if all the dirty pages encountered were
> > +	 * backed by a congested BDI. In this case, reclaimers should just
> > +	 * back off and wait for congestion to clear because further reclaim
> > +	 * will encounter the same problem
> > +	 */
> > +	if (nr_dirty == nr_congested)
> > +		zone_set_flag(zone, ZONE_CONGESTED);
> > +
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > @@ -1387,12 +1421,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc);
> > +	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
> >  
> >  	/* Check if we should syncronously wait for writeback */
> >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> >  		set_lumpy_reclaim_mode(priority, sc, true);
> > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > +		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> >  	}
> >  
> >  	local_irq_disable();
> > @@ -1940,8 +1974,26 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  
> >  		/* Take a nap, wait for some writeback to complete */
> >  		if (!sc->hibernation_mode && sc->nr_scanned &&
> > -		    priority < DEF_PRIORITY - 2)
> > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +		    priority < DEF_PRIORITY - 2) {
> > +			struct zone *active_zone = NULL;
> > +			unsigned long max_writeback = 0;
> > +			for_each_zone_zonelist(zone, z, zonelist,
> > +					gfp_zone(sc->gfp_mask)) {
> > +				unsigned long writeback;
> > +
> > +				/* Initialise for first zone */
> > +				if (active_zone == NULL)
> > +					active_zone = zone;
> > +
> > +				writeback = zone_page_state(zone, NR_WRITEBACK);
> > +				if (writeback > max_writeback) {
> > +					max_writeback = writeback;
> > +					active_zone = zone;
> > +				}
> > +			}
> > +
> > +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > +		}
> 
> Other place just considers preferred zone. 
> What is the rationale that consider max writeback zone in all zone of zonelist to 
> call wait_iff_congeested?

Initially, it was because wait_iff_congested() heuristic was based on
writeback, not zone congestion.  This time around, it was because I
wanted to be aggressive in terms of triggering the congestion wait to be
better than existing behaviour but not too far from it.

> Maybe max writeback zone can be much slow bdi but this process could be not related
> to the bdi. It can make random stall by point of view of this proces.
> 

Fair point, I will retest using the preferred zone.

> >  	}
> >  
> >  out:
> > @@ -2251,6 +2303,15 @@ loop_again:
> >  				if (!zone_watermark_ok(zone, order,
> >  					    min_wmark_pages(zone), end_zone, 0))
> >  					has_under_min_watermark_zone = 1;
> > +			} else {
> > +				/*
> > +				 * If a zone reaches its high watermark,
> > +				 * consider it to be no longer congested. It's
> > +				 * possible there are dirty pages backed by
> > +				 * congested BDIs but as pressure is relieved,
> > +				 * spectulatively avoid congestion waits
> > +				 */
> > +				zone_clear_flag(zone, ZONE_CONGESTED);
> >  			}
> >  
> >  		}
> > -- 
> > 1.7.1
> > 
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-16  9:18       ` Mel Gorman
@ 2010-09-16 14:11         ` Minchan Kim
  -1 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16 14:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 10:18:24AM +0100, Mel Gorman wrote:
> On Thu, Sep 16, 2010 at 05:13:38PM +0900, Minchan Kim wrote:
> > On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> > > If wait_iff_congested() is called with no BDI congested, the function simply
> > > calls cond_resched(). In the event there is significant writeback happening
> > > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > > would succeed once writeback was completed. Without any backoff logic,
> > > younger clean pages can be reclaimed resulting in more reclaim overall and
> > > poor performance.
> > 
> > I agree. 
> > 
> > > 
> > > This patch tracks how many pages backed by a congested BDI were found during
> > > scanning. If all the dirty pages encountered on a list isolated from the
> > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > 
> > I am not sure it works well. 
> 
> Check the competion times for the micro-mapped-file-stream benchmark in
> the leader mail. Backing off like this is faster overall for some
> workloads.
> 
> > We just met the condition once but we backoff it until high watermark.
> 
> Reaching the high watermark is considered to be a relieving of pressure.
> 
> > (ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
> > First impression is rather _aggressive_.
> > 
> 
> Yes, it is. I intended to start with something quite aggressive that is
> close to existing behaviour and then experiment with alternatives.

Agree. 

> 
> For example, I considered clearing zone congestion when but nr_bdi_congested
> drops to 0. This would be less aggressive in terms of congestion waiting but
> it is further from todays behaviour. I felt it would be best to introduce
> wait_iff_congested() in one kernel cycle but wait to a later cycle to deviate
> a lot from congestion_wait().

Fair enough. 

> 
> > How about more checking?
> > For example, if above pattern continues repeately above some threshold,
> > we can regard "zone is congested" and then if the pattern isn't repeated 
> > during some threshold, we can regard "zone isn't congested any more.".
> > 
> 
> I also considered these options and got stuck at what the "some
> threshold" is and how to record the history. Should it be recorded on a
> per BDI basis for example? I think all these questions can be answered
> but should be in a different cycle.
> 
> > > reaches the high watermark.  wait_iff_congested() then checks both the
> > > number of congested BDIs and if the current zone is one that has encounted
> > > congestion recently, it will sleep on the congestion queue. Otherwise it
> > > will call cond_reched() to yield the processor if necessary.
> > > 
> > > The end result is that waiting on the congestion queue is avoided when
> > > necessary but when significant congestion is being encountered,
> > > reclaimers and page allocators will back off.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/backing-dev.h |    2 +-
> > >  include/linux/mmzone.h      |    8 ++++
> > >  mm/backing-dev.c            |   23 ++++++++----
> > >  mm/page_alloc.c             |    4 +-
> > >  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
> > >  5 files changed, 98 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > > index 72bb510..f1b402a 100644
> > > --- a/include/linux/backing-dev.h
> > > +++ b/include/linux/backing-dev.h
> > > +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
> > 
> > <snip>
> > 
> > >  			      struct scan_control *sc)
> > >  {
> > > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > > +
> > >  	if (current->flags & PF_SWAPWRITE)
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > >  	if (!bdi_write_congested(bdi))
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > > +	else
> > > +		ret = QUEUEWRITE_CONGESTED;
> > >  	if (bdi == current->backing_dev_info)
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > >  
> > >  	/* lumpy reclaim for hugepage often need a lot of write */
> > >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > > -		return 1;
> > > -	return 0;
> > > +		return QUEUEWRITE_ALLOWED;
> > > +	return ret;
> > >  }
> > 
> > The function can't return QUEUEXXX_DENIED.
> > It can affect disable_lumpy_reclaim. 
> > 
> 
> Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
> reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
> objection.

I means current may_write_to_queue never returns QUEUEWRITE_DENIED.
What's the role of it?

In addition, we don't need disable_lumpy_reclaim_mode() in pageout.
That's because both PAGE_KEEP and PAGE_KEEP_CONGESTED go to keep_locked
and calls disable_lumpy_reclaim_mode at last. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-16 14:11         ` Minchan Kim
  0 siblings, 0 replies; 59+ messages in thread
From: Minchan Kim @ 2010-09-16 14:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Sep 16, 2010 at 10:18:24AM +0100, Mel Gorman wrote:
> On Thu, Sep 16, 2010 at 05:13:38PM +0900, Minchan Kim wrote:
> > On Wed, Sep 15, 2010 at 01:27:51PM +0100, Mel Gorman wrote:
> > > If wait_iff_congested() is called with no BDI congested, the function simply
> > > calls cond_resched(). In the event there is significant writeback happening
> > > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > > would succeed once writeback was completed. Without any backoff logic,
> > > younger clean pages can be reclaimed resulting in more reclaim overall and
> > > poor performance.
> > 
> > I agree. 
> > 
> > > 
> > > This patch tracks how many pages backed by a congested BDI were found during
> > > scanning. If all the dirty pages encountered on a list isolated from the
> > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > 
> > I am not sure it works well. 
> 
> Check the competion times for the micro-mapped-file-stream benchmark in
> the leader mail. Backing off like this is faster overall for some
> workloads.
> 
> > We just met the condition once but we backoff it until high watermark.
> 
> Reaching the high watermark is considered to be a relieving of pressure.
> 
> > (ex, 32 isolated dirty pages == 32 pages on congestioned bdi)
> > First impression is rather _aggressive_.
> > 
> 
> Yes, it is. I intended to start with something quite aggressive that is
> close to existing behaviour and then experiment with alternatives.

Agree. 

> 
> For example, I considered clearing zone congestion when but nr_bdi_congested
> drops to 0. This would be less aggressive in terms of congestion waiting but
> it is further from todays behaviour. I felt it would be best to introduce
> wait_iff_congested() in one kernel cycle but wait to a later cycle to deviate
> a lot from congestion_wait().

Fair enough. 

> 
> > How about more checking?
> > For example, if above pattern continues repeately above some threshold,
> > we can regard "zone is congested" and then if the pattern isn't repeated 
> > during some threshold, we can regard "zone isn't congested any more.".
> > 
> 
> I also considered these options and got stuck at what the "some
> threshold" is and how to record the history. Should it be recorded on a
> per BDI basis for example? I think all these questions can be answered
> but should be in a different cycle.
> 
> > > reaches the high watermark.  wait_iff_congested() then checks both the
> > > number of congested BDIs and if the current zone is one that has encounted
> > > congestion recently, it will sleep on the congestion queue. Otherwise it
> > > will call cond_reched() to yield the processor if necessary.
> > > 
> > > The end result is that waiting on the congestion queue is avoided when
> > > necessary but when significant congestion is being encountered,
> > > reclaimers and page allocators will back off.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/backing-dev.h |    2 +-
> > >  include/linux/mmzone.h      |    8 ++++
> > >  mm/backing-dev.c            |   23 ++++++++----
> > >  mm/page_alloc.c             |    4 +-
> > >  mm/vmscan.c                 |   83 +++++++++++++++++++++++++++++++++++++------
> > >  5 files changed, 98 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > > index 72bb510..f1b402a 100644
> > > --- a/include/linux/backing-dev.h
> > > +++ b/include/linux/backing-dev.h
> > > +static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
> > 
> > <snip>
> > 
> > >  			      struct scan_control *sc)
> > >  {
> > > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > > +
> > >  	if (current->flags & PF_SWAPWRITE)
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > >  	if (!bdi_write_congested(bdi))
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > > +	else
> > > +		ret = QUEUEWRITE_CONGESTED;
> > >  	if (bdi == current->backing_dev_info)
> > > -		return 1;
> > > +		return QUEUEWRITE_ALLOWED;
> > >  
> > >  	/* lumpy reclaim for hugepage often need a lot of write */
> > >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > > -		return 1;
> > > -	return 0;
> > > +		return QUEUEWRITE_ALLOWED;
> > > +	return ret;
> > >  }
> > 
> > The function can't return QUEUEXXX_DENIED.
> > It can affect disable_lumpy_reclaim. 
> > 
> 
> Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
> reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
> objection.

I means current may_write_to_queue never returns QUEUEWRITE_DENIED.
What's the role of it?

In addition, we don't need disable_lumpy_reclaim_mode() in pageout.
That's because both PAGE_KEEP and PAGE_KEEP_CONGESTED go to keep_locked
and calls disable_lumpy_reclaim_mode at last. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-16 14:11         ` Minchan Kim
@ 2010-09-16 15:18           ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16 15:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

> > > <snip>
> > > 
> > > >  			      struct scan_control *sc)
> > > >  {
> > > > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > > > +
> > > >  	if (current->flags & PF_SWAPWRITE)
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > >  	if (!bdi_write_congested(bdi))
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > > +	else
> > > > +		ret = QUEUEWRITE_CONGESTED;
> > > >  	if (bdi == current->backing_dev_info)
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > >  
> > > >  	/* lumpy reclaim for hugepage often need a lot of write */
> > > >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > > > -		return 1;
> > > > -	return 0;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > > +	return ret;
> > > >  }
> > > 
> > > The function can't return QUEUEXXX_DENIED.
> > > It can affect disable_lumpy_reclaim. 
> > > 
> > 
> > Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
> > reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
> > objection.
> 
> I means current may_write_to_queue never returns QUEUEWRITE_DENIED.
> What's the role of it?
> 

As of now, little point because QUEUEWRITE_CONGESTED implies denied. I was allowing
the possibility of distinguishing between these cases in the future depending
on what happened with wait_iff_congested(). I will drop it for simplicity
and reintroduce it when or if there is a distinction between
denied and congested.

> In addition, we don't need disable_lumpy_reclaim_mode() in pageout.
> That's because both PAGE_KEEP and PAGE_KEEP_CONGESTED go to keep_locked
> and calls disable_lumpy_reclaim_mode at last. 
> 

True, good spot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-16 15:18           ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-16 15:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

> > > <snip>
> > > 
> > > >  			      struct scan_control *sc)
> > > >  {
> > > > +	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
> > > > +
> > > >  	if (current->flags & PF_SWAPWRITE)
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > >  	if (!bdi_write_congested(bdi))
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > > +	else
> > > > +		ret = QUEUEWRITE_CONGESTED;
> > > >  	if (bdi == current->backing_dev_info)
> > > > -		return 1;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > >  
> > > >  	/* lumpy reclaim for hugepage often need a lot of write */
> > > >  	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> > > > -		return 1;
> > > > -	return 0;
> > > > +		return QUEUEWRITE_ALLOWED;
> > > > +	return ret;
> > > >  }
> > > 
> > > The function can't return QUEUEXXX_DENIED.
> > > It can affect disable_lumpy_reclaim. 
> > > 
> > 
> > Yes, but that change was made in "vmscan: Narrow the scenarios lumpy
> > reclaim uses synchrounous reclaim". Maybe I am misunderstanding your
> > objection.
> 
> I means current may_write_to_queue never returns QUEUEWRITE_DENIED.
> What's the role of it?
> 

As of now, little point because QUEUEWRITE_CONGESTED implies denied. I was allowing
the possibility of distinguishing between these cases in the future depending
on what happened with wait_iff_congested(). I will drop it for simplicity
and reintroduce it when or if there is a distinction between
denied and congested.

> In addition, we don't need disable_lumpy_reclaim_mode() in pageout.
> That's because both PAGE_KEEP and PAGE_KEEP_CONGESTED go to keep_locked
> and calls disable_lumpy_reclaim_mode at last. 
> 

True, good spot.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-09-15 12:27 ` Mel Gorman
@ 2010-09-16 22:28   ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-16 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, 15 Sep 2010 13:27:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.

epic changelog!

>
> ...
>
> The tests run were as follows
> 
> kernbench
> 	compile-based benchmark. Smoke test performance
> 
> sysbench
> 	OLTP read-only benchmark. Will be re-run in the future as read-write
> 
> micro-mapped-file-stream
> 	This is a micro-benchmark from Johannes Weiner that accesses a
> 	large sparse-file through mmap(). It was configured to run in only
> 	single-CPU mode but can be indicative of how well page reclaim
> 	identifies suitable pages.
> 
> stress-highalloc
> 	Tries to allocate huge pages under heavy load.
> 
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
> 
> X86-64 micro-mapped-file-stream
> 
>                                       traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> 
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
> 
> FTrace Reclaim Statistics: vmscan
> 
>                                 traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims                                443        273        513       1568 
> Direct reclaim pages scanned                305968     280402     600825     957933 
> Direct reclaim pages reclaimed               43503      19005      30327     117191 
> Direct reclaim write file async I/O              0          0          0          0 
> Direct reclaim write anon async I/O              0          3          4         12 
> Direct reclaim write file sync I/O               0          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                        187649     132338     191695     267701 
> Kswapd wakeups                                   3          1          4          1 
> Kswapd pages scanned                       4599269    4454162    4296815    3891906 
> Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
> Kswapd reclaim write file async I/O              1          0          1          1 
> Kswapd reclaim write anon async I/O             59        187         41        222 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
> Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 
> 
> Total pages scanned                        4905237   4734564   4897640   4849839
> Total pages reclaimed                      2339450   2447439   2430145   2436897
> %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> 
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

Yes, I think the reclaimed/scanned ratio (what I call "reclaim
efficiency") is a key metric.

50% is low!  What's the testcase here? micro-mapped-file-stream?

It's strange that the "total pages reclaimed" increased a little.  Just
a measurement glitch?

> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                87        196         64          0 
> Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
> Direct full   congest     waited                72        145         53          0 
> Direct number conditional waited                 0          0        324       1315 
> Direct time   conditional waited               0ms        0ms        0ms        0ms 
> Direct full   conditional waited                 0          0          0          0 
> KSwapd number congest     waited                20         10         15          7 
> KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
> KSwapd full   congest     waited                10          4          6          2 
> KSwapd number conditional waited                 0          0          0          0 
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> KSwapd full   conditional waited                 0          0          0          0 
> 
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76

Is that user time plus system time?  If so, why didn't user+sys equal
elapsed in the we-never-slept-in-congestion-wait() case?  Because the
test's CPU got stolen by kswapd perhaps?

> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
> 
> PPC64 micro-mapped-file-stream
> pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> 
> ...
>
> 
> X86-64 STRESS-HIGHALLOC
>                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> 
> Success figures across the board are broadly similar.
> 
>                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                               1045        944        886        887 
> Direct reclaim pages scanned                135091     119604     109382     101019 
> Direct reclaim pages reclaimed               88599      47535      47863      46671 
> Direct reclaim write file async I/O            494        283        465        280 
> Direct reclaim write anon async I/O          29357      13710      16656      13462 
> Direct reclaim write file sync I/O             154          2          2          3 
> Direct reclaim write anon sync I/O           14594        571        509        561 
> Wake kswapd requests                          7491        933        872        892 
> Kswapd wakeups                                 814        778        731        780 
> Kswapd pages scanned                       7290822   15341158   11916436   13703442 
> Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
> Kswapd reclaim write file async I/O          91975      32317      28022      29628 
> Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
> Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 
> 
> Total pages scanned                        7425913  15460762  12025818  13804461
> Total pages reclaimed                      3675935   3190031   3142255   3233822
> %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> 
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.

Wait.  The reclaim efficiency got *worse*, didn't it?  To reclaim
3,xxx,xxx pages, the number of pages we had to scan went from 7,xxx,xxx
up to 13,xxx,xxx?

>
> ...
>
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.

slub sucks :(

Is this patchset likely to have any impact on the "hey my net driver
couldn't do an order 3 allocation" reports?  I guess not.

>
> ...
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-09-16 22:28   ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-16 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, 15 Sep 2010 13:27:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.

epic changelog!

>
> ...
>
> The tests run were as follows
> 
> kernbench
> 	compile-based benchmark. Smoke test performance
> 
> sysbench
> 	OLTP read-only benchmark. Will be re-run in the future as read-write
> 
> micro-mapped-file-stream
> 	This is a micro-benchmark from Johannes Weiner that accesses a
> 	large sparse-file through mmap(). It was configured to run in only
> 	single-CPU mode but can be indicative of how well page reclaim
> 	identifies suitable pages.
> 
> stress-highalloc
> 	Tries to allocate huge pages under heavy load.
> 
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
> 
> X86-64 micro-mapped-file-stream
> 
>                                       traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> 
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
> 
> FTrace Reclaim Statistics: vmscan
> 
>                                 traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims                                443        273        513       1568 
> Direct reclaim pages scanned                305968     280402     600825     957933 
> Direct reclaim pages reclaimed               43503      19005      30327     117191 
> Direct reclaim write file async I/O              0          0          0          0 
> Direct reclaim write anon async I/O              0          3          4         12 
> Direct reclaim write file sync I/O               0          0          0          0 
> Direct reclaim write anon sync I/O               0          0          0          0 
> Wake kswapd requests                        187649     132338     191695     267701 
> Kswapd wakeups                                   3          1          4          1 
> Kswapd pages scanned                       4599269    4454162    4296815    3891906 
> Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
> Kswapd reclaim write file async I/O              1          0          1          1 
> Kswapd reclaim write anon async I/O             59        187         41        222 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
> Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 
> 
> Total pages scanned                        4905237   4734564   4897640   4849839
> Total pages reclaimed                      2339450   2447439   2430145   2436897
> %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> 
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.

Yes, I think the reclaimed/scanned ratio (what I call "reclaim
efficiency") is a key metric.

50% is low!  What's the testcase here? micro-mapped-file-stream?

It's strange that the "total pages reclaimed" increased a little.  Just
a measurement glitch?

> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                87        196         64          0 
> Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
> Direct full   congest     waited                72        145         53          0 
> Direct number conditional waited                 0          0        324       1315 
> Direct time   conditional waited               0ms        0ms        0ms        0ms 
> Direct full   conditional waited                 0          0          0          0 
> KSwapd number congest     waited                20         10         15          7 
> KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
> KSwapd full   congest     waited                10          4          6          2 
> KSwapd number conditional waited                 0          0          0          0 
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> KSwapd full   conditional waited                 0          0          0          0 
> 
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76

Is that user time plus system time?  If so, why didn't user+sys equal
elapsed in the we-never-slept-in-congestion-wait() case?  Because the
test's CPU got stolen by kswapd perhaps?

> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
> 
> PPC64 micro-mapped-file-stream
> pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> 
> ...
>
> 
> X86-64 STRESS-HIGHALLOC
>                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> 
> Success figures across the board are broadly similar.
> 
>                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                               1045        944        886        887 
> Direct reclaim pages scanned                135091     119604     109382     101019 
> Direct reclaim pages reclaimed               88599      47535      47863      46671 
> Direct reclaim write file async I/O            494        283        465        280 
> Direct reclaim write anon async I/O          29357      13710      16656      13462 
> Direct reclaim write file sync I/O             154          2          2          3 
> Direct reclaim write anon sync I/O           14594        571        509        561 
> Wake kswapd requests                          7491        933        872        892 
> Kswapd wakeups                                 814        778        731        780 
> Kswapd pages scanned                       7290822   15341158   11916436   13703442 
> Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
> Kswapd reclaim write file async I/O          91975      32317      28022      29628 
> Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
> Kswapd reclaim write file sync I/O               0          0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0          0 
> Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
> Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 
> 
> Total pages scanned                        7425913  15460762  12025818  13804461
> Total pages reclaimed                      3675935   3190031   3142255   3233822
> %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> 
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.

Wait.  The reclaim efficiency got *worse*, didn't it?  To reclaim
3,xxx,xxx pages, the number of pages we had to scan went from 7,xxx,xxx
up to 13,xxx,xxx?

>
> ...
>
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.

slub sucks :(

Is this patchset likely to have any impact on the "hey my net driver
couldn't do an order 3 allocation" reports?  I guess not.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-15 12:27   ` Mel Gorman
@ 2010-09-16 22:28     ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-16 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, 15 Sep 2010 13:27:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If wait_iff_congested() is called with no BDI congested, the function simply
> calls cond_resched(). In the event there is significant writeback happening
> in the zone that is being reclaimed, this can be a poor decision as reclaim
> would succeed once writeback was completed. Without any backoff logic,
> younger clean pages can be reclaimed resulting in more reclaim overall and
> poor performance.

This is because cond_resched() is a no-op, and we skip around the
under-writeback pages and go off and look further along the LRU for
younger clean pages, yes?

> This patch tracks how many pages backed by a congested BDI were found during
> scanning. If all the dirty pages encountered on a list isolated from the
> LRU belong to a congested BDI, the zone is marked congested until the zone
> reaches the high watermark.

High watermark, or low watermark?

The terms are rather ambiguous so let's avoid them.  Maybe "full"
watermark and "empty"?

>
> ...
>
> @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(page_zone(page) != zone);

?

>  		sc->nr_scanned++;
>  
>
> ...
>
> @@ -903,6 +928,15 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * Tag a zone as congested if all the dirty pages encountered were
> +	 * backed by a congested BDI. In this case, reclaimers should just
> +	 * back off and wait for congestion to clear because further reclaim
> +	 * will encounter the same problem
> +	 */
> +	if (nr_dirty == nr_congested)
> +		zone_set_flag(zone, ZONE_CONGESTED);

The implicit "100%" there is a magic number.  hrm.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-16 22:28     ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-16 22:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Wed, 15 Sep 2010 13:27:51 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If wait_iff_congested() is called with no BDI congested, the function simply
> calls cond_resched(). In the event there is significant writeback happening
> in the zone that is being reclaimed, this can be a poor decision as reclaim
> would succeed once writeback was completed. Without any backoff logic,
> younger clean pages can be reclaimed resulting in more reclaim overall and
> poor performance.

This is because cond_resched() is a no-op, and we skip around the
under-writeback pages and go off and look further along the LRU for
younger clean pages, yes?

> This patch tracks how many pages backed by a congested BDI were found during
> scanning. If all the dirty pages encountered on a list isolated from the
> LRU belong to a congested BDI, the zone is marked congested until the zone
> reaches the high watermark.

High watermark, or low watermark?

The terms are rather ambiguous so let's avoid them.  Maybe "full"
watermark and "empty"?

>
> ...
>
> @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));
> +		VM_BUG_ON(page_zone(page) != zone);

?

>  		sc->nr_scanned++;
>  
>
> ...
>
> @@ -903,6 +928,15 @@ keep_lumpy:
>  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
>  	}
>  
> +	/*
> +	 * Tag a zone as congested if all the dirty pages encountered were
> +	 * backed by a congested BDI. In this case, reclaimers should just
> +	 * back off and wait for congestion to clear because further reclaim
> +	 * will encounter the same problem
> +	 */
> +	if (nr_dirty == nr_congested)
> +		zone_set_flag(zone, ZONE_CONGESTED);

The implicit "100%" there is a magic number.  hrm.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-09-16 22:28   ` Andrew Morton
@ 2010-09-17  7:52     ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-17  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Sep 16, 2010 at 03:28:04PM -0700, Andrew Morton wrote:
> On Wed, 15 Sep 2010 13:27:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This is v2 of a series to reduce some of the latencies seen in page reclaim
> > and to improve the efficiency a bit.
> 
> epic changelog!
> 

Thanks

> >
> > ...
> >
> > The tests run were as follows
> > 
> > kernbench
> > 	compile-based benchmark. Smoke test performance
> > 
> > sysbench
> > 	OLTP read-only benchmark. Will be re-run in the future as read-write
> > 
> > micro-mapped-file-stream
> > 	This is a micro-benchmark from Johannes Weiner that accesses a
> > 	large sparse-file through mmap(). It was configured to run in only
> > 	single-CPU mode but can be indicative of how well page reclaim
> > 	identifies suitable pages.
> > 
> > stress-highalloc
> > 	Tries to allocate huge pages under heavy load.
> > 
> > kernbench, iozone and sysbench did not report any performance regression
> > on any machine. sysbench did pressure the system lightly and there was reclaim
> > activity but there were no difference of major interest between the kernels.
> > 
> > X86-64 micro-mapped-file-stream
> > 
> >                                       traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> > pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> > pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> > pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> > pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> > pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> > pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> > pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> > pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> > pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> > allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> > 
> > These are based on the raw figures taken from /proc/vmstat. It's a rough
> > measure of reclaim activity. Note that allocstall counts are higher because
> > we are entering direct reclaim more often as a result of not sleeping in
> > congestion. In itself, it's not necessarily a bad thing. It's easier to
> > get a view of what happened from the vmscan tracepoint report.
> > 
> > FTrace Reclaim Statistics: vmscan
> > 
> >                                 traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> > Direct reclaims                                443        273        513       1568 
> > Direct reclaim pages scanned                305968     280402     600825     957933 
> > Direct reclaim pages reclaimed               43503      19005      30327     117191 
> > Direct reclaim write file async I/O              0          0          0          0 
> > Direct reclaim write anon async I/O              0          3          4         12 
> > Direct reclaim write file sync I/O               0          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                        187649     132338     191695     267701 
> > Kswapd wakeups                                   3          1          4          1 
> > Kswapd pages scanned                       4599269    4454162    4296815    3891906 
> > Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
> > Kswapd reclaim write file async I/O              1          0          1          1 
> > Kswapd reclaim write anon async I/O             59        187         41        222 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
> > Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 
> > 
> > Total pages scanned                        4905237   4734564   4897640   4849839
> > Total pages reclaimed                      2339450   2447439   2430145   2436897
> > %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> > %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> > Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> > 
> > What is interesting here for nocongest in particular is that while direct
> > reclaim scans more pages, the overall number of pages scanned remains the same
> > and the ratio of pages scanned to pages reclaimed is more or less the same. In
> > other words, while we are sleeping less, reclaim is not doing more work and
> > as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
> 
> Yes, I think the reclaimed/scanned ratio (what I call "reclaim
> efficiency") is a key metric.
> 

Indeed.

> 50% is low!  What's the testcase here? micro-mapped-file-stream?
> 

It's a streaming write workload Johannes posted at
http://linux--kernel.googlegroups.com/attach/922930ad782c993f/mapped-file-stream.c?gda=C9ZmZUYAAAC7YRbTg15qnVftAVpdAUbEdtSiuVqDFQ7IygxgoOgCJibbrMllVnGRuK4kFCYFogdx40jamwa1UURqDcgHarKEE-Ea7GxYMt0t6nY0uV5FIQ&part=2
He considered it to be a somewhat adverse workload for reclaim.

> It's strange that the "total pages reclaimed" increased a little.  Just
> a measurement glitch?
> 

Probably not a glitch but the measurements are system-wide. Depending on
the starting state of the system when the benchmark ran, there will be
slightly different scanning numbers.

> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest     waited                87        196         64          0 
> > Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
> > Direct full   congest     waited                72        145         53          0 
> > Direct number conditional waited                 0          0        324       1315 
> > Direct time   conditional waited               0ms        0ms        0ms        0ms 
> > Direct full   conditional waited                 0          0          0          0 
> > KSwapd number congest     waited                20         10         15          7 
> > KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
> > KSwapd full   congest     waited                10          4          6          2 
> > KSwapd number conditional waited                 0          0          0          0 
> > KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> > KSwapd full   conditional waited                 0          0          0          0 
> > 
> > The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> > all asleep with the patches.
> > 
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> > Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
> 
> Is that user time plus system time? 

Yes.

> If so, why didn't user+sys equal
> elapsed in the we-never-slept-in-congestion-wait() case?  Because the
> test's CPU got stolen by kswapd perhaps?
> 

One possibility. The other is IO wait time. I'll think about it some
more. I'm afraid this mail is a bit rushed because I'm about to leave
for a wedding. I won't be back online until Monday.

> > Overall, the tests completed faster. It is interesting to note that backing off further
> > when a zone is congested and not just a BDI was more efficient overall.
> > 
> > PPC64 micro-mapped-file-stream
> > pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> > pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> > pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> > pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> > pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> > allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> > 
> > ...
> >
> > 
> > X86-64 STRESS-HIGHALLOC
> >                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> > Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> > Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> > At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> > 
> > Success figures across the board are broadly similar.
> > 
> >                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> > Direct reclaims                               1045        944        886        887 
> > Direct reclaim pages scanned                135091     119604     109382     101019 
> > Direct reclaim pages reclaimed               88599      47535      47863      46671 
> > Direct reclaim write file async I/O            494        283        465        280 
> > Direct reclaim write anon async I/O          29357      13710      16656      13462 
> > Direct reclaim write file sync I/O             154          2          2          3 
> > Direct reclaim write anon sync I/O           14594        571        509        561 
> > Wake kswapd requests                          7491        933        872        892 
> > Kswapd wakeups                                 814        778        731        780 
> > Kswapd pages scanned                       7290822   15341158   11916436   13703442 
> > Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
> > Kswapd reclaim write file async I/O          91975      32317      28022      29628 
> > Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
> > Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 
> > 
> > Total pages scanned                        7425913  15460762  12025818  13804461
> > Total pages reclaimed                      3675935   3190031   3142255   3233822
> > %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> > %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> > %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> > Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> > Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> > 
> > Scanned/reclaimed ratios again look good with big improvements in
> > efficiency. The Scanned/written ratios also look much improved. With a
> > better scanned/written ration, there is an expectation that IO would be more
> > efficient and indeed, the time spent in direct reclaim is much reduced by
> > the full series and kswapd spends a little less time awake.
> 
> Wait.  The reclaim efficiency got *worse*, didn't it?  To reclaim
> 3,xxx,xxx pages, the number of pages we had to scan went from 7,xxx,xxx
> up to 13,xxx,xxx?
> 

Arguably, yes. The biggest change here is due to lumpy reclaim giving up
a range of pages when one fails to reclaim. An impact of this is that it
will end up scanning more for a suitable contiguous range of pages because
it aborted trying to reclaim the same page stupidly. So, it looks worse
from a scanning/reclaim perspective but it's more sensible behaviour (and
finishes faster)

Similarly, when reclaimers are no longer unnecessarily sleeping, they
have more time to be scanning pushing up the rates slightly. The
allocation success rates are slightly higher which might be a reflection
of the higher scanning.

The reclaim efficiency is improved by the later two patches again and
while not as good as the "vanilla" kernel, that only has good efficiency
figures because it's grinding on the same useless pages chewing up CPU
time. Overall, it's still better behaviour.

> >
> > ...
> >
> > I think this series is ready for much wider testing. The lowlumpy patches in
> > particular should be relatively uncontroversial. While their largest impact
> > can be seen in the high order stress tests, they would also have an impact
> > if SLUB was configured (these tests are based on slab) and stalls in lumpy
> > reclaim could be partially responsible for some desktop stalling reports.
> 
> slub sucks :(
> 
> Is this patchset likely to have any impact on the "hey my net driver
> couldn't do an order 3 allocation" reports?  I guess not.
> 

Some actually. direct reclaimers and kswapd are not going to waste as
much time trying to reclaim those order-3 pages so there will be less
stalling and kswapd might keep ahead of the rush of allocators.

Sorry I won't get the chance to respond to other mails for the next few
days. Have to hit the road.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-09-17  7:52     ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-17  7:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Sep 16, 2010 at 03:28:04PM -0700, Andrew Morton wrote:
> On Wed, 15 Sep 2010 13:27:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This is v2 of a series to reduce some of the latencies seen in page reclaim
> > and to improve the efficiency a bit.
> 
> epic changelog!
> 

Thanks

> >
> > ...
> >
> > The tests run were as follows
> > 
> > kernbench
> > 	compile-based benchmark. Smoke test performance
> > 
> > sysbench
> > 	OLTP read-only benchmark. Will be re-run in the future as read-write
> > 
> > micro-mapped-file-stream
> > 	This is a micro-benchmark from Johannes Weiner that accesses a
> > 	large sparse-file through mmap(). It was configured to run in only
> > 	single-CPU mode but can be indicative of how well page reclaim
> > 	identifies suitable pages.
> > 
> > stress-highalloc
> > 	Tries to allocate huge pages under heavy load.
> > 
> > kernbench, iozone and sysbench did not report any performance regression
> > on any machine. sysbench did pressure the system lightly and there was reclaim
> > activity but there were no difference of major interest between the kernels.
> > 
> > X86-64 micro-mapped-file-stream
> > 
> >                                       traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> > pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> > pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> > pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> > pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> > pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> > pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> > pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> > pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> > pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> > allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> > 
> > These are based on the raw figures taken from /proc/vmstat. It's a rough
> > measure of reclaim activity. Note that allocstall counts are higher because
> > we are entering direct reclaim more often as a result of not sleeping in
> > congestion. In itself, it's not necessarily a bad thing. It's easier to
> > get a view of what happened from the vmscan tracepoint report.
> > 
> > FTrace Reclaim Statistics: vmscan
> > 
> >                                 traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> > Direct reclaims                                443        273        513       1568 
> > Direct reclaim pages scanned                305968     280402     600825     957933 
> > Direct reclaim pages reclaimed               43503      19005      30327     117191 
> > Direct reclaim write file async I/O              0          0          0          0 
> > Direct reclaim write anon async I/O              0          3          4         12 
> > Direct reclaim write file sync I/O               0          0          0          0 
> > Direct reclaim write anon sync I/O               0          0          0          0 
> > Wake kswapd requests                        187649     132338     191695     267701 
> > Kswapd wakeups                                   3          1          4          1 
> > Kswapd pages scanned                       4599269    4454162    4296815    3891906 
> > Kswapd pages reclaimed                     2295947    2428434    2399818    2319706 
> > Kswapd reclaim write file async I/O              1          0          1          1 
> > Kswapd reclaim write anon async I/O             59        187         41        222 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96 
> > Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19 
> > 
> > Total pages scanned                        4905237   4734564   4897640   4849839
> > Total pages reclaimed                      2339450   2447439   2430145   2436897
> > %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> > %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> > Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> > Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> > 
> > What is interesting here for nocongest in particular is that while direct
> > reclaim scans more pages, the overall number of pages scanned remains the same
> > and the ratio of pages scanned to pages reclaimed is more or less the same. In
> > other words, while we are sleeping less, reclaim is not doing more work and
> > as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
> 
> Yes, I think the reclaimed/scanned ratio (what I call "reclaim
> efficiency") is a key metric.
> 

Indeed.

> 50% is low!  What's the testcase here? micro-mapped-file-stream?
> 

It's a streaming write workload Johannes posted at
http://linux--kernel.googlegroups.com/attach/922930ad782c993f/mapped-file-stream.c?gda=C9ZmZUYAAAC7YRbTg15qnVftAVpdAUbEdtSiuVqDFQ7IygxgoOgCJibbrMllVnGRuK4kFCYFogdx40jamwa1UURqDcgHarKEE-Ea7GxYMt0t6nY0uV5FIQ&part=2
He considered it to be a somewhat adverse workload for reclaim.

> It's strange that the "total pages reclaimed" increased a little.  Just
> a measurement glitch?
> 

Probably not a glitch but the measurements are system-wide. Depending on
the starting state of the system when the benchmark ran, there will be
slightly different scanning numbers.

> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest     waited                87        196         64          0 
> > Direct time   congest     waited            4604ms     4732ms     5420ms        0ms 
> > Direct full   congest     waited                72        145         53          0 
> > Direct number conditional waited                 0          0        324       1315 
> > Direct time   conditional waited               0ms        0ms        0ms        0ms 
> > Direct full   conditional waited                 0          0          0          0 
> > KSwapd number congest     waited                20         10         15          7 
> > KSwapd time   congest     waited            1264ms      536ms      884ms      284ms 
> > KSwapd full   congest     waited                10          4          6          2 
> > KSwapd number conditional waited                 0          0          0          0 
> > KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
> > KSwapd full   conditional waited                 0          0          0          0 
> > 
> > The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> > all asleep with the patches.
> > 
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> > Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
> 
> Is that user time plus system time? 

Yes.

> If so, why didn't user+sys equal
> elapsed in the we-never-slept-in-congestion-wait() case?  Because the
> test's CPU got stolen by kswapd perhaps?
> 

One possibility. The other is IO wait time. I'll think about it some
more. I'm afraid this mail is a bit rushed because I'm about to leave
for a wedding. I won't be back online until Monday.

> > Overall, the tests completed faster. It is interesting to note that backing off further
> > when a zone is congested and not just a BDI was more efficient overall.
> > 
> > PPC64 micro-mapped-file-stream
> > pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> > pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> > pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> > pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> > pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> > pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> > allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> > 
> > ...
> >
> > 
> > X86-64 STRESS-HIGHALLOC
> >                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> > Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> > Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> > At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> > 
> > Success figures across the board are broadly similar.
> > 
> >                 traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> > Direct reclaims                               1045        944        886        887 
> > Direct reclaim pages scanned                135091     119604     109382     101019 
> > Direct reclaim pages reclaimed               88599      47535      47863      46671 
> > Direct reclaim write file async I/O            494        283        465        280 
> > Direct reclaim write anon async I/O          29357      13710      16656      13462 
> > Direct reclaim write file sync I/O             154          2          2          3 
> > Direct reclaim write anon sync I/O           14594        571        509        561 
> > Wake kswapd requests                          7491        933        872        892 
> > Kswapd wakeups                                 814        778        731        780 
> > Kswapd pages scanned                       7290822   15341158   11916436   13703442 
> > Kswapd pages reclaimed                     3587336    3142496    3094392    3187151 
> > Kswapd reclaim write file async I/O          91975      32317      28022      29628 
> > Kswapd reclaim write anon async I/O        1992022     789307     829745     849769 
> > Kswapd reclaim write file sync I/O               0          0          0          0 
> > Kswapd reclaim write anon sync I/O               0          0          0          0 
> > Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07 
> > Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82 
> > 
> > Total pages scanned                        7425913  15460762  12025818  13804461
> > Total pages reclaimed                      3675935   3190031   3142255   3233822
> > %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> > %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> > %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> > Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> > Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> > 
> > Scanned/reclaimed ratios again look good with big improvements in
> > efficiency. The Scanned/written ratios also look much improved. With a
> > better scanned/written ration, there is an expectation that IO would be more
> > efficient and indeed, the time spent in direct reclaim is much reduced by
> > the full series and kswapd spends a little less time awake.
> 
> Wait.  The reclaim efficiency got *worse*, didn't it?  To reclaim
> 3,xxx,xxx pages, the number of pages we had to scan went from 7,xxx,xxx
> up to 13,xxx,xxx?
> 

Arguably, yes. The biggest change here is due to lumpy reclaim giving up
a range of pages when one fails to reclaim. An impact of this is that it
will end up scanning more for a suitable contiguous range of pages because
it aborted trying to reclaim the same page stupidly. So, it looks worse
from a scanning/reclaim perspective but it's more sensible behaviour (and
finishes faster)

Similarly, when reclaimers are no longer unnecessarily sleeping, they
have more time to be scanning pushing up the rates slightly. The
allocation success rates are slightly higher which might be a reflection
of the higher scanning.

The reclaim efficiency is improved by the later two patches again and
while not as good as the "vanilla" kernel, that only has good efficiency
figures because it's grinding on the same useless pages chewing up CPU
time. Overall, it's still better behaviour.

> >
> > ...
> >
> > I think this series is ready for much wider testing. The lowlumpy patches in
> > particular should be relatively uncontroversial. While their largest impact
> > can be seen in the high order stress tests, they would also have an impact
> > if SLUB was configured (these tests are based on slab) and stalls in lumpy
> > reclaim could be partially responsible for some desktop stalling reports.
> 
> slub sucks :(
> 
> Is this patchset likely to have any impact on the "hey my net driver
> couldn't do an order 3 allocation" reports?  I guess not.
> 

Some actually. direct reclaimers and kswapd are not going to waste as
much time trying to reclaim those order-3 pages so there will be less
stalling and kswapd might keep ahead of the rush of allocators.

Sorry I won't get the chance to respond to other mails for the next few
days. Have to hit the road.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-16 22:28     ` Andrew Morton
@ 2010-09-20  9:52       ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-20  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Sep 16, 2010 at 03:28:10PM -0700, Andrew Morton wrote:
> On Wed, 15 Sep 2010 13:27:51 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If wait_iff_congested() is called with no BDI congested, the function simply
> > calls cond_resched(). In the event there is significant writeback happening
> > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > would succeed once writeback was completed. Without any backoff logic,
> > younger clean pages can be reclaimed resulting in more reclaim overall and
> > poor performance.
> 
> This is because cond_resched() is a no-op,

Can be a no-op surely. There is an expectation that it will sometimes schedule.

> and we skip around the
> under-writeback pages and go off and look further along the LRU for
> younger clean pages, yes?
> 

Yes.

> > This patch tracks how many pages backed by a congested BDI were found during
> > scanning. If all the dirty pages encountered on a list isolated from the
> > LRU belong to a congested BDI, the zone is marked congested until the zone
> > reaches the high watermark.
> 
> High watermark, or low watermark?
> 

High watermark. The check is made by kswapd.

> The terms are rather ambiguous so let's avoid them.  Maybe "full"
> watermark and "empty"?
> 

Unfortunately they are ambiguous to me. I know what the high watermark
is but not what the full or empty watermarks are.

> >
> > ...
> >
> > @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep;
> >  
> >  		VM_BUG_ON(PageActive(page));
> > +		VM_BUG_ON(page_zone(page) != zone);
> 
> ?
> 

It should not be the case that pages from multiple zones exist on the list
passed to shrink_page_list(). Lets say someone broke that assumption in the
future, which one should be marked congested? No way to know, so lets catch
the bug if the assumptions is ever broken.

> >  		sc->nr_scanned++;
> >  
> >
> > ...
> >
> > @@ -903,6 +928,15 @@ keep_lumpy:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * Tag a zone as congested if all the dirty pages encountered were
> > +	 * backed by a congested BDI. In this case, reclaimers should just
> > +	 * back off and wait for congestion to clear because further reclaim
> > +	 * will encounter the same problem
> > +	 */
> > +	if (nr_dirty == nr_congested)
> > +		zone_set_flag(zone, ZONE_CONGESTED);
> 
> The implicit "100%" there is a magic number.  hrm.
> 

It is but any other value for that number would be very specific to a
workload or a machine. A sysctl would have to be maintained and I
couldn't convince myself that anyone could do something sensible with
the value.

Rather than introducing a new tunable for this, I was toying with the idea over
the weekend on tracking the scanned/reclaimed ratio within the scan control -
possibly on a per-zone basis but more likely globally. When this ratio drops
below a given threshold, start increasing the time it backs off for up to a
maximum of HZ/10. There are a lot of details to iron out but it's possibly a
better long-term direction than adding a tunable for this implicit magic number
because it would be adaptive to what is happening for the current workload.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-20  9:52       ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-20  9:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Thu, Sep 16, 2010 at 03:28:10PM -0700, Andrew Morton wrote:
> On Wed, 15 Sep 2010 13:27:51 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If wait_iff_congested() is called with no BDI congested, the function simply
> > calls cond_resched(). In the event there is significant writeback happening
> > in the zone that is being reclaimed, this can be a poor decision as reclaim
> > would succeed once writeback was completed. Without any backoff logic,
> > younger clean pages can be reclaimed resulting in more reclaim overall and
> > poor performance.
> 
> This is because cond_resched() is a no-op,

Can be a no-op surely. There is an expectation that it will sometimes schedule.

> and we skip around the
> under-writeback pages and go off and look further along the LRU for
> younger clean pages, yes?
> 

Yes.

> > This patch tracks how many pages backed by a congested BDI were found during
> > scanning. If all the dirty pages encountered on a list isolated from the
> > LRU belong to a congested BDI, the zone is marked congested until the zone
> > reaches the high watermark.
> 
> High watermark, or low watermark?
> 

High watermark. The check is made by kswapd.

> The terms are rather ambiguous so let's avoid them.  Maybe "full"
> watermark and "empty"?
> 

Unfortunately they are ambiguous to me. I know what the high watermark
is but not what the full or empty watermarks are.

> >
> > ...
> >
> > @@ -706,6 +726,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			goto keep;
> >  
> >  		VM_BUG_ON(PageActive(page));
> > +		VM_BUG_ON(page_zone(page) != zone);
> 
> ?
> 

It should not be the case that pages from multiple zones exist on the list
passed to shrink_page_list(). Lets say someone broke that assumption in the
future, which one should be marked congested? No way to know, so lets catch
the bug if the assumptions is ever broken.

> >  		sc->nr_scanned++;
> >  
> >
> > ...
> >
> > @@ -903,6 +928,15 @@ keep_lumpy:
> >  		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> >  	}
> >  
> > +	/*
> > +	 * Tag a zone as congested if all the dirty pages encountered were
> > +	 * backed by a congested BDI. In this case, reclaimers should just
> > +	 * back off and wait for congestion to clear because further reclaim
> > +	 * will encounter the same problem
> > +	 */
> > +	if (nr_dirty == nr_congested)
> > +		zone_set_flag(zone, ZONE_CONGESTED);
> 
> The implicit "100%" there is a magic number.  hrm.
> 

It is but any other value for that number would be very specific to a
workload or a machine. A sysctl would have to be maintained and I
couldn't convince myself that anyone could do something sensible with
the value.

Rather than introducing a new tunable for this, I was toying with the idea over
the weekend on tracking the scanned/reclaimed ratio within the scan control -
possibly on a per-zone basis but more likely globally. When this ratio drops
below a given threshold, start increasing the time it backs off for up to a
maximum of HZ/10. There are a lot of details to iron out but it's possibly a
better long-term direction than adding a tunable for this implicit magic number
because it would be adaptive to what is happening for the current workload.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encounted in the current zone fix
  2010-09-15 12:27   ` Mel Gorman
@ 2010-09-20 13:05     ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-20 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

Based on feedback from Minchan Kim, I updated the patch
writeback-do-not-sleep-on-the-congestion-queue-if-there-are-no-congested-bdis-or-if-significant-congestion-is-not-being-encountered-in-the-current-zone.patch
currently in the mm tree in the following manner

1. Deleted the bdi_queue_status enum until such point as we distinguish
   between being unable to write to the IO queue and it being congested
2. Direct reclaimers consider congestion the first zone in the zonelist.
   In the mm version of the patch, it scans for a zone with the most
   pages in writeback. This made more sense for an earlier version of
   wait_iff_congested().

Tests did not show up any significant difference. This patch should be
merged with
writeback-do-not-sleep-on-the-congestion-queue-if-there-are-no-congested-bdis-or-if-significant-congestion-is-not-being-encountered-in-the-current-zone.patch

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   57 ++++++++++++---------------------------------------------
 1 files changed, 12 insertions(+), 45 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ef6294..aaf03ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,30 +311,20 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-enum bdi_queue_status {
-	QUEUEWRITE_DENIED,
-	QUEUEWRITE_CONGESTED,
-	QUEUEWRITE_ALLOWED,
-};
-
-static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
+static int may_write_to_queue(struct backing_dev_info *bdi,
 			      struct scan_control *sc)
 {
-	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
-
 	if (current->flags & PF_SWAPWRITE)
-		return QUEUEWRITE_ALLOWED;
+		return 1;
 	if (!bdi_write_congested(bdi))
-		return QUEUEWRITE_ALLOWED;
-	else
-		ret = QUEUEWRITE_CONGESTED;
+		return 1;
 	if (bdi == current->backing_dev_info)
-		return QUEUEWRITE_ALLOWED;
+		return 1;
 
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		return QUEUEWRITE_ALLOWED;
-	return ret;
+		return 1;
+	return 0;
 }
 
 /*
@@ -362,8 +352,6 @@ static void handle_write_error(struct address_space *mapping,
 typedef enum {
 	/* failed to write page out, page is locked */
 	PAGE_KEEP,
-	/* failed to write page out due to congestion, page is locked */
-	PAGE_KEEP_CONGESTED,
 	/* move page to the active list, page is locked */
 	PAGE_ACTIVATE,
 	/* page has been sent to the disk successfully, page is unlocked */
@@ -413,15 +401,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
-	case QUEUEWRITE_CONGESTED:
-		return PAGE_KEEP_CONGESTED;
-	case QUEUEWRITE_DENIED:
-		disable_lumpy_reclaim_mode(sc);
+	if (!may_write_to_queue(mapping->backing_dev_info, sc))
 		return PAGE_KEEP;
-	case QUEUEWRITE_ALLOWED:
-		;
-	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -815,9 +796,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
-			case PAGE_KEEP_CONGESTED:
-				nr_congested++;
 			case PAGE_KEEP:
+				nr_congested++;
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
@@ -1975,24 +1955,11 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
 		    priority < DEF_PRIORITY - 2) {
-			struct zone *active_zone = NULL;
-			unsigned long max_writeback = 0;
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				unsigned long writeback;
-
-				/* Initialise for first zone */
-				if (active_zone == NULL)
-					active_zone = zone;
-
-				writeback = zone_page_state(zone, NR_WRITEBACK);
-				if (writeback > max_writeback) {
-					max_writeback = writeback;
-					active_zone = zone;
-				}
-			}
+			struct zone *preferred_zone;
 
-			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
+							NULL, &preferred_zone);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
 		}
 	}
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encounted in the current zone fix
@ 2010-09-20 13:05     ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-20 13:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

Based on feedback from Minchan Kim, I updated the patch
writeback-do-not-sleep-on-the-congestion-queue-if-there-are-no-congested-bdis-or-if-significant-congestion-is-not-being-encountered-in-the-current-zone.patch
currently in the mm tree in the following manner

1. Deleted the bdi_queue_status enum until such point as we distinguish
   between being unable to write to the IO queue and it being congested
2. Direct reclaimers consider congestion the first zone in the zonelist.
   In the mm version of the patch, it scans for a zone with the most
   pages in writeback. This made more sense for an earlier version of
   wait_iff_congested().

Tests did not show up any significant difference. This patch should be
merged with
writeback-do-not-sleep-on-the-congestion-queue-if-there-are-no-congested-bdis-or-if-significant-congestion-is-not-being-encountered-in-the-current-zone.patch

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   57 ++++++++++++---------------------------------------------
 1 files changed, 12 insertions(+), 45 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ef6294..aaf03ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,30 +311,20 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-enum bdi_queue_status {
-	QUEUEWRITE_DENIED,
-	QUEUEWRITE_CONGESTED,
-	QUEUEWRITE_ALLOWED,
-};
-
-static enum bdi_queue_status may_write_to_queue(struct backing_dev_info *bdi,
+static int may_write_to_queue(struct backing_dev_info *bdi,
 			      struct scan_control *sc)
 {
-	enum bdi_queue_status ret = QUEUEWRITE_DENIED;
-
 	if (current->flags & PF_SWAPWRITE)
-		return QUEUEWRITE_ALLOWED;
+		return 1;
 	if (!bdi_write_congested(bdi))
-		return QUEUEWRITE_ALLOWED;
-	else
-		ret = QUEUEWRITE_CONGESTED;
+		return 1;
 	if (bdi == current->backing_dev_info)
-		return QUEUEWRITE_ALLOWED;
+		return 1;
 
 	/* lumpy reclaim for hugepage often need a lot of write */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		return QUEUEWRITE_ALLOWED;
-	return ret;
+		return 1;
+	return 0;
 }
 
 /*
@@ -362,8 +352,6 @@ static void handle_write_error(struct address_space *mapping,
 typedef enum {
 	/* failed to write page out, page is locked */
 	PAGE_KEEP,
-	/* failed to write page out due to congestion, page is locked */
-	PAGE_KEEP_CONGESTED,
 	/* move page to the active list, page is locked */
 	PAGE_ACTIVATE,
 	/* page has been sent to the disk successfully, page is unlocked */
@@ -413,15 +401,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	switch (may_write_to_queue(mapping->backing_dev_info, sc)) {
-	case QUEUEWRITE_CONGESTED:
-		return PAGE_KEEP_CONGESTED;
-	case QUEUEWRITE_DENIED:
-		disable_lumpy_reclaim_mode(sc);
+	if (!may_write_to_queue(mapping->backing_dev_info, sc))
 		return PAGE_KEEP;
-	case QUEUEWRITE_ALLOWED:
-		;
-	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -815,9 +796,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
-			case PAGE_KEEP_CONGESTED:
-				nr_congested++;
 			case PAGE_KEEP:
+				nr_congested++;
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
@@ -1975,24 +1955,11 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
 		    priority < DEF_PRIORITY - 2) {
-			struct zone *active_zone = NULL;
-			unsigned long max_writeback = 0;
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				unsigned long writeback;
-
-				/* Initialise for first zone */
-				if (active_zone == NULL)
-					active_zone = zone;
-
-				writeback = zone_page_state(zone, NR_WRITEBACK);
-				if (writeback > max_writeback) {
-					max_writeback = writeback;
-					active_zone = zone;
-				}
-			}
+			struct zone *preferred_zone;
 
-			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
+							NULL, &preferred_zone);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
 		}
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-20  9:52       ` Mel Gorman
@ 2010-09-21 21:44         ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-21 21:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 20 Sep 2010 10:52:39 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > This patch tracks how many pages backed by a congested BDI were found during
> > > scanning. If all the dirty pages encountered on a list isolated from the
> > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > reaches the high watermark.
> > 
> > High watermark, or low watermark?
> > 
> 
> High watermark. The check is made by kswapd.
> 
> > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > watermark and "empty"?
> > 
> 
> Unfortunately they are ambiguous to me. I know what the high watermark
> is but not what the full or empty watermarks are.

Really.  So what's the "high" watermark?  From the above text I'm
thinking that you mean the high watermark is when the queue has a small
number of requests and the low watermark is when the queue has a large
number of requests.

I'd have thought that this is backwards: the "high" watermark is when
the queue has a large (ie: high) number of requests.

A problem.  How do we fix it?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-21 21:44         ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-21 21:44 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Mon, 20 Sep 2010 10:52:39 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> > > This patch tracks how many pages backed by a congested BDI were found during
> > > scanning. If all the dirty pages encountered on a list isolated from the
> > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > reaches the high watermark.
> > 
> > High watermark, or low watermark?
> > 
> 
> High watermark. The check is made by kswapd.
> 
> > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > watermark and "empty"?
> > 
> 
> Unfortunately they are ambiguous to me. I know what the high watermark
> is but not what the full or empty watermarks are.

Really.  So what's the "high" watermark?  From the above text I'm
thinking that you mean the high watermark is when the queue has a small
number of requests and the low watermark is when the queue has a large
number of requests.

I'd have thought that this is backwards: the "high" watermark is when
the queue has a large (ie: high) number of requests.

A problem.  How do we fix it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-21 21:44         ` Andrew Morton
@ 2010-09-21 22:10           ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-21 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Sep 21, 2010 at 02:44:13PM -0700, Andrew Morton wrote:
> On Mon, 20 Sep 2010 10:52:39 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > This patch tracks how many pages backed by a congested BDI were found during
> > > > scanning. If all the dirty pages encountered on a list isolated from the
> > > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > > reaches the high watermark.
> > > 
> > > High watermark, or low watermark?
> > > 
> > 
> > High watermark. The check is made by kswapd.
> > 
> > > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > > watermark and "empty"?
> > > 
> > 
> > Unfortunately they are ambiguous to me. I know what the high watermark
> > is but not what the full or empty watermarks are.
> 
> Really.  So what's the "high" watermark? 

The high watermark is the point where kswapd goes back to sleep because
enough pages have been reclaimed. It's a proxy measure for memory pressure.

> From the above text I'm
> thinking that you mean the high watermark is when the queue has a small
> number of requests and the low watermark is when the queue has a large
> number of requests.
> 

I was expecting "zone reaches the high watermark" was the clue that I was
talking about zone watermarks and not an IO queue but it could be better.


> I'd have thought that this is backwards: the "high" watermark is when
> the queue has a large (ie: high) number of requests.
> 
> A problem.  How do we fix it?
> 

I will try and clarify. How about this as a replacement paragraph?

==== CUT HERE ====
This patch tracks how many pages backed by a congested BDI were found
during scanning. If all the dirty pages isolated from the LRU are
backed by a congested BDI, the zone is marked congested. A zone is marked
uncongested with enough pages have been freed for the zone's high watermark
to be reached indicating that the zone is no longer under any memory
pressure. wait_iff_congested() checks if there are any congested BDIs and if
so if the current zone is marked congested. If both conditions are met, the
caller sleeps on the congestion queue. Otherwise it will call cond_reched()
to yield the processor if necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-21 22:10           ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-09-21 22:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, Sep 21, 2010 at 02:44:13PM -0700, Andrew Morton wrote:
> On Mon, 20 Sep 2010 10:52:39 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > > > This patch tracks how many pages backed by a congested BDI were found during
> > > > scanning. If all the dirty pages encountered on a list isolated from the
> > > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > > reaches the high watermark.
> > > 
> > > High watermark, or low watermark?
> > > 
> > 
> > High watermark. The check is made by kswapd.
> > 
> > > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > > watermark and "empty"?
> > > 
> > 
> > Unfortunately they are ambiguous to me. I know what the high watermark
> > is but not what the full or empty watermarks are.
> 
> Really.  So what's the "high" watermark? 

The high watermark is the point where kswapd goes back to sleep because
enough pages have been reclaimed. It's a proxy measure for memory pressure.

> From the above text I'm
> thinking that you mean the high watermark is when the queue has a small
> number of requests and the low watermark is when the queue has a large
> number of requests.
> 

I was expecting "zone reaches the high watermark" was the clue that I was
talking about zone watermarks and not an IO queue but it could be better.


> I'd have thought that this is backwards: the "high" watermark is when
> the queue has a large (ie: high) number of requests.
> 
> A problem.  How do we fix it?
> 

I will try and clarify. How about this as a replacement paragraph?

==== CUT HERE ====
This patch tracks how many pages backed by a congested BDI were found
during scanning. If all the dirty pages isolated from the LRU are
backed by a congested BDI, the zone is marked congested. A zone is marked
uncongested with enough pages have been freed for the zone's high watermark
to be reached indicating that the zone is no longer under any memory
pressure. wait_iff_congested() checks if there are any congested BDIs and if
so if the current zone is marked congested. If both conditions are met, the
caller sleeps on the congestion queue. Otherwise it will call cond_reched()
to yield the processor if necessary.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
  2010-09-21 22:10           ` Mel Gorman
@ 2010-09-21 22:24             ` Andrew Morton
  -1 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-21 22:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, 21 Sep 2010 23:10:08 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Sep 21, 2010 at 02:44:13PM -0700, Andrew Morton wrote:
> > On Mon, 20 Sep 2010 10:52:39 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > > > This patch tracks how many pages backed by a congested BDI were found during
> > > > > scanning. If all the dirty pages encountered on a list isolated from the
> > > > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > > > reaches the high watermark.
> > > > 
> > > > High watermark, or low watermark?
> > > > 
> > > 
> > > High watermark. The check is made by kswapd.
> > > 
> > > > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > > > watermark and "empty"?
> > > > 
> > > 
> > > Unfortunately they are ambiguous to me. I know what the high watermark
> > > is but not what the full or empty watermarks are.
> > 
> > Really.  So what's the "high" watermark? 
> 
> The high watermark is the point where kswapd goes back to sleep because
> enough pages have been reclaimed. It's a proxy measure for memory pressure.
> 
> > From the above text I'm
> > thinking that you mean the high watermark is when the queue has a small
> > number of requests and the low watermark is when the queue has a large
> > number of requests.
> > 
> 
> I was expecting "zone reaches the high watermark" was the clue that I was
> talking about zone watermarks and not an IO queue but it could be better.

It was more a rant about general terminology rather than one specific case.

> I will try and clarify. How about this as a replacement paragraph?

Works for me, thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
@ 2010-09-21 22:24             ` Andrew Morton
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Morton @ 2010-09-21 22:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Tue, 21 Sep 2010 23:10:08 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Sep 21, 2010 at 02:44:13PM -0700, Andrew Morton wrote:
> > On Mon, 20 Sep 2010 10:52:39 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > > > This patch tracks how many pages backed by a congested BDI were found during
> > > > > scanning. If all the dirty pages encountered on a list isolated from the
> > > > > LRU belong to a congested BDI, the zone is marked congested until the zone
> > > > > reaches the high watermark.
> > > > 
> > > > High watermark, or low watermark?
> > > > 
> > > 
> > > High watermark. The check is made by kswapd.
> > > 
> > > > The terms are rather ambiguous so let's avoid them.  Maybe "full"
> > > > watermark and "empty"?
> > > > 
> > > 
> > > Unfortunately they are ambiguous to me. I know what the high watermark
> > > is but not what the full or empty watermarks are.
> > 
> > Really.  So what's the "high" watermark? 
> 
> The high watermark is the point where kswapd goes back to sleep because
> enough pages have been reclaimed. It's a proxy measure for memory pressure.
> 
> > From the above text I'm
> > thinking that you mean the high watermark is when the queue has a small
> > number of requests and the low watermark is when the queue has a large
> > number of requests.
> > 
> 
> I was expecting "zone reaches the high watermark" was the clue that I was
> talking about zone watermarks and not an IO queue but it could be better.

It was more a rant about general terminology rather than one specific case.

> I will try and clarify. How about this as a replacement paragraph?

Works for me, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-09-15 12:27 ` Mel Gorman
  (?)
@ 2010-10-14 15:28   ` Christian Ehrhardt
  -1 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-14 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Seing the patches Mel sent a few weeks ago I realized that this series might be at least partially related to my reports in 1Q 2010 - so I ran my testcase on a few kernels to provide you with some more backing data.

Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
Processes are evenly distributed so that it always is one process per disk.
In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.

Kernel (git)                   4          8         16   deviation #16 case                           comment
linux-2.6.30              902694    1396073    1892624                 base                              base
linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad 
linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!

So much from the case that I used when I reported the issue earlier this year.
The short summary is that the patch series from Mel helps a lot for my test case.

So I guess Mel you now want some traces of the last two cases right?
Could you give me some minimal advice what/how you would exactly need.

In addition it worked really fine, so you can add both, however you like.
Reported-by: <ehrhardt@linux.vnet.ibm.com>
Tested-by: <ehrhardt@linux.vnet.ibm.com>

Note: it might be worth to mention that the write case improved a lot since 2.6.30.
Not directly related to the read degradations, but with up to 150% (write) 272% (rewrite).
Therefore not everything is bad :-) 

Any further comments or questions?

Christian

On 09/15/2010 02:27 PM, Mel Gorman wrote:
> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.  There are a number of changes in this
> revision. The first is to drop the patches avoiding writeback from direct
> reclaim again. Wu asked me to look at a large number of his patches and I felt
> it was best to do that independent of this series which should be relatively
> uncontroversial. The second big change is to wait_iff_congested(). There
> were a few complaints that the avoidance heuristic was way too fuzzy and
> so I tried following Andrew's suggestion to take note of the return value
> of bdi_write_congested() in may_write_to_queue() to identify when a zone
> is congested.
> 
> Changelog since V2
>    o Reshuffle patches to order from least to most controversial
>    o Drop the patches dealing with writeback avoidance. Wu is working
>      on some patches that potentially collide with this area so it
>      will be revisited later
>    o Use BDI congestion feedback in wait_iff_congested() instead of
>      making a determination based on number of pages currently being
>      written back
>    o Do not use lock_page in pageout path
>    o Rebase to 2.6.36-rc4
> 
> Changelog since V1
>    o Fix mis-named function in documentation
>    o Added reviewed and acked bys
> 
> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> puts together two different patch sets which in combination should tackle
> some of the root causes of latency problems being reported.
> 
> Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
> most important results is being able to calculate the scanning/reclaim
> ratio as a measure of the amount of work being done by page reclaim.
> 
> Patch 2 accounts for time spent in congestion_wait.
> 
> Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.
> 
> Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
> wait_iff_congested() only sleeps if there is a BDI that is currently congested.
> 
> Patch 8 notes that any BDI being congested is not necessarily a problem
> because there could be multiple BDIs of varying speeds and numberous zones. It
> attempts to track when a zone being reclaimed contains many pages backed
> by a congested BDI and if so, reclaimers wait on the congestion queue.
> 
> I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
> machine had 3G of RAM and the CPUs were
> 
> X86:    Intel P4 2-core
> X86-64: AMD Phenom 4-core
> PPC64:  PPC970MP
> 
> Each used a single disk and the onboard IO controller. Dirty ratio was left
> at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
> keep this report short. Four kernels were tested each based on v2.6.36-rc4
> 
> traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
> lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
> waitcongest-v2r3:   Patches 1-7 to only wait on congestion
> waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
> 
> nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
> nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
> 
> The tests run were as follows
> 
> kernbench
> 	compile-based benchmark. Smoke test performance
> 
> sysbench
> 	OLTP read-only benchmark. Will be re-run in the future as read-write
> 
> micro-mapped-file-stream
> 	This is a micro-benchmark from Johannes Weiner that accesses a
> 	large sparse-file through mmap(). It was configured to run in only
> 	single-CPU mode but can be indicative of how well page reclaim
> 	identifies suitable pages.
> 
> stress-highalloc
> 	Tries to allocate huge pages under heavy load.
> 
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
> 
> X86-64 micro-mapped-file-stream
> 
>                                        traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> 
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
> 
> FTrace Reclaim Statistics: vmscan
> 
>                                  traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims                                443        273        513       1568
> Direct reclaim pages scanned                305968     280402     600825     957933
> Direct reclaim pages reclaimed               43503      19005      30327     117191
> Direct reclaim write file async I/O              0          0          0          0
> Direct reclaim write anon async I/O              0          3          4         12
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        187649     132338     191695     267701
> Kswapd wakeups                                   3          1          4          1
> Kswapd pages scanned                       4599269    4454162    4296815    3891906
> Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
> Kswapd reclaim write file async I/O              1          0          1          1
> Kswapd reclaim write anon async I/O             59        187         41        222
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
> Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
> 
> Total pages scanned                        4905237   4734564   4897640   4849839
> Total pages reclaimed                      2339450   2447439   2430145   2436897
> %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> 
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                87        196         64          0
> Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
> Direct full   congest     waited                72        145         53          0
> Direct number conditional waited                 0          0        324       1315
> Direct time   conditional waited               0ms        0ms        0ms        0ms
> Direct full   conditional waited                 0          0          0          0
> KSwapd number congest     waited                20         10         15          7
> KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
> KSwapd full   congest     waited                10          4          6          2
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
> 
> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
> 
> PPC64 micro-mapped-file-stream
> pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> 
> Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
> 
> FTrace Reclaim Statistics: vmscan
> Direct reclaims                                977       2709       2098       5136
> Direct reclaim pages scanned                629825     963814    1063938    1711935
> Direct reclaim pages reclaimed               75550     242538     150904     387647
> Direct reclaim write file async I/O              0          0          0          2
> Direct reclaim write anon async I/O              0         10          0          4
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        392119    1201712     571935     571921
> Kswapd wakeups                                   3          2          3          3
> Kswapd pages scanned                       4601307    4128076    3912317    3377165
> Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
> Kswapd reclaim write file async I/O             20          1          1          1
> Kswapd reclaim write anon async I/O             57        132         11        121
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
> Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
> 
> Total pages scanned                        5231132   5091890   4976255   5089100
> Total pages reclaimed                      2508073   2561335   2463577   2532263
> %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
> Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
> 
> Again, a similar trend that the congestion_wait changes mean that direct
> reclaim scans more pages but the overall number of pages scanned while
> slightly reduced, are very similar. The ratio of scanning/reclaimed remains
> roughly similar. The downside is that kswapd and direct reclaim was awake
> longer and for a larger percentage of the overall workload. It's possible
> there were big differences in the amount of time spent reclaiming slab
> pages between the different kernels which is plausible considering that
> the micro tests runs after fsmark and sysbench.
> 
> Trace Reclaim Statistics: congestion_wait
> Direct number congest     waited               845       1312        104          0
> Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
> Direct full   congest     waited               745       1105         72          0
> Direct number conditional waited                 0          0       1322       2935
> Direct time   conditional waited               0ms        0ms       12ms      312ms
> Direct full   conditional waited                 0          0          0          3
> KSwapd number congest     waited                39        102         75         63
> KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
> KSwapd full   congest     waited                20         48         46         25
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
> asleep with the patches.  The time kswapd spent congest waited was also
> reduced by a large factor.
> 
> MMTests Statistics: duration
> ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
> Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
> 
> With all patches applies, the completion times are very similar.
> 
> 
> X86-64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> 
> Success figures across the board are broadly similar.
> 
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                               1045        944        886        887
> Direct reclaim pages scanned                135091     119604     109382     101019
> Direct reclaim pages reclaimed               88599      47535      47863      46671
> Direct reclaim write file async I/O            494        283        465        280
> Direct reclaim write anon async I/O          29357      13710      16656      13462
> Direct reclaim write file sync I/O             154          2          2          3
> Direct reclaim write anon sync I/O           14594        571        509        561
> Wake kswapd requests                          7491        933        872        892
> Kswapd wakeups                                 814        778        731        780
> Kswapd pages scanned                       7290822   15341158   11916436   13703442
> Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
> Kswapd reclaim write file async I/O          91975      32317      28022      29628
> Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
> Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
> 
> Total pages scanned                        7425913  15460762  12025818  13804461
> Total pages reclaimed                      3675935   3190031   3142255   3233822
> %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> 
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.
> 
> Overall, indications here are that allocations were
> happening much faster and this can be seen with a graph of
> the latency figures as the allocations were taking place
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited              1333        204        169          4
> Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
> Direct full   congest     waited               756         92         69          2
> Direct number conditional waited                 0          0         26        186
> Direct time   conditional waited               0ms        0ms        0ms     2504ms
> Direct full   conditional waited                 0          0          0         25
> KSwapd number congest     waited                 4        395        227        282
> KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
> KSwapd full   congest     waited                 3        232         98        176
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> KSwapd full   conditional waited               318          0        312          9
> 
> 
> Overall, the time spent speeping is reduced. kswapd is still hitting
> congestion_wait() but that is because there are callers remaining where it
> wasn't clear in advance if they should be changed to wait_iff_congested()
> or not.  Overall the sleep imes are reduced though - from 79ish seconds to
> about 19.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
> Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
> 
> With the full series, the time to complete the tests are reduced by 30%
> 
> PPC64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
> Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
> At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
> 
> Success rates there are *way* up particularly considering that the 16MB
> huge pages on PPC64 mean that it's always much harder to allocate them.
> 
> FTrace Reclaim Statistics: vmscan
>                stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                                499        505        564        509
> Direct reclaim pages scanned                223478      41898      51818      45605
> Direct reclaim pages reclaimed              137730      21148      27161      23455
> Direct reclaim write file async I/O            399        136        162        136
> Direct reclaim write anon async I/O          46977       2865       4686       3998
> Direct reclaim write file sync I/O              29          0          1          3
> Direct reclaim write anon sync I/O           31023        159        237        239
> Wake kswapd requests                           420        351        360        326
> Kswapd wakeups                                 185        294        249        277
> Kswapd pages scanned                      15703488   16392500   17821724   17598737
> Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
> Kswapd reclaim write file async I/O         159938      18400      18717      13473
> Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
> Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
> 
> Total pages scanned                       15926966  16434398  17873542  17644342
> Total pages reclaimed                      5946196   2930006   3166547   3168890
> %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
> %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
> %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
> Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
> Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
> 
> While the scanning rates are slightly up, the scanned/reclaimed and
> scanned/written figures are much improved. The time spent in direct reclaim
> and with kswapd are massively reduced, mostly by the lowlumpy patches.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited               725        303        126          3
> Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
> Direct full   congest     waited               487        190         52          3
> Direct number conditional waited                 0          0        200        301
> Direct time   conditional waited               0ms        0ms        0ms     1904ms
> Direct full   conditional waited                 0          0          0         19
> KSwapd number congest     waited                 0          2         23          4
> KSwapd time   congest     waited               0ms      200ms      420ms      404ms
> KSwapd full   congest     waited                 0          2          2          4
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> 
> Not as dramatic a story here but the time spent asleep is reduced and we can still
> see what wait_iff_congested is going to sleep when necessary.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
> Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
> 
> The time to complete this test goes way down. With the full series, we are allocating
> over twice the number of huge pages in 30% of the time and there is a corresponding
> impact on the allocation latency graph available at.
> 
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
> 
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.
> 
> The congestion_wait avoidance stuff was controversial in v1 because the
> heuristic used to avoid the wait was a bit shaky. I'm expecting that this
> version is more predictable.
> 
>   .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
>   include/linux/backing-dev.h                        |    2 +-
>   include/linux/mmzone.h                             |    8 +
>   include/trace/events/vmscan.h                      |   44 ++++-
>   include/trace/events/writeback.h                   |   35 +++
>   mm/backing-dev.c                                   |   66 ++++++-
>   mm/page_alloc.c                                    |    4 +-
>   mm/vmscan.c                                        |  226 ++++++++++++++------
>   8 files changed, 341 insertions(+), 83 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-10-14 15:28   ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-14 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Seing the patches Mel sent a few weeks ago I realized that this series might be at least partially related to my reports in 1Q 2010 - so I ran my testcase on a few kernels to provide you with some more backing data.

Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
Processes are evenly distributed so that it always is one process per disk.
In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.

Kernel (git)                   4          8         16   deviation #16 case                           comment
linux-2.6.30              902694    1396073    1892624                 base                              base
linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad 
linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!

So much from the case that I used when I reported the issue earlier this year.
The short summary is that the patch series from Mel helps a lot for my test case.

So I guess Mel you now want some traces of the last two cases right?
Could you give me some minimal advice what/how you would exactly need.

In addition it worked really fine, so you can add both, however you like.
Reported-by: <ehrhardt@linux.vnet.ibm.com>
Tested-by: <ehrhardt@linux.vnet.ibm.com>

Note: it might be worth to mention that the write case improved a lot since 2.6.30.
Not directly related to the read degradations, but with up to 150% (write) 272% (rewrite).
Therefore not everything is bad :-) 

Any further comments or questions?

Christian

On 09/15/2010 02:27 PM, Mel Gorman wrote:
> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.  There are a number of changes in this
> revision. The first is to drop the patches avoiding writeback from direct
> reclaim again. Wu asked me to look at a large number of his patches and I felt
> it was best to do that independent of this series which should be relatively
> uncontroversial. The second big change is to wait_iff_congested(). There
> were a few complaints that the avoidance heuristic was way too fuzzy and
> so I tried following Andrew's suggestion to take note of the return value
> of bdi_write_congested() in may_write_to_queue() to identify when a zone
> is congested.
> 
> Changelog since V2
>    o Reshuffle patches to order from least to most controversial
>    o Drop the patches dealing with writeback avoidance. Wu is working
>      on some patches that potentially collide with this area so it
>      will be revisited later
>    o Use BDI congestion feedback in wait_iff_congested() instead of
>      making a determination based on number of pages currently being
>      written back
>    o Do not use lock_page in pageout path
>    o Rebase to 2.6.36-rc4
> 
> Changelog since V1
>    o Fix mis-named function in documentation
>    o Added reviewed and acked bys
> 
> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> puts together two different patch sets which in combination should tackle
> some of the root causes of latency problems being reported.
> 
> Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
> most important results is being able to calculate the scanning/reclaim
> ratio as a measure of the amount of work being done by page reclaim.
> 
> Patch 2 accounts for time spent in congestion_wait.
> 
> Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.
> 
> Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
> wait_iff_congested() only sleeps if there is a BDI that is currently congested.
> 
> Patch 8 notes that any BDI being congested is not necessarily a problem
> because there could be multiple BDIs of varying speeds and numberous zones. It
> attempts to track when a zone being reclaimed contains many pages backed
> by a congested BDI and if so, reclaimers wait on the congestion queue.
> 
> I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
> machine had 3G of RAM and the CPUs were
> 
> X86:    Intel P4 2-core
> X86-64: AMD Phenom 4-core
> PPC64:  PPC970MP
> 
> Each used a single disk and the onboard IO controller. Dirty ratio was left
> at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
> keep this report short. Four kernels were tested each based on v2.6.36-rc4
> 
> traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
> lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
> waitcongest-v2r3:   Patches 1-7 to only wait on congestion
> waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
> 
> nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
> nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
> 
> The tests run were as follows
> 
> kernbench
> 	compile-based benchmark. Smoke test performance
> 
> sysbench
> 	OLTP read-only benchmark. Will be re-run in the future as read-write
> 
> micro-mapped-file-stream
> 	This is a micro-benchmark from Johannes Weiner that accesses a
> 	large sparse-file through mmap(). It was configured to run in only
> 	single-CPU mode but can be indicative of how well page reclaim
> 	identifies suitable pages.
> 
> stress-highalloc
> 	Tries to allocate huge pages under heavy load.
> 
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
> 
> X86-64 micro-mapped-file-stream
> 
>                                        traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> 
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
> 
> FTrace Reclaim Statistics: vmscan
> 
>                                  traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims                                443        273        513       1568
> Direct reclaim pages scanned                305968     280402     600825     957933
> Direct reclaim pages reclaimed               43503      19005      30327     117191
> Direct reclaim write file async I/O              0          0          0          0
> Direct reclaim write anon async I/O              0          3          4         12
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        187649     132338     191695     267701
> Kswapd wakeups                                   3          1          4          1
> Kswapd pages scanned                       4599269    4454162    4296815    3891906
> Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
> Kswapd reclaim write file async I/O              1          0          1          1
> Kswapd reclaim write anon async I/O             59        187         41        222
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
> Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
> 
> Total pages scanned                        4905237   4734564   4897640   4849839
> Total pages reclaimed                      2339450   2447439   2430145   2436897
> %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> 
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                87        196         64          0
> Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
> Direct full   congest     waited                72        145         53          0
> Direct number conditional waited                 0          0        324       1315
> Direct time   conditional waited               0ms        0ms        0ms        0ms
> Direct full   conditional waited                 0          0          0          0
> KSwapd number congest     waited                20         10         15          7
> KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
> KSwapd full   congest     waited                10          4          6          2
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
> 
> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
> 
> PPC64 micro-mapped-file-stream
> pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> 
> Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
> 
> FTrace Reclaim Statistics: vmscan
> Direct reclaims                                977       2709       2098       5136
> Direct reclaim pages scanned                629825     963814    1063938    1711935
> Direct reclaim pages reclaimed               75550     242538     150904     387647
> Direct reclaim write file async I/O              0          0          0          2
> Direct reclaim write anon async I/O              0         10          0          4
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        392119    1201712     571935     571921
> Kswapd wakeups                                   3          2          3          3
> Kswapd pages scanned                       4601307    4128076    3912317    3377165
> Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
> Kswapd reclaim write file async I/O             20          1          1          1
> Kswapd reclaim write anon async I/O             57        132         11        121
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
> Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
> 
> Total pages scanned                        5231132   5091890   4976255   5089100
> Total pages reclaimed                      2508073   2561335   2463577   2532263
> %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
> Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
> 
> Again, a similar trend that the congestion_wait changes mean that direct
> reclaim scans more pages but the overall number of pages scanned while
> slightly reduced, are very similar. The ratio of scanning/reclaimed remains
> roughly similar. The downside is that kswapd and direct reclaim was awake
> longer and for a larger percentage of the overall workload. It's possible
> there were big differences in the amount of time spent reclaiming slab
> pages between the different kernels which is plausible considering that
> the micro tests runs after fsmark and sysbench.
> 
> Trace Reclaim Statistics: congestion_wait
> Direct number congest     waited               845       1312        104          0
> Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
> Direct full   congest     waited               745       1105         72          0
> Direct number conditional waited                 0          0       1322       2935
> Direct time   conditional waited               0ms        0ms       12ms      312ms
> Direct full   conditional waited                 0          0          0          3
> KSwapd number congest     waited                39        102         75         63
> KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
> KSwapd full   congest     waited                20         48         46         25
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
> asleep with the patches.  The time kswapd spent congest waited was also
> reduced by a large factor.
> 
> MMTests Statistics: duration
> ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
> Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
> 
> With all patches applies, the completion times are very similar.
> 
> 
> X86-64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> 
> Success figures across the board are broadly similar.
> 
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                               1045        944        886        887
> Direct reclaim pages scanned                135091     119604     109382     101019
> Direct reclaim pages reclaimed               88599      47535      47863      46671
> Direct reclaim write file async I/O            494        283        465        280
> Direct reclaim write anon async I/O          29357      13710      16656      13462
> Direct reclaim write file sync I/O             154          2          2          3
> Direct reclaim write anon sync I/O           14594        571        509        561
> Wake kswapd requests                          7491        933        872        892
> Kswapd wakeups                                 814        778        731        780
> Kswapd pages scanned                       7290822   15341158   11916436   13703442
> Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
> Kswapd reclaim write file async I/O          91975      32317      28022      29628
> Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
> Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
> 
> Total pages scanned                        7425913  15460762  12025818  13804461
> Total pages reclaimed                      3675935   3190031   3142255   3233822
> %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> 
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.
> 
> Overall, indications here are that allocations were
> happening much faster and this can be seen with a graph of
> the latency figures as the allocations were taking place
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited              1333        204        169          4
> Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
> Direct full   congest     waited               756         92         69          2
> Direct number conditional waited                 0          0         26        186
> Direct time   conditional waited               0ms        0ms        0ms     2504ms
> Direct full   conditional waited                 0          0          0         25
> KSwapd number congest     waited                 4        395        227        282
> KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
> KSwapd full   congest     waited                 3        232         98        176
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> KSwapd full   conditional waited               318          0        312          9
> 
> 
> Overall, the time spent speeping is reduced. kswapd is still hitting
> congestion_wait() but that is because there are callers remaining where it
> wasn't clear in advance if they should be changed to wait_iff_congested()
> or not.  Overall the sleep imes are reduced though - from 79ish seconds to
> about 19.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
> Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
> 
> With the full series, the time to complete the tests are reduced by 30%
> 
> PPC64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
> Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
> At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
> 
> Success rates there are *way* up particularly considering that the 16MB
> huge pages on PPC64 mean that it's always much harder to allocate them.
> 
> FTrace Reclaim Statistics: vmscan
>                stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                                499        505        564        509
> Direct reclaim pages scanned                223478      41898      51818      45605
> Direct reclaim pages reclaimed              137730      21148      27161      23455
> Direct reclaim write file async I/O            399        136        162        136
> Direct reclaim write anon async I/O          46977       2865       4686       3998
> Direct reclaim write file sync I/O              29          0          1          3
> Direct reclaim write anon sync I/O           31023        159        237        239
> Wake kswapd requests                           420        351        360        326
> Kswapd wakeups                                 185        294        249        277
> Kswapd pages scanned                      15703488   16392500   17821724   17598737
> Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
> Kswapd reclaim write file async I/O         159938      18400      18717      13473
> Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
> Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
> 
> Total pages scanned                       15926966  16434398  17873542  17644342
> Total pages reclaimed                      5946196   2930006   3166547   3168890
> %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
> %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
> %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
> Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
> Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
> 
> While the scanning rates are slightly up, the scanned/reclaimed and
> scanned/written figures are much improved. The time spent in direct reclaim
> and with kswapd are massively reduced, mostly by the lowlumpy patches.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited               725        303        126          3
> Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
> Direct full   congest     waited               487        190         52          3
> Direct number conditional waited                 0          0        200        301
> Direct time   conditional waited               0ms        0ms        0ms     1904ms
> Direct full   conditional waited                 0          0          0         19
> KSwapd number congest     waited                 0          2         23          4
> KSwapd time   congest     waited               0ms      200ms      420ms      404ms
> KSwapd full   congest     waited                 0          2          2          4
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> 
> Not as dramatic a story here but the time spent asleep is reduced and we can still
> see what wait_iff_congested is going to sleep when necessary.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
> Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
> 
> The time to complete this test goes way down. With the full series, we are allocating
> over twice the number of huge pages in 30% of the time and there is a corresponding
> impact on the allocation latency graph available at.
> 
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
> 
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.
> 
> The congestion_wait avoidance stuff was controversial in v1 because the
> heuristic used to avoid the wait was a bit shaky. I'm expecting that this
> version is more predictable.
> 
>   .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
>   include/linux/backing-dev.h                        |    2 +-
>   include/linux/mmzone.h                             |    8 +
>   include/trace/events/vmscan.h                      |   44 ++++-
>   include/trace/events/writeback.h                   |   35 +++
>   mm/backing-dev.c                                   |   66 ++++++-
>   mm/page_alloc.c                                    |    4 +-
>   mm/vmscan.c                                        |  226 ++++++++++++++------
>   8 files changed, 341 insertions(+), 83 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-10-14 15:28   ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-14 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Seing the patches Mel sent a few weeks ago I realized that this series might be at least partially related to my reports in 1Q 2010 - so I ran my testcase on a few kernels to provide you with some more backing data.

Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
Processes are evenly distributed so that it always is one process per disk.
In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.

Kernel (git)                   4          8         16   deviation #16 case                           comment
linux-2.6.30              902694    1396073    1892624                 base                              base
linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad 
linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!

So much from the case that I used when I reported the issue earlier this year.
The short summary is that the patch series from Mel helps a lot for my test case.

So I guess Mel you now want some traces of the last two cases right?
Could you give me some minimal advice what/how you would exactly need.

In addition it worked really fine, so you can add both, however you like.
Reported-by: <ehrhardt@linux.vnet.ibm.com>
Tested-by: <ehrhardt@linux.vnet.ibm.com>

Note: it might be worth to mention that the write case improved a lot since 2.6.30.
Not directly related to the read degradations, but with up to 150% (write) 272% (rewrite).
Therefore not everything is bad :-) 

Any further comments or questions?

Christian

On 09/15/2010 02:27 PM, Mel Gorman wrote:
> This is v2 of a series to reduce some of the latencies seen in page reclaim
> and to improve the efficiency a bit.  There are a number of changes in this
> revision. The first is to drop the patches avoiding writeback from direct
> reclaim again. Wu asked me to look at a large number of his patches and I felt
> it was best to do that independent of this series which should be relatively
> uncontroversial. The second big change is to wait_iff_congested(). There
> were a few complaints that the avoidance heuristic was way too fuzzy and
> so I tried following Andrew's suggestion to take note of the return value
> of bdi_write_congested() in may_write_to_queue() to identify when a zone
> is congested.
> 
> Changelog since V2
>    o Reshuffle patches to order from least to most controversial
>    o Drop the patches dealing with writeback avoidance. Wu is working
>      on some patches that potentially collide with this area so it
>      will be revisited later
>    o Use BDI congestion feedback in wait_iff_congested() instead of
>      making a determination based on number of pages currently being
>      written back
>    o Do not use lock_page in pageout path
>    o Rebase to 2.6.36-rc4
> 
> Changelog since V1
>    o Fix mis-named function in documentation
>    o Added reviewed and acked bys
> 
> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> puts together two different patch sets which in combination should tackle
> some of the root causes of latency problems being reported.
> 
> Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
> most important results is being able to calculate the scanning/reclaim
> ratio as a measure of the amount of work being done by page reclaim.
> 
> Patch 2 accounts for time spent in congestion_wait.
> 
> Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.
> 
> Patch 7 adds wait_iff_congested() and replaces some callers of congestion_wait().
> wait_iff_congested() only sleeps if there is a BDI that is currently congested.
> 
> Patch 8 notes that any BDI being congested is not necessarily a problem
> because there could be multiple BDIs of varying speeds and numberous zones. It
> attempts to track when a zone being reclaimed contains many pages backed
> by a congested BDI and if so, reclaimers wait on the congestion queue.
> 
> I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
> machine had 3G of RAM and the CPUs were
> 
> X86:    Intel P4 2-core
> X86-64: AMD Phenom 4-core
> PPC64:  PPC970MP
> 
> Each used a single disk and the onboard IO controller. Dirty ratio was left
> at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
> keep this report short. Four kernels were tested each based on v2.6.36-rc4
> 
> traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
> lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
> waitcongest-v2r3:   Patches 1-7 to only wait on congestion
> waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
> 
> nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
> nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
> 
> The tests run were as follows
> 
> kernbench
> 	compile-based benchmark. Smoke test performance
> 
> sysbench
> 	OLTP read-only benchmark. Will be re-run in the future as read-write
> 
> micro-mapped-file-stream
> 	This is a micro-benchmark from Johannes Weiner that accesses a
> 	large sparse-file through mmap(). It was configured to run in only
> 	single-CPU mode but can be indicative of how well page reclaim
> 	identifies suitable pages.
> 
> stress-highalloc
> 	Tries to allocate huge pages under heavy load.
> 
> kernbench, iozone and sysbench did not report any performance regression
> on any machine. sysbench did pressure the system lightly and there was reclaim
> activity but there were no difference of major interest between the kernels.
> 
> X86-64 micro-mapped-file-stream
> 
>                                        traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
> pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
> pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
> pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
> pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
> pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
> allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
> 
> These are based on the raw figures taken from /proc/vmstat. It's a rough
> measure of reclaim activity. Note that allocstall counts are higher because
> we are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
> 
> FTrace Reclaim Statistics: vmscan
> 
>                                  traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
> Direct reclaims                                443        273        513       1568
> Direct reclaim pages scanned                305968     280402     600825     957933
> Direct reclaim pages reclaimed               43503      19005      30327     117191
> Direct reclaim write file async I/O              0          0          0          0
> Direct reclaim write anon async I/O              0          3          4         12
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        187649     132338     191695     267701
> Kswapd wakeups                                   3          1          4          1
> Kswapd pages scanned                       4599269    4454162    4296815    3891906
> Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
> Kswapd reclaim write file async I/O              1          0          1          1
> Kswapd reclaim write anon async I/O             59        187         41        222
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
> Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
> 
> Total pages scanned                        4905237   4734564   4897640   4849839
> Total pages reclaimed                      2339450   2447439   2430145   2436897
> %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
> Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
> 
> What is interesting here for nocongest in particular is that while direct
> reclaim scans more pages, the overall number of pages scanned remains the same
> and the ratio of pages scanned to pages reclaimed is more or less the same. In
> other words, while we are sleeping less, reclaim is not doing more work and
> as direct reclaim and kswapd is awake for less time, it would appear to be doing less work.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                87        196         64          0
> Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
> Direct full   congest     waited                72        145         53          0
> Direct number conditional waited                 0          0        324       1315
> Direct time   conditional waited               0ms        0ms        0ms        0ms
> Direct full   conditional waited                 0          0          0          0
> KSwapd number congest     waited                20         10         15          7
> KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
> KSwapd full   congest     waited                10          4          6          2
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
> all asleep with the patches.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
> Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
> 
> Overall, the tests completed faster. It is interesting to note that backing off further
> when a zone is congested and not just a BDI was more efficient overall.
> 
> PPC64 micro-mapped-file-stream
> pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
> pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
> pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
> pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
> pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
> pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
> allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
> 
> Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
> 
> FTrace Reclaim Statistics: vmscan
> Direct reclaims                                977       2709       2098       5136
> Direct reclaim pages scanned                629825     963814    1063938    1711935
> Direct reclaim pages reclaimed               75550     242538     150904     387647
> Direct reclaim write file async I/O              0          0          0          2
> Direct reclaim write anon async I/O              0         10          0          4
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        392119    1201712     571935     571921
> Kswapd wakeups                                   3          2          3          3
> Kswapd pages scanned                       4601307    4128076    3912317    3377165
> Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
> Kswapd reclaim write file async I/O             20          1          1          1
> Kswapd reclaim write anon async I/O             57        132         11        121
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
> Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
> 
> Total pages scanned                        5231132   5091890   4976255   5089100
> Total pages reclaimed                      2508073   2561335   2463577   2532263
> %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
> Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
> 
> Again, a similar trend that the congestion_wait changes mean that direct
> reclaim scans more pages but the overall number of pages scanned while
> slightly reduced, are very similar. The ratio of scanning/reclaimed remains
> roughly similar. The downside is that kswapd and direct reclaim was awake
> longer and for a larger percentage of the overall workload. It's possible
> there were big differences in the amount of time spent reclaiming slab
> pages between the different kernels which is plausible considering that
> the micro tests runs after fsmark and sysbench.
> 
> Trace Reclaim Statistics: congestion_wait
> Direct number congest     waited               845       1312        104          0
> Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
> Direct full   congest     waited               745       1105         72          0
> Direct number conditional waited                 0          0       1322       2935
> Direct time   conditional waited               0ms        0ms       12ms      312ms
> Direct full   conditional waited                 0          0          0          3
> KSwapd number congest     waited                39        102         75         63
> KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
> KSwapd full   congest     waited                20         48         46         25
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> The vanilla kernel spent 20 seconds asleep in direct reclaim and only 312ms
> asleep with the patches.  The time kswapd spent congest waited was also
> reduced by a large factor.
> 
> MMTests Statistics: duration
> ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
> Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
> 
> With all patches applies, the completion times are very similar.
> 
> 
> X86-64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
> Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
> At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
> 
> Success figures across the board are broadly similar.
> 
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                               1045        944        886        887
> Direct reclaim pages scanned                135091     119604     109382     101019
> Direct reclaim pages reclaimed               88599      47535      47863      46671
> Direct reclaim write file async I/O            494        283        465        280
> Direct reclaim write anon async I/O          29357      13710      16656      13462
> Direct reclaim write file sync I/O             154          2          2          3
> Direct reclaim write anon sync I/O           14594        571        509        561
> Wake kswapd requests                          7491        933        872        892
> Kswapd wakeups                                 814        778        731        780
> Kswapd pages scanned                       7290822   15341158   11916436   13703442
> Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
> Kswapd reclaim write file async I/O          91975      32317      28022      29628
> Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
> Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
> 
> Total pages scanned                        7425913  15460762  12025818  13804461
> Total pages reclaimed                      3675935   3190031   3142255   3233822
> %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
> %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
> %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
> Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
> Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
> 
> Scanned/reclaimed ratios again look good with big improvements in
> efficiency. The Scanned/written ratios also look much improved. With a
> better scanned/written ration, there is an expectation that IO would be more
> efficient and indeed, the time spent in direct reclaim is much reduced by
> the full series and kswapd spends a little less time awake.
> 
> Overall, indications here are that allocations were
> happening much faster and this can be seen with a graph of
> the latency figures as the allocations were taking place
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited              1333        204        169          4
> Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
> Direct full   congest     waited               756         92         69          2
> Direct number conditional waited                 0          0         26        186
> Direct time   conditional waited               0ms        0ms        0ms     2504ms
> Direct full   conditional waited                 0          0          0         25
> KSwapd number congest     waited                 4        395        227        282
> KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
> KSwapd full   congest     waited                 3        232         98        176
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> KSwapd full   conditional waited               318          0        312          9
> 
> 
> Overall, the time spent speeping is reduced. kswapd is still hitting
> congestion_wait() but that is because there are callers remaining where it
> wasn't clear in advance if they should be changed to wait_iff_congested()
> or not.  Overall the sleep imes are reduced though - from 79ish seconds to
> about 19.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
> Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
> 
> With the full series, the time to complete the tests are reduced by 30%
> 
> PPC64 STRESS-HIGHALLOC
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
> Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
> At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
> 
> Success rates there are *way* up particularly considering that the 16MB
> huge pages on PPC64 mean that it's always much harder to allocate them.
> 
> FTrace Reclaim Statistics: vmscan
>                stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
>                  traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
> Direct reclaims                                499        505        564        509
> Direct reclaim pages scanned                223478      41898      51818      45605
> Direct reclaim pages reclaimed              137730      21148      27161      23455
> Direct reclaim write file async I/O            399        136        162        136
> Direct reclaim write anon async I/O          46977       2865       4686       3998
> Direct reclaim write file sync I/O              29          0          1          3
> Direct reclaim write anon sync I/O           31023        159        237        239
> Wake kswapd requests                           420        351        360        326
> Kswapd wakeups                                 185        294        249        277
> Kswapd pages scanned                      15703488   16392500   17821724   17598737
> Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
> Kswapd reclaim write file async I/O         159938      18400      18717      13473
> Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
> Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
> 
> Total pages scanned                       15926966  16434398  17873542  17644342
> Total pages reclaimed                      5946196   2930006   3166547   3168890
> %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
> %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
> %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
> Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
> Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
> 
> While the scanning rates are slightly up, the scanned/reclaimed and
> scanned/written figures are much improved. The time spent in direct reclaim
> and with kswapd are massively reduced, mostly by the lowlumpy patches.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited               725        303        126          3
> Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
> Direct full   congest     waited               487        190         52          3
> Direct number conditional waited                 0          0        200        301
> Direct time   conditional waited               0ms        0ms        0ms     1904ms
> Direct full   conditional waited                 0          0          0         19
> KSwapd number congest     waited                 0          2         23          4
> KSwapd time   congest     waited               0ms      200ms      420ms      404ms
> KSwapd full   congest     waited                 0          2          2          4
> KSwapd number conditional waited                 0          0          0          0
> KSwapd time   conditional waited               0ms        0ms        0ms        0ms
> KSwapd full   conditional waited                 0          0          0          0
> 
> 
> Not as dramatic a story here but the time spent asleep is reduced and we can still
> see what wait_iff_congested is going to sleep when necessary.
> 
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
> Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
> 
> The time to complete this test goes way down. With the full series, we are allocating
> over twice the number of huge pages in 30% of the time and there is a corresponding
> impact on the allocation latency graph available at.
> 
> http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
> 
> I think this series is ready for much wider testing. The lowlumpy patches in
> particular should be relatively uncontroversial. While their largest impact
> can be seen in the high order stress tests, they would also have an impact
> if SLUB was configured (these tests are based on slab) and stalls in lumpy
> reclaim could be partially responsible for some desktop stalling reports.
> 
> The congestion_wait avoidance stuff was controversial in v1 because the
> heuristic used to avoid the wait was a bit shaky. I'm expecting that this
> version is more predictable.
> 
>   .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
>   include/linux/backing-dev.h                        |    2 +-
>   include/linux/mmzone.h                             |    8 +
>   include/trace/events/vmscan.h                      |   44 ++++-
>   include/trace/events/writeback.h                   |   35 +++
>   mm/backing-dev.c                                   |   66 ++++++-
>   mm/page_alloc.c                                    |    4 +-
>   mm/vmscan.c                                        |  226 ++++++++++++++------
>   8 files changed, 341 insertions(+), 83 deletions(-)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-10-14 15:28   ` Christian Ehrhardt
@ 2010-10-18 13:55     ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-10-18 13:55 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:

> Seing the patches Mel sent a few weeks ago I realized that this series
> might be at least partially related to my reports in 1Q 2010 - so I ran my
> testcase on a few kernels to provide you with some more backing data.

Thanks very much for revisiting this.

> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
> Processes are evenly distributed so that it always is one process per disk.
> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
> 
> Kernel (git)                   4          8         16   deviation #16 case                           comment
> linux-2.6.30              902694    1396073    1892624                 base                              base
> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad 

FWIW, I wouldn't expect the trace kernel to help. It's only adding the
markers but not doing anything useful with them.

> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
> 
> So much from the case that I used when I reported the issue earlier this year.
> The short summary is that the patch series from Mel helps a lot for my test case.
> 

This is good to hear. We're going in the right direction at least.

> So I guess Mel you now want some traces of the last two cases right?
> Could you give me some minimal advice what/how you would exactly need.
> 

Yes please. Please do something like the following before the test

mount -t debugfs none /sys/kernel/debug
echo 1 > /sys/kernel/debug/tracing/events/vmscan/enable
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_congestion_wait/enable
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_wait_iff_congested/enable
cat /sys/kernel/debug/tracing/trace_pipe > trace.log &

rerun the test, gzip trace.log and drop it on some publicly accessible
webserver. I can rerun the analysis scripts and see if something odd
falls out.

> In addition it worked really fine, so you can add both, however you like.
> Reported-by: <ehrhardt@linux.vnet.ibm.com>
> Tested-by: <ehrhardt@linux.vnet.ibm.com>
> 
> Note: it might be worth to mention that the write case improved a lot since 2.6.30.
> Not directly related to the read degradations, but with up to 150% (write) 272% (rewrite).
> Therefore not everything is bad :-) 
> 

Every cloud has a silver lining I guess :)

> Any further comments or questions?
> 

The log might help me further in figuring out how and why we are losing
time. When/if the patches move from -mm to mainline, it'd also be worth
retesting as there is some churn in this area and we need to know whether
we are heading in the right direction or not. If all goes according to plan,
kernel 2.6.37-rc1 will be of interest. Thanks again.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-10-18 13:55     ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-10-18 13:55 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:

> Seing the patches Mel sent a few weeks ago I realized that this series
> might be at least partially related to my reports in 1Q 2010 - so I ran my
> testcase on a few kernels to provide you with some more backing data.

Thanks very much for revisiting this.

> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
> Processes are evenly distributed so that it always is one process per disk.
> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
> 
> Kernel (git)                   4          8         16   deviation #16 case                           comment
> linux-2.6.30              902694    1396073    1892624                 base                              base
> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad 

FWIW, I wouldn't expect the trace kernel to help. It's only adding the
markers but not doing anything useful with them.

> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
> 
> So much from the case that I used when I reported the issue earlier this year.
> The short summary is that the patch series from Mel helps a lot for my test case.
> 

This is good to hear. We're going in the right direction at least.

> So I guess Mel you now want some traces of the last two cases right?
> Could you give me some minimal advice what/how you would exactly need.
> 

Yes please. Please do something like the following before the test

mount -t debugfs none /sys/kernel/debug
echo 1 > /sys/kernel/debug/tracing/events/vmscan/enable
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_congestion_wait/enable
echo 1 > /sys/kernel/debug/tracing/events/writeback/writeback_wait_iff_congested/enable
cat /sys/kernel/debug/tracing/trace_pipe > trace.log &

rerun the test, gzip trace.log and drop it on some publicly accessible
webserver. I can rerun the analysis scripts and see if something odd
falls out.

> In addition it worked really fine, so you can add both, however you like.
> Reported-by: <ehrhardt@linux.vnet.ibm.com>
> Tested-by: <ehrhardt@linux.vnet.ibm.com>
> 
> Note: it might be worth to mention that the write case improved a lot since 2.6.30.
> Not directly related to the read degradations, but with up to 150% (write) 272% (rewrite).
> Therefore not everything is bad :-) 
> 

Every cloud has a silver lining I guess :)

> Any further comments or questions?
> 

The log might help me further in figuring out how and why we are losing
time. When/if the patches move from -mm to mainline, it'd also be worth
retesting as there is some churn in this area and we need to know whether
we are heading in the right direction or not. If all goes according to plan,
kernel 2.6.37-rc1 will be of interest. Thanks again.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-10-18 13:55     ` Mel Gorman
  (?)
@ 2010-10-22 12:29       ` Christian Ehrhardt
  -1 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-22 12:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
[...]
>>
>> So much from the case that I used when I reported the issue earlier this year.
>> The short summary is that the patch series from Mel helps a lot for my test case.
>>
> 
> This is good to hear. We're going in the right direction at least.
> 
>> So I guess Mel you now want some traces of the last two cases right?
>> Could you give me some minimal advice what/how you would exactly need.
>>
> 
> Yes please. Please do something like the following before the test
> 
> mount -t debugfs none /sys/kernel/debug
> echo 1>  /sys/kernel/debug/tracing/events/vmscan/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_congestion_wait/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_wait_iff_congested/enable
> cat /sys/kernel/debug/tracing/trace_pipe>  trace.log&
> 
> rerun the test, gzip trace.log and drop it on some publicly accessible
> webserver. I can rerun the analysis scripts and see if something odd
> falls out.
> 

I ran my sequential read load with triple sync, 3 > drop caches and
some sleeps in advance. Therefore I hope you can see/find some rampup
towards the problem in the log, as all we know from the past suggests
that it isn't a problem as long as there are free or easy-to-free
things around.

The "writeback_wait_iff_congested" trace comes in with one of the
later patches so you can only find it in the log for the -fix kernel.
To be sure I activated all events of writeback (they don't seem to
add too much events - vmscan causes the majority).

I only traced the 16 thread case and raw performance when taking the
logs was still roughly as it appeared without tracing (ftp access as
user "anonymous" - no pw - should ):
                                 TP          Log-size     ftp-access
2.6.36-rc4-trace           179 mb/s             892mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            1630 mb/s             229mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

You can find the bzipped full log files at:
2.6.36-rc4-trace          ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

I used the post-processing script that was patched within your
series, this should easily give everyone a good overview (the
differences are huge). But I don't know if my scripts are really
up-to-date - so it is up to you to decide if the following is
really valid (I also found nothing about the *iff* stuff in the
script, so you might want the full log anyway):

## WITHOUT-FIXES 2.6.36-rc4-trace ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28292         13654     459886     844139   453638         0       20  159.156      direct-0=13654        wakeup-0=459884 wakeup-1=2
iozone-28300         13071     436052     818191   434998         0        6  159.932      direct-0=13071        wakeup-0=436051 wakeup-1=1
iozone-28303         13813     464730     858740   459634         0        6  159.152      direct-0=13813        wakeup-0=464730
iozone-28295         12824     428748     826281   427246         0       25  159.488      direct-0=12824        wakeup-0=428748
iozone-28301         13482     452617     849624   448212         0       32  159.240      direct-0=13482        wakeup-0=452614 wakeup-1=3
iozone-28304         13131     443473     833093   437755         0       17  159.409      direct-0=13131        wakeup-0=443472 wakeup-1=1
iozone-28305         13628     458115     852889   453645         0        0  159.700      direct-0=13628        wakeup-0=458113 wakeup-1=2
iozone-28291         13625     460635     847770   453657         0        0  159.553      direct-0=13625        wakeup-0=460634 wakeup-1=1
iozone-28297         13103     439959     847125   436743         0       44  159.698      direct-0=13103        wakeup-0=439959
iozone-28302         11991     399591     797354   400234         0        0  160.685      direct-0=11991        wakeup-0=399590 wakeup-1=1
iozone-28296         13085     437466     821684   436628         0        7  159.446      direct-0=13085        wakeup-0=437466
iozone-28294         14028     471795     858038   466738         0        8  159.403      direct-0=14028        wakeup-0=471793 wakeup-1=2
iozone-28298         14216     477065     860224   473428         0        9  158.943      direct-0=14216        wakeup-0=477060 wakeup-1=5
iozone-28299         13354     449048     858721   445392         0        4  159.905      direct-0=13354        wakeup-0=449048
iozone-28293         13554     456445     855633   451410         0       31  159.418      direct-0=13554        wakeup-0=456441 wakeup-1=4
iozone-28290         14664     488925     893139   488442         0        5  158.800      direct-0=14664        wakeup-0=488921 wakeup-1=4
rpcbind-605             45        542       5009     1464         0        0    1.056      direct-0=45           wakeup-0=542
crond-774               11        138        636      414         0        0    0.203      direct-0=11           wakeup-0=138
kthreadd-2               2          2         64       64         0        0    0.000      direct-1=1 direct-2=1 wakeup-1=1 wakeup-2=1
cat-28278             1117       5046     220362    39158         0        0   67.623      direct-0=1117         wakeup-0=5046
sendmail-758           211       6665      33016     7353         0        0    9.436      direct-0=211          wakeup-0=6665
netcat-28279           145       1709      39559     5288         0        0   11.772      direct-0=145          wakeup-0=1709

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages     
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-40              31     267142    9687398  1017640         0     2173      wake-0=30 wake-2=1       rewake-0=267128 rewake-1=13 rewake-2=1

Summary
Direct reclaims:                        216754
Direct reclaim pages scanned:           13821291
Direct reclaim pages reclaimed:         7221541
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    214
Wake kswapd requests:                   7238652
Time stalled direct reclaim:            2642.02 seconds

Kswapd wakeups:                         31
Kswapd pages scanned:                   9687398
Kswapd pages reclaimed:                 1017640
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    2173
Time kswapd awake:                      170.15 seconds

## WITH-FIXES 2.6.36-rc4-fix ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28116          2948      93766     277563    99026         0       41    2.622      direct-0=2948         wakeup-0=93766
iozone-28122          2852      90519     263432    95304         0       15    2.487      direct-0=2852         wakeup-0=90519
iozone-28126          3082     101045     276212   103204         0        7    2.191      direct-0=3082         wakeup-0=101045
iozone-28114          2875      92733     271584    96677         0        5    3.031      direct-0=2875         wakeup-0=92733
iozone-28118          2715      88316     255099    90875         0        2    2.247      direct-0=2715         wakeup-0=88316
iozone-28111          2967      95493     273437    98998         0        0    2.363      direct-0=2967         wakeup-0=95493
iozone-28123          3153     101812     255698   105400         0       25    2.865      direct-0=3153         wakeup-0=101812
iozone-28112          3062     100341     283059   102653         0        4    2.560      direct-0=3062         wakeup-0=100341
iozone-28115          2738      88916     255389    91634         0       14    3.106      direct-0=2738         wakeup-0=88916
iozone-28121          3201     103626     276337   107378         0        0    3.265      direct-0=3201         wakeup-0=103626
iozone-28119          3147     102094     307378   105165         0        0    3.159      direct-0=3147         wakeup-0=102094
iozone-28125          3032      98644     282571   101666         0       12    2.257      direct-0=3032         wakeup-0=98644
iozone-28124          3075     100182     292561   103107         0       12    2.419      direct-0=3075         wakeup-0=100182
iozone-28120          2809      90570     273207    94067         0        7    2.565      direct-0=2809         wakeup-0=90570
iozone-28117          2813      89807     252515    93916         0        0    2.884      direct-0=2813         wakeup-0=89807
iozone-28113          2711      87677     253710    90648         0       18    2.537      direct-0=2711         wakeup-0=87677
sendmail-758            13        442       1915      499         0        0    0.011      direct-0=13           wakeup-0=442
netcat-28100            44        331       4554     1549         0        0    0.507      direct-0=44           wakeup-0=331
cat-28099              141        513      35986     5085         0       39    0.702      direct-0=141          wakeup-0=513
bash-816                 1        173         32       32         0        0    0.000      direct-0=1            wakeup-0=173

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-45               2     617968      33692     8905         0        3      wake-0=2       rewake-0=617968

Summary
Direct reclaims:                        47379
Direct reclaim pages scanned:           4392239
Direct reclaim pages reclaimed:         1586883
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    201
Wake kswapd requests:                   1527000
Time stalled direct reclaim:            43.78 seconds

Kswapd wakeups:                         2
Kswapd pages scanned:                   33692
Kswapd pages reclaimed:                 8905
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    3
Time kswapd awake:                      22.35 seconds

[...]
>>
> 
> The log might help me further in figuring out how and why we are losing
> time. When/if the patches move from -mm to mainline, it'd also be worth
> retesting as there is some churn in this area and we need to know whether
> we are heading in the right direction or not. If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-10-22 12:29       ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-22 12:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
[...]
>>
>> So much from the case that I used when I reported the issue earlier this year.
>> The short summary is that the patch series from Mel helps a lot for my test case.
>>
> 
> This is good to hear. We're going in the right direction at least.
> 
>> So I guess Mel you now want some traces of the last two cases right?
>> Could you give me some minimal advice what/how you would exactly need.
>>
> 
> Yes please. Please do something like the following before the test
> 
> mount -t debugfs none /sys/kernel/debug
> echo 1>  /sys/kernel/debug/tracing/events/vmscan/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_congestion_wait/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_wait_iff_congested/enable
> cat /sys/kernel/debug/tracing/trace_pipe>  trace.log&
> 
> rerun the test, gzip trace.log and drop it on some publicly accessible
> webserver. I can rerun the analysis scripts and see if something odd
> falls out.
> 

I ran my sequential read load with triple sync, 3 > drop caches and
some sleeps in advance. Therefore I hope you can see/find some rampup
towards the problem in the log, as all we know from the past suggests
that it isn't a problem as long as there are free or easy-to-free
things around.

The "writeback_wait_iff_congested" trace comes in with one of the
later patches so you can only find it in the log for the -fix kernel.
To be sure I activated all events of writeback (they don't seem to
add too much events - vmscan causes the majority).

I only traced the 16 thread case and raw performance when taking the
logs was still roughly as it appeared without tracing (ftp access as
user "anonymous" - no pw - should ):
                                 TP          Log-size     ftp-access
2.6.36-rc4-trace           179 mb/s             892mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            1630 mb/s             229mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

You can find the bzipped full log files at:
2.6.36-rc4-trace          ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

I used the post-processing script that was patched within your
series, this should easily give everyone a good overview (the
differences are huge). But I don't know if my scripts are really
up-to-date - so it is up to you to decide if the following is
really valid (I also found nothing about the *iff* stuff in the
script, so you might want the full log anyway):

## WITHOUT-FIXES 2.6.36-rc4-trace ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28292         13654     459886     844139   453638         0       20  159.156      direct-0=13654        wakeup-0=459884 wakeup-1=2
iozone-28300         13071     436052     818191   434998         0        6  159.932      direct-0=13071        wakeup-0=436051 wakeup-1=1
iozone-28303         13813     464730     858740   459634         0        6  159.152      direct-0=13813        wakeup-0=464730
iozone-28295         12824     428748     826281   427246         0       25  159.488      direct-0=12824        wakeup-0=428748
iozone-28301         13482     452617     849624   448212         0       32  159.240      direct-0=13482        wakeup-0=452614 wakeup-1=3
iozone-28304         13131     443473     833093   437755         0       17  159.409      direct-0=13131        wakeup-0=443472 wakeup-1=1
iozone-28305         13628     458115     852889   453645         0        0  159.700      direct-0=13628        wakeup-0=458113 wakeup-1=2
iozone-28291         13625     460635     847770   453657         0        0  159.553      direct-0=13625        wakeup-0=460634 wakeup-1=1
iozone-28297         13103     439959     847125   436743         0       44  159.698      direct-0=13103        wakeup-0=439959
iozone-28302         11991     399591     797354   400234         0        0  160.685      direct-0=11991        wakeup-0=399590 wakeup-1=1
iozone-28296         13085     437466     821684   436628         0        7  159.446      direct-0=13085        wakeup-0=437466
iozone-28294         14028     471795     858038   466738         0        8  159.403      direct-0=14028        wakeup-0=471793 wakeup-1=2
iozone-28298         14216     477065     860224   473428         0        9  158.943      direct-0=14216        wakeup-0=477060 wakeup-1=5
iozone-28299         13354     449048     858721   445392         0        4  159.905      direct-0=13354        wakeup-0=449048
iozone-28293         13554     456445     855633   451410         0       31  159.418      direct-0=13554        wakeup-0=456441 wakeup-1=4
iozone-28290         14664     488925     893139   488442         0        5  158.800      direct-0=14664        wakeup-0=488921 wakeup-1=4
rpcbind-605             45        542       5009     1464         0        0    1.056      direct-0=45           wakeup-0=542
crond-774               11        138        636      414         0        0    0.203      direct-0=11           wakeup-0=138
kthreadd-2               2          2         64       64         0        0    0.000      direct-1=1 direct-2=1 wakeup-1=1 wakeup-2=1
cat-28278             1117       5046     220362    39158         0        0   67.623      direct-0=1117         wakeup-0=5046
sendmail-758           211       6665      33016     7353         0        0    9.436      direct-0=211          wakeup-0=6665
netcat-28279           145       1709      39559     5288         0        0   11.772      direct-0=145          wakeup-0=1709

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages     
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-40              31     267142    9687398  1017640         0     2173      wake-0=30 wake-2=1       rewake-0=267128 rewake-1=13 rewake-2=1

Summary
Direct reclaims:                        216754
Direct reclaim pages scanned:           13821291
Direct reclaim pages reclaimed:         7221541
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    214
Wake kswapd requests:                   7238652
Time stalled direct reclaim:            2642.02 seconds

Kswapd wakeups:                         31
Kswapd pages scanned:                   9687398
Kswapd pages reclaimed:                 1017640
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    2173
Time kswapd awake:                      170.15 seconds

## WITH-FIXES 2.6.36-rc4-fix ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28116          2948      93766     277563    99026         0       41    2.622      direct-0=2948         wakeup-0=93766
iozone-28122          2852      90519     263432    95304         0       15    2.487      direct-0=2852         wakeup-0=90519
iozone-28126          3082     101045     276212   103204         0        7    2.191      direct-0=3082         wakeup-0=101045
iozone-28114          2875      92733     271584    96677         0        5    3.031      direct-0=2875         wakeup-0=92733
iozone-28118          2715      88316     255099    90875         0        2    2.247      direct-0=2715         wakeup-0=88316
iozone-28111          2967      95493     273437    98998         0        0    2.363      direct-0=2967         wakeup-0=95493
iozone-28123          3153     101812     255698   105400         0       25    2.865      direct-0=3153         wakeup-0=101812
iozone-28112          3062     100341     283059   102653         0        4    2.560      direct-0=3062         wakeup-0=100341
iozone-28115          2738      88916     255389    91634         0       14    3.106      direct-0=2738         wakeup-0=88916
iozone-28121          3201     103626     276337   107378         0        0    3.265      direct-0=3201         wakeup-0=103626
iozone-28119          3147     102094     307378   105165         0        0    3.159      direct-0=3147         wakeup-0=102094
iozone-28125          3032      98644     282571   101666         0       12    2.257      direct-0=3032         wakeup-0=98644
iozone-28124          3075     100182     292561   103107         0       12    2.419      direct-0=3075         wakeup-0=100182
iozone-28120          2809      90570     273207    94067         0        7    2.565      direct-0=2809         wakeup-0=90570
iozone-28117          2813      89807     252515    93916         0        0    2.884      direct-0=2813         wakeup-0=89807
iozone-28113          2711      87677     253710    90648         0       18    2.537      direct-0=2711         wakeup-0=87677
sendmail-758            13        442       1915      499         0        0    0.011      direct-0=13           wakeup-0=442
netcat-28100            44        331       4554     1549         0        0    0.507      direct-0=44           wakeup-0=331
cat-28099              141        513      35986     5085         0       39    0.702      direct-0=141          wakeup-0=513
bash-816                 1        173         32       32         0        0    0.000      direct-0=1            wakeup-0=173

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-45               2     617968      33692     8905         0        3      wake-0=2       rewake-0=617968

Summary
Direct reclaims:                        47379
Direct reclaim pages scanned:           4392239
Direct reclaim pages reclaimed:         1586883
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    201
Wake kswapd requests:                   1527000
Time stalled direct reclaim:            43.78 seconds

Kswapd wakeups:                         2
Kswapd pages scanned:                   33692
Kswapd pages reclaimed:                 8905
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    3
Time kswapd awake:                      22.35 seconds

[...]
>>
> 
> The log might help me further in figuring out how and why we are losing
> time. When/if the patches move from -mm to mainline, it'd also be worth
> retesting as there is some churn in this area and we need to know whether
> we are heading in the right direction or not. If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.
> 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-10-22 12:29       ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-10-22 12:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
[...]
>>
>> So much from the case that I used when I reported the issue earlier this year.
>> The short summary is that the patch series from Mel helps a lot for my test case.
>>
> 
> This is good to hear. We're going in the right direction at least.
> 
>> So I guess Mel you now want some traces of the last two cases right?
>> Could you give me some minimal advice what/how you would exactly need.
>>
> 
> Yes please. Please do something like the following before the test
> 
> mount -t debugfs none /sys/kernel/debug
> echo 1>  /sys/kernel/debug/tracing/events/vmscan/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_congestion_wait/enable
> echo 1>  /sys/kernel/debug/tracing/events/writeback/writeback_wait_iff_congested/enable
> cat /sys/kernel/debug/tracing/trace_pipe>  trace.log&
> 
> rerun the test, gzip trace.log and drop it on some publicly accessible
> webserver. I can rerun the analysis scripts and see if something odd
> falls out.
> 

I ran my sequential read load with triple sync, 3 > drop caches and
some sleeps in advance. Therefore I hope you can see/find some rampup
towards the problem in the log, as all we know from the past suggests
that it isn't a problem as long as there are free or easy-to-free
things around.

The "writeback_wait_iff_congested" trace comes in with one of the
later patches so you can only find it in the log for the -fix kernel.
To be sure I activated all events of writeback (they don't seem to
add too much events - vmscan causes the majority).

I only traced the 16 thread case and raw performance when taking the
logs was still roughly as it appeared without tracing (ftp access as
user "anonymous" - no pw - should ):
                                 TP          Log-size     ftp-access
2.6.36-rc4-trace           179 mb/s             892mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            1630 mb/s             229mb     ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

You can find the bzipped full log files at:
2.6.36-rc4-trace          ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-trace.log.bz2
2.6.36-rc4-fix            ftp://testcase.boulder.ibm.com/fromibm/linux/iozone-seq-16thr-2.6.36-fix.log.bz2

I used the post-processing script that was patched within your
series, this should easily give everyone a good overview (the
differences are huge). But I don't know if my scripts are really
up-to-date - so it is up to you to decide if the following is
really valid (I also found nothing about the *iff* stuff in the
script, so you might want the full log anyway):

## WITHOUT-FIXES 2.6.36-rc4-trace ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28292         13654     459886     844139   453638         0       20  159.156      direct-0=13654        wakeup-0=459884 wakeup-1=2
iozone-28300         13071     436052     818191   434998         0        6  159.932      direct-0=13071        wakeup-0=436051 wakeup-1=1
iozone-28303         13813     464730     858740   459634         0        6  159.152      direct-0=13813        wakeup-0=464730
iozone-28295         12824     428748     826281   427246         0       25  159.488      direct-0=12824        wakeup-0=428748
iozone-28301         13482     452617     849624   448212         0       32  159.240      direct-0=13482        wakeup-0=452614 wakeup-1=3
iozone-28304         13131     443473     833093   437755         0       17  159.409      direct-0=13131        wakeup-0=443472 wakeup-1=1
iozone-28305         13628     458115     852889   453645         0        0  159.700      direct-0=13628        wakeup-0=458113 wakeup-1=2
iozone-28291         13625     460635     847770   453657         0        0  159.553      direct-0=13625        wakeup-0=460634 wakeup-1=1
iozone-28297         13103     439959     847125   436743         0       44  159.698      direct-0=13103        wakeup-0=439959
iozone-28302         11991     399591     797354   400234         0        0  160.685      direct-0=11991        wakeup-0=399590 wakeup-1=1
iozone-28296         13085     437466     821684   436628         0        7  159.446      direct-0=13085        wakeup-0=437466
iozone-28294         14028     471795     858038   466738         0        8  159.403      direct-0=14028        wakeup-0=471793 wakeup-1=2
iozone-28298         14216     477065     860224   473428         0        9  158.943      direct-0=14216        wakeup-0=477060 wakeup-1=5
iozone-28299         13354     449048     858721   445392         0        4  159.905      direct-0=13354        wakeup-0=449048
iozone-28293         13554     456445     855633   451410         0       31  159.418      direct-0=13554        wakeup-0=456441 wakeup-1=4
iozone-28290         14664     488925     893139   488442         0        5  158.800      direct-0=14664        wakeup-0=488921 wakeup-1=4
rpcbind-605             45        542       5009     1464         0        0    1.056      direct-0=45           wakeup-0=542
crond-774               11        138        636      414         0        0    0.203      direct-0=11           wakeup-0=138
kthreadd-2               2          2         64       64         0        0    0.000      direct-1=1 direct-2=1 wakeup-1=1 wakeup-2=1
cat-28278             1117       5046     220362    39158         0        0   67.623      direct-0=1117         wakeup-0=5046
sendmail-758           211       6665      33016     7353         0        0    9.436      direct-0=211          wakeup-0=6665
netcat-28279           145       1709      39559     5288         0        0   11.772      direct-0=145          wakeup-0=1709

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages     
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-40              31     267142    9687398  1017640         0     2173      wake-0=30 wake-2=1       rewake-0=267128 rewake-1=13 rewake-2=1

Summary
Direct reclaims:                        216754
Direct reclaim pages scanned:           13821291
Direct reclaim pages reclaimed:         7221541
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    214
Wake kswapd requests:                   7238652
Time stalled direct reclaim:            2642.02 seconds

Kswapd wakeups:                         31
Kswapd pages scanned:                   9687398
Kswapd pages reclaimed:                 1017640
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    2173
Time kswapd awake:                      170.15 seconds

## WITH-FIXES 2.6.36-rc4-fix ##
Process             Direct     Wokeup      Pages    Pages     Pages    Pages     Time
details              Rclms     Kswapd    Scanned   Rclmed   Sync-IO ASync-IO  Stalled
iozone-28116          2948      93766     277563    99026         0       41    2.622      direct-0=2948         wakeup-0=93766
iozone-28122          2852      90519     263432    95304         0       15    2.487      direct-0=2852         wakeup-0=90519
iozone-28126          3082     101045     276212   103204         0        7    2.191      direct-0=3082         wakeup-0=101045
iozone-28114          2875      92733     271584    96677         0        5    3.031      direct-0=2875         wakeup-0=92733
iozone-28118          2715      88316     255099    90875         0        2    2.247      direct-0=2715         wakeup-0=88316
iozone-28111          2967      95493     273437    98998         0        0    2.363      direct-0=2967         wakeup-0=95493
iozone-28123          3153     101812     255698   105400         0       25    2.865      direct-0=3153         wakeup-0=101812
iozone-28112          3062     100341     283059   102653         0        4    2.560      direct-0=3062         wakeup-0=100341
iozone-28115          2738      88916     255389    91634         0       14    3.106      direct-0=2738         wakeup-0=88916
iozone-28121          3201     103626     276337   107378         0        0    3.265      direct-0=3201         wakeup-0=103626
iozone-28119          3147     102094     307378   105165         0        0    3.159      direct-0=3147         wakeup-0=102094
iozone-28125          3032      98644     282571   101666         0       12    2.257      direct-0=3032         wakeup-0=98644
iozone-28124          3075     100182     292561   103107         0       12    2.419      direct-0=3075         wakeup-0=100182
iozone-28120          2809      90570     273207    94067         0        7    2.565      direct-0=2809         wakeup-0=90570
iozone-28117          2813      89807     252515    93916         0        0    2.884      direct-0=2813         wakeup-0=89807
iozone-28113          2711      87677     253710    90648         0       18    2.537      direct-0=2711         wakeup-0=87677
sendmail-758            13        442       1915      499         0        0    0.011      direct-0=13           wakeup-0=442
netcat-28100            44        331       4554     1549         0        0    0.507      direct-0=44           wakeup-0=331
cat-28099              141        513      35986     5085         0       39    0.702      direct-0=141          wakeup-0=513
bash-816                 1        173         32       32         0        0    0.000      direct-0=1            wakeup-0=173

Kswapd              Kswapd      Order      Pages      Pages    Pages    Pages
Instance           Wakeups  Re-wakeup    Scanned     Rclmed  Sync-IO ASync-IO
kswapd0-45               2     617968      33692     8905         0        3      wake-0=2       rewake-0=617968

Summary
Direct reclaims:                        47379
Direct reclaim pages scanned:           4392239
Direct reclaim pages reclaimed:         1586883
Direct reclaim write file sync I/O:     0
Direct reclaim write anon sync I/O:     0
Direct reclaim write file async I/O:    0
Direct reclaim write anon async I/O:    201
Wake kswapd requests:                   1527000
Time stalled direct reclaim:            43.78 seconds

Kswapd wakeups:                         2
Kswapd pages scanned:                   33692
Kswapd pages reclaimed:                 8905
Kswapd reclaim write file sync I/O:     0
Kswapd reclaim write anon sync I/O:     0
Kswapd reclaim write file async I/O:    0
Kswapd reclaim write anon async I/O:    3
Time kswapd awake:                      22.35 seconds

[...]
>>
> 
> The log might help me further in figuring out how and why we are losing
> time. When/if the patches move from -mm to mainline, it'd also be worth
> retesting as there is some churn in this area and we need to know whether
> we are heading in the right direction or not. If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.
> 

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-10-18 13:55     ` Mel Gorman
  (?)
@ 2010-11-03 10:50       ` Christian Ehrhardt
  -1 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-11-03 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
> 
>> Seing the patches Mel sent a few weeks ago I realized that this series
>> might be at least partially related to my reports in 1Q 2010 - so I ran my
>> testcase on a few kernels to provide you with some more backing data.
> 
> Thanks very much for revisiting this.
> 
>> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
>> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
>> Processes are evenly distributed so that it always is one process per disk.
>> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
>> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
>> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
>>
>> Kernel (git)                   4          8         16   deviation #16 case                           comment
>> linux-2.6.30              902694    1396073    1892624                 base                              base
>> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
>> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
>> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
>> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad
>> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
>>
[...]
> If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.

Here a measurement with 2.6.37-rc1 as confirmation of progress:
   linux-2.6.37-rc1          876588    1161876    1643430               -13.1%       even better than 2.6.36-fix

That means 2.6.37-rc1 really shows what we hoped for.
And it eventually even turned out a little bit better than 2.6.36 + your fixes.

 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-11-03 10:50       ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-11-03 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
> 
>> Seing the patches Mel sent a few weeks ago I realized that this series
>> might be at least partially related to my reports in 1Q 2010 - so I ran my
>> testcase on a few kernels to provide you with some more backing data.
> 
> Thanks very much for revisiting this.
> 
>> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
>> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
>> Processes are evenly distributed so that it always is one process per disk.
>> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
>> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
>> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
>>
>> Kernel (git)                   4          8         16   deviation #16 case                           comment
>> linux-2.6.30              902694    1396073    1892624                 base                              base
>> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
>> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
>> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
>> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad
>> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
>>
[...]
> If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.

Here a measurement with 2.6.37-rc1 as confirmation of progress:
   linux-2.6.37-rc1          876588    1161876    1643430               -13.1%       even better than 2.6.36-fix

That means 2.6.37-rc1 really shows what we hoped for.
And it eventually even turned out a little bit better than 2.6.36 + your fixes.

 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-11-03 10:50       ` Christian Ehrhardt
  0 siblings, 0 replies; 59+ messages in thread
From: Christian Ehrhardt @ 2010-11-03 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro



On 10/18/2010 03:55 PM, Mel Gorman wrote:
> On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
> 
>> Seing the patches Mel sent a few weeks ago I realized that this series
>> might be at least partially related to my reports in 1Q 2010 - so I ran my
>> testcase on a few kernels to provide you with some more backing data.
> 
> Thanks very much for revisiting this.
> 
>> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
>> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
>> Processes are evenly distributed so that it always is one process per disk.
>> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
>> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
>> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
>>
>> Kernel (git)                   4          8         16   deviation #16 case                           comment
>> linux-2.6.30              902694    1396073    1892624                 base                              base
>> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
>> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
>> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
>> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad
>> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
>>
[...]
> If all goes according to plan,
> kernel 2.6.37-rc1 will be of interest. Thanks again.

Here a measurement with 2.6.37-rc1 as confirmation of progress:
   linux-2.6.37-rc1          876588    1161876    1643430               -13.1%       even better than 2.6.36-fix

That means 2.6.37-rc1 really shows what we hoped for.
And it eventually even turned out a little bit better than 2.6.36 + your fixes.

 

-- 

Grusse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
  2010-11-03 10:50       ` Christian Ehrhardt
@ 2010-11-10 14:37         ` Mel Gorman
  -1 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-11-10 14:37 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Nov 03, 2010 at 11:50:35AM +0100, Christian Ehrhardt wrote:
> 
> 
> On 10/18/2010 03:55 PM, Mel Gorman wrote:
> > On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
> > 
> >> Seing the patches Mel sent a few weeks ago I realized that this series
> >> might be at least partially related to my reports in 1Q 2010 - so I ran my
> >> testcase on a few kernels to provide you with some more backing data.
> > 
> > Thanks very much for revisiting this.
> > 
> >> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
> >> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
> >> Processes are evenly distributed so that it always is one process per disk.
> >> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
> >> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
> >> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
> >>
> >> Kernel (git)                   4          8         16   deviation #16 case                           comment
> >> linux-2.6.30              902694    1396073    1892624                 base                              base
> >> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
> >> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
> >> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
> >> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad
> >> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
> >>
> [...]
> > If all goes according to plan,
> > kernel 2.6.37-rc1 will be of interest. Thanks again.
> 
> Here a measurement with 2.6.37-rc1 as confirmation of progress:
>    linux-2.6.37-rc1          876588    1161876    1643430               -13.1%       even better than 2.6.36-fix
> 

Ok, great. There were a few other changes related to reclaim and
writeback that I expected to help, but was not certain. It's good to
have confirmation.

> That means 2.6.37-rc1 really shows what we hoped for.
> And it eventually even turned out a little bit better than 2.6.36 + your fixes.
> 

Good. I looked over your data and I see we are still losing time but I
haven't new ideas on how to improve it further yet without falling into the
"special case" hole. I'll keep on it and hopefully we can get parity
performance on read while still keeping the write improvements.

Thanks a lot for testing this.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2
@ 2010-11-10 14:37         ` Mel Gorman
  0 siblings, 0 replies; 59+ messages in thread
From: Mel Gorman @ 2010-11-10 14:37 UTC (permalink / raw)
  To: Christian Ehrhardt
  Cc: Andrew Morton, linux-mm, linux-fsdevel, Linux Kernel List,
	Johannes Weiner, Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

On Wed, Nov 03, 2010 at 11:50:35AM +0100, Christian Ehrhardt wrote:
> 
> 
> On 10/18/2010 03:55 PM, Mel Gorman wrote:
> > On Thu, Oct 14, 2010 at 05:28:33PM +0200, Christian Ehrhardt wrote:
> > 
> >> Seing the patches Mel sent a few weeks ago I realized that this series
> >> might be at least partially related to my reports in 1Q 2010 - so I ran my
> >> testcase on a few kernels to provide you with some more backing data.
> > 
> > Thanks very much for revisiting this.
> > 
> >> Results are always the average of three iozone runs as it is known to be somewhat noisy - especially when affected by the issue I try to show here.
> >> As discussed in detail in older threads the setup uses 16 disks and scales the number of concurrent iozone processes.
> >> Processes are evenly distributed so that it always is one process per disk.
> >> In the past we reported 40% to 80% degradation for the sequential read case based on 2.6.32 which can still be seen.
> >> What we found was that the allocations for page cache with GFP_COLD flag loop a long time between try_to_free, get_page, reclaim as free makes some progress and due to that GFP_COLD allocations can loop and retry.
> >> In addition my case had no writes at all, which forced congestion_wait to wait the full timeout all the time.
> >>
> >> Kernel (git)                   4          8         16   deviation #16 case                           comment
> >> linux-2.6.30              902694    1396073    1892624                 base                              base
> >> linux-2.6.32              752008     990425     932938               -50.7%     impact as reported in 1Q 2010
> >> linux-2.6.35               63532      71573      64083               -96.6%                    got even worse
> >> linux-2.6.35.6            176485     174442     212102               -88.8%  fixes useful, but still far away
> >> linux-2.6.36-rc4-trace    119683     188997     187012               -90.1%                         still bad
> >> linux-2.6.36-rc4-fix      884431    1114073    1470659               -22.3%            Mels fixes help a lot!
> >>
> [...]
> > If all goes according to plan,
> > kernel 2.6.37-rc1 will be of interest. Thanks again.
> 
> Here a measurement with 2.6.37-rc1 as confirmation of progress:
>    linux-2.6.37-rc1          876588    1161876    1643430               -13.1%       even better than 2.6.36-fix
> 

Ok, great. There were a few other changes related to reclaim and
writeback that I expected to help, but was not certain. It's good to
have confirmation.

> That means 2.6.37-rc1 really shows what we hoped for.
> And it eventually even turned out a little bit better than 2.6.36 + your fixes.
> 

Good. I looked over your data and I see we are still losing time but I
haven't new ideas on how to improve it further yet without falling into the
"special case" hole. I'll keep on it and hopefully we can get parity
performance on read while still keeping the write improvements.

Thanks a lot for testing this.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2010-11-10 14:37 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-15 12:27 [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2 Mel Gorman
2010-09-15 12:27 ` Mel Gorman
2010-09-15 12:27 ` [PATCH 1/8] tracing, vmscan: Add trace events for LRU list shrinking Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 2/8] writeback: Account for time spent congestion_waited Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 3/8] vmscan: Synchronous lumpy reclaim should not call congestion_wait() Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 4/8] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 5/8] vmscan: Remove dead code in shrink_inactive_list() Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 6/8] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-15 12:27 ` [PATCH 7/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-16  7:59   ` Minchan Kim
2010-09-16  7:59     ` Minchan Kim
2010-09-16  8:23     ` Mel Gorman
2010-09-16  8:23       ` Mel Gorman
2010-09-15 12:27 ` [PATCH 8/8] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone Mel Gorman
2010-09-15 12:27   ` Mel Gorman
2010-09-16  8:13   ` Minchan Kim
2010-09-16  8:13     ` Minchan Kim
2010-09-16  9:18     ` Mel Gorman
2010-09-16  9:18       ` Mel Gorman
2010-09-16 14:11       ` Minchan Kim
2010-09-16 14:11         ` Minchan Kim
2010-09-16 15:18         ` Mel Gorman
2010-09-16 15:18           ` Mel Gorman
2010-09-16 22:28   ` Andrew Morton
2010-09-16 22:28     ` Andrew Morton
2010-09-20  9:52     ` Mel Gorman
2010-09-20  9:52       ` Mel Gorman
2010-09-21 21:44       ` Andrew Morton
2010-09-21 21:44         ` Andrew Morton
2010-09-21 22:10         ` Mel Gorman
2010-09-21 22:10           ` Mel Gorman
2010-09-21 22:24           ` Andrew Morton
2010-09-21 22:24             ` Andrew Morton
2010-09-20 13:05   ` [PATCH] writeback: Do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encounted in the current zone fix Mel Gorman
2010-09-20 13:05     ` Mel Gorman
2010-09-16 22:28 ` [PATCH 0/8] Reduce latencies and improve overall reclaim efficiency v2 Andrew Morton
2010-09-16 22:28   ` Andrew Morton
2010-09-17  7:52   ` Mel Gorman
2010-09-17  7:52     ` Mel Gorman
2010-10-14 15:28 ` Christian Ehrhardt
2010-10-14 15:28   ` Christian Ehrhardt
2010-10-14 15:28   ` Christian Ehrhardt
2010-10-18 13:55   ` Mel Gorman
2010-10-18 13:55     ` Mel Gorman
2010-10-22 12:29     ` Christian Ehrhardt
2010-10-22 12:29       ` Christian Ehrhardt
2010-10-22 12:29       ` Christian Ehrhardt
2010-11-03 10:50     ` Christian Ehrhardt
2010-11-03 10:50       ` Christian Ehrhardt
2010-11-03 10:50       ` Christian Ehrhardt
2010-11-10 14:37       ` Mel Gorman
2010-11-10 14:37         ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.