All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-14 11:17 ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This is a merging of two series - the first of which reduces stack usage
in page reclaim and the second which writes contiguous pages during reclaim
and avoids writeback in direct reclaimers.

Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 11 puts dirty pages as it finds them onto a temporary list and then
writes them all out with a helper function. This simplifies patch 12 and
also increases the chances that IO requests can be optimally merged.

Patch 12 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For lumpy reclaim, the caller will briefly
wait on dirty pages to be written out before trying to reclaim the dirty
pages a second time. This increases the responsibility of kswapd somewhat
because it's now cleaning pages on behalf of direct reclaimers but kswapd
seemed a better fit than background flushers to clean pages as it knows
where the pages needing cleaning are. As it's async IO, it should not cause
kswapd to stall (at least until the queue is congested) but the order that
pages are reclaimed on the LRU is altered. Dirty pages that would have been
reclaimed by direct reclaimers are getting another lap on the LRU. The
dirty pages could have been put on a dedicated list but this increased
counter overhead and the number of lists and it is unclear if it is necessary.

Apologies for the length of the rest of the mail. Measuring the impact of
this is not exactly straight-forward.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty
ratio was left at 20 (tests with 40 are in progress). The filesystem all
the tests were run on was XFS.

Three kernels are compared.

traceonly-v2r5		is the first 4 patches of this series
stackreduce-v2r5	is the first 10 patches of this series
nodirect-v2r5		is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Elapsed min       98.16 ( 0.00%)    97.95 ( 0.21%)    98.25 (-0.09%)
Elapsed mean      98.29 ( 0.00%)    98.26 ( 0.03%)    98.40 (-0.11%)
Elapsed stddev     0.08 ( 0.00%)     0.20 (-165.90%)     0.12 (-59.87%)
Elapsed max       98.34 ( 0.00%)    98.51 (-0.17%)    98.58 (-0.24%)
User    min      311.03 ( 0.00%)   311.74 (-0.23%)   311.13 (-0.03%)
User    mean     311.28 ( 0.00%)   312.45 (-0.38%)   311.42 (-0.05%)
User    stddev     0.24 ( 0.00%)     0.51 (-114.08%)     0.38 (-59.06%)
User    max      311.58 ( 0.00%)   312.94 (-0.44%)   312.06 (-0.15%)
System  min       40.54 ( 0.00%)    39.65 ( 2.20%)    40.34 ( 0.49%)
System  mean      40.80 ( 0.00%)    40.01 ( 1.93%)    40.81 (-0.03%)
System  stddev     0.23 ( 0.00%)     0.34 (-47.57%)     0.29 (-25.47%)
System  max       41.04 ( 0.00%)    40.51 ( 1.29%)    41.11 (-0.17%)
CPU     min      357.00 ( 0.00%)   357.00 ( 0.00%)   357.00 ( 0.00%)
CPU     mean     357.75 ( 0.00%)   358.00 (-0.07%)   357.75 ( 0.00%)
CPU     stddev     0.43 ( 0.00%)     0.71 (-63.30%)     0.43 ( 0.00%)
CPU     max      358.00 ( 0.00%)   359.00 (-0.28%)   358.00 ( 0.00%)

FTrace Reclaim Statistics
                                     traceonly-v2r5  stackreduce-v2r5  nodirect-v2r5
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)       2144.58   2146.22    2144.8
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                788.42    794.85    793.66
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

IOZone
======
                       traceonly-v2r5       stackreduce-v2r5     nodirect-v2r5
write-64               395452 ( 0.00%)       397208 ( 0.44%)       397796 ( 0.59%)
write-128              463696 ( 0.00%)       460514 (-0.69%)       458940 (-1.04%)
write-256              504861 ( 0.00%)       506050 ( 0.23%)       502969 (-0.38%)
write-512              490875 ( 0.00%)       485767 (-1.05%)       494264 ( 0.69%)
write-1024             497574 ( 0.00%)       489689 (-1.61%)       505956 ( 1.66%)
write-2048             500993 ( 0.00%)       503076 ( 0.41%)       510097 ( 1.78%)
write-4096             504491 ( 0.00%)       502073 (-0.48%)       506993 ( 0.49%)
write-8192             488398 ( 0.00%)       228857 (-113.41%)       313871 (-55.60%)
write-16384            409006 ( 0.00%)       433783 ( 5.71%)       365696 (-11.84%)
write-32768            473136 ( 0.00%)       481153 ( 1.67%)       486373 ( 2.72%)
write-65536            474833 ( 0.00%)       477970 ( 0.66%)       481192 ( 1.32%)
write-131072           429557 ( 0.00%)       452604 ( 5.09%)       459840 ( 6.59%)
write-262144           397934 ( 0.00%)       401955 ( 1.00%)       397479 (-0.11%)
write-524288           222849 ( 0.00%)       230297 ( 3.23%)       209999 (-6.12%)
rewrite-64            1452528 ( 0.00%)      1492919 ( 2.71%)      1452528 ( 0.00%)
rewrite-128           1622919 ( 0.00%)      1618028 (-0.30%)      1663139 ( 2.42%)
rewrite-256           1694118 ( 0.00%)      1704877 ( 0.63%)      1639786 (-3.31%)
rewrite-512           1753325 ( 0.00%)      1740536 (-0.73%)      1730717 (-1.31%)
rewrite-1024          1741104 ( 0.00%)      1787480 ( 2.59%)      1759651 ( 1.05%)
rewrite-2048          1710867 ( 0.00%)      1747411 ( 2.09%)      1747411 ( 2.09%)
rewrite-4096          1583280 ( 0.00%)      1621536 ( 2.36%)      1599942 ( 1.04%)
rewrite-8192          1308005 ( 0.00%)      1338579 ( 2.28%)      1307358 (-0.05%)
rewrite-16384         1293742 ( 0.00%)      1314178 ( 1.56%)      1291602 (-0.17%)
rewrite-32768         1298360 ( 0.00%)      1314503 ( 1.23%)      1276758 (-1.69%)
rewrite-65536         1289212 ( 0.00%)      1316088 ( 2.04%)      1281351 (-0.61%)
rewrite-131072        1286117 ( 0.00%)      1309070 ( 1.75%)      1283007 (-0.24%)
rewrite-262144        1285902 ( 0.00%)      1305816 ( 1.53%)      1274121 (-0.92%)
rewrite-524288         220417 ( 0.00%)       223971 ( 1.59%)       226133 ( 2.53%)
read-64               3203069 ( 0.00%)      2467108 (-29.83%)      3541098 ( 9.55%)
read-128              3759450 ( 0.00%)      4267461 (11.90%)      4233807 (11.20%)
read-256              4264168 ( 0.00%)      4350555 ( 1.99%)      3935921 (-8.34%)
read-512              3437042 ( 0.00%)      3366987 (-2.08%)      3437042 ( 0.00%)
read-1024             3738636 ( 0.00%)      3821805 ( 2.18%)      3735385 (-0.09%)
read-2048             3938881 ( 0.00%)      3984558 ( 1.15%)      3967993 ( 0.73%)
read-4096             3631489 ( 0.00%)      3828122 ( 5.14%)      3781775 ( 3.97%)
read-8192             3175046 ( 0.00%)      3230268 ( 1.71%)      3247058 ( 2.22%)
read-16384            2923635 ( 0.00%)      2869911 (-1.87%)      2954684 ( 1.05%)
read-32768            2819040 ( 0.00%)      2839776 ( 0.73%)      2852152 ( 1.16%)
read-65536            2659324 ( 0.00%)      2827502 ( 5.95%)      2816464 ( 5.58%)
read-131072           2707652 ( 0.00%)      2727534 ( 0.73%)      2746406 ( 1.41%)
read-262144           2765929 ( 0.00%)      2782166 ( 0.58%)      2776125 ( 0.37%)
read-524288           2810803 ( 0.00%)      2822894 ( 0.43%)      2822394 ( 0.41%)
reread-64             5389653 ( 0.00%)      5860307 ( 8.03%)      5735102 ( 6.02%)
reread-128            5122535 ( 0.00%)      5325799 ( 3.82%)      5325799 ( 3.82%)
reread-256            3245838 ( 0.00%)      3285566 ( 1.21%)      3236056 (-0.30%)
reread-512            4340054 ( 0.00%)      4571003 ( 5.05%)      4742616 ( 8.49%)
reread-1024           4265934 ( 0.00%)      4356809 ( 2.09%)      4374559 ( 2.48%)
reread-2048           3915540 ( 0.00%)      4301837 ( 8.98%)      4338776 ( 9.75%)
reread-4096           3846119 ( 0.00%)      3984379 ( 3.47%)      3979764 ( 3.36%)
reread-8192           3257215 ( 0.00%)      3304518 ( 1.43%)      3325949 ( 2.07%)
reread-16384          2959519 ( 0.00%)      2892622 (-2.31%)      2995773 ( 1.21%)
reread-32768          2570835 ( 0.00%)      2607266 ( 1.40%)      2610783 ( 1.53%)
reread-65536          2731466 ( 0.00%)      2683809 (-1.78%)      2691957 (-1.47%)
reread-131072         2738144 ( 0.00%)      2763886 ( 0.93%)      2776056 ( 1.37%)
reread-262144         2781012 ( 0.00%)      2781786 ( 0.03%)      2784322 ( 0.12%)
reread-524288         2787049 ( 0.00%)      2779851 (-0.26%)      2787681 ( 0.02%)
randread-64           1204796 ( 0.00%)      1204796 ( 0.00%)      1143223 (-5.39%)
randread-128          4135958 ( 0.00%)      4012317 (-3.08%)      4135958 ( 0.00%)
randread-256          3454704 ( 0.00%)      3511189 ( 1.61%)      3511189 ( 1.61%)
randread-512          3437042 ( 0.00%)      3366987 (-2.08%)      3437042 ( 0.00%)
randread-1024         3301774 ( 0.00%)      3401130 ( 2.92%)      3403826 ( 3.00%)
randread-2048         3549844 ( 0.00%)      3391470 (-4.67%)      3477979 (-2.07%)
randread-4096         3214912 ( 0.00%)      3261295 ( 1.42%)      3258820 ( 1.35%)
randread-8192         2818958 ( 0.00%)      2836645 ( 0.62%)      2861450 ( 1.48%)
randread-16384        2571662 ( 0.00%)      2515924 (-2.22%)      2564465 (-0.28%)
randread-32768        2319848 ( 0.00%)      2331892 ( 0.52%)      2333594 ( 0.59%)
randread-65536        2288193 ( 0.00%)      2297103 ( 0.39%)      2301123 ( 0.56%)
randread-131072       2270669 ( 0.00%)      2275707 ( 0.22%)      2279150 ( 0.37%)
randread-262144       2258949 ( 0.00%)      2264700 ( 0.25%)      2259975 ( 0.05%)
randread-524288       2250529 ( 0.00%)      2240365 (-0.45%)      2242837 (-0.34%)
randwrite-64           942521 ( 0.00%)       939223 (-0.35%)       939223 (-0.35%)
randwrite-128          962469 ( 0.00%)       971174 ( 0.90%)       969421 ( 0.72%)
randwrite-256          980760 ( 0.00%)       980760 ( 0.00%)       966633 (-1.46%)
randwrite-512         1190529 ( 0.00%)      1138158 (-4.60%)      1040545 (-14.41%)
randwrite-1024        1361836 ( 0.00%)      1361836 ( 0.00%)      1367037 ( 0.38%)
randwrite-2048        1325646 ( 0.00%)      1364390 ( 2.84%)      1361794 ( 2.65%)
randwrite-4096        1360371 ( 0.00%)      1372653 ( 0.89%)      1363935 ( 0.26%)
randwrite-8192        1291680 ( 0.00%)      1305272 ( 1.04%)      1285206 (-0.50%)
randwrite-16384       1253666 ( 0.00%)      1255865 ( 0.18%)      1231889 (-1.77%)
randwrite-32768       1239139 ( 0.00%)      1250641 ( 0.92%)      1224239 (-1.22%)
randwrite-65536       1223186 ( 0.00%)      1228115 ( 0.40%)      1203094 (-1.67%)
randwrite-131072      1207002 ( 0.00%)      1215733 ( 0.72%)      1198229 (-0.73%)
randwrite-262144      1184954 ( 0.00%)      1201542 ( 1.38%)      1179145 (-0.49%)
randwrite-524288        96156 ( 0.00%)        96502 ( 0.36%)        95942 (-0.22%)
bkwdread-64           2923952 ( 0.00%)      3022727 ( 3.27%)      2772930 (-5.45%)
bkwdread-128          3785961 ( 0.00%)      3657016 (-3.53%)      3657016 (-3.53%)
bkwdread-256          3017775 ( 0.00%)      3052087 ( 1.12%)      3159869 ( 4.50%)
bkwdread-512          2875558 ( 0.00%)      2845081 (-1.07%)      2841317 (-1.21%)
bkwdread-1024         3083680 ( 0.00%)      3181915 ( 3.09%)      3140041 ( 1.79%)
bkwdread-2048         3266376 ( 0.00%)      3282603 ( 0.49%)      3281349 ( 0.46%)
bkwdread-4096         3207709 ( 0.00%)      3287506 ( 2.43%)      3248345 ( 1.25%)
bkwdread-8192         2777710 ( 0.00%)      2792156 ( 0.52%)      2777036 (-0.02%)
bkwdread-16384        2565614 ( 0.00%)      2570412 ( 0.19%)      2541795 (-0.94%)
bkwdread-32768        2472332 ( 0.00%)      2495631 ( 0.93%)      2284260 (-8.23%)
bkwdread-65536        2435202 ( 0.00%)      2391036 (-1.85%)      2361477 (-3.12%)
bkwdread-131072       2417850 ( 0.00%)      2453436 ( 1.45%)      2417903 ( 0.00%)
bkwdread-262144       2468467 ( 0.00%)      2491433 ( 0.92%)      2446649 (-0.89%)
bkwdread-524288       2513411 ( 0.00%)      2534789 ( 0.84%)      2486486 (-1.08%)

FTrace Reclaim Statistics
                                  traceonly-v2r5     stackreduce-v2r5  nodirect-v2r5
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)         14.54     14.43     14.51
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                106.39    104.49    107.15
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

No big surprises in terms of performance. I know there are gains and losses
but I've always had trouble getting very stable figures out of IOZone so
I find it hard to draw conclusions from them. I should revisit what I'm
doing there to see what's wrong.

In terms of reclaim, nothing interesting happened.

SysBench
========
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
           1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
           2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
           3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
           4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
           5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
           6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
           7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
           8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
FTrace Reclaim Statistics
                                     traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               9843      13398      51651 
Direct reclaim pages scanned                871367    1008709    3080593 
Direct reclaim write async I/O               24883      30699          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       7070819    6961672   11268341 
Kswapd wakeups                                1578       1500        943 
Kswapd pages scanned                      22016558   21779455   17393431 
Kswapd reclaim write async I/O             1161346    1101641    1717759 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             26.11      45.04       2.97 
Time kswapd awake (ms)                     5105.06    5135.93    6086.32 

User/Sys Time Running Test (seconds)        734.52    712.39     703.9
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
Percentage Time kswapd Awake                 0.06%     0.00%     0.00%

Unlike other sysbench results I post, this is the result of a read/write
test. As the machine is under-provisioned for the type of tests, figures
are very unstable. For example, for each of the thread counts from 1-8, the
test is run a minimum of 5 times. If the estimated mean is not within 1%,
it's run up to a maximum of 10 times trying to get a stable average. None
of these averages are stable with variances up to 15%. Part of the problem
is that larger thread counts push the test into swap as the memory is
insufficient. I could tune for this, but it was reclaim that was important.

To illustrate though, here is a graph of total io in comparison to swap io
as reported by vmstat. The update frequency was 2 seconds so the IO shown in
the graph is about the maximum capacity of the disk for the entire duration
of the test with swap kicking in every so often so this was heavily IO bound.

http://www.csn.ul.ie/~mel/postings/nodirect-20100614/totalio-swapio-comparison.ps

The writing back of dirty pages was a factor. It did happen, but it was
a negligible portion of the overall IO. For example, with just tracing a
total of 97MB or 2% of the pages scanned by direct reclaim was written back
and it was all async IO. The time stalled in direct reclaim was negligible
although as you'd expect, reduced by not writing back at all.

What is interesting is that kswapd wake requests were raised by direc reclaim
not writing back pages. My theory is that it's because the processes are
making forward progress meaning they need more memory faster and are not
calling congestion_wait as they would have before. kswapd is awake longer
as a result of direct reclaim not writing back pages.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Simple Writeback Test
=====================
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               1923       2394       2683 
Direct reclaim pages scanned                246400     252256     282896 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       1496401    1648245    1709870 
Kswapd wakeups                                1140       1118       1113 
Kswapd pages scanned                      10999585   10982748   10851473 
Kswapd reclaim write async I/O                   0          0       1398 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.01       0.01       0.01 
Time kswapd awake (ms)                      293.54     293.68     285.51 

User/Sys Time Running Test (seconds)        105.17    102.81    104.34
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                638.75    640.63    639.42
Percentage Time kswapd Awake                 0.04%     0.00%     0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direc reclaim writeback was not a problem for this test even though a number
of pages were scanned so there is no reason not to disable it. kswapd
encountered some pages in the "nodirect" kernel but it's likely a timing
issue.

Stress HighAlloc
================
              stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Pass 1          76.00 ( 0.00%)    77.00 ( 1.00%)    70.00 (-6.00%)
Pass 2          78.00 ( 0.00%)    78.00 ( 0.00%)    73.00 (-5.00%)
At Rest         80.00 ( 0.00%)    79.00 (-1.00%)    78.00 (-2.00%)

FTrace Reclaim Statistics
              stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               1245       1247       1369 
Direct reclaim pages scanned                180262     177032     164337 
Direct reclaim write async I/O               35211      30075          0 
Direct reclaim write sync I/O                17994      13127          0 
Wake kswapd requests                          4211       4868       4386 
Kswapd wakeups                                 842        808        739 
Kswapd pages scanned                      41757111   33360435    9946823 
Kswapd reclaim write async I/O             4872154    3154195     791840 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)           9064.33    7249.29    2868.22 
Time kswapd awake (ms)                     6250.77    4612.99    1937.64 

User/Sys Time Running Test (seconds)       2822.01   2812.45    2629.6
Percentage Time Spent Direct Reclaim         0.10%     0.00%     0.00%
Total Elapsed Time (seconds)              11365.05   9682.00   5210.19
Percentage Time kswapd Awake                 0.02%     0.00%     0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. I know
PPC64's success rates are hit a lot worse than this but I think it could be
improved in other means than what we have today so I'm less worried about it.

Unlike the other tests, synchronous direct writeback is a factor in this
test because of lumpy reclaim. This increases the stall time of a lumpy
reclaimer by quite a margin. Compare the "Time stalled direct reclaim"
between the vanilla and nodirect kernel - the nodirect kernel is stalled
less than a third of the time.  Interestingly, the time kswapd is stalled
is significantly reduced as well and overall, the test completes a lot faster.

Whether this is because of improved IO patterns or just because lumpy
reclaim stalling on synchronous IO is a bad idea, I'm not sure but either
way, the patch looks like a good idea from the perspective of this test.

Based on this series of tests at least, there appears to be good reasons
for preventing direct reclaim writing back pages and it does not appear
we are currently spending a lot of our time in writeback. It remains to
be seen if it's still true with dirty ratio is higher (e.g. 40) or the
amount of available memory differs (e.g. 256MB) but the trace points and
post-processing script can be used to help figure it out.

Comments?

KOSAKI Motohiro (2):
  vmscan: kill prev_priority completely
  vmscan: simplify shrink_inactive_list()

Mel Gorman (11):
  tracing, vmscan: Add trace events for kswapd wakeup, sleeping and
    direct reclaim
  tracing, vmscan: Add trace events for LRU page isolation
  tracing, vmscan: Add trace event when a page is written
  tracing, vmscan: Add a postprocessing script for reclaim-related
    ftrace events
  vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
  vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  vmscan: Setup pagevec as late as possible in shrink_page_list()
  vmscan: Update isolated page counters outside of main path in
    shrink_inactive_list()
  vmscan: Write out dirty pages in batch
  vmscan: Do not writeback pages in direct reclaim
  fix: script formatting

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  623 ++++++++++++++++++++
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  622 ++++++++++++--------
 mm/vmstat.c                                        |    2 -
 8 files changed, 1223 insertions(+), 300 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h


^ permalink raw reply	[flat|nested] 198+ messages in thread

* [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-14 11:17 ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This is a merging of two series - the first of which reduces stack usage
in page reclaim and the second which writes contiguous pages during reclaim
and avoids writeback in direct reclaimers.

Changelog since V1
  o Merge with series that reduces stack usage in page reclaim in general
  o Allow memcg to writeback pages as they are not expected to overflow stack
  o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 11 puts dirty pages as it finds them onto a temporary list and then
writes them all out with a helper function. This simplifies patch 12 and
also increases the chances that IO requests can be optimally merged.

Patch 12 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For lumpy reclaim, the caller will briefly
wait on dirty pages to be written out before trying to reclaim the dirty
pages a second time. This increases the responsibility of kswapd somewhat
because it's now cleaning pages on behalf of direct reclaimers but kswapd
seemed a better fit than background flushers to clean pages as it knows
where the pages needing cleaning are. As it's async IO, it should not cause
kswapd to stall (at least until the queue is congested) but the order that
pages are reclaimed on the LRU is altered. Dirty pages that would have been
reclaimed by direct reclaimers are getting another lap on the LRU. The
dirty pages could have been put on a dedicated list but this increased
counter overhead and the number of lists and it is unclear if it is necessary.

Apologies for the length of the rest of the mail. Measuring the impact of
this is not exactly straight-forward.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty
ratio was left at 20 (tests with 40 are in progress). The filesystem all
the tests were run on was XFS.

Three kernels are compared.

traceonly-v2r5		is the first 4 patches of this series
stackreduce-v2r5	is the first 10 patches of this series
nodirect-v2r5		is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script.  The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Elapsed min       98.16 ( 0.00%)    97.95 ( 0.21%)    98.25 (-0.09%)
Elapsed mean      98.29 ( 0.00%)    98.26 ( 0.03%)    98.40 (-0.11%)
Elapsed stddev     0.08 ( 0.00%)     0.20 (-165.90%)     0.12 (-59.87%)
Elapsed max       98.34 ( 0.00%)    98.51 (-0.17%)    98.58 (-0.24%)
User    min      311.03 ( 0.00%)   311.74 (-0.23%)   311.13 (-0.03%)
User    mean     311.28 ( 0.00%)   312.45 (-0.38%)   311.42 (-0.05%)
User    stddev     0.24 ( 0.00%)     0.51 (-114.08%)     0.38 (-59.06%)
User    max      311.58 ( 0.00%)   312.94 (-0.44%)   312.06 (-0.15%)
System  min       40.54 ( 0.00%)    39.65 ( 2.20%)    40.34 ( 0.49%)
System  mean      40.80 ( 0.00%)    40.01 ( 1.93%)    40.81 (-0.03%)
System  stddev     0.23 ( 0.00%)     0.34 (-47.57%)     0.29 (-25.47%)
System  max       41.04 ( 0.00%)    40.51 ( 1.29%)    41.11 (-0.17%)
CPU     min      357.00 ( 0.00%)   357.00 ( 0.00%)   357.00 ( 0.00%)
CPU     mean     357.75 ( 0.00%)   358.00 (-0.07%)   357.75 ( 0.00%)
CPU     stddev     0.43 ( 0.00%)     0.71 (-63.30%)     0.43 ( 0.00%)
CPU     max      358.00 ( 0.00%)   359.00 (-0.28%)   358.00 ( 0.00%)

FTrace Reclaim Statistics
                                     traceonly-v2r5  stackreduce-v2r5  nodirect-v2r5
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)       2144.58   2146.22    2144.8
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                788.42    794.85    793.66
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

IOZone
======
                       traceonly-v2r5       stackreduce-v2r5     nodirect-v2r5
write-64               395452 ( 0.00%)       397208 ( 0.44%)       397796 ( 0.59%)
write-128              463696 ( 0.00%)       460514 (-0.69%)       458940 (-1.04%)
write-256              504861 ( 0.00%)       506050 ( 0.23%)       502969 (-0.38%)
write-512              490875 ( 0.00%)       485767 (-1.05%)       494264 ( 0.69%)
write-1024             497574 ( 0.00%)       489689 (-1.61%)       505956 ( 1.66%)
write-2048             500993 ( 0.00%)       503076 ( 0.41%)       510097 ( 1.78%)
write-4096             504491 ( 0.00%)       502073 (-0.48%)       506993 ( 0.49%)
write-8192             488398 ( 0.00%)       228857 (-113.41%)       313871 (-55.60%)
write-16384            409006 ( 0.00%)       433783 ( 5.71%)       365696 (-11.84%)
write-32768            473136 ( 0.00%)       481153 ( 1.67%)       486373 ( 2.72%)
write-65536            474833 ( 0.00%)       477970 ( 0.66%)       481192 ( 1.32%)
write-131072           429557 ( 0.00%)       452604 ( 5.09%)       459840 ( 6.59%)
write-262144           397934 ( 0.00%)       401955 ( 1.00%)       397479 (-0.11%)
write-524288           222849 ( 0.00%)       230297 ( 3.23%)       209999 (-6.12%)
rewrite-64            1452528 ( 0.00%)      1492919 ( 2.71%)      1452528 ( 0.00%)
rewrite-128           1622919 ( 0.00%)      1618028 (-0.30%)      1663139 ( 2.42%)
rewrite-256           1694118 ( 0.00%)      1704877 ( 0.63%)      1639786 (-3.31%)
rewrite-512           1753325 ( 0.00%)      1740536 (-0.73%)      1730717 (-1.31%)
rewrite-1024          1741104 ( 0.00%)      1787480 ( 2.59%)      1759651 ( 1.05%)
rewrite-2048          1710867 ( 0.00%)      1747411 ( 2.09%)      1747411 ( 2.09%)
rewrite-4096          1583280 ( 0.00%)      1621536 ( 2.36%)      1599942 ( 1.04%)
rewrite-8192          1308005 ( 0.00%)      1338579 ( 2.28%)      1307358 (-0.05%)
rewrite-16384         1293742 ( 0.00%)      1314178 ( 1.56%)      1291602 (-0.17%)
rewrite-32768         1298360 ( 0.00%)      1314503 ( 1.23%)      1276758 (-1.69%)
rewrite-65536         1289212 ( 0.00%)      1316088 ( 2.04%)      1281351 (-0.61%)
rewrite-131072        1286117 ( 0.00%)      1309070 ( 1.75%)      1283007 (-0.24%)
rewrite-262144        1285902 ( 0.00%)      1305816 ( 1.53%)      1274121 (-0.92%)
rewrite-524288         220417 ( 0.00%)       223971 ( 1.59%)       226133 ( 2.53%)
read-64               3203069 ( 0.00%)      2467108 (-29.83%)      3541098 ( 9.55%)
read-128              3759450 ( 0.00%)      4267461 (11.90%)      4233807 (11.20%)
read-256              4264168 ( 0.00%)      4350555 ( 1.99%)      3935921 (-8.34%)
read-512              3437042 ( 0.00%)      3366987 (-2.08%)      3437042 ( 0.00%)
read-1024             3738636 ( 0.00%)      3821805 ( 2.18%)      3735385 (-0.09%)
read-2048             3938881 ( 0.00%)      3984558 ( 1.15%)      3967993 ( 0.73%)
read-4096             3631489 ( 0.00%)      3828122 ( 5.14%)      3781775 ( 3.97%)
read-8192             3175046 ( 0.00%)      3230268 ( 1.71%)      3247058 ( 2.22%)
read-16384            2923635 ( 0.00%)      2869911 (-1.87%)      2954684 ( 1.05%)
read-32768            2819040 ( 0.00%)      2839776 ( 0.73%)      2852152 ( 1.16%)
read-65536            2659324 ( 0.00%)      2827502 ( 5.95%)      2816464 ( 5.58%)
read-131072           2707652 ( 0.00%)      2727534 ( 0.73%)      2746406 ( 1.41%)
read-262144           2765929 ( 0.00%)      2782166 ( 0.58%)      2776125 ( 0.37%)
read-524288           2810803 ( 0.00%)      2822894 ( 0.43%)      2822394 ( 0.41%)
reread-64             5389653 ( 0.00%)      5860307 ( 8.03%)      5735102 ( 6.02%)
reread-128            5122535 ( 0.00%)      5325799 ( 3.82%)      5325799 ( 3.82%)
reread-256            3245838 ( 0.00%)      3285566 ( 1.21%)      3236056 (-0.30%)
reread-512            4340054 ( 0.00%)      4571003 ( 5.05%)      4742616 ( 8.49%)
reread-1024           4265934 ( 0.00%)      4356809 ( 2.09%)      4374559 ( 2.48%)
reread-2048           3915540 ( 0.00%)      4301837 ( 8.98%)      4338776 ( 9.75%)
reread-4096           3846119 ( 0.00%)      3984379 ( 3.47%)      3979764 ( 3.36%)
reread-8192           3257215 ( 0.00%)      3304518 ( 1.43%)      3325949 ( 2.07%)
reread-16384          2959519 ( 0.00%)      2892622 (-2.31%)      2995773 ( 1.21%)
reread-32768          2570835 ( 0.00%)      2607266 ( 1.40%)      2610783 ( 1.53%)
reread-65536          2731466 ( 0.00%)      2683809 (-1.78%)      2691957 (-1.47%)
reread-131072         2738144 ( 0.00%)      2763886 ( 0.93%)      2776056 ( 1.37%)
reread-262144         2781012 ( 0.00%)      2781786 ( 0.03%)      2784322 ( 0.12%)
reread-524288         2787049 ( 0.00%)      2779851 (-0.26%)      2787681 ( 0.02%)
randread-64           1204796 ( 0.00%)      1204796 ( 0.00%)      1143223 (-5.39%)
randread-128          4135958 ( 0.00%)      4012317 (-3.08%)      4135958 ( 0.00%)
randread-256          3454704 ( 0.00%)      3511189 ( 1.61%)      3511189 ( 1.61%)
randread-512          3437042 ( 0.00%)      3366987 (-2.08%)      3437042 ( 0.00%)
randread-1024         3301774 ( 0.00%)      3401130 ( 2.92%)      3403826 ( 3.00%)
randread-2048         3549844 ( 0.00%)      3391470 (-4.67%)      3477979 (-2.07%)
randread-4096         3214912 ( 0.00%)      3261295 ( 1.42%)      3258820 ( 1.35%)
randread-8192         2818958 ( 0.00%)      2836645 ( 0.62%)      2861450 ( 1.48%)
randread-16384        2571662 ( 0.00%)      2515924 (-2.22%)      2564465 (-0.28%)
randread-32768        2319848 ( 0.00%)      2331892 ( 0.52%)      2333594 ( 0.59%)
randread-65536        2288193 ( 0.00%)      2297103 ( 0.39%)      2301123 ( 0.56%)
randread-131072       2270669 ( 0.00%)      2275707 ( 0.22%)      2279150 ( 0.37%)
randread-262144       2258949 ( 0.00%)      2264700 ( 0.25%)      2259975 ( 0.05%)
randread-524288       2250529 ( 0.00%)      2240365 (-0.45%)      2242837 (-0.34%)
randwrite-64           942521 ( 0.00%)       939223 (-0.35%)       939223 (-0.35%)
randwrite-128          962469 ( 0.00%)       971174 ( 0.90%)       969421 ( 0.72%)
randwrite-256          980760 ( 0.00%)       980760 ( 0.00%)       966633 (-1.46%)
randwrite-512         1190529 ( 0.00%)      1138158 (-4.60%)      1040545 (-14.41%)
randwrite-1024        1361836 ( 0.00%)      1361836 ( 0.00%)      1367037 ( 0.38%)
randwrite-2048        1325646 ( 0.00%)      1364390 ( 2.84%)      1361794 ( 2.65%)
randwrite-4096        1360371 ( 0.00%)      1372653 ( 0.89%)      1363935 ( 0.26%)
randwrite-8192        1291680 ( 0.00%)      1305272 ( 1.04%)      1285206 (-0.50%)
randwrite-16384       1253666 ( 0.00%)      1255865 ( 0.18%)      1231889 (-1.77%)
randwrite-32768       1239139 ( 0.00%)      1250641 ( 0.92%)      1224239 (-1.22%)
randwrite-65536       1223186 ( 0.00%)      1228115 ( 0.40%)      1203094 (-1.67%)
randwrite-131072      1207002 ( 0.00%)      1215733 ( 0.72%)      1198229 (-0.73%)
randwrite-262144      1184954 ( 0.00%)      1201542 ( 1.38%)      1179145 (-0.49%)
randwrite-524288        96156 ( 0.00%)        96502 ( 0.36%)        95942 (-0.22%)
bkwdread-64           2923952 ( 0.00%)      3022727 ( 3.27%)      2772930 (-5.45%)
bkwdread-128          3785961 ( 0.00%)      3657016 (-3.53%)      3657016 (-3.53%)
bkwdread-256          3017775 ( 0.00%)      3052087 ( 1.12%)      3159869 ( 4.50%)
bkwdread-512          2875558 ( 0.00%)      2845081 (-1.07%)      2841317 (-1.21%)
bkwdread-1024         3083680 ( 0.00%)      3181915 ( 3.09%)      3140041 ( 1.79%)
bkwdread-2048         3266376 ( 0.00%)      3282603 ( 0.49%)      3281349 ( 0.46%)
bkwdread-4096         3207709 ( 0.00%)      3287506 ( 2.43%)      3248345 ( 1.25%)
bkwdread-8192         2777710 ( 0.00%)      2792156 ( 0.52%)      2777036 (-0.02%)
bkwdread-16384        2565614 ( 0.00%)      2570412 ( 0.19%)      2541795 (-0.94%)
bkwdread-32768        2472332 ( 0.00%)      2495631 ( 0.93%)      2284260 (-8.23%)
bkwdread-65536        2435202 ( 0.00%)      2391036 (-1.85%)      2361477 (-3.12%)
bkwdread-131072       2417850 ( 0.00%)      2453436 ( 1.45%)      2417903 ( 0.00%)
bkwdread-262144       2468467 ( 0.00%)      2491433 ( 0.92%)      2446649 (-0.89%)
bkwdread-524288       2513411 ( 0.00%)      2534789 ( 0.84%)      2486486 (-1.08%)

FTrace Reclaim Statistics
                                  traceonly-v2r5     stackreduce-v2r5  nodirect-v2r5
Direct reclaims                                  0          0          0 
Direct reclaim pages scanned                     0          0          0 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                             0          0          0 
Kswapd wakeups                                   0          0          0 
Kswapd pages scanned                             0          0          0 
Kswapd reclaim write async I/O                   0          0          0 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.00       0.00       0.00 
Time kswapd awake (ms)                        0.00       0.00       0.00 

User/Sys Time Running Test (seconds)         14.54     14.43     14.51
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                106.39    104.49    107.15
Percentage Time kswapd Awake                 0.00%     0.00%     0.00%

No big surprises in terms of performance. I know there are gains and losses
but I've always had trouble getting very stable figures out of IOZone so
I find it hard to draw conclusions from them. I should revisit what I'm
doing there to see what's wrong.

In terms of reclaim, nothing interesting happened.

SysBench
========
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
           1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
           2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
           3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
           4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
           5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
           6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
           7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
           8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
FTrace Reclaim Statistics
                                     traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               9843      13398      51651 
Direct reclaim pages scanned                871367    1008709    3080593 
Direct reclaim write async I/O               24883      30699          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       7070819    6961672   11268341 
Kswapd wakeups                                1578       1500        943 
Kswapd pages scanned                      22016558   21779455   17393431 
Kswapd reclaim write async I/O             1161346    1101641    1717759 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             26.11      45.04       2.97 
Time kswapd awake (ms)                     5105.06    5135.93    6086.32 

User/Sys Time Running Test (seconds)        734.52    712.39     703.9
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
Percentage Time kswapd Awake                 0.06%     0.00%     0.00%

Unlike other sysbench results I post, this is the result of a read/write
test. As the machine is under-provisioned for the type of tests, figures
are very unstable. For example, for each of the thread counts from 1-8, the
test is run a minimum of 5 times. If the estimated mean is not within 1%,
it's run up to a maximum of 10 times trying to get a stable average. None
of these averages are stable with variances up to 15%. Part of the problem
is that larger thread counts push the test into swap as the memory is
insufficient. I could tune for this, but it was reclaim that was important.

To illustrate though, here is a graph of total io in comparison to swap io
as reported by vmstat. The update frequency was 2 seconds so the IO shown in
the graph is about the maximum capacity of the disk for the entire duration
of the test with swap kicking in every so often so this was heavily IO bound.

http://www.csn.ul.ie/~mel/postings/nodirect-20100614/totalio-swapio-comparison.ps

The writing back of dirty pages was a factor. It did happen, but it was
a negligible portion of the overall IO. For example, with just tracing a
total of 97MB or 2% of the pages scanned by direct reclaim was written back
and it was all async IO. The time stalled in direct reclaim was negligible
although as you'd expect, reduced by not writing back at all.

What is interesting is that kswapd wake requests were raised by direc reclaim
not writing back pages. My theory is that it's because the processes are
making forward progress meaning they need more memory faster and are not
calling congestion_wait as they would have before. kswapd is awake longer
as a result of direct reclaim not writing back pages.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Simple Writeback Test
=====================
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               1923       2394       2683 
Direct reclaim pages scanned                246400     252256     282896 
Direct reclaim write async I/O                   0          0          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       1496401    1648245    1709870 
Kswapd wakeups                                1140       1118       1113 
Kswapd pages scanned                      10999585   10982748   10851473 
Kswapd reclaim write async I/O                   0          0       1398 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)              0.01       0.01       0.01 
Time kswapd awake (ms)                      293.54     293.68     285.51 

User/Sys Time Running Test (seconds)        105.17    102.81    104.34
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)                638.75    640.63    639.42
Percentage Time kswapd Awake                 0.04%     0.00%     0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direc reclaim writeback was not a problem for this test even though a number
of pages were scanned so there is no reason not to disable it. kswapd
encountered some pages in the "nodirect" kernel but it's likely a timing
issue.

Stress HighAlloc
================
              stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Pass 1          76.00 ( 0.00%)    77.00 ( 1.00%)    70.00 (-6.00%)
Pass 2          78.00 ( 0.00%)    78.00 ( 0.00%)    73.00 (-5.00%)
At Rest         80.00 ( 0.00%)    79.00 (-1.00%)    78.00 (-2.00%)

FTrace Reclaim Statistics
              stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               1245       1247       1369 
Direct reclaim pages scanned                180262     177032     164337 
Direct reclaim write async I/O               35211      30075          0 
Direct reclaim write sync I/O                17994      13127          0 
Wake kswapd requests                          4211       4868       4386 
Kswapd wakeups                                 842        808        739 
Kswapd pages scanned                      41757111   33360435    9946823 
Kswapd reclaim write async I/O             4872154    3154195     791840 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)           9064.33    7249.29    2868.22 
Time kswapd awake (ms)                     6250.77    4612.99    1937.64 

User/Sys Time Running Test (seconds)       2822.01   2812.45    2629.6
Percentage Time Spent Direct Reclaim         0.10%     0.00%     0.00%
Total Elapsed Time (seconds)              11365.05   9682.00   5210.19
Percentage Time kswapd Awake                 0.02%     0.00%     0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. I know
PPC64's success rates are hit a lot worse than this but I think it could be
improved in other means than what we have today so I'm less worried about it.

Unlike the other tests, synchronous direct writeback is a factor in this
test because of lumpy reclaim. This increases the stall time of a lumpy
reclaimer by quite a margin. Compare the "Time stalled direct reclaim"
between the vanilla and nodirect kernel - the nodirect kernel is stalled
less than a third of the time.  Interestingly, the time kswapd is stalled
is significantly reduced as well and overall, the test completes a lot faster.

Whether this is because of improved IO patterns or just because lumpy
reclaim stalling on synchronous IO is a bad idea, I'm not sure but either
way, the patch looks like a good idea from the perspective of this test.

Based on this series of tests at least, there appears to be good reasons
for preventing direct reclaim writing back pages and it does not appear
we are currently spending a lot of our time in writeback. It remains to
be seen if it's still true with dirty ratio is higher (e.g. 40) or the
amount of available memory differs (e.g. 256MB) but the trace points and
post-processing script can be used to help figure it out.

Comments?

KOSAKI Motohiro (2):
  vmscan: kill prev_priority completely
  vmscan: simplify shrink_inactive_list()

Mel Gorman (11):
  tracing, vmscan: Add trace events for kswapd wakeup, sleeping and
    direct reclaim
  tracing, vmscan: Add trace events for LRU page isolation
  tracing, vmscan: Add trace event when a page is written
  tracing, vmscan: Add a postprocessing script for reclaim-related
    ftrace events
  vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
  vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  vmscan: Setup pagevec as late as possible in shrink_page_list()
  vmscan: Update isolated page counters outside of main path in
    shrink_inactive_list()
  vmscan: Write out dirty pages in batch
  vmscan: Do not writeback pages in direct reclaim
  fix: script formatting

 .../trace/postprocess/trace-vmscan-postprocess.pl  |  623 ++++++++++++++++++++
 include/linux/mmzone.h                             |   15 -
 include/trace/events/gfpflags.h                    |   37 ++
 include/trace/events/kmem.h                        |   38 +--
 include/trace/events/vmscan.h                      |  184 ++++++
 mm/page_alloc.c                                    |    2 -
 mm/vmscan.c                                        |  622 ++++++++++++--------
 mm/vmstat.c                                        |    2 -
 8 files changed, 1223 insertions(+), 300 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process.  High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/gfpflags.h |   37 +++++++++++++
 include/trace/events/kmem.h     |   38 +-------------
 include/trace/events/vmscan.h   |  115 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                     |   24 +++++++--
 4 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..6bfb579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process.  High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/gfpflags.h |   37 +++++++++++++
 include/trace/events/kmem.h     |   38 +-------------
 include/trace/events/vmscan.h   |  115 +++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                     |   24 +++++++--
 4 files changed, 173 insertions(+), 41 deletions(-)
 create mode 100644 include/trace/events/gfpflags.h
 create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ *  GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",				\
+	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
+	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
+	{(unsigned long)GFP_USER,		"GFP_USER"},		\
+	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
+	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
+	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
+	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
+	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
+	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
+	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
+	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
+	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
+	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
+	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
+	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
+	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
+	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
+	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
+	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
+	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
+	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
+	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
+	) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@
 
 #include <linux/types.h>
 #include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- *  GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags)						\
-	(flags) ? __print_flags(flags, "|",				\
-	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"}, \
-	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
-	{(unsigned long)GFP_USER,		"GFP_USER"},		\
-	{(unsigned long)GFP_TEMPORARY,		"GFP_TEMPORARY"},	\
-	{(unsigned long)GFP_KERNEL,		"GFP_KERNEL"},		\
-	{(unsigned long)GFP_NOFS,		"GFP_NOFS"},		\
-	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
-	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
-	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
-	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
-	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
-	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
-	{(unsigned long)__GFP_REPEAT,		"GFP_REPEAT"},		\
-	{(unsigned long)__GFP_NOFAIL,		"GFP_NOFAIL"},		\
-	{(unsigned long)__GFP_NORETRY,		"GFP_NORETRY"},		\
-	{(unsigned long)__GFP_COMP,		"GFP_COMP"},		\
-	{(unsigned long)__GFP_ZERO,		"GFP_ZERO"},		\
-	{(unsigned long)__GFP_NOMEMALLOC,	"GFP_NOMEMALLOC"},	\
-	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
-	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
-	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"}		\
-	) : "GFP_NOWAIT"
+#include "gfpflags.h"
 
 DECLARE_EVENT_CLASS(kmem_alloc,
 
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+	TP_PROTO(int nid),
+
+	TP_ARGS(nid),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+	),
+
+	TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+	TP_PROTO(int nid, int order),
+
+	TP_ARGS(nid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,	nid	)
+		__field(	int,	order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid	= nid;
+		__entry->order	= order;
+	),
+
+	TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+	TP_PROTO(int nid, int zid, int order),
+
+	TP_ARGS(nid, zid, order),
+
+	TP_STRUCT__entry(
+		__field(	int,		nid	)
+		__field(	int,		zid	)
+		__field(	int,		order	)
+	),
+
+	TP_fast_assign(
+		__entry->nid		= nid;
+		__entry->zid		= zid;
+		__entry->order		= order;
+	),
+
+	TP_printk("nid=%d zid=%d order=%d",
+		__entry->nid,
+		__entry->zid,
+		__entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+	TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+	TP_ARGS(order, may_writepage, gfp_flags),
+
+	TP_STRUCT__entry(
+		__field(	int,	order		)
+		__field(	int,	may_writepage	)
+		__field(	gfp_t,	gfp_flags	)
+	),
+
+	TP_fast_assign(
+		__entry->order		= order;
+		__entry->may_writepage	= may_writepage;
+		__entry->gfp_flags	= gfp_flags;
+	),
+
+	TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+		__entry->order,
+		__entry->may_writepage,
+		show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+	TP_PROTO(unsigned long nr_reclaimed),
+
+	TP_ARGS(nr_reclaimed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned long,	nr_reclaimed	)
+	),
+
+	TP_fast_assign(
+		__entry->nr_reclaimed	= nr_reclaimed;
+	),
+
+	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..6bfb579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@
 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
+	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
 		.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.nodemask = nodemask,
 	};
 
-	return do_try_to_free_pages(zonelist, &sc);
+	trace_mm_vmscan_direct_reclaim_begin(order,
+				sc.may_writepage,
+				gfp_mask);
+
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+	return nr_reclaimed;
 }
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
 				 * premature sleep. If not, then go fully
 				 * to sleep until explicitly woken up
 				 */
-				if (!sleeping_prematurely(pgdat, order, remaining))
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
+					trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 					schedule();
-				else {
+				} else {
 					if (remaining)
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
 					else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
 		 * We can speed up thawing tasks if we don't call balance_pgdat
 		 * after returning from the refrigerator
 		 */
-		if (!ret)
+		if (!ret) {
+			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
 			balance_pgdat(pgdat, order);
+		}
 	}
 	return 0;
 }
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
 		return;
 	if (pgdat->kswapd_max_order < order)
 		pgdat->kswapd_max_order = order;
+	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
 	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 		return;
 	if (!waitqueue_active(&pgdat->kswapd_wait))
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   46 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   14 ++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
 	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+		
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6bfb579..25bf05a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order, 
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   46 +++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   14 ++++++++++++
 2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
 	TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+	TP_PROTO(int order,
+		unsigned long nr_requested,
+		unsigned long nr_scanned,
+		unsigned long nr_taken,
+		unsigned long nr_lumpy_taken,
+		unsigned long nr_lumpy_dirty,
+		unsigned long nr_lumpy_failed,
+		int isolate_mode),
+
+	TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+	TP_STRUCT__entry(
+		__field(int, order)
+		__field(unsigned long, nr_requested)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_taken)
+		__field(unsigned long, nr_lumpy_taken)
+		__field(unsigned long, nr_lumpy_dirty)
+		__field(unsigned long, nr_lumpy_failed)
+		__field(int, isolate_mode)
+	),
+
+	TP_fast_assign(
+		__entry->order = order;
+		__entry->nr_requested = nr_requested;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_taken = nr_taken;
+		__entry->nr_lumpy_taken = nr_lumpy_taken;
+		__entry->nr_lumpy_dirty = nr_lumpy_dirty;
+		__entry->nr_lumpy_failed = nr_lumpy_failed;
+		__entry->isolate_mode = isolate_mode;
+	),
+
+	TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+		__entry->isolate_mode,
+		__entry->order,
+		__entry->nr_requested,
+		__entry->nr_scanned,
+		__entry->nr_taken,
+		__entry->nr_lumpy_taken,
+		__entry->nr_lumpy_dirty,
+		__entry->nr_lumpy_failed)
+);
+		
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6bfb579..25bf05a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 		unsigned long *scanned, int order, int mode, int file)
 {
 	unsigned long nr_taken = 0;
+	unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
 	unsigned long scan;
 
 	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
 				nr_taken++;
+				nr_lumpy_taken++;
+				if (PageDirty(cursor_page))
+					nr_lumpy_dirty++;
 				scan++;
+			} else {
+				if (mode == ISOLATE_BOTH &&
+						page_count(cursor_page))
+					nr_lumpy_failed++;
 			}
 		}
 	}
 
 	*scanned = scan;
+
+	trace_mm_vmscan_lru_isolate(order, 
+			nr_to_scan, scan,
+			nr_taken,
+			nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+			mode);
 	return nr_taken;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 		__entry->nr_lumpy_dirty,
 		__entry->nr_lumpy_failed)
 );
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
 		
 #endif /* _TRACE_VMSCAN_H */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25bf05a..58527c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |   23 +++++++++++++++++++++++
 mm/vmscan.c                   |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
 		__entry->nr_lumpy_dirty,
 		__entry->nr_lumpy_failed)
 );
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+	TP_PROTO(struct page *page,
+		int sync_io),
+
+	TP_ARGS(page, sync_io),
+
+	TP_STRUCT__entry(
+		__field(struct page *, page)
+		__field(int, sync_io)
+	),
+
+	TP_fast_assign(
+		__entry->page = page;
+		__entry->sync_io = sync_io;
+	),
+
+	TP_printk("page=%p pfn=%lu sync_io=%d",
+		__entry->page,
+		page_to_pfn(__entry->page),
+		__entry->sync_io)
+);
 		
 #endif /* _TRACE_VMSCAN_H */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25bf05a..58527c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			/* synchronous write or broken a_ops? */
 			ClearPageReclaim(page);
 		}
+		trace_mm_vmscan_writepage(page,
+			sync_writeback == PAGEOUT_IO_SYNC);
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds a simple post-processing script for the reclaim-related
trace events.  It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942             9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311            0-662.097999999998 0-2.79700000002049 \
	0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
	0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
	0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
	0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
	0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
	9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
	9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
	9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
	9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
	9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
	9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
	9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
	9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
	9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
	9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
	9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
	0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
	0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
	1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
	9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
	9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
	9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
	9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942   9-7063.33900000004 9-129960.482 9-2062.27500000002 \
	9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
	9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
	9-5522.37700000009 9-21630.429 \
	9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
	9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
	9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
	9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
	9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
	9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
	9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
	9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
	9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
	9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
	9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
	9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
	9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
	9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
	9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
	9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
	9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
	9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
	9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
	9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
	9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
	9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
	9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
	9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
	9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
	9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
	9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
	9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
	9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
	9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
	9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
	9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
	9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
	9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
	9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
	9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
	9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
	9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
	9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
	9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
	9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
	9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
	9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
	9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
	9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
	9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
	9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
	9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
	9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
	9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
	9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
	9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
	9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
	9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
	9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
	9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
	9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
	9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
	9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
	9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
	9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
	9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
	9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
	9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process                  Direct     Wokeup      Pages      Pages    Pages
details                   Rclms     Kswapd    Scanned    Sync-IO ASync-IO
cc1-30800                     0          1          0          0        0      wakeup-0=1
cc1-24260                     0          1          0          0        0      wakeup-0=1
cc1-24152                     0         12          0          0        0      wakeup-0=12
cc1-8139                      0          1          0          0        0      wakeup-0=1
cc1-4390                      0          1          0          0        0      wakeup-0=1
cc1-4648                      0          7          0          0        0      wakeup-0=7
cc1-4552                      0          3          0          0        0      wakeup-0=3
dd-4550                       0         31          0          0        0      wakeup-0=31
date-4898                     0          1          0          0        0      wakeup-0=1
cc1-6549                      0          7          0          0        0      wakeup-0=7
as-22202                      0         17          0          0        0      wakeup-0=17
cc1-6495                      0          9          0          0        0      wakeup-0=9
cc1-8299                      0          1          0          0        0      wakeup-0=1
cc1-6009                      0          1          0          0        0      wakeup-0=1
cc1-2574                      0          2          0          0        0      wakeup-0=2
cc1-30568                     0          1          0          0        0      wakeup-0=1
cc1-2679                      0          6          0          0        0      wakeup-0=6
sh-13747                      0         12          0          0        0      wakeup-0=12
cc1-22193                     0         18          0          0        0      wakeup-0=18
cc1-30725                     0          2          0          0        0      wakeup-0=2
as-4392                       0          2          0          0        0      wakeup-0=2
cc1-28180                     0         14          0          0        0      wakeup-0=14
cc1-13697                     0          2          0          0        0      wakeup-0=2
cc1-22207                     0          8          0          0        0      wakeup-0=8
cc1-15270                     0        179          0          0        0      wakeup-0=179
cc1-22011                     0         82          0          0        0      wakeup-0=82
cp-14682                      0          1          0          0        0      wakeup-0=1
as-11926                      0          2          0          0        0      wakeup-0=2
cc1-6016                      0          5          0          0        0      wakeup-0=5
make-18554                    0         13          0          0        0      wakeup-0=13
cc1-8292                      0         12          0          0        0      wakeup-0=12
make-24381                    0          1          0          0        0      wakeup-1=1
date-18681                    0         33          0          0        0      wakeup-0=33
cc1-32276                     0          1          0          0        0      wakeup-0=1
timestamp-outpu-2809          0        253          0          0        0      wakeup-0=240 wakeup-1=13
date-18624                    0          7          0          0        0      wakeup-0=7
cc1-30960                     0          9          0          0        0      wakeup-0=9
cc1-4014                      0          1          0          0        0      wakeup-0=1
cc1-30706                     0         22          0          0        0      wakeup-0=22
uname-3942                    4          1        306          0       17      direct-9=4       wakeup-9=1
cc1-28207                     0          1          0          0        0      wakeup-0=1
cc1-30563                     0          9          0          0        0      wakeup-0=9
cc1-22214                     0         10          0          0        0      wakeup-0=10
cc1-28221                     0         11          0          0        0      wakeup-0=11
cc1-28123                     0          6          0          0        0      wakeup-0=6
kswapd0-311                   0          7     357302          0    34233      wakeup-0=7
cc1-5988                      0          7          0          0        0      wakeup-0=7
as-30734                      0        161          0          0        0      wakeup-0=161
cc1-22004                     0         45          0          0        0      wakeup-0=45
date-4590                     0          4          0          0        0      wakeup-0=4
cc1-15279                     0        213          0          0        0      wakeup-0=213
date-30735                    0          1          0          0        0      wakeup-0=1
cc1-30583                     0          4          0          0        0      wakeup-0=4
cc1-32324                     0          2          0          0        0      wakeup-0=2
cc1-23933                     0          3          0          0        0      wakeup-0=3
cc1-22001                     0         36          0          0        0      wakeup-0=36
bench-stresshig-3942        287        287      80186       6295    12196      direct-9=287       wakeup-9=287
cc1-28170                     0          7          0          0        0      wakeup-0=7
date-7932                     0         92          0          0        0      wakeup-0=92
cc1-22222                     0          6          0          0        0      wakeup-0=6
cc1-32334                     0         16          0          0        0      wakeup-0=16
cc1-2690                      0          6          0          0        0      wakeup-0=6
cc1-30733                     0          9          0          0        0      wakeup-0=9
cc1-32298                     0          2          0          0        0      wakeup-0=2
cc1-13743                     0         18          0          0        0      wakeup-0=18
cc1-22186                     0          4          0          0        0      wakeup-0=4
cc1-28214                     0         11          0          0        0      wakeup-0=11
cc1-13735                     0          1          0          0        0      wakeup-0=1
updatedb-8173                 0         18          0          0        0      wakeup-0=18
cc1-13750                     0          3          0          0        0      wakeup-0=3
cat-2808                      0          2          0          0        0      wakeup-0=2
cc1-15277                     0        169          0          0        0      wakeup-0=169
date-18317                    0          1          0          0        0      wakeup-0=1
cc1-15274                     0        197          0          0        0      wakeup-0=197
cc1-30732                     0          1          0          0        0      wakeup-0=1

Kswapd                   Kswapd      Order      Pages      Pages    Pages
Instance                Wakeups  Re-wakeup    Scanned    Sync-IO ASync-IO
kswapd0-311                  91         24     357302          0    34233      wake-0=31 wake-1=1 wake-9=59       rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims:     		291
Direct reclaim pages scanned:	437794
Direct reclaim write sync I/O:	6295
Direct reclaim write async I/O:	46446
Wake kswapd requests:		2152
Time stalled direct reclaim: 	519.163009000002 ms

Kswapd wakeups:			91
Kswapd pages scanned:		357302
Kswapd reclaim write sync I/O:	0
Kswapd reclaim write async I/O:	34233
Time kswapd awake:		5282.749757 ms

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 1 files changed, 654 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..b48d968
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

This patch adds a simple post-processing script for the reclaim-related
trace events.  It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942             9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311            0-662.097999999998 0-2.79700000002049 \
	0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
	0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
	0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
	0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
	0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
	9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
	9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
	9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
	9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
	9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
	9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
	9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
	9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
	9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
	9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
	9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
	0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
	0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
	1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
	9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
	9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
	9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
	9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942   9-7063.33900000004 9-129960.482 9-2062.27500000002 \
	9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
	9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
	9-5522.37700000009 9-21630.429 \
	9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
	9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
	9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
	9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
	9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
	9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
	9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
	9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
	9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
	9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
	9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
	9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
	9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
	9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
	9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
	9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
	9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
	9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
	9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
	9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
	9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
	9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
	9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
	9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
	9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
	9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
	9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
	9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
	9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
	9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
	9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
	9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
	9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
	9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
	9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
	9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
	9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
	9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
	9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
	9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
	9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
	9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
	9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
	9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
	9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
	9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
	9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
	9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
	9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
	9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
	9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
	9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
	9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
	9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
	9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
	9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
	9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
	9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
	9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
	9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
	9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
	9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
	9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
	9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process                  Direct     Wokeup      Pages      Pages    Pages
details                   Rclms     Kswapd    Scanned    Sync-IO ASync-IO
cc1-30800                     0          1          0          0        0      wakeup-0=1
cc1-24260                     0          1          0          0        0      wakeup-0=1
cc1-24152                     0         12          0          0        0      wakeup-0=12
cc1-8139                      0          1          0          0        0      wakeup-0=1
cc1-4390                      0          1          0          0        0      wakeup-0=1
cc1-4648                      0          7          0          0        0      wakeup-0=7
cc1-4552                      0          3          0          0        0      wakeup-0=3
dd-4550                       0         31          0          0        0      wakeup-0=31
date-4898                     0          1          0          0        0      wakeup-0=1
cc1-6549                      0          7          0          0        0      wakeup-0=7
as-22202                      0         17          0          0        0      wakeup-0=17
cc1-6495                      0          9          0          0        0      wakeup-0=9
cc1-8299                      0          1          0          0        0      wakeup-0=1
cc1-6009                      0          1          0          0        0      wakeup-0=1
cc1-2574                      0          2          0          0        0      wakeup-0=2
cc1-30568                     0          1          0          0        0      wakeup-0=1
cc1-2679                      0          6          0          0        0      wakeup-0=6
sh-13747                      0         12          0          0        0      wakeup-0=12
cc1-22193                     0         18          0          0        0      wakeup-0=18
cc1-30725                     0          2          0          0        0      wakeup-0=2
as-4392                       0          2          0          0        0      wakeup-0=2
cc1-28180                     0         14          0          0        0      wakeup-0=14
cc1-13697                     0          2          0          0        0      wakeup-0=2
cc1-22207                     0          8          0          0        0      wakeup-0=8
cc1-15270                     0        179          0          0        0      wakeup-0=179
cc1-22011                     0         82          0          0        0      wakeup-0=82
cp-14682                      0          1          0          0        0      wakeup-0=1
as-11926                      0          2          0          0        0      wakeup-0=2
cc1-6016                      0          5          0          0        0      wakeup-0=5
make-18554                    0         13          0          0        0      wakeup-0=13
cc1-8292                      0         12          0          0        0      wakeup-0=12
make-24381                    0          1          0          0        0      wakeup-1=1
date-18681                    0         33          0          0        0      wakeup-0=33
cc1-32276                     0          1          0          0        0      wakeup-0=1
timestamp-outpu-2809          0        253          0          0        0      wakeup-0=240 wakeup-1=13
date-18624                    0          7          0          0        0      wakeup-0=7
cc1-30960                     0          9          0          0        0      wakeup-0=9
cc1-4014                      0          1          0          0        0      wakeup-0=1
cc1-30706                     0         22          0          0        0      wakeup-0=22
uname-3942                    4          1        306          0       17      direct-9=4       wakeup-9=1
cc1-28207                     0          1          0          0        0      wakeup-0=1
cc1-30563                     0          9          0          0        0      wakeup-0=9
cc1-22214                     0         10          0          0        0      wakeup-0=10
cc1-28221                     0         11          0          0        0      wakeup-0=11
cc1-28123                     0          6          0          0        0      wakeup-0=6
kswapd0-311                   0          7     357302          0    34233      wakeup-0=7
cc1-5988                      0          7          0          0        0      wakeup-0=7
as-30734                      0        161          0          0        0      wakeup-0=161
cc1-22004                     0         45          0          0        0      wakeup-0=45
date-4590                     0          4          0          0        0      wakeup-0=4
cc1-15279                     0        213          0          0        0      wakeup-0=213
date-30735                    0          1          0          0        0      wakeup-0=1
cc1-30583                     0          4          0          0        0      wakeup-0=4
cc1-32324                     0          2          0          0        0      wakeup-0=2
cc1-23933                     0          3          0          0        0      wakeup-0=3
cc1-22001                     0         36          0          0        0      wakeup-0=36
bench-stresshig-3942        287        287      80186       6295    12196      direct-9=287       wakeup-9=287
cc1-28170                     0          7          0          0        0      wakeup-0=7
date-7932                     0         92          0          0        0      wakeup-0=92
cc1-22222                     0          6          0          0        0      wakeup-0=6
cc1-32334                     0         16          0          0        0      wakeup-0=16
cc1-2690                      0          6          0          0        0      wakeup-0=6
cc1-30733                     0          9          0          0        0      wakeup-0=9
cc1-32298                     0          2          0          0        0      wakeup-0=2
cc1-13743                     0         18          0          0        0      wakeup-0=18
cc1-22186                     0          4          0          0        0      wakeup-0=4
cc1-28214                     0         11          0          0        0      wakeup-0=11
cc1-13735                     0          1          0          0        0      wakeup-0=1
updatedb-8173                 0         18          0          0        0      wakeup-0=18
cc1-13750                     0          3          0          0        0      wakeup-0=3
cat-2808                      0          2          0          0        0      wakeup-0=2
cc1-15277                     0        169          0          0        0      wakeup-0=169
date-18317                    0          1          0          0        0      wakeup-0=1
cc1-15274                     0        197          0          0        0      wakeup-0=197
cc1-30732                     0          1          0          0        0      wakeup-0=1

Kswapd                   Kswapd      Order      Pages      Pages    Pages
Instance                Wakeups  Re-wakeup    Scanned    Sync-IO ASync-IO
kswapd0-311                  91         24     357302          0    34233      wake-0=31 wake-1=1 wake-9=59       rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims:     		291
Direct reclaim pages scanned:	437794
Direct reclaim write sync I/O:	6295
Direct reclaim write async I/O:	46446
Wake kswapd requests:		2152
Time stalled direct reclaim: 	519.163009000002 ms

Kswapd wakeups:			91
Kswapd pages scanned:		357302
Kswapd reclaim write sync I/O:	0
Kswapd reclaim write async I/O:	34233
Time kswapd awake:		5282.749757 ms

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |  654 ++++++++++++++++++++
 1 files changed, 654 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..b48d968
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+#   --read-procstat	If the trace lacks process info, get it from /proc
+#   --ignore-pid	Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <mel@csn.ul.ie>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN	=> 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END	=> 2;
+use constant MM_VMSCAN_KSWAPD_WAKE		=> 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP		=> 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE	=> 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE	=> 6;
+use constant MM_VMSCAN_LRU_ISOLATE		=> 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC		=> 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC		=> 9;
+use constant EVENT_UNKNOWN			=> 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER 	=> 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER	=> 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER	=> 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN 		=> 15;
+use constant STATE_DIRECT_ORDER 		=> 16;
+use constant STATE_KSWAPD_BEGIN			=> 17;
+use constant STATE_KSWAPD_ORDER			=> 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY	=> 19;
+use constant HIGH_KSWAPD_LATENCY		=> 20;
+use constant HIGH_KSWAPD_REWAKEUP		=> 21;
+use constant HIGH_NR_SCANNED			=> 22;
+use constant HIGH_NR_TAKEN			=> 23;
+use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_CONTIG_DIRTY		=> 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+	my $current_time = time;
+	if ($current_time - 2 > $sigint_received) {
+		print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+		$sigint_report = 1;
+	} else {
+		if (!$sigint_exit) {
+			print "Second SIGINT received quickly, exiting\n";
+		}
+		$sigint_exit++;
+	}
+
+	if ($sigint_exit > 3) {
+		print "Many SIGINTs received, exiting now without report\n";
+		exit;
+	}
+
+	$sigint_received = $current_time;
+	$sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+	'ignore-pid'	 =>	\$opt_ignorepid,
+	'read-procstat'	 =>	\$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+#                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+	my $event = shift;
+	my $default = shift;
+	my $regex;
+
+	# Read the event format or use the default
+	if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+		print("WARNING: Event $event format string not found\n");
+		return $default;
+	} else {
+		my $line;
+		while (!eof(FORMAT)) {
+			$line = <FORMAT>;
+			$line =~ s/, REC->.*//;
+			if ($line =~ /^print fmt:\s"(.*)".*/) {
+				$regex = $1;
+				$regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+				$regex =~ s/%p/\([0-9a-f]*\)/g;
+				$regex =~ s/%d/\([-0-9]*\)/g;
+				$regex =~ s/%ld/\([-0-9]*\)/g;
+				$regex =~ s/%lu/\([0-9]*\)/g;
+			}
+		}
+	}
+
+	# Can't handle the print_flags stuff but in the context of this
+	# script, it really doesn't matter
+	$regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+	# Verify fields are in the right order
+	my $tuple;
+	foreach $tuple (split /\s/, $regex) {
+		my ($key, $value) = split(/=/, $tuple);
+		my $expected = shift;
+		if ($key ne $expected) {
+			print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+			$regex =~ s/$key=\((.*)\)/$key=$1/;
+		}
+	}
+
+	if (defined shift) {
+		die("Fewer fields than expected in format");
+	}
+
+	return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_begin",
+			$regex_direct_begin_default,
+			"order", "may_writepage",
+			"gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+			"vmscan/mm_vmscan_direct_reclaim_end",
+			$regex_direct_end_default,
+			"nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_wake",
+			$regex_kswapd_wake_default,
+			"nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+			"vmscan/mm_vmscan_kswapd_sleep",
+			$regex_kswapd_sleep_default,
+			"nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+			"vmscan/mm_vmscan_wakeup_kswapd",
+			$regex_wakeup_kswapd_default,
+			"nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_isolate",
+			$regex_lru_isolate_default,
+			"isolate_mode", "order",
+			"nr_requested", "nr_scanned", "nr_taken",
+			"contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_inactive",
+			$regex_lru_shrink_inactive_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+			"vmscan/mm_vmscan_lru_shrink_active",
+			$regex_lru_shrink_active_default,
+			"nid", "zid",
+			"lru",
+			"nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+			"vmscan/mm_vmscan_writepage",
+			$regex_writepage_default,
+			"page", "pfn", "sync_io");
+
+sub read_statline($) {
+	my $pid = $_[0];
+	my $statline;
+
+	if (open(STAT, "/proc/$pid/stat")) {
+		$statline = <STAT>;
+		close(STAT);
+	}
+
+	if ($statline eq '') {
+		$statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+	}
+
+	return $statline;
+}
+
+sub guess_process_pid($$) {
+	my $pid = $_[0];
+	my $statline = $_[1];
+
+	if ($pid == 0) {
+		return "swapper-0";
+	}
+
+	if ($statline !~ /$regex_statname/o) {
+		die("Failed to math stat line for process name :: $statline");
+	}
+	return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+	my $timestamp = $_[0];
+
+	my ($sec, $usec) = split (/\./, $timestamp);
+	return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+	my $traceevent;
+	my $process_pid;
+	my $cpus;
+	my $timestamp;
+	my $tracepoint;
+	my $details;
+	my $statline;
+
+	# Read each line of the event log
+EVENT_PROCESS:
+	while ($traceevent = <STDIN>) {
+		if ($traceevent =~ /$regex_traceevent/o) {
+			$process_pid = $1;
+			$timestamp = $3;
+			$tracepoint = $4;
+
+			$process_pid =~ /(.*)-([0-9]*)$/;
+			my $process = $1;
+			my $pid = $2;
+
+			if ($process eq "") {
+				$process = $last_procmap{$pid};
+				$process_pid = "$process-$pid";
+			}
+			$last_procmap{$pid} = $process;
+
+			if ($opt_read_procstat) {
+				$statline = read_statline($pid);
+				if ($opt_read_procstat && $process eq '') {
+					$process_pid = guess_process_pid($pid, $statline);
+				}
+			}
+		} else {
+			next;
+		}
+
+		# Perl Switch() sucks majorly
+		if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+			$timestamp = timestamp_to_ms($timestamp);
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+			$details = $5;
+			if ($details !~ /$regex_direct_begin/o) {
+				print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+				print "         $details\n";
+				print "         $regex_direct_begin\n";
+				next;
+			}
+			my $order = $1;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+			$perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+		} elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+			# Record how long direct reclaim took this time
+			if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+				my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+				$perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+			$details = $5;
+			if ($details !~ /$regex_kswapd_wake/o) {
+				print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+				print "         $details\n";
+				print "         $regex_kswapd_wake\n";
+				next;
+			}
+
+			my $order = $2;
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+			if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+				$timestamp = timestamp_to_ms($timestamp);
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+				$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+				$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+			} else {
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+				$perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+			}
+		} elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+			# Count the event itself
+			my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+			$perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+			# Record how long kswapd was awake
+			$timestamp = timestamp_to_ms($timestamp);
+			my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+			my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+			$perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+			$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+		} elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+			$details = $5;
+			if ($details !~ /$regex_wakeup_kswapd/o) {
+				print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+				print "         $details\n";
+				print "         $regex_wakeup_kswapd\n";
+				next;
+			}
+			my $order = $3;
+			$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+		} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+			$details = $5;
+			if ($details !~ /$regex_lru_isolate/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_isolate/o\n";
+				next;
+			}
+			my $nr_scanned = $4;
+			my $nr_contig_dirty = $7;
+			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_writepage") {
+			$details = $5;
+			if ($details !~ /$regex_writepage/o) {
+				print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+				print "         $details\n";
+				print "         $regex_writepage\n";
+				next;
+			}
+
+			my $sync_io = $3;
+			if ($sync_io) {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+			} else {
+				$perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+			}
+		} else {
+			$perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+		}
+
+		if ($sigint_pending) {
+			last EVENT_PROCESS;
+		}
+	}
+}
+
+sub dump_stats {
+	my $hashref = shift;
+	my %stats = %$hashref;
+
+	# Dump per-process stats
+	my $process_pid;
+	my $max_strlen = 0;
+
+	# Get the maximum process name
+	foreach $process_pid (keys %perprocesspid) {
+		my $len = length($process_pid);
+		if ($len > $max_strlen) {
+			$max_strlen = $len;
+		}
+	}
+	$max_strlen += 2;
+
+	# Work out latencies
+	printf("\n") if !$opt_ignorepid;
+	printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+				!$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+			next;
+		}
+
+		printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+		my $index = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+			defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+			if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { 
+				printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+				$total_direct_latency += $latency;
+			} else {
+				printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+				my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+				$total_kswapd_latency += $latency;
+			}
+			$index++;
+		}
+		print "\n" if !$opt_ignorepid;
+	}
+
+	# Print out process activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			next;
+		}
+
+		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		my $index = 0;
+		my $this_reclaim_delay = 0;
+		while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+			 my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+			$this_reclaim_delay += $latency;
+			$index++;
+		}
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+			$this_reclaim_delay / 1000);
+
+		if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+				if ($count != 0) {
+					print "direct-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+				if ($count != 0) {
+					print "wakeup-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+			print "      ";
+			my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+			if ($count != 0) {
+				print "contig-dirty=$count ";
+			}
+		}
+
+		print "\n";
+	}
+
+	# Print out kswapd activity
+	printf("\n");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	foreach $process_pid (keys %stats) {
+
+		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			next;
+		}
+
+		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+			$process_pid,
+			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+		if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+				if ($count != 0) {
+					print "wake-$order=$count ";
+				}
+			}
+		}
+		if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+			print "      ";
+			for (my $order = 0; $order < 20; $order++) {
+				my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+				if ($count != 0) {
+					print "rewake-$order=$count ";
+				}
+			}
+		}
+		printf("\n");
+	}
+
+	# Print out summaries
+	$total_direct_latency /= 1000;
+	$total_kswapd_latency /= 1000;
+	print "\nSummary\n";
+	print "Direct reclaims:     		$total_direct_reclaim\n";
+	print "Direct reclaim pages scanned:	$total_direct_nr_scanned\n";
+	print "Direct reclaim write sync I/O:	$total_direct_writepage_sync\n";
+	print "Direct reclaim write async I/O:	$total_direct_writepage_async\n";
+	print "Wake kswapd requests:		$total_wakeup_kswapd\n";
+	printf "Time stalled direct reclaim: 	%-1.2f ms\n", $total_direct_latency;
+	print "\n";
+	print "Kswapd wakeups:			$total_kswapd_wake\n";
+	print "Kswapd pages scanned:		$total_kswapd_nr_scanned\n";
+	print "Kswapd reclaim write sync I/O:	$total_kswapd_writepage_sync\n";
+	print "Kswapd reclaim write async I/O:	$total_kswapd_writepage_async\n";
+	printf "Time kswapd awake:		%-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+	my $process_pid;
+	my $process;
+	undef %perprocess;
+
+	foreach $process_pid (keys %perprocesspid) {
+		$process = $process_pid;
+		$process =~ s/-([0-9])*$//;
+		if ($process eq '') {
+			$process = "NO_PROCESS_NAME";
+		}
+
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+		$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+		for (my $order = 0; $order < 20; $order++) {
+			$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+			$perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+		}
+
+		# Aggregate direct reclaim latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+		# Aggregate kswapd latencies
+		my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+		my $rd_index = 0;
+		while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+			$perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+			$rd_index++;
+			$wr_index++;
+		}
+		$perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+	}
+}
+
+sub report() {
+	if (!$opt_ignorepid) {
+		dump_stats(\%perprocesspid);
+	} else {
+		aggregate_perprocesspid();
+		dump_stats(\%perprocess);
+	}
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+	my $sigint_processed;
+	do {
+		$sigint_processed = 0;
+		process_events();
+
+		# Handle pending signals if any
+		if ($sigint_pending) {
+			my $current_time = time;
+
+			if ($sigint_exit) {
+				print "Received exit signal\n";
+				$sigint_pending = 0;
+			}
+			if ($sigint_report) {
+				if ($current_time >= $sigint_received + 2) {
+					report();
+					$sigint_report = 0;
+					$sigint_pending = 0;
+					$sigint_processed = 1;
+				}
+			}
+		}
+	} while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   15 ------------
 mm/page_alloc.c        |    2 -
 mm/vmscan.c            |   57 ------------------------------------------------
 mm/vmstat.c            |    2 -
 4 files changed, 0 insertions(+), 76 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58527c4..29e1ecd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,20 +1286,6 @@ done:
 }
 
 /*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
-}
-
-/*
  * This moves pages from the active list to the inactive list.
  *
  * We move them the other way if the page is referenced by one or more
@@ -1762,17 +1748,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1878,17 +1855,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -2054,22 +2020,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2137,9 +2093,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2212,16 +2166,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2641,7 +2585,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |   15 ------------
 mm/page_alloc.c        |    2 -
 mm/vmscan.c            |   57 ------------------------------------------------
 mm/vmstat.c            |    2 -
 4 files changed, 0 insertions(+), 76 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 
-		zone->prev_priority = DEF_PRIORITY;
-
 		zone_pcp_init(zone);
 		for_each_lru(l) {
 			INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58527c4..29e1ecd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,20 +1286,6 @@ done:
 }
 
 /*
- * We are about to scan this zone at a certain priority level.  If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone.  This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
-	if (priority < zone->prev_priority)
-		zone->prev_priority = priority;
-}
-
-/*
  * This moves pages from the active list to the inactive list.
  *
  * We move them the other way if the page is referenced by one or more
@@ -1762,17 +1748,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 		if (scanning_global_lru(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-			note_zone_scanning_priority(zone, priority);
-
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-		} else {
-			/*
-			 * Ignore cpuset limitation here. We just want to reduce
-			 * # of used pages by us regardless of memory shortage.
-			 */
-			mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
-							priority);
 		}
 
 		shrink_zone(priority, zone, sc);
@@ -1878,17 +1855,6 @@ out:
 	if (priority < 0)
 		priority = 0;
 
-	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			zone->prev_priority = priority;
-		}
-	} else
-		mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
 	delayacct_freepages_end();
 	put_mems_allowed();
 
@@ -2054,22 +2020,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
 		.order = order,
 		.mem_cgroup = NULL,
 	};
-	/*
-	 * temp_priority is used to remember the scanning priority at which
-	 * this zone was successfully refilled to
-	 * free_pages == high_wmark_pages(zone).
-	 */
-	int temp_priority[MAX_NR_ZONES];
-
 loop_again:
 	total_scanned = 0;
 	sc.nr_reclaimed = 0;
 	sc.may_writepage = !laptop_mode;
 	count_vm_event(PAGEOUTRUN);
 
-	for (i = 0; i < pgdat->nr_zones; i++)
-		temp_priority[i] = DEF_PRIORITY;
-
 	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
 		int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
 		unsigned long lru_pages = 0;
@@ -2137,9 +2093,7 @@ loop_again:
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;
 
-			temp_priority[i] = priority;
 			sc.nr_scanned = 0;
-			note_zone_scanning_priority(zone, priority);
 
 			nid = pgdat->node_id;
 			zid = zone_idx(zone);
@@ -2212,16 +2166,6 @@ loop_again:
 			break;
 	}
 out:
-	/*
-	 * Note within each zone the priority level at which this zone was
-	 * brought into a happy state.  So that the next thread which scans this
-	 * zone will start out at that priority level.
-	 */
-	for (i = 0; i < pgdat->nr_zones; i++) {
-		struct zone *zone = pgdat->node_zones + i;
-
-		zone->prev_priority = temp_priority[i];
-	}
 	if (!all_zones_ok) {
 		cond_resched();
 
@@ -2641,7 +2585,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 		 */
 		priority = ZONE_RECLAIM_PRIORITY;
 		do {
-			note_zone_scanning_priority(zone, priority);
 			shrink_zone(priority, zone, &sc);
 			priority--;
 		} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
 	}
 	seq_printf(m,
 		   "\n  all_unreclaimable: %u"
-		   "\n  prev_priority:     %i"
 		   "\n  start_pfn:         %lu"
 		   "\n  inactive_ratio:    %u",
 		   zone->all_unreclaimable,
-		   zone->prev_priority,
 		   zone->zone_start_pfn,
 		   zone->inactive_ratio);
 	seq_putc(m, '\n');
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 06/12] vmscan: simplify shrink_inactive_list()
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |  211 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29e1ecd..6a42ebc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
+		/*
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
+		 */
+	}
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		goto done;
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	nr_active = clear_active_flags(&page_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+					-count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+					-count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+					-count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+					-count[LRU_INACTIVE_ANON]);
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+	reclaim_stat->recent_scanned[0] += nr_anon;
+	reclaim_stat->recent_scanned[1] += nr_file;
 
-		spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
-
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+		nr_active = clear_active_flags(&page_list, count);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-		spin_lock(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		int lru;
+		page = lru_to_page(&page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
 		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-  	} while (nr_scanned < max_scan);
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
 done:
 	spin_unlock_irq(&zone->lru_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 06/12] vmscan: simplify shrink_inactive_list()
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |  211 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29e1ecd..6a42ebc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
-		unsigned long nr_anon;
-		unsigned long nr_file;
 
-		if (scanning_global_lru(sc)) {
-			nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, 0, file);
-			zone->pages_scanned += nr_scan;
-			if (current_is_kswapd())
-				__count_zone_vm_events(PGSCAN_KSWAPD, zone,
-						       nr_scan);
-			else
-				__count_zone_vm_events(PGSCAN_DIRECT, zone,
-						       nr_scan);
-		} else {
-			nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
-							&page_list, &nr_scan,
-							sc->order, mode,
-							zone, sc->mem_cgroup,
-							0, file);
-			/*
-			 * mem_cgroup_isolate_pages() keeps track of
-			 * scanned pages on its own.
-			 */
-		}
+	if (scanning_global_lru(sc)) {
+		nr_taken = isolate_pages_global(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, 0, file);
+		zone->pages_scanned += nr_scanned;
+		if (current_is_kswapd())
+			__count_zone_vm_events(PGSCAN_KSWAPD, zone,
+					       nr_scanned);
+		else
+			__count_zone_vm_events(PGSCAN_DIRECT, zone,
+					       nr_scanned);
+	} else {
+		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+			&page_list, &nr_scanned, sc->order,
+			sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+			zone, sc->mem_cgroup,
+			0, file);
+		/*
+		 * mem_cgroup_isolate_pages() keeps track of
+		 * scanned pages on its own.
+		 */
+	}
 
-		if (nr_taken == 0)
-			goto done;
+	if (nr_taken == 0)
+		goto done;
 
-		nr_active = clear_active_flags(&page_list, count);
-		__count_vm_events(PGDEACTIVATE, nr_active);
+	nr_active = clear_active_flags(&page_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
 
-		__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-						-count[LRU_ACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-						-count[LRU_INACTIVE_FILE]);
-		__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-						-count[LRU_ACTIVE_ANON]);
-		__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-						-count[LRU_INACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+					-count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+					-count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+					-count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+					-count[LRU_INACTIVE_ANON]);
 
-		nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-		nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
 
-		reclaim_stat->recent_scanned[0] += nr_anon;
-		reclaim_stat->recent_scanned[1] += nr_file;
+	reclaim_stat->recent_scanned[0] += nr_anon;
+	reclaim_stat->recent_scanned[1] += nr_file;
 
-		spin_unlock_irq(&zone->lru_lock);
+	spin_unlock_irq(&zone->lru_lock);
 
-		nr_scanned += nr_scan;
-		nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+	/*
+	 * If we are direct reclaiming for contiguous pages and we do
+	 * not reclaim everything in the list, try again and wait
+	 * for IO to complete. This will stall high-order allocations
+	 * but that should be acceptable to the caller
+	 */
+	if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-		 * If we are direct reclaiming for contiguous pages and we do
-		 * not reclaim everything in the list, try again and wait
-		 * for IO to complete. This will stall high-order allocations
-		 * but that should be acceptable to the caller
+		 * The attempt at page out may have made some
+		 * of the pages active, mark them inactive again.
 		 */
-		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim_mode) {
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-			/*
-			 * The attempt at page out may have made some
-			 * of the pages active, mark them inactive again.
-			 */
-			nr_active = clear_active_flags(&page_list, count);
-			count_vm_events(PGDEACTIVATE, nr_active);
-
-			nr_freed += shrink_page_list(&page_list, sc,
-							PAGEOUT_IO_SYNC);
-		}
+		nr_active = clear_active_flags(&page_list, count);
+		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += nr_freed;
+		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+	}
 
-		local_irq_disable();
-		if (current_is_kswapd())
-			__count_vm_events(KSWAPD_STEAL, nr_freed);
-		__count_zone_vm_events(PGSTEAL, zone, nr_freed);
+	local_irq_disable();
+	if (current_is_kswapd())
+		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-		spin_lock(&zone->lru_lock);
-		/*
-		 * Put back any unfreeable pages.
-		 */
-		while (!list_empty(&page_list)) {
-			int lru;
-			page = lru_to_page(&page_list);
-			VM_BUG_ON(PageLRU(page));
-			list_del(&page->lru);
-			if (unlikely(!page_evictable(page, NULL))) {
-				spin_unlock_irq(&zone->lru_lock);
-				putback_lru_page(page);
-				spin_lock_irq(&zone->lru_lock);
-				continue;
-			}
-			SetPageLRU(page);
-			lru = page_lru(page);
-			add_page_to_lru_list(zone, page, lru);
-			if (is_active_lru(lru)) {
-				int file = is_file_lru(lru);
-				reclaim_stat->recent_rotated[file]++;
-			}
-			if (!pagevec_add(&pvec, page)) {
-				spin_unlock_irq(&zone->lru_lock);
-				__pagevec_release(&pvec);
-				spin_lock_irq(&zone->lru_lock);
-			}
+	spin_lock(&zone->lru_lock);
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	while (!list_empty(&page_list)) {
+		int lru;
+		page = lru_to_page(&page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
 		}
-		__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-		__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-  	} while (nr_scanned < max_scan);
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
 
 done:
 	spin_unlock_irq(&zone->lru_lock);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a42ebc..246e084 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a42ebc..246e084 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
 static bool shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	struct zoneref *z;
 	struct zone *zone;
 	bool all_unreclaimable = true;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
-					sc->nodemask) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	struct zoneref *z;
 	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long writeback_threshold;
 
 	get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scanning_global_lru(sc)) {
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 54 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 246e084..34c5c87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void putback_lru_pages(struct zone *zone,
+				struct zone_reclaim_stat *reclaim_stat,
+				unsigned long nr_anon, unsigned long nr_file,
+ 				struct list_head *page_list)
+{
+	struct page *page;
+	struct pagevec pvec;
+
+	pagevec_init(&pvec, 1);
+
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
+static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
-	struct pagevec pvec;
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	struct page *page;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
@@ -1157,8 +1203,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	}
 
 
-	pagevec_init(&pvec, 1);
-
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1186,8 +1230,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		 */
 	}
 
-	if (nr_taken == 0)
-		goto done;
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
 	nr_active = clear_active_flags(&page_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1237,40 +1283,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	spin_lock(&zone->lru_lock);
-	/*
-	 * Put back any unfreeable pages.
-	 */
-	while (!list_empty(&page_list)) {
-		int lru;
-		page = lru_to_page(&page_list);
-		VM_BUG_ON(PageLRU(page));
-		list_del(&page->lru);
-		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
-			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
-			continue;
-		}
-		SetPageLRU(page);
-		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
-		if (is_active_lru(lru)) {
-			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
-		}
-		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&zone->lru_lock);
-			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
-		}
-	}
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
+	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   95 +++++++++++++++++++++++++++++++++-------------------------
 1 files changed, 54 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 246e084..34c5c87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
 }
 
 /*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void putback_lru_pages(struct zone *zone,
+				struct zone_reclaim_stat *reclaim_stat,
+				unsigned long nr_anon, unsigned long nr_file,
+ 				struct list_head *page_list)
+{
+	struct page *page;
+	struct pagevec pvec;
+
+	pagevec_init(&pvec, 1);
+
+	/*
+	 * Put back any unfreeable pages.
+	 */
+	spin_lock(&zone->lru_lock);
+	while (!list_empty(page_list)) {
+		int lru;
+		page = lru_to_page(page_list);
+		VM_BUG_ON(PageLRU(page));
+		list_del(&page->lru);
+		if (unlikely(!page_evictable(page, NULL))) {
+			spin_unlock_irq(&zone->lru_lock);
+			putback_lru_page(page);
+			spin_lock_irq(&zone->lru_lock);
+			continue;
+		}
+		SetPageLRU(page);
+		lru = page_lru(page);
+		add_page_to_lru_list(zone, page, lru);
+		if (is_active_lru(lru)) {
+			int file = is_file_lru(lru);
+			reclaim_stat->recent_rotated[file]++;
+		}
+		if (!pagevec_add(&pvec, page)) {
+			spin_unlock_irq(&zone->lru_lock);
+			__pagevec_release(&pvec);
+			spin_lock_irq(&zone->lru_lock);
+		}
+	}
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+	spin_unlock_irq(&zone->lru_lock);
+	pagevec_release(&pvec);
+}
+
+/*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
+static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int priority, int file)
 {
 	LIST_HEAD(page_list);
-	struct pagevec pvec;
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	struct page *page;
 	unsigned long nr_taken;
 	unsigned long nr_active;
 	unsigned int count[NR_LRU_LISTS] = { 0, };
@@ -1157,8 +1203,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	}
 
 
-	pagevec_init(&pvec, 1);
-
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1186,8 +1230,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		 */
 	}
 
-	if (nr_taken == 0)
-		goto done;
+	if (nr_taken == 0) {
+		spin_unlock_irq(&zone->lru_lock);
+		return 0;
+	}
 
 	nr_active = clear_active_flags(&page_list, count);
 	__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1237,40 +1283,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	spin_lock(&zone->lru_lock);
-	/*
-	 * Put back any unfreeable pages.
-	 */
-	while (!list_empty(&page_list)) {
-		int lru;
-		page = lru_to_page(&page_list);
-		VM_BUG_ON(PageLRU(page));
-		list_del(&page->lru);
-		if (unlikely(!page_evictable(page, NULL))) {
-			spin_unlock_irq(&zone->lru_lock);
-			putback_lru_page(page);
-			spin_lock_irq(&zone->lru_lock);
-			continue;
-		}
-		SetPageLRU(page);
-		lru = page_lru(page);
-		add_page_to_lru_list(zone, page, lru);
-		if (is_active_lru(lru)) {
-			int file = is_file_lru(lru);
-			reclaim_stat->recent_rotated[file]++;
-		}
-		if (!pagevec_add(&pvec, page)) {
-			spin_unlock_irq(&zone->lru_lock);
-			__pagevec_release(&pvec);
-			spin_lock_irq(&zone->lru_lock);
-		}
-	}
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-done:
-	spin_unlock_irq(&zone->lru_lock);
-	pagevec_release(&pvec);
+	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++++--------
 1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34c5c87..165c2f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,25 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	if (pagevec_count(&freed_pvec))
+		__pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -628,13 +647,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -809,10 +827,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -835,9 +855,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++++--------
 1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34c5c87..165c2f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,25 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+	struct pagevec freed_pvec;
+	struct page *page, *tmp;
+
+	pagevec_init(&freed_pvec, 1);
+
+	list_for_each_entry_safe(page, tmp, free_pages, lru) {
+		list_del(&page->lru);
+		if (!pagevec_add(&freed_pvec, page)) {
+			__pagevec_free(&freed_pvec);
+			pagevec_reinit(&freed_pvec);
+		}
+	}
+
+	if (pagevec_count(&freed_pvec))
+		__pagevec_free(&freed_pvec);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -628,13 +647,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
+	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -809,10 +827,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+
+		/*
+		 * Is there need to periodically free_page_list? It would
+		 * appear not as the counts should be low
+		 */
+		list_add(&page->lru, &free_pages);
 		continue;
 
 cull_mlocked:
@@ -835,9 +855,10 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
+
+	free_page_list(&free_pages);
+
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   63 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 165c2f5..019f0af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1073,7 +1073,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1153,12 +1154,13 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void putback_lru_pages(struct zone *zone,
-				struct zone_reclaim_stat *reclaim_stat,
+				struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
  				struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	pagevec_init(&pvec, 1);
 
@@ -1197,6 +1199,37 @@ static noinline_for_stack void putback_lru_pages(struct zone *zone,
 	pagevec_release(&pvec);
 }
 
+static noinline_for_stack void update_isolated_counts(struct zone *zone, 
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
+
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
+
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
+
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1208,10 +1241,8 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_taken;
 	unsigned long nr_active;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1256,25 +1287,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		return 0;
 	}
 
-	nr_active = clear_active_flags(&page_list, count);
-	__count_vm_events(PGDEACTIVATE, nr_active);
-
-	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-					-count[LRU_ACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-					-count[LRU_INACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-					-count[LRU_ACTIVE_ANON]);
-	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-					-count[LRU_INACTIVE_ANON]);
-
-	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
-
-	reclaim_stat->recent_scanned[0] += nr_anon;
-	reclaim_stat->recent_scanned[1] += nr_file;
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
@@ -1293,7 +1306,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
 		 */
-		nr_active = clear_active_flags(&page_list, count);
+		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
 		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
@@ -1304,7 +1317,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   63 +++++++++++++++++++++++++++++++++++-----------------------
 1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 165c2f5..019f0af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1073,7 +1073,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
 			ClearPageActive(page);
 			nr_active++;
 		}
-		count[lru]++;
+		if (count)
+			count[lru]++;
 	}
 
 	return nr_active;
@@ -1153,12 +1154,13 @@ static int too_many_isolated(struct zone *zone, int file,
  * TODO: Try merging with migrations version of putback_lru_pages
  */
 static noinline_for_stack void putback_lru_pages(struct zone *zone,
-				struct zone_reclaim_stat *reclaim_stat,
+				struct scan_control *sc,
 				unsigned long nr_anon, unsigned long nr_file,
  				struct list_head *page_list)
 {
 	struct page *page;
 	struct pagevec pvec;
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 
 	pagevec_init(&pvec, 1);
 
@@ -1197,6 +1199,37 @@ static noinline_for_stack void putback_lru_pages(struct zone *zone,
 	pagevec_release(&pvec);
 }
 
+static noinline_for_stack void update_isolated_counts(struct zone *zone, 
+					struct scan_control *sc,
+					unsigned long *nr_anon,
+					unsigned long *nr_file,
+					struct list_head *isolated_list)
+{
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+	nr_active = clear_active_flags(isolated_list, count);
+	__count_vm_events(PGDEACTIVATE, nr_active);
+
+	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
+			      -count[LRU_ACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
+			      -count[LRU_INACTIVE_FILE]);
+	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
+			      -count[LRU_ACTIVE_ANON]);
+	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
+			      -count[LRU_INACTIVE_ANON]);
+
+	*nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	*nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
+
+	reclaim_stat->recent_scanned[0] += *nr_anon;
+	reclaim_stat->recent_scanned[1] += *nr_file;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1208,10 +1241,8 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 	LIST_HEAD(page_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
-	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	unsigned long nr_taken;
 	unsigned long nr_active;
-	unsigned int count[NR_LRU_LISTS] = { 0, };
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1256,25 +1287,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		return 0;
 	}
 
-	nr_active = clear_active_flags(&page_list, count);
-	__count_vm_events(PGDEACTIVATE, nr_active);
-
-	__mod_zone_page_state(zone, NR_ACTIVE_FILE,
-					-count[LRU_ACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_INACTIVE_FILE,
-					-count[LRU_INACTIVE_FILE]);
-	__mod_zone_page_state(zone, NR_ACTIVE_ANON,
-					-count[LRU_ACTIVE_ANON]);
-	__mod_zone_page_state(zone, NR_INACTIVE_ANON,
-					-count[LRU_INACTIVE_ANON]);
-
-	nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
-	nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
-	__mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
-	__mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
-
-	reclaim_stat->recent_scanned[0] += nr_anon;
-	reclaim_stat->recent_scanned[1] += nr_file;
+	update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);
 
 	spin_unlock_irq(&zone->lru_lock);
 
@@ -1293,7 +1306,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
 		 */
-		nr_active = clear_active_flags(&page_list, count);
+		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
 		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
@@ -1304,7 +1317,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
-	putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
+	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.

This patch queues all dirty pages at once to maximise the chances that
the write requests get merged efficiently. It also makes the next patch
that avoids writeout from direct reclaim more straight-forward.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  175 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 019f0af..4856a2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	/*
+	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+	 * the page lock has been dropped by ->writepage, that mapping could
+	 * be anything
+	 */
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,45 +416,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -640,19 +651,75 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 }
 
 /*
+ * Clean a list of pages. It is expected that all the pages on page_list have been
+ * locked as part of isolation from the LRU.
+ *
+ * XXX: Is there a problem with holding multiple page locks like this?
+ */
+static noinline_for_stack void clean_page_list(struct list_head *page_list,
+				struct scan_control *sc)
+{
+	LIST_HEAD(ret_pages);
+	struct page *page;
+
+	if (!sc->may_writepage)
+		return;
+
+	/* Write the pages out to disk in ranges where possible */
+	while (!list_empty(page_list)) {
+		struct address_space *mapping;
+		bool may_enter_fs;
+
+		page = lru_to_page(page_list);
+		list_del(&page->lru);
+		list_add(&page->lru, &ret_pages);
+
+		mapping = page_mapping(page);
+		if (!mapping || !may_write_to_queue(mapping->backing_dev_info)) {
+			unlock_page(page);
+			continue;
+		}
+
+		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+		if (!may_enter_fs) {
+			unlock_page(page);
+			continue;
+		}
+
+		/* Write single page */
+		switch (write_reclaim_page(page, mapping, PAGEOUT_IO_ASYNC)) {
+		case PAGE_KEEP:
+		case PAGE_ACTIVATE:
+		case PAGE_CLEAN:
+			unlock_page(page);
+			break;
+		case PAGE_SUCCESS:
+			break;
+		}
+	}
+	list_splice(&ret_pages, page_list);
+}
+
+/*
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	struct list_head *ret_list = page_list;
+	int pgactivate;
+	bool cleaned = false;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -741,7 +808,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * On the first pass, dirty pages are put on a separate
+			 * list. IO is then queued based on ranges of pages for
+			 * each unique mapping in the list
+			 */
+			if (!cleaned) {
+				/* Keep locked for clean_page_list */
+				list_add(&page->lru, &dirty_pages);
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +930,22 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (!cleaned && !list_empty(&dirty_pages)) {
+		clean_page_list(&dirty_pages, sc);
+		page_list = &dirty_pages;
+		cleaned = true;
+		goto restart_dirty;
+	}
+	BUG_ON(!list_empty(&dirty_pages));
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	list_splice(&putback_pages, ret_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.

This patch queues all dirty pages at once to maximise the chances that
the write requests get merged efficiently. It also makes the next patch
that avoids writeout from direct reclaim more straight-forward.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |  175 ++++++++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 019f0af..4856a2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+						enum pageout_io sync_writeback)
+{
+	int res;
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = SWAP_CLUSTER_MAX,
+		.range_start = 0,
+		.range_end = LLONG_MAX,
+		.nonblocking = 1,
+		.for_reclaim = 1,
+	};
+
+	if (!clear_page_dirty_for_io(page))
+		return PAGE_CLEAN;
+
+	SetPageReclaim(page);
+	res = mapping->a_ops->writepage(page, &wbc);
+	/*
+	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+	 * the page lock has been dropped by ->writepage, that mapping could
+	 * be anything
+	 */
+	if (res < 0)
+		handle_write_error(mapping, page, res);
+	if (res == AOP_WRITEPAGE_ACTIVATE) {
+		ClearPageReclaim(page);
+		return PAGE_ACTIVATE;
+	}
+
+	/*
+	 * Wait on writeback if requested to. This happens when
+	 * direct reclaiming a large contiguous area and the
+	 * first attempt to free a range of pages fails.
+	 */
+	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		wait_on_page_writeback(page);
+
+	if (!PageWriteback(page)) {
+		/* synchronous write or broken a_ops? */
+		ClearPageReclaim(page);
+	}
+	trace_mm_vmscan_writepage(page,
+		sync_writeback == PAGEOUT_IO_SYNC);
+	inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+	return PAGE_SUCCESS;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -367,45 +416,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	if (!may_write_to_queue(mapping->backing_dev_info))
 		return PAGE_KEEP;
 
-	if (clear_page_dirty_for_io(page)) {
-		int res;
-		struct writeback_control wbc = {
-			.sync_mode = WB_SYNC_NONE,
-			.nr_to_write = SWAP_CLUSTER_MAX,
-			.range_start = 0,
-			.range_end = LLONG_MAX,
-			.nonblocking = 1,
-			.for_reclaim = 1,
-		};
-
-		SetPageReclaim(page);
-		res = mapping->a_ops->writepage(page, &wbc);
-		if (res < 0)
-			handle_write_error(mapping, page, res);
-		if (res == AOP_WRITEPAGE_ACTIVATE) {
-			ClearPageReclaim(page);
-			return PAGE_ACTIVATE;
-		}
-
-		/*
-		 * Wait on writeback if requested to. This happens when
-		 * direct reclaiming a large contiguous area and the
-		 * first attempt to free a range of pages fails.
-		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
-			wait_on_page_writeback(page);
-
-		if (!PageWriteback(page)) {
-			/* synchronous write or broken a_ops? */
-			ClearPageReclaim(page);
-		}
-		trace_mm_vmscan_writepage(page,
-			sync_writeback == PAGEOUT_IO_SYNC);
-		inc_zone_page_state(page, NR_VMSCAN_WRITE);
-		return PAGE_SUCCESS;
-	}
-
-	return PAGE_CLEAN;
+	return write_reclaim_page(page, mapping, sync_writeback);
 }
 
 /*
@@ -640,19 +651,75 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 }
 
 /*
+ * Clean a list of pages. It is expected that all the pages on page_list have been
+ * locked as part of isolation from the LRU.
+ *
+ * XXX: Is there a problem with holding multiple page locks like this?
+ */
+static noinline_for_stack void clean_page_list(struct list_head *page_list,
+				struct scan_control *sc)
+{
+	LIST_HEAD(ret_pages);
+	struct page *page;
+
+	if (!sc->may_writepage)
+		return;
+
+	/* Write the pages out to disk in ranges where possible */
+	while (!list_empty(page_list)) {
+		struct address_space *mapping;
+		bool may_enter_fs;
+
+		page = lru_to_page(page_list);
+		list_del(&page->lru);
+		list_add(&page->lru, &ret_pages);
+
+		mapping = page_mapping(page);
+		if (!mapping || !may_write_to_queue(mapping->backing_dev_info)) {
+			unlock_page(page);
+			continue;
+		}
+
+		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+		if (!may_enter_fs) {
+			unlock_page(page);
+			continue;
+		}
+
+		/* Write single page */
+		switch (write_reclaim_page(page, mapping, PAGEOUT_IO_ASYNC)) {
+		case PAGE_KEEP:
+		case PAGE_ACTIVATE:
+		case PAGE_CLEAN:
+			unlock_page(page);
+			break;
+		case PAGE_SUCCESS:
+			break;
+		}
+	}
+	list_splice(&ret_pages, page_list);
+}
+
+/*
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
 					enum pageout_io sync_writeback)
 {
-	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
-	int pgactivate = 0;
+	LIST_HEAD(putback_pages);
+	LIST_HEAD(dirty_pages);
+	struct list_head *ret_list = page_list;
+	int pgactivate;
+	bool cleaned = false;
 	unsigned long nr_reclaimed = 0;
 
+	pgactivate = 0;
 	cond_resched();
 
+restart_dirty:
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -741,7 +808,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			}
 		}
 
-		if (PageDirty(page)) {
+		if (PageDirty(page))  {
+			/*
+			 * On the first pass, dirty pages are put on a separate
+			 * list. IO is then queued based on ranges of pages for
+			 * each unique mapping in the list
+			 */
+			if (!cleaned) {
+				/* Keep locked for clean_page_list */
+				list_add(&page->lru, &dirty_pages);
+				goto keep_dirty;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -852,13 +930,22 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
-		list_add(&page->lru, &ret_pages);
+		list_add(&page->lru, &putback_pages);
+keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
+	if (!cleaned && !list_empty(&dirty_pages)) {
+		clean_page_list(&dirty_pages, sc);
+		page_list = &dirty_pages;
+		cleaned = true;
+		goto restart_dirty;
+	}
+	BUG_ON(!list_empty(&dirty_pages));
+
 	free_page_list(&free_pages);
 
-	list_splice(&ret_pages, page_list);
+	list_splice(&putback_pages, ret_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 11:17   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   76 ++++++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4856a2a..574e816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
 	return PAGE_SUCCESS;
 }
 
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+	return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -701,6 +707,9 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
 	list_splice(&ret_pages, page_list);
 }
 
+/* Direct lumpy reclaim waits up to a second for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 10
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -711,15 +720,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 	LIST_HEAD(free_pages);
 	LIST_HEAD(putback_pages);
 	LIST_HEAD(dirty_pages);
-	struct list_head *ret_list = page_list;
 	int pgactivate;
-	bool cleaned = false;
+	int cleaned = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
 	pgactivate = 0;
 	cond_resched();
 
 restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -811,12 +821,17 @@ restart_dirty:
 		if (PageDirty(page))  {
 			/*
 			 * On the first pass, dirty pages are put on a separate
-			 * list. IO is then queued based on ranges of pages for
-			 * each unique mapping in the list
+			 * list. If kswapd, IO is then queued based on ranges of
+			 * pages for each unique mapping in the list. Direct
+			 * reclaimers put the dirty pages back on the list for
+			 * cleaning by kswapd
 			 */
-			if (!cleaned) {
-				/* Keep locked for clean_page_list */
+			if (cleaned < MAX_SWAP_CLEAN_WAIT) {
+				/* Keep locked for kswapd to call clean_page_list */
+				if (!reclaim_can_writeback(sc))
+					unlock_page(page);
 				list_add(&page->lru, &dirty_pages);
+				nr_dirty++;
 				goto keep_dirty;
 			}
 
@@ -935,17 +950,41 @@ keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
-	if (!cleaned && !list_empty(&dirty_pages)) {
-		clean_page_list(&dirty_pages, sc);
-		page_list = &dirty_pages;
-		cleaned = true;
-		goto restart_dirty;
+	if (cleaned < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Only kswapd cleans pages. Direct reclaimers entering the filesystem
+		 * potentially splices two expensive call-chains and busts the stack
+		 * so instead they go to sleep to give background cleaning a chance
+		 */
+		list_splice(&dirty_pages, page_list);
+		INIT_LIST_HEAD(&dirty_pages);
+		if (reclaim_can_writeback(sc)) {
+			cleaned = MAX_SWAP_CLEAN_WAIT;
+			clean_page_list(page_list, sc);
+			goto restart_dirty;
+		} else {
+			cleaned++;
+			/*
+			 * If lumpy reclaiming, kick the background flusher and wait
+			 * for the pages to be cleaned
+			 *
+		 	 * XXX: kswapd won't find these isolated pages but the
+		 	 * 	background flusher does not prioritise pages. It'd
+		 	 * 	be nice to prioritise a list of pages somehow
+		 	 */
+			if (sync_writeback == PAGEOUT_IO_SYNC) {
+				wakeup_flusher_threads(nr_dirty);
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				goto restart_dirty;
+			}
+		}
 	}
-	BUG_ON(!list_empty(&dirty_pages));
 
 	free_page_list(&free_pages);
 
-	list_splice(&putback_pages, ret_list);
+	list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1954,10 +1993,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1995,7 +2032,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -2024,7 +2061,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -2676,7 +2713,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 198+ messages in thread

* [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-14 11:17   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-14 11:17 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-mm
  Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki,
	Andrew Morton, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   76 ++++++++++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4856a2a..574e816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
 	return PAGE_SUCCESS;
 }
 
+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+	return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -701,6 +707,9 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
 	list_splice(&ret_pages, page_list);
 }
 
+/* Direct lumpy reclaim waits up to a second for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 10
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -711,15 +720,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 	LIST_HEAD(free_pages);
 	LIST_HEAD(putback_pages);
 	LIST_HEAD(dirty_pages);
-	struct list_head *ret_list = page_list;
 	int pgactivate;
-	bool cleaned = false;
+	int cleaned = 0;
+	unsigned long nr_dirty;
 	unsigned long nr_reclaimed = 0;
 
 	pgactivate = 0;
 	cond_resched();
 
 restart_dirty:
+	nr_dirty = 0;
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -811,12 +821,17 @@ restart_dirty:
 		if (PageDirty(page))  {
 			/*
 			 * On the first pass, dirty pages are put on a separate
-			 * list. IO is then queued based on ranges of pages for
-			 * each unique mapping in the list
+			 * list. If kswapd, IO is then queued based on ranges of
+			 * pages for each unique mapping in the list. Direct
+			 * reclaimers put the dirty pages back on the list for
+			 * cleaning by kswapd
 			 */
-			if (!cleaned) {
-				/* Keep locked for clean_page_list */
+			if (cleaned < MAX_SWAP_CLEAN_WAIT) {
+				/* Keep locked for kswapd to call clean_page_list */
+				if (!reclaim_can_writeback(sc))
+					unlock_page(page);
 				list_add(&page->lru, &dirty_pages);
+				nr_dirty++;
 				goto keep_dirty;
 			}
 
@@ -935,17 +950,41 @@ keep_dirty:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
-	if (!cleaned && !list_empty(&dirty_pages)) {
-		clean_page_list(&dirty_pages, sc);
-		page_list = &dirty_pages;
-		cleaned = true;
-		goto restart_dirty;
+	if (cleaned < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+		/*
+		 * Only kswapd cleans pages. Direct reclaimers entering the filesystem
+		 * potentially splices two expensive call-chains and busts the stack
+		 * so instead they go to sleep to give background cleaning a chance
+		 */
+		list_splice(&dirty_pages, page_list);
+		INIT_LIST_HEAD(&dirty_pages);
+		if (reclaim_can_writeback(sc)) {
+			cleaned = MAX_SWAP_CLEAN_WAIT;
+			clean_page_list(page_list, sc);
+			goto restart_dirty;
+		} else {
+			cleaned++;
+			/*
+			 * If lumpy reclaiming, kick the background flusher and wait
+			 * for the pages to be cleaned
+			 *
+		 	 * XXX: kswapd won't find these isolated pages but the
+		 	 * 	background flusher does not prioritise pages. It'd
+		 	 * 	be nice to prioritise a list of pages somehow
+		 	 */
+			if (sync_writeback == PAGEOUT_IO_SYNC) {
+				wakeup_flusher_threads(nr_dirty);
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				goto restart_dirty;
+			}
+		}
 	}
-	BUG_ON(!list_empty(&dirty_pages));
 
 	free_page_list(&free_pages);
 
-	list_splice(&putback_pages, ret_list);
+	list_splice(&dirty_pages, page_list);
+	list_splice(&putback_pages, page_list);
+
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1954,10 +1993,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1995,7 +2032,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.may_unmap = 1,
 		.may_swap = 1,
@@ -2024,7 +2061,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						struct zone *zone, int nid)
 {
 	struct scan_control sc = {
-		.may_writepage = !laptop_mode,
+		.may_writepage = 0,
 		.may_unmap = 1,
 		.may_swap = !noswap,
 		.swappiness = swappiness,
@@ -2676,7 +2713,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	struct reclaim_state reclaim_state;
 	int priority;
 	struct scan_control sc = {
-		.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+		.may_writepage = (current_is_kswapd() &&
+					(zone_reclaim_mode & RECLAIM_WRITE)),
 		.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
 		.may_swap = 1,
 		.nr_to_reclaim = max_t(unsigned long, nr_pages,
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-14 15:10   ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-14 15:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> This is a merging of two series - the first of which reduces stack usage
> in page reclaim and the second which writes contiguous pages during reclaim
> and avoids writeback in direct reclaimers.

This stuff looks good to me from the filesystem POV.

You might want to throw in a follow on patch to remove the PF_MEMALLOC
checks from the various ->writepage methods.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-14 15:10   ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-14 15:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> This is a merging of two series - the first of which reduces stack usage
> in page reclaim and the second which writes contiguous pages during reclaim
> and avoids writeback in direct reclaimers.

This stuff looks good to me from the filesystem POV.

You might want to throw in a follow on patch to remove the PF_MEMALLOC
checks from the various ->writepage methods.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 15:45     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 15:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process.  High frequency triggering of these events could point to memory
> pressure problems.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
@ 2010-06-14 15:45     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 15:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process.  High frequency triggering of these events could point to memory
> pressure problems.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 16:47     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 16:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
@ 2010-06-14 16:47     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 16:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 16:48     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 16:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
@ 2010-06-14 16:48     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 16:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 17:55     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 17:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events.  It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.
> Example output looks like the following

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
@ 2010-06-14 17:55     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 17:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events.  It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.
> Example output looks like the following

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 18:04     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
>
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-14 18:04     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
>
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 18:06     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
> Now, max_scan of shrink_inactive_list() is always passed less than
> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
> This patch also help stack diet.
>
> detail
>   - remove "while (nr_scanned<  max_scan)" loop
>   - remove nr_freed (now, we use nr_reclaimed directly)
>   - remove nr_scan (now, we use nr_scanned directly)
>   - rename max_scan to nr_to_scan
>   - pass nr_to_scan into isolate_pages() directly instead
>     using SWAP_CLUSTER_MAX
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Other than the weird whitespace below,

Reviewed-by: Rik van Riel <riel@redhat.com>

> +	/*
> +	 * If we are direct reclaiming for contiguous pages and we do
> +	 * not reclaim everything in the list, try again and wait
> +	 * for IO to complete. This will stall high-order allocations
> +	 * but that should be acceptable to the caller
> +	 */
> +	if (nr_reclaimed<  nr_taken&&  !current_is_kswapd()&&  sc->lumpy_reclaim_mode) {
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()
@ 2010-06-14 18:06     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>
> Now, max_scan of shrink_inactive_list() is always passed less than
> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
> This patch also help stack diet.
>
> detail
>   - remove "while (nr_scanned<  max_scan)" loop
>   - remove nr_freed (now, we use nr_reclaimed directly)
>   - remove nr_scan (now, we use nr_scanned directly)
>   - rename max_scan to nr_to_scan
>   - pass nr_to_scan into isolate_pages() directly instead
>     using SWAP_CLUSTER_MAX
>
> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Other than the weird whitespace below,

Reviewed-by: Rik van Riel <riel@redhat.com>

> +	/*
> +	 * If we are direct reclaiming for contiguous pages and we do
> +	 * not reclaim everything in the list, try again and wait
> +	 * for IO to complete. This will stall high-order allocations
> +	 * but that should be acceptable to the caller
> +	 */
> +	if (nr_reclaimed<  nr_taken&&  !current_is_kswapd()&&  sc->lumpy_reclaim_mode) {
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 18:14     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Remove temporary variable that is only used once and does not help
> clarify code.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
@ 2010-06-14 18:14     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Remove temporary variable that is only used once and does not help
> clarify code.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 18:59     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-14 18:59     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 18:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 19:24     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 19:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
@ 2010-06-14 19:24     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 19:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 19:42     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 19:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> When shrink_inactive_list() isolates pages, it updates a number of
> counters using temporary variables to gather them. These consume stack
> and it's in the main path that calls ->writepage(). This patch moves the
> accounting updates outside of the main path to reduce stack usage.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()
@ 2010-06-14 19:42     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 19:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> When shrink_inactive_list() isolates pages, it updates a number of
> counters using temporary variables to gather them. These consume stack
> and it's in the main path that calls ->writepage(). This patch moves the
> accounting updates outside of the main path to reduce stack usage.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:01     ` Larry Woodman
  -1 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process.  High frequency triggering of these events could point to memory
> pressure problems.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim
@ 2010-06-14 21:01     ` Larry Woodman
  0 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process.  High frequency triggering of these events could point to memory
> pressure problems.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:02     ` Larry Woodman
  -1 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation
@ 2010-06-14 21:02     ` Larry Woodman
  0 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:02     ` Larry Woodman
  -1 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written
@ 2010-06-14 21:02     ` Larry Woodman
  0 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Larry Woodman <lwoodman@redhat.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:03     ` Larry Woodman
  -1 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events.  It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.

Acked-by: Larry Woodman <lwoodman@redhat.com>



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events
@ 2010-06-14 21:03     ` Larry Woodman
  0 siblings, 0 replies; 198+ messages in thread
From: Larry Woodman @ 2010-06-14 21:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events.  It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.

Acked-by: Larry Woodman <lwoodman@redhat.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:13     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 21:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.

Reclaiming clean pages should be fast enough that this should
make little, if any, difference.

> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

However, this is a convincing argument :)

> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-14 21:13     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 21:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.

Reclaiming clean pages should be fast enough that this should
make little, if any, difference.

> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

However, this is a convincing argument :)

> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 21:55     ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 21:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4856a2a..574e816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>   	return PAGE_SUCCESS;
>   }
>
> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> +{
> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
> +}
> +

I'm not entirely convinced on this bit, but am willing to
be convinced by the data.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-14 21:55     ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-14 21:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/14/2010 07:17 AM, Mel Gorman wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4856a2a..574e816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>   	return PAGE_SUCCESS;
>   }
>
> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> +{
> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
> +}
> +

I'm not entirely convinced on this bit, but am willing to
be convinced by the data.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-14 23:11     ` Dave Chinner
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-14 23:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.
> 
> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

Seeing as you have a list of pages for IO, perhaps they could be sorted
before issuing ->writepage on them.

That is, while this patch issues all the IO in one hit, it doesn't
change the order in which the IO is issued - it is still issued in
LRU order. Given that they are issued in a short period of time now,
rather than across a longer scan period, it is likely that it will
not be any faster as:

	a) IO will not be started as soon, and
	b) the IO scheduler still only has a small re-ordering
	   window and will choke just as much on random IO patterns.

However, there is a list_sort() function that could be used to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.  Perhaps this is a simple
improvement that can be made to this code that will make a big
difference to worst case performance.

FWIW, I did this for delayed metadata buffer writeback in XFS
recently (i.e. sort the queue of (potentially tens of thousands of)
buffers in ascending block order before dispatch) and that showed a
10-15% reduction in seeks on simple kernel compile workloads. This
shows that if we optimise IO patterns at higher layers where the
sort window is much, much larger than in the IO scheduler, then
overall system performance improves....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-14 23:11     ` Dave Chinner
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-14 23:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.
> 
> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

Seeing as you have a list of pages for IO, perhaps they could be sorted
before issuing ->writepage on them.

That is, while this patch issues all the IO in one hit, it doesn't
change the order in which the IO is issued - it is still issued in
LRU order. Given that they are issued in a short period of time now,
rather than across a longer scan period, it is likely that it will
not be any faster as:

	a) IO will not be started as soon, and
	b) the IO scheduler still only has a small re-ordering
	   window and will choke just as much on random IO patterns.

However, there is a list_sort() function that could be used to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.  Perhaps this is a simple
improvement that can be made to this code that will make a big
difference to worst case performance.

FWIW, I did this for delayed metadata buffer writeback in XFS
recently (i.e. sort the queue of (potentially tens of thousands of)
buffers in ascending block order before dispatch) and that showed a
10-15% reduction in seeks on simple kernel compile workloads. This
shows that if we optimise IO patterns at higher layers where the
sort window is much, much larger than in the IO scheduler, then
overall system performance improves....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 23:11     ` Dave Chinner
@ 2010-06-14 23:21       ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-14 23:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 09:11:44 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > Page reclaim cleans individual pages using a_ops->writepage() because from
> > the VM perspective, it is known that pages in a particular zone must be freed
> > soon, it considers the target page to be the oldest and it does not want
> > to wait while background flushers cleans other pages. From a filesystem
> > perspective this is extremely inefficient as it generates a very seeky
> > IO pattern leading to the perverse situation where it can take longer to
> > clean all dirty pages than it would have otherwise.
> > 
> > This patch queues all dirty pages at once to maximise the chances that
> > the write requests get merged efficiently. It also makes the next patch
> > that avoids writeout from direct reclaim more straight-forward.
> 
> Seeing as you have a list of pages for IO, perhaps they could be sorted
> before issuing ->writepage on them.
> 
> That is, while this patch issues all the IO in one hit, it doesn't
> change the order in which the IO is issued - it is still issued in
> LRU order. Given that they are issued in a short period of time now,
> rather than across a longer scan period, it is likely that it will
> not be any faster as:
> 
> 	a) IO will not be started as soon, and
> 	b) the IO scheduler still only has a small re-ordering
> 	   window and will choke just as much on random IO patterns.
> 
> However, there is a list_sort() function that could be used to sort
> the list; sorting the list of pages by mapping and page->index
> within the mapping would result in all the pages on each mapping
> being sent down in ascending offset order at once - exactly how the
> filesystems want IO to be sent to it.  Perhaps this is a simple
> improvement that can be made to this code that will make a big
> difference to worst case performance.
> 
> FWIW, I did this for delayed metadata buffer writeback in XFS
> recently (i.e. sort the queue of (potentially tens of thousands of)
> buffers in ascending block order before dispatch) and that showed a
> 10-15% reduction in seeks on simple kernel compile workloads. This
> shows that if we optimise IO patterns at higher layers where the
> sort window is much, much larger than in the IO scheduler, then
> overall system performance improves....

Yup.

But then, this all really should be done at the block layer so other
io-submitting-paths can benefit from it.

IOW, maybe "the sort queue is the submission queue" wasn't a good idea.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-14 23:21       ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-14 23:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 09:11:44 +1000
Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > Page reclaim cleans individual pages using a_ops->writepage() because from
> > the VM perspective, it is known that pages in a particular zone must be freed
> > soon, it considers the target page to be the oldest and it does not want
> > to wait while background flushers cleans other pages. From a filesystem
> > perspective this is extremely inefficient as it generates a very seeky
> > IO pattern leading to the perverse situation where it can take longer to
> > clean all dirty pages than it would have otherwise.
> > 
> > This patch queues all dirty pages at once to maximise the chances that
> > the write requests get merged efficiently. It also makes the next patch
> > that avoids writeout from direct reclaim more straight-forward.
> 
> Seeing as you have a list of pages for IO, perhaps they could be sorted
> before issuing ->writepage on them.
> 
> That is, while this patch issues all the IO in one hit, it doesn't
> change the order in which the IO is issued - it is still issued in
> LRU order. Given that they are issued in a short period of time now,
> rather than across a longer scan period, it is likely that it will
> not be any faster as:
> 
> 	a) IO will not be started as soon, and
> 	b) the IO scheduler still only has a small re-ordering
> 	   window and will choke just as much on random IO patterns.
> 
> However, there is a list_sort() function that could be used to sort
> the list; sorting the list of pages by mapping and page->index
> within the mapping would result in all the pages on each mapping
> being sent down in ascending offset order at once - exactly how the
> filesystems want IO to be sent to it.  Perhaps this is a simple
> improvement that can be made to this code that will make a big
> difference to worst case performance.
> 
> FWIW, I did this for delayed metadata buffer writeback in XFS
> recently (i.e. sort the queue of (potentially tens of thousands of)
> buffers in ascending block order before dispatch) and that showed a
> 10-15% reduction in seeks on simple kernel compile workloads. This
> shows that if we optimise IO patterns at higher layers where the
> sort window is much, much larger than in the IO scheduler, then
> overall system performance improves....

Yup.

But then, this all really should be done at the block layer so other
io-submitting-paths can benefit from it.

IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
  2010-06-14 11:17 ` Mel Gorman
@ 2010-06-15  0:08   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-15  0:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Mon, 14 Jun 2010 12:17:41 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> SysBench
> ========
>                 traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
>            1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
>            2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
>            3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
>            4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
>            5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
>            6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
>            7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
>            8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
> FTrace Reclaim Statistics
>                                      traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> Direct reclaims                               9843      13398      51651 
> Direct reclaim pages scanned                871367    1008709    3080593 
> Direct reclaim write async I/O               24883      30699          0 
> Direct reclaim write sync I/O                    0          0          0 

Hmm, page-scan and reclaims jumps up but...


> User/Sys Time Running Test (seconds)        734.52    712.39     703.9
> Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
> Percentage Time kswapd Awake                 0.06%     0.00%     0.00%
> 

Execution time is reduced. Does this shows removing "I/O noise" by direct
reclaim makes the system happy ? or writeback in direct reclaim give
us too much costs ?

It seems I'll have to consider about avoiding direct-reciam in memcg, later.

BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
direct reclaim if we want to avoid too much scanning, later.


Thank you for interesting test.

Regards,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-15  0:08   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-15  0:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Mon, 14 Jun 2010 12:17:41 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> SysBench
> ========
>                 traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
>            1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
>            2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
>            3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
>            4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
>            5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
>            6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
>            7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
>            8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
> FTrace Reclaim Statistics
>                                      traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> Direct reclaims                               9843      13398      51651 
> Direct reclaim pages scanned                871367    1008709    3080593 
> Direct reclaim write async I/O               24883      30699          0 
> Direct reclaim write sync I/O                    0          0          0 

Hmm, page-scan and reclaims jumps up but...


> User/Sys Time Running Test (seconds)        734.52    712.39     703.9
> Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
> Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
> Percentage Time kswapd Awake                 0.06%     0.00%     0.00%
> 

Execution time is reduced. Does this shows removing "I/O noise" by direct
reclaim makes the system happy ? or writeback in direct reclaim give
us too much costs ?

It seems I'll have to consider about avoiding direct-reciam in memcg, later.

BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
direct reclaim if we want to avoid too much scanning, later.


Thank you for interesting test.

Regards,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 23:21       ` Andrew Morton
@ 2010-06-15  0:39         ` Dave Chinner
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 09:11:44 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > > Page reclaim cleans individual pages using a_ops->writepage() because from
> > > the VM perspective, it is known that pages in a particular zone must be freed
> > > soon, it considers the target page to be the oldest and it does not want
> > > to wait while background flushers cleans other pages. From a filesystem
> > > perspective this is extremely inefficient as it generates a very seeky
> > > IO pattern leading to the perverse situation where it can take longer to
> > > clean all dirty pages than it would have otherwise.
> > > 
> > > This patch queues all dirty pages at once to maximise the chances that
> > > the write requests get merged efficiently. It also makes the next patch
> > > that avoids writeout from direct reclaim more straight-forward.
> > 
> > Seeing as you have a list of pages for IO, perhaps they could be sorted
> > before issuing ->writepage on them.
> > 
> > That is, while this patch issues all the IO in one hit, it doesn't
> > change the order in which the IO is issued - it is still issued in
> > LRU order. Given that they are issued in a short period of time now,
> > rather than across a longer scan period, it is likely that it will
> > not be any faster as:
> > 
> > 	a) IO will not be started as soon, and
> > 	b) the IO scheduler still only has a small re-ordering
> > 	   window and will choke just as much on random IO patterns.
> > 
> > However, there is a list_sort() function that could be used to sort
> > the list; sorting the list of pages by mapping and page->index
> > within the mapping would result in all the pages on each mapping
> > being sent down in ascending offset order at once - exactly how the
> > filesystems want IO to be sent to it.  Perhaps this is a simple
> > improvement that can be made to this code that will make a big
> > difference to worst case performance.
> > 
> > FWIW, I did this for delayed metadata buffer writeback in XFS
> > recently (i.e. sort the queue of (potentially tens of thousands of)
> > buffers in ascending block order before dispatch) and that showed a
> > 10-15% reduction in seeks on simple kernel compile workloads. This
> > shows that if we optimise IO patterns at higher layers where the
> > sort window is much, much larger than in the IO scheduler, then
> > overall system performance improves....
> 
> Yup.
> 
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.

That was what we did in the past with really, really deep IO
scheduler queues. That leads to IO latency and OOM problems because
we could lock gigabytes of memory away under IO and take minutes to
clean it.

Besides, there really isn't the right context in the block layer to
be able to queue and prioritise large amounts of IO without
significant penalties to some higher layer operation.

> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Perhaps, but IMO sorting should be done where the context allows it
to be done most efficiently. Sorting is most effective when ever a
significant queue of IO is formed, whether it be in the filesystem,
the VFS, the VM or the block layer because the IO stack is very much
a GIGO queue.

Simply put, there's nothing the lower layers can do to optimise bad
IO patterns from the higher layers because they have small sort
windows which are necessary to keep IO latency in check. Hence if
the higher layers feed the lower layers crap they simply don't have
the context or depth to perform the same level of optimistations we
can do easily higher up the stack.

IOWs, IMO anywhere there is a context with significant queue of IO,
that's where we should be doing a better job of sorting before that
IO is dispatched to the lower layers. This is still no guarantee of
better IO (e.g. if the filesystem fragments the file) but it does
give the lower layers a far better chance at optimal allocation and
scheduling of IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  0:39         ` Dave Chinner
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 09:11:44 +1000
> Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > > Page reclaim cleans individual pages using a_ops->writepage() because from
> > > the VM perspective, it is known that pages in a particular zone must be freed
> > > soon, it considers the target page to be the oldest and it does not want
> > > to wait while background flushers cleans other pages. From a filesystem
> > > perspective this is extremely inefficient as it generates a very seeky
> > > IO pattern leading to the perverse situation where it can take longer to
> > > clean all dirty pages than it would have otherwise.
> > > 
> > > This patch queues all dirty pages at once to maximise the chances that
> > > the write requests get merged efficiently. It also makes the next patch
> > > that avoids writeout from direct reclaim more straight-forward.
> > 
> > Seeing as you have a list of pages for IO, perhaps they could be sorted
> > before issuing ->writepage on them.
> > 
> > That is, while this patch issues all the IO in one hit, it doesn't
> > change the order in which the IO is issued - it is still issued in
> > LRU order. Given that they are issued in a short period of time now,
> > rather than across a longer scan period, it is likely that it will
> > not be any faster as:
> > 
> > 	a) IO will not be started as soon, and
> > 	b) the IO scheduler still only has a small re-ordering
> > 	   window and will choke just as much on random IO patterns.
> > 
> > However, there is a list_sort() function that could be used to sort
> > the list; sorting the list of pages by mapping and page->index
> > within the mapping would result in all the pages on each mapping
> > being sent down in ascending offset order at once - exactly how the
> > filesystems want IO to be sent to it.  Perhaps this is a simple
> > improvement that can be made to this code that will make a big
> > difference to worst case performance.
> > 
> > FWIW, I did this for delayed metadata buffer writeback in XFS
> > recently (i.e. sort the queue of (potentially tens of thousands of)
> > buffers in ascending block order before dispatch) and that showed a
> > 10-15% reduction in seeks on simple kernel compile workloads. This
> > shows that if we optimise IO patterns at higher layers where the
> > sort window is much, much larger than in the IO scheduler, then
> > overall system performance improves....
> 
> Yup.
> 
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.

That was what we did in the past with really, really deep IO
scheduler queues. That leads to IO latency and OOM problems because
we could lock gigabytes of memory away under IO and take minutes to
clean it.

Besides, there really isn't the right context in the block layer to
be able to queue and prioritise large amounts of IO without
significant penalties to some higher layer operation.

> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Perhaps, but IMO sorting should be done where the context allows it
to be done most efficiently. Sorting is most effective when ever a
significant queue of IO is formed, whether it be in the filesystem,
the VFS, the VM or the block layer because the IO stack is very much
a GIGO queue.

Simply put, there's nothing the lower layers can do to optimise bad
IO patterns from the higher layers because they have small sort
windows which are necessary to keep IO latency in check. Hence if
the higher layers feed the lower layers crap they simply don't have
the context or depth to perform the same level of optimistations we
can do easily higher up the stack.

IOWs, IMO anywhere there is a context with significant queue of IO,
that's where we should be doing a better job of sorting before that
IO is dispatched to the lower layers. This is still no guarantee of
better IO (e.g. if the filesystem fragments the file) but it does
give the lower layers a far better chance at optimal allocation and
scheduling of IO...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  0:39         ` Dave Chinner
@ 2010-06-15  1:16           ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15  1:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/14/2010 08:39 PM, Dave Chinner wrote:
> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:

>> But then, this all really should be done at the block layer so other
>> io-submitting-paths can benefit from it.
>
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
>
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.

Can we kick flushing for the whole inode at once from
vmscan.c?

That way we should:
1) ensure that the page we want is written to disk, and
2) we flush out related pages at the same time, getting
    a decent IO pattern

Chances are that if we want to evict one page from a
file, we'll also want to evict other pages from that
same file.  In fact, chances are a good number of them
will live nearby on the LRU list.

Does this make sense?

Would it be hard to add a "please flush this file"
way to call the filesystem flushing threads?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  1:16           ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15  1:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/14/2010 08:39 PM, Dave Chinner wrote:
> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:

>> But then, this all really should be done at the block layer so other
>> io-submitting-paths can benefit from it.
>
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
>
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.

Can we kick flushing for the whole inode at once from
vmscan.c?

That way we should:
1) ensure that the page we want is written to disk, and
2) we flush out related pages at the same time, getting
    a decent IO pattern

Chances are that if we want to evict one page from a
file, we'll also want to evict other pages from that
same file.  In fact, chances are a good number of them
will live nearby on the LRU list.

Does this make sense?

Would it be hard to add a "please flush this file"
way to call the filesystem flushing threads?

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  0:39         ` Dave Chinner
@ 2010-06-15  1:39           ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  1:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 09:11:44 +1000
> > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > shows that if we optimise IO patterns at higher layers where the
> > > sort window is much, much larger than in the IO scheduler, then
> > > overall system performance improves....
> > 
> > Yup.
> > 
> > But then, this all really should be done at the block layer so other
> > io-submitting-paths can benefit from it.
> 
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
> 
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.
> 
> > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> 
> Perhaps, but IMO sorting should be done where the context allows it
> to be done most efficiently. Sorting is most effective when ever a
> significant queue of IO is formed, whether it be in the filesystem,
> the VFS, the VM or the block layer because the IO stack is very much
> a GIGO queue.
> 
> Simply put, there's nothing the lower layers can do to optimise bad
> IO patterns from the higher layers because they have small sort
> windows which are necessary to keep IO latency in check. Hence if
> the higher layers feed the lower layers crap they simply don't have
> the context or depth to perform the same level of optimistations we
> can do easily higher up the stack.
> 
> IOWs, IMO anywhere there is a context with significant queue of IO,
> that's where we should be doing a better job of sorting before that
> IO is dispatched to the lower layers. This is still no guarantee of
> better IO (e.g. if the filesystem fragments the file) but it does
> give the lower layers a far better chance at optimal allocation and
> scheduling of IO...

None of what you said had much to do with what I said.

What you've described are implementation problems in the current block
layer because it conflates "sorting" with "queueing".  I'm saying "fix
that".

And...  sorting at the block layer will always be superior to sorting
at the pagecache layer because the block layer sorts at the physical
block level and can handle not-well-laid-out files and can sort and merge
pages from different address_spaces.

Still, I suspect none of it will improve anything anyway.  Those pages
are still dirty, possibly-locked and need to go to disk.  It doesn't
matter from the MM POV whether they sit in some VM list or in the
request queue.

Possibly there may be some benefit to not putting so many of these
unreclaimable pages into the queue all at the the same time.  But
that's a shortcoming in the block code: we should be able to shove
arbitrary numbers of dirty page (segments) into the queue and not gum
the system up.  Don't try to work around that in the VM.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  1:39           ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  1:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 09:11:44 +1000
> > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > shows that if we optimise IO patterns at higher layers where the
> > > sort window is much, much larger than in the IO scheduler, then
> > > overall system performance improves....
> > 
> > Yup.
> > 
> > But then, this all really should be done at the block layer so other
> > io-submitting-paths can benefit from it.
> 
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
> 
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.
> 
> > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> 
> Perhaps, but IMO sorting should be done where the context allows it
> to be done most efficiently. Sorting is most effective when ever a
> significant queue of IO is formed, whether it be in the filesystem,
> the VFS, the VM or the block layer because the IO stack is very much
> a GIGO queue.
> 
> Simply put, there's nothing the lower layers can do to optimise bad
> IO patterns from the higher layers because they have small sort
> windows which are necessary to keep IO latency in check. Hence if
> the higher layers feed the lower layers crap they simply don't have
> the context or depth to perform the same level of optimistations we
> can do easily higher up the stack.
> 
> IOWs, IMO anywhere there is a context with significant queue of IO,
> that's where we should be doing a better job of sorting before that
> IO is dispatched to the lower layers. This is still no guarantee of
> better IO (e.g. if the filesystem fragments the file) but it does
> give the lower layers a far better chance at optimal allocation and
> scheduling of IO...

None of what you said had much to do with what I said.

What you've described are implementation problems in the current block
layer because it conflates "sorting" with "queueing".  I'm saying "fix
that".

And...  sorting at the block layer will always be superior to sorting
at the pagecache layer because the block layer sorts at the physical
block level and can handle not-well-laid-out files and can sort and merge
pages from different address_spaces.

Still, I suspect none of it will improve anything anyway.  Those pages
are still dirty, possibly-locked and need to go to disk.  It doesn't
matter from the MM POV whether they sit in some VM list or in the
request queue.

Possibly there may be some benefit to not putting so many of these
unreclaimable pages into the queue all at the the same time.  But
that's a shortcoming in the block code: we should be able to shove
arbitrary numbers of dirty page (segments) into the queue and not gum
the system up.  Don't try to work around that in the VM.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  1:16           ` Rik van Riel
@ 2010-06-15  1:45             ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  1:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel <riel@redhat.com> wrote:

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

Passing the igrab()bed inode into the flusher threads would fix the
iput_final() problems, as long as the alloc_pages() caller never blocks
indefinitely waiting for the work which the flusher threads are doing. 

Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
caller holds VFS locks and is waiting for the flusher threads while all
the flusher threads are stuck under iput_final() waiting for those VFS
locks.

That's fixable by not using igrab()/iput().  You can use lock_page() to
pin the address_space.  Pass the address of the locked page across to
the flusher threads so they don't try to lock it a second time, or just
use trylocking on that writeback path or whatever.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  1:45             ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  1:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel <riel@redhat.com> wrote:

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

Passing the igrab()bed inode into the flusher threads would fix the
iput_final() problems, as long as the alloc_pages() caller never blocks
indefinitely waiting for the work which the flusher threads are doing. 

Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
caller holds VFS locks and is waiting for the flusher threads while all
the flusher threads are stuck under iput_final() waiting for those VFS
locks.

That's fixable by not using igrab()/iput().  You can use lock_page() to
pin the address_space.  Pass the address of the locked page across to
the flusher threads so they don't try to lock it a second time, or just
use trylocking on that writeback path or whatever.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  1:39           ` Andrew Morton
@ 2010-06-15  3:20             ` Dave Chinner
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  3:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 09:11:44 +1000
> > > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > > shows that if we optimise IO patterns at higher layers where the
> > > > sort window is much, much larger than in the IO scheduler, then
> > > > overall system performance improves....
> > > 
> > > Yup.
> > > 
> > > But then, this all really should be done at the block layer so other
> > > io-submitting-paths can benefit from it.
> > 
> > That was what we did in the past with really, really deep IO
> > scheduler queues. That leads to IO latency and OOM problems because
> > we could lock gigabytes of memory away under IO and take minutes to
> > clean it.
> > 
> > Besides, there really isn't the right context in the block layer to
> > be able to queue and prioritise large amounts of IO without
> > significant penalties to some higher layer operation.
> > 
> > > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> > 
> > Perhaps, but IMO sorting should be done where the context allows it
> > to be done most efficiently. Sorting is most effective when ever a
> > significant queue of IO is formed, whether it be in the filesystem,
> > the VFS, the VM or the block layer because the IO stack is very much
> > a GIGO queue.
> > 
> > Simply put, there's nothing the lower layers can do to optimise bad
> > IO patterns from the higher layers because they have small sort
> > windows which are necessary to keep IO latency in check. Hence if
> > the higher layers feed the lower layers crap they simply don't have
> > the context or depth to perform the same level of optimistations we
> > can do easily higher up the stack.
> > 
> > IOWs, IMO anywhere there is a context with significant queue of IO,
> > that's where we should be doing a better job of sorting before that
> > IO is dispatched to the lower layers. This is still no guarantee of
> > better IO (e.g. if the filesystem fragments the file) but it does
> > give the lower layers a far better chance at optimal allocation and
> > scheduling of IO...
> 
> None of what you said had much to do with what I said.
> 
> What you've described are implementation problems in the current block
> layer because it conflates "sorting" with "queueing".  I'm saying "fix
> that".

You can't sort until you've queued.

> And...  sorting at the block layer will always be superior to sorting
> at the pagecache layer because the block layer sorts at the physical
> block level and can handle not-well-laid-out files and can sort and merge
> pages from different address_spaces.

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
optmisations to the IO that the VM issues?

> Still, I suspect none of it will improve anything anyway.  Those pages
> are still dirty, possibly-locked and need to go to disk.  It doesn't
> matter from the MM POV whether they sit in some VM list or in the
> request queue.

Oh, but it does.....

> Possibly there may be some benefit to not putting so many of these
> unreclaimable pages into the queue all at the the same time.  But
> that's a shortcoming in the block code: we should be able to shove
> arbitrary numbers of dirty page (segments) into the queue and not gum
> the system up.  Don't try to work around that in the VM.

I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly.  If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  3:20             ` Dave Chinner
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  3:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 09:11:44 +1000
> > > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > > shows that if we optimise IO patterns at higher layers where the
> > > > sort window is much, much larger than in the IO scheduler, then
> > > > overall system performance improves....
> > > 
> > > Yup.
> > > 
> > > But then, this all really should be done at the block layer so other
> > > io-submitting-paths can benefit from it.
> > 
> > That was what we did in the past with really, really deep IO
> > scheduler queues. That leads to IO latency and OOM problems because
> > we could lock gigabytes of memory away under IO and take minutes to
> > clean it.
> > 
> > Besides, there really isn't the right context in the block layer to
> > be able to queue and prioritise large amounts of IO without
> > significant penalties to some higher layer operation.
> > 
> > > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> > 
> > Perhaps, but IMO sorting should be done where the context allows it
> > to be done most efficiently. Sorting is most effective when ever a
> > significant queue of IO is formed, whether it be in the filesystem,
> > the VFS, the VM or the block layer because the IO stack is very much
> > a GIGO queue.
> > 
> > Simply put, there's nothing the lower layers can do to optimise bad
> > IO patterns from the higher layers because they have small sort
> > windows which are necessary to keep IO latency in check. Hence if
> > the higher layers feed the lower layers crap they simply don't have
> > the context or depth to perform the same level of optimistations we
> > can do easily higher up the stack.
> > 
> > IOWs, IMO anywhere there is a context with significant queue of IO,
> > that's where we should be doing a better job of sorting before that
> > IO is dispatched to the lower layers. This is still no guarantee of
> > better IO (e.g. if the filesystem fragments the file) but it does
> > give the lower layers a far better chance at optimal allocation and
> > scheduling of IO...
> 
> None of what you said had much to do with what I said.
> 
> What you've described are implementation problems in the current block
> layer because it conflates "sorting" with "queueing".  I'm saying "fix
> that".

You can't sort until you've queued.

> And...  sorting at the block layer will always be superior to sorting
> at the pagecache layer because the block layer sorts at the physical
> block level and can handle not-well-laid-out files and can sort and merge
> pages from different address_spaces.

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
optmisations to the IO that the VM issues?

> Still, I suspect none of it will improve anything anyway.  Those pages
> are still dirty, possibly-locked and need to go to disk.  It doesn't
> matter from the MM POV whether they sit in some VM list or in the
> request queue.

Oh, but it does.....

> Possibly there may be some benefit to not putting so many of these
> unreclaimable pages into the queue all at the the same time.  But
> that's a shortcoming in the block code: we should be able to shove
> arbitrary numbers of dirty page (segments) into the queue and not gum
> the system up.  Don't try to work around that in the VM.

I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly.  If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  1:45             ` Andrew Morton
@ 2010-06-15  4:08               ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15  4:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/14/2010 09:45 PM, Andrew Morton wrote:
> On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> Passing the igrab()bed inode into the flusher threads would fix the
> iput_final() problems, as long as the alloc_pages() caller never blocks
> indefinitely waiting for the work which the flusher threads are doing.
>
> Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> caller holds VFS locks and is waiting for the flusher threads while all
> the flusher threads are stuck under iput_final() waiting for those VFS
> locks.
>
> That's fixable by not using igrab()/iput().  You can use lock_page() to
> pin the address_space.  Pass the address of the locked page across to
> the flusher threads so they don't try to lock it a second time, or just
> use trylocking on that writeback path or whatever.

Any thread that does not have __GFP_FS set in its gfp_mask
cannot wait for the flusher to complete. This is regardless
of the mechanism used to kick the flusher.

Then again, those threads cannot call ->writepage today
either, so we should be fine keeping that behaviour.

Threads that do have __GFP_FS in their gfp_mask can wait
for the flusher in various ways.  Maybe the lock_page()
method can be simplified by having the flusher thread
unlock the page the moment it gets it, and then run the
normal flusher code?

The pageout code (in shrink_page_list) already unlocks
the page anyway before putting it back on the relevant
LRU list.  It would be easy enough to skip that unlock
and let the flusher thread take care of it.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  4:08               ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15  4:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/14/2010 09:45 PM, Andrew Morton wrote:
> On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> Passing the igrab()bed inode into the flusher threads would fix the
> iput_final() problems, as long as the alloc_pages() caller never blocks
> indefinitely waiting for the work which the flusher threads are doing.
>
> Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> caller holds VFS locks and is waiting for the flusher threads while all
> the flusher threads are stuck under iput_final() waiting for those VFS
> locks.
>
> That's fixable by not using igrab()/iput().  You can use lock_page() to
> pin the address_space.  Pass the address of the locked page across to
> the flusher threads so they don't try to lock it a second time, or just
> use trylocking on that writeback path or whatever.

Any thread that does not have __GFP_FS set in its gfp_mask
cannot wait for the flusher to complete. This is regardless
of the mechanism used to kick the flusher.

Then again, those threads cannot call ->writepage today
either, so we should be fine keeping that behaviour.

Threads that do have __GFP_FS in their gfp_mask can wait
for the flusher in various ways.  Maybe the lock_page()
method can be simplified by having the flusher thread
unlock the page the moment it gets it, and then run the
normal flusher code?

The pageout code (in shrink_page_list) already unlocks
the page anyway before putting it back on the relevant
LRU list.  It would be easy enough to skip that unlock
and let the flusher thread take care of it.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  3:20             ` Dave Chinner
@ 2010-06-15  4:15               ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  4:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > 
> > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > that's where we should be doing a better job of sorting before that
> > > IO is dispatched to the lower layers. This is still no guarantee of
> > > better IO (e.g. if the filesystem fragments the file) but it does
> > > give the lower layers a far better chance at optimal allocation and
> > > scheduling of IO...
> > 
> > None of what you said had much to do with what I said.
> > 
> > What you've described are implementation problems in the current block
> > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > that".
> 
> You can't sort until you've queued.

Yes you can.  That's exactly what you're recommending!  Only you're
recommending doing it at the wrong level.  The fs-writeback radix-tree
walks do it at the wrong level too.  Sorting should be done within, or
in a layer above the block queues, not within the large number of
individual callers.

> > And...  sorting at the block layer will always be superior to sorting
> > at the pagecache layer because the block layer sorts at the physical
> > block level and can handle not-well-laid-out files and can sort and merge
> > pages from different address_spaces.
> 
> Yes it, can do that. And it still does that even if the higher
> layers sort their I/O dispatch better,
> 
> Filesystems try very hard to allocate adjacent logical offsets in a
> file in adjacent physical blocks on disk - that's the whole point of
> extent-indexed filesystems. Hence with modern filesystems there is
> generally a direct correlation between the page {mapping,index}
> tuple and the physical location of the mapped block.
> 
> i.e. there is generally zero physical correlation between pages in
> different mappings, but there is a high physical correlation
> between the index of pages on the same mapping.

Nope.  Large-number-of-small-files is a pretty common case.  If the fs
doesn't handle that well (ie: by placing them nearby on disk), it's
borked.

> Hence by sorting
> where we have a {mapping,index} context, we push out IO that is
> much more likely to be in contiguous physical chunks that the
> current random page shootdown.
> 
> We optimise applications to use these sorts of correlations all the
> time to improve IO patterns. Why can't we make the same sort of
> optmisations to the IO that the VM issues?

We can, but it shouldn't be specific to page reclaim.  Other places
submit IO too, and want the same treatment.

> > Still, I suspect none of it will improve anything anyway.  Those pages
> > are still dirty, possibly-locked and need to go to disk.  It doesn't
> > matter from the MM POV whether they sit in some VM list or in the
> > request queue.
> 
> Oh, but it does.....

The only difference is that pages which are in the queue (current
implementation thereof) can't be shot down by truncate.

> > Possibly there may be some benefit to not putting so many of these
> > unreclaimable pages into the queue all at the the same time.  But
> > that's a shortcoming in the block code: we should be able to shove
> > arbitrary numbers of dirty page (segments) into the queue and not gum
> > the system up.  Don't try to work around that in the VM.
> 
> I think you know perfectly well why the system gums up when we
> increase block layer queue depth: it's the fact that the _VM_ relies
> on block layer queue congestion to limit the amount of dirty memory
> in the system.

mm, a little bit still, I guess.  Mainly because dirty memory
management isn't zone aware, so even though we limit dirty memory
globally, a particular zone(set) can get excessively dirtied.

Most of this problem happen on the balance_dirty_pages() path, where we
already sort the pages in ascending logical order.

> We've got a feedback loop between the block layer and the VM that
> only works if block device queues are kept shallow. Keeping the
> number of dirty pages under control is a VM responsibility, but it
> is putting limitations on the block layer to ensure that the VM
> works correctly.  If you want the block layer to have deep queues,
> then someone needs to fix the VM not to require knowledge of the
> internal operation of the block layer for correct operation.
> 
> Adding a few lines of code to sort a list in the VM is far, far
> easier than redesigning the write throttling code....

It's a hack and a workaround.  And I suspect it won't make any
difference, especially given Mel's measurements of the number of dirty
pages he's seeing coming off the LRU.  Although those numbers may well
be due to the new quite-low dirty memory thresholds.  

It would be interesting to code up a little test patch though, see if
there's benefit to be had going down this path.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  4:15               ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  4:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > 
> > > 
> > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > that's where we should be doing a better job of sorting before that
> > > IO is dispatched to the lower layers. This is still no guarantee of
> > > better IO (e.g. if the filesystem fragments the file) but it does
> > > give the lower layers a far better chance at optimal allocation and
> > > scheduling of IO...
> > 
> > None of what you said had much to do with what I said.
> > 
> > What you've described are implementation problems in the current block
> > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > that".
> 
> You can't sort until you've queued.

Yes you can.  That's exactly what you're recommending!  Only you're
recommending doing it at the wrong level.  The fs-writeback radix-tree
walks do it at the wrong level too.  Sorting should be done within, or
in a layer above the block queues, not within the large number of
individual callers.

> > And...  sorting at the block layer will always be superior to sorting
> > at the pagecache layer because the block layer sorts at the physical
> > block level and can handle not-well-laid-out files and can sort and merge
> > pages from different address_spaces.
> 
> Yes it, can do that. And it still does that even if the higher
> layers sort their I/O dispatch better,
> 
> Filesystems try very hard to allocate adjacent logical offsets in a
> file in adjacent physical blocks on disk - that's the whole point of
> extent-indexed filesystems. Hence with modern filesystems there is
> generally a direct correlation between the page {mapping,index}
> tuple and the physical location of the mapped block.
> 
> i.e. there is generally zero physical correlation between pages in
> different mappings, but there is a high physical correlation
> between the index of pages on the same mapping.

Nope.  Large-number-of-small-files is a pretty common case.  If the fs
doesn't handle that well (ie: by placing them nearby on disk), it's
borked.

> Hence by sorting
> where we have a {mapping,index} context, we push out IO that is
> much more likely to be in contiguous physical chunks that the
> current random page shootdown.
> 
> We optimise applications to use these sorts of correlations all the
> time to improve IO patterns. Why can't we make the same sort of
> optmisations to the IO that the VM issues?

We can, but it shouldn't be specific to page reclaim.  Other places
submit IO too, and want the same treatment.

> > Still, I suspect none of it will improve anything anyway.  Those pages
> > are still dirty, possibly-locked and need to go to disk.  It doesn't
> > matter from the MM POV whether they sit in some VM list or in the
> > request queue.
> 
> Oh, but it does.....

The only difference is that pages which are in the queue (current
implementation thereof) can't be shot down by truncate.

> > Possibly there may be some benefit to not putting so many of these
> > unreclaimable pages into the queue all at the the same time.  But
> > that's a shortcoming in the block code: we should be able to shove
> > arbitrary numbers of dirty page (segments) into the queue and not gum
> > the system up.  Don't try to work around that in the VM.
> 
> I think you know perfectly well why the system gums up when we
> increase block layer queue depth: it's the fact that the _VM_ relies
> on block layer queue congestion to limit the amount of dirty memory
> in the system.

mm, a little bit still, I guess.  Mainly because dirty memory
management isn't zone aware, so even though we limit dirty memory
globally, a particular zone(set) can get excessively dirtied.

Most of this problem happen on the balance_dirty_pages() path, where we
already sort the pages in ascending logical order.

> We've got a feedback loop between the block layer and the VM that
> only works if block device queues are kept shallow. Keeping the
> number of dirty pages under control is a VM responsibility, but it
> is putting limitations on the block layer to ensure that the VM
> works correctly.  If you want the block layer to have deep queues,
> then someone needs to fix the VM not to require knowledge of the
> internal operation of the block layer for correct operation.
> 
> Adding a few lines of code to sort a list in the VM is far, far
> easier than redesigning the write throttling code....

It's a hack and a workaround.  And I suspect it won't make any
difference, especially given Mel's measurements of the number of dirty
pages he's seeing coming off the LRU.  Although those numbers may well
be due to the new quite-low dirty memory thresholds.  

It would be interesting to code up a little test patch though, see if
there's benefit to be had going down this path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  4:08               ` Rik van Riel
@ 2010-06-15  4:37                 ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  4:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <riel@redhat.com> wrote:

> On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
> >
> >> Would it be hard to add a "please flush this file"
> >> way to call the filesystem flushing threads?
> >
> > Passing the igrab()bed inode into the flusher threads would fix the
> > iput_final() problems, as long as the alloc_pages() caller never blocks
> > indefinitely waiting for the work which the flusher threads are doing.
> >
> > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > caller holds VFS locks and is waiting for the flusher threads while all
> > the flusher threads are stuck under iput_final() waiting for those VFS
> > locks.
> >
> > That's fixable by not using igrab()/iput().  You can use lock_page() to
> > pin the address_space.  Pass the address of the locked page across to
> > the flusher threads so they don't try to lock it a second time, or just
> > use trylocking on that writeback path or whatever.
> 
> Any thread that does not have __GFP_FS set in its gfp_mask
> cannot wait for the flusher to complete. This is regardless
> of the mechanism used to kick the flusher.

mm...  kinda.  A bare order-zero __GFP_WAIT allocation can still wait
forever, afaict.

> Then again, those threads cannot call ->writepage today
> either, so we should be fine keeping that behaviour.

I'm not sure.  iput_final() can take a lot of locks, both VFS and
heaven knows what within the individual filesystems.  Is it the case
that all allocations which occur under all of those locks is always
!__GFP_FS?  Hard to say...

> Threads that do have __GFP_FS in their gfp_mask can wait
> for the flusher in various ways.  Maybe the lock_page()
> method can be simplified by having the flusher thread
> unlock the page the moment it gets it, and then run the
> normal flusher code?

Well, _something_ has to pin the address_space.  A single locked page
will do.

> The pageout code (in shrink_page_list) already unlocks
> the page anyway before putting it back on the relevant
> LRU list.  It would be easy enough to skip that unlock
> and let the flusher thread take care of it.

Once that page is unlocked, we can't touch *mapping - its inode can be
concurrently reclaimed.  Although I guess the technique in
handle_write_error() can be reused.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  4:37                 ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-15  4:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <riel@redhat.com> wrote:

> On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
> >
> >> Would it be hard to add a "please flush this file"
> >> way to call the filesystem flushing threads?
> >
> > Passing the igrab()bed inode into the flusher threads would fix the
> > iput_final() problems, as long as the alloc_pages() caller never blocks
> > indefinitely waiting for the work which the flusher threads are doing.
> >
> > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > caller holds VFS locks and is waiting for the flusher threads while all
> > the flusher threads are stuck under iput_final() waiting for those VFS
> > locks.
> >
> > That's fixable by not using igrab()/iput().  You can use lock_page() to
> > pin the address_space.  Pass the address of the locked page across to
> > the flusher threads so they don't try to lock it a second time, or just
> > use trylocking on that writeback path or whatever.
> 
> Any thread that does not have __GFP_FS set in its gfp_mask
> cannot wait for the flusher to complete. This is regardless
> of the mechanism used to kick the flusher.

mm...  kinda.  A bare order-zero __GFP_WAIT allocation can still wait
forever, afaict.

> Then again, those threads cannot call ->writepage today
> either, so we should be fine keeping that behaviour.

I'm not sure.  iput_final() can take a lot of locks, both VFS and
heaven knows what within the individual filesystems.  Is it the case
that all allocations which occur under all of those locks is always
!__GFP_FS?  Hard to say...

> Threads that do have __GFP_FS in their gfp_mask can wait
> for the flusher in various ways.  Maybe the lock_page()
> method can be simplified by having the flusher thread
> unlock the page the moment it gets it, and then run the
> normal flusher code?

Well, _something_ has to pin the address_space.  A single locked page
will do.

> The pageout code (in shrink_page_list) already unlocks
> the page anyway before putting it back on the relevant
> LRU list.  It would be easy enough to skip that unlock
> and let the flusher thread take care of it.

Once that page is unlocked, we can't touch *mapping - its inode can be
concurrently reclaimed.  Although I guess the technique in
handle_write_error() can be reused.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  4:37                 ` Andrew Morton
@ 2010-06-15  5:12                   ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15  5:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Chinner, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:37:32PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
> > On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
> > >
> > >> Would it be hard to add a "please flush this file"
> > >> way to call the filesystem flushing threads?
> > >
> > > Passing the igrab()bed inode into the flusher threads would fix the
> > > iput_final() problems, as long as the alloc_pages() caller never blocks
> > > indefinitely waiting for the work which the flusher threads are doing.
> > >
> > > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > > caller holds VFS locks and is waiting for the flusher threads while all
> > > the flusher threads are stuck under iput_final() waiting for those VFS
> > > locks.
> > >
> > > That's fixable by not using igrab()/iput().  You can use lock_page() to
> > > pin the address_space.  Pass the address of the locked page across to
> > > the flusher threads so they don't try to lock it a second time, or just
> > > use trylocking on that writeback path or whatever.
> > 
> > Any thread that does not have __GFP_FS set in its gfp_mask
> > cannot wait for the flusher to complete. This is regardless
> > of the mechanism used to kick the flusher.
> 
> mm...  kinda.  A bare order-zero __GFP_WAIT allocation can still wait
> forever, afaict.
> 
> > Then again, those threads cannot call ->writepage today
> > either, so we should be fine keeping that behaviour.
> 
> I'm not sure.  iput_final() can take a lot of locks, both VFS and
> heaven knows what within the individual filesystems.  Is it the case
> that all allocations which occur under all of those locks is always
> !__GFP_FS?  Hard to say...

__GFP_FS is set with i_mutex held in places, and there is nothing to
prevent a filesystem from using that in iput_final paths, AFAIK.

 
> > Threads that do have __GFP_FS in their gfp_mask can wait
> > for the flusher in various ways.  Maybe the lock_page()
> > method can be simplified by having the flusher thread
> > unlock the page the moment it gets it, and then run the
> > normal flusher code?
> 
> Well, _something_ has to pin the address_space.  A single locked page
> will do.
> 
> > The pageout code (in shrink_page_list) already unlocks
> > the page anyway before putting it back on the relevant
> > LRU list.  It would be easy enough to skip that unlock
> > and let the flusher thread take care of it.
> 
> Once that page is unlocked, we can't touch *mapping - its inode can be
> concurrently reclaimed.  Although I guess the technique in
> handle_write_error() can be reused.

Nasty. That guy needs to be using lock_page_nosync().


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  5:12                   ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15  5:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Chinner, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:37:32PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <riel@redhat.com> wrote:
> 
> > On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<riel@redhat.com>  wrote:
> > >
> > >> Would it be hard to add a "please flush this file"
> > >> way to call the filesystem flushing threads?
> > >
> > > Passing the igrab()bed inode into the flusher threads would fix the
> > > iput_final() problems, as long as the alloc_pages() caller never blocks
> > > indefinitely waiting for the work which the flusher threads are doing.
> > >
> > > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > > caller holds VFS locks and is waiting for the flusher threads while all
> > > the flusher threads are stuck under iput_final() waiting for those VFS
> > > locks.
> > >
> > > That's fixable by not using igrab()/iput().  You can use lock_page() to
> > > pin the address_space.  Pass the address of the locked page across to
> > > the flusher threads so they don't try to lock it a second time, or just
> > > use trylocking on that writeback path or whatever.
> > 
> > Any thread that does not have __GFP_FS set in its gfp_mask
> > cannot wait for the flusher to complete. This is regardless
> > of the mechanism used to kick the flusher.
> 
> mm...  kinda.  A bare order-zero __GFP_WAIT allocation can still wait
> forever, afaict.
> 
> > Then again, those threads cannot call ->writepage today
> > either, so we should be fine keeping that behaviour.
> 
> I'm not sure.  iput_final() can take a lot of locks, both VFS and
> heaven knows what within the individual filesystems.  Is it the case
> that all allocations which occur under all of those locks is always
> !__GFP_FS?  Hard to say...

__GFP_FS is set with i_mutex held in places, and there is nothing to
prevent a filesystem from using that in iput_final paths, AFAIK.

 
> > Threads that do have __GFP_FS in their gfp_mask can wait
> > for the flusher in various ways.  Maybe the lock_page()
> > method can be simplified by having the flusher thread
> > unlock the page the moment it gets it, and then run the
> > normal flusher code?
> 
> Well, _something_ has to pin the address_space.  A single locked page
> will do.
> 
> > The pageout code (in shrink_page_list) already unlocks
> > the page anyway before putting it back on the relevant
> > LRU list.  It would be easy enough to skip that unlock
> > and let the flusher thread take care of it.
> 
> Once that page is unlocked, we can't touch *mapping - its inode can be
> concurrently reclaimed.  Although I guess the technique in
> handle_write_error() can be reused.

Nasty. That guy needs to be using lock_page_nosync().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch] mm: vmscan fix mapping use after free
  2010-06-15  5:12                   ` Nick Piggin
@ 2010-06-15  5:43                     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15  5:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Chinner, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > Once that page is unlocked, we can't touch *mapping - its inode can be
> > concurrently reclaimed.  Although I guess the technique in
> > handle_write_error() can be reused.
> 
> Nasty. That guy needs to be using lock_page_nosync().
--

Need lock_page_nosync here because we have no reference to the mapping when
taking the page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);

^ permalink raw reply	[flat|nested] 198+ messages in thread

* [patch] mm: vmscan fix mapping use after free
@ 2010-06-15  5:43                     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15  5:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Dave Chinner, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > Once that page is unlocked, we can't touch *mapping - its inode can be
> > concurrently reclaimed.  Although I guess the technique in
> > handle_write_error() can be reused.
> 
> Nasty. That guy needs to be using lock_page_nosync().
--

Need lock_page_nosync here because we have no reference to the mapping when
taking the page lock.

Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
 static void handle_write_error(struct address_space *mapping,
 				struct page *page, int error)
 {
-	lock_page(page);
+	lock_page_nosync(page);
 	if (page_mapping(page) == mapping)
 		mapping_set_error(mapping, error);
 	unlock_page(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  4:15               ` Andrew Morton
@ 2010-06-15  6:36                 ` Dave Chinner
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  6:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > 
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > > 
> > > None of what you said had much to do with what I said.
> > > 
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > > that".
> > 
> > You can't sort until you've queued.
> 
> Yes you can.  That's exactly what you're recommending!

Umm, I suggested sorting a queue dirty pages that was build by
reclaim before dispatching them. How does that translate to
me recommending "sort before queuing"?

> Only you're
> recommending doing it at the wrong level.

If you feed a filesystem garbage IO, you'll get garbage performance
and there's nothing that a block layer sort queue can do to fix the
damage it does to both performance and filesystem fragmentation
levels. It's not just about IO issue - delayed allocation pretty
much requires writeback to be issuing well formed IOs to reap the
benefits it can provide....

> > > And...  sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> > 
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> > 
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> > 
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
> 
> Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.

Filesystems already handle this case just fine as we see it from
writeback all the time. Untarring a kernel is a good example of
this...

I suggested sorting all the IO to be issued into per-mapping page
groups because:
	a) makes IO issued from reclaim look almost exactly the same
	   to the filesytem as if writeback is pushing out the IO.
	b) it looks to be a trivial addition to the new code.

To me that's a no-brainer.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.

I doubt Mel's tests cases will show anything - they simply didn't
show enough IO issued from reclaim to make any difference.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15  6:36                 ` Dave Chinner
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15  6:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > 
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > > 
> > > None of what you said had much to do with what I said.
> > > 
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > > that".
> > 
> > You can't sort until you've queued.
> 
> Yes you can.  That's exactly what you're recommending!

Umm, I suggested sorting a queue dirty pages that was build by
reclaim before dispatching them. How does that translate to
me recommending "sort before queuing"?

> Only you're
> recommending doing it at the wrong level.

If you feed a filesystem garbage IO, you'll get garbage performance
and there's nothing that a block layer sort queue can do to fix the
damage it does to both performance and filesystem fragmentation
levels. It's not just about IO issue - delayed allocation pretty
much requires writeback to be issuing well formed IOs to reap the
benefits it can provide....

> > > And...  sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> > 
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> > 
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> > 
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
> 
> Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.

Filesystems already handle this case just fine as we see it from
writeback all the time. Untarring a kernel is a good example of
this...

I suggested sorting all the IO to be issued into per-mapping page
groups because:
	a) makes IO issued from reclaim look almost exactly the same
	   to the filesytem as if writeback is pushing out the IO.
	b) it looks to be a trivial addition to the new code.

To me that's a no-brainer.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.

I doubt Mel's tests cases will show anything - they simply didn't
show enough IO issued from reclaim to make any difference.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()
  2010-06-14 18:06     ` Rik van Riel
@ 2010-06-15 10:13       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 10:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 02:06:13PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>>
>> Now, max_scan of shrink_inactive_list() is always passed less than
>> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
>> This patch also help stack diet.
>>
>> detail
>>   - remove "while (nr_scanned<  max_scan)" loop
>>   - remove nr_freed (now, we use nr_reclaimed directly)
>>   - remove nr_scan (now, we use nr_scanned directly)
>>   - rename max_scan to nr_to_scan
>>   - pass nr_to_scan into isolate_pages() directly instead
>>     using SWAP_CLUSTER_MAX
>>
>> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>
>
> Other than the weird whitespace below,
>

Not sure where this came out of. It's not in my local patch file nor in
my working tree. Very odd.

> Reviewed-by: Rik van Riel <riel@redhat.com>
>

Thanks

>> +	/*
>> +	 * If we are direct reclaiming for contiguous pages and we do
>> +	 * not reclaim everything in the list, try again and wait
>> +	 * for IO to complete. This will stall high-order allocations
>> +	 * but that should be acceptable to the caller
>> +	 */
>> +	if (nr_reclaimed<  nr_taken&&  !current_is_kswapd()&&  sc->lumpy_reclaim_mode) {
>> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> -- 
> All rights reversed
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()
@ 2010-06-15 10:13       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 10:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 02:06:13PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> From: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>>
>> Now, max_scan of shrink_inactive_list() is always passed less than
>> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
>> This patch also help stack diet.
>>
>> detail
>>   - remove "while (nr_scanned<  max_scan)" loop
>>   - remove nr_freed (now, we use nr_reclaimed directly)
>>   - remove nr_scan (now, we use nr_scanned directly)
>>   - rename max_scan to nr_to_scan
>>   - pass nr_to_scan into isolate_pages() directly instead
>>     using SWAP_CLUSTER_MAX
>>
>> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>
>> Reviewed-by: Johannes Weiner<hannes@cmpxchg.org>
>
> Other than the weird whitespace below,
>

Not sure where this came out of. It's not in my local patch file nor in
my working tree. Very odd.

> Reviewed-by: Rik van Riel <riel@redhat.com>
>

Thanks

>> +	/*
>> +	 * If we are direct reclaiming for contiguous pages and we do
>> +	 * not reclaim everything in the list, try again and wait
>> +	 * for IO to complete. This will stall high-order allocations
>> +	 * but that should be acceptable to the caller
>> +	 */
>> +	if (nr_reclaimed<  nr_taken&&  !current_is_kswapd()&&  sc->lumpy_reclaim_mode) {
>> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> -- 
> All rights reversed
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 21:13     ` Rik van Riel
@ 2010-06-15 10:18       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 10:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 05:13:37PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> Page reclaim cleans individual pages using a_ops->writepage() because from
>> the VM perspective, it is known that pages in a particular zone must be freed
>> soon, it considers the target page to be the oldest and it does not want
>> to wait while background flushers cleans other pages. From a filesystem
>> perspective this is extremely inefficient as it generates a very seeky
>> IO pattern leading to the perverse situation where it can take longer to
>> clean all dirty pages than it would have otherwise.
>
> Reclaiming clean pages should be fast enough that this should
> make little, if any, difference.
>

Indeed, this was a bit weak. The original point of the patch was to write
contiguous pages belonging to the same inode when they were encountered in
that batch which made a bit more sense but didn't work out at first
pass.

>> This patch queues all dirty pages at once to maximise the chances that
>> the write requests get merged efficiently. It also makes the next patch
>> that avoids writeout from direct reclaim more straight-forward.
>
> However, this is a convincing argument :)
>

Thanks.

>> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
>
> Reviewed-by: Rik van Riel <riel@redhat.com>
>

Thanks again :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 10:18       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 10:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 05:13:37PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> Page reclaim cleans individual pages using a_ops->writepage() because from
>> the VM perspective, it is known that pages in a particular zone must be freed
>> soon, it considers the target page to be the oldest and it does not want
>> to wait while background flushers cleans other pages. From a filesystem
>> perspective this is extremely inefficient as it generates a very seeky
>> IO pattern leading to the perverse situation where it can take longer to
>> clean all dirty pages than it would have otherwise.
>
> Reclaiming clean pages should be fast enough that this should
> make little, if any, difference.
>

Indeed, this was a bit weak. The original point of the patch was to write
contiguous pages belonging to the same inode when they were encountered in
that batch which made a bit more sense but didn't work out at first
pass.

>> This patch queues all dirty pages at once to maximise the chances that
>> the write requests get merged efficiently. It also makes the next patch
>> that avoids writeout from direct reclaim more straight-forward.
>
> However, this is a convincing argument :)
>

Thanks.

>> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
>
> Reviewed-by: Rik van Riel <riel@redhat.com>
>

Thanks again :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  6:36                 ` Dave Chinner
@ 2010-06-15 10:28                   ` Evgeniy Polyakov
  -1 siblings, 0 replies; 198+ messages in thread
From: Evgeniy Polyakov @ 2010-06-15 10:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> > doesn't handle that well (ie: by placing them nearby on disk), it's
> > borked.
> 
> Filesystems already handle this case just fine as we see it from
> writeback all the time. Untarring a kernel is a good example of
> this...
> 
> I suggested sorting all the IO to be issued into per-mapping page
> groups because:
> 	a) makes IO issued from reclaim look almost exactly the same
> 	   to the filesytem as if writeback is pushing out the IO.
> 	b) it looks to be a trivial addition to the new code.
> 
> To me that's a no-brainer.

That doesn't coverup large-number-of-small-files pattern, since
untarring subsequently means creating something new, which FS can
optimize. Much more interesting case is when we have dirtied large
number of small files in kind-of random order and submitted them
down to disk.

Per-mapping sorting will not do anything good in this case, even if
files were previously created in a good facion being placed closely and
so on, and only block layer will find a correlation between adjacent
blocks in different files. But with existing queue management it has
quite a small opportunity, and that's what I think Andrew is arguing
about.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 10:28                   ` Evgeniy Polyakov
  0 siblings, 0 replies; 198+ messages in thread
From: Evgeniy Polyakov @ 2010-06-15 10:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> > doesn't handle that well (ie: by placing them nearby on disk), it's
> > borked.
> 
> Filesystems already handle this case just fine as we see it from
> writeback all the time. Untarring a kernel is a good example of
> this...
> 
> I suggested sorting all the IO to be issued into per-mapping page
> groups because:
> 	a) makes IO issued from reclaim look almost exactly the same
> 	   to the filesytem as if writeback is pushing out the IO.
> 	b) it looks to be a trivial addition to the new code.
> 
> To me that's a no-brainer.

That doesn't coverup large-number-of-small-files pattern, since
untarring subsequently means creating something new, which FS can
optimize. Much more interesting case is when we have dirtied large
number of small files in kind-of random order and submitted them
down to disk.

Per-mapping sorting will not do anything good in this case, even if
files were previously created in a good facion being placed closely and
so on, and only block layer will find a correlation between adjacent
blocks in different files. But with existing queue management it has
quite a small opportunity, and that's what I think Andrew is arguing
about.

-- 
	Evgeniy Polyakov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-15 10:47     ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
>  /*
> + * TODO: Try merging with migrations version of putback_lru_pages
> + */
> +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> +				struct zone_reclaim_stat *reclaim_stat,
> +				unsigned long nr_anon, unsigned long nr_file,
> + 				struct list_head *page_list)
> +{

I hate to nitpick on this, but with noinline_for_stack a prototype
really is unreadbale unless the function identifier goes to the next
line.  Compare the one above with:

static noinline_for_stack void
putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
		unsigned long nr_anon, unsigned long nr_file,
		struct list_head *page_list)

> -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int priority, int file)

Same here, just even worse due to the spill over ove 80 characters.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-15 10:47     ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
>  /*
> + * TODO: Try merging with migrations version of putback_lru_pages
> + */
> +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> +				struct zone_reclaim_stat *reclaim_stat,
> +				unsigned long nr_anon, unsigned long nr_file,
> + 				struct list_head *page_list)
> +{

I hate to nitpick on this, but with noinline_for_stack a prototype
really is unreadbale unless the function identifier goes to the next
line.  Compare the one above with:

static noinline_for_stack void
putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
		unsigned long nr_anon, unsigned long nr_file,
		struct list_head *page_list)

> -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>  			struct zone *zone, struct scan_control *sc,
>  			int priority, int file)

Same here, just even worse due to the spill over ove 80 characters.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-15 10:53     ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

> +	/*
> +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> +	 * the page lock has been dropped by ->writepage, that mapping could
> +	 * be anything
> +	 */

Why is this an XXX comment?

> +	/*
> +	 * Wait on writeback if requested to. This happens when
> +	 * direct reclaiming a large contiguous area and the
> +	 * first attempt to free a range of pages fails.
> +	 */
> +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> +		wait_on_page_writeback(page);
> +
> +	if (!PageWriteback(page)) {
> +		/* synchronous write or broken a_ops? */
> +		ClearPageReclaim(page);
> +	}

how about:

	if (PageWriteback(page) {
		if (sync_writeback == PAGEOUT_IO_SYNC)
			wait_on_page_writeback(page);
	} else {
		/* synchronous write or broken a_ops? */
		ClearPageReclaim(page);
	}

>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
>  /*
> + * Clean a list of pages. It is expected that all the pages on page_list have been
> + * locked as part of isolation from the LRU.

A rather pointless line of 80 chars.  I see the point for long string
literals, but here's it's just a pain.

> + *
> + * XXX: Is there a problem with holding multiple page locks like this?

I think there is.  There's quite a few places that do hold multiple
pages locked, but they always lock pages in increasing page->inxex order.
Given that this locks basically in random order it could cause problems
for those places.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 10:53     ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

> +	/*
> +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> +	 * the page lock has been dropped by ->writepage, that mapping could
> +	 * be anything
> +	 */

Why is this an XXX comment?

> +	/*
> +	 * Wait on writeback if requested to. This happens when
> +	 * direct reclaiming a large contiguous area and the
> +	 * first attempt to free a range of pages fails.
> +	 */
> +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> +		wait_on_page_writeback(page);
> +
> +	if (!PageWriteback(page)) {
> +		/* synchronous write or broken a_ops? */
> +		ClearPageReclaim(page);
> +	}

how about:

	if (PageWriteback(page) {
		if (sync_writeback == PAGEOUT_IO_SYNC)
			wait_on_page_writeback(page);
	} else {
		/* synchronous write or broken a_ops? */
		ClearPageReclaim(page);
	}

>  	if (!may_write_to_queue(mapping->backing_dev_info))
>  		return PAGE_KEEP;
>  
>  /*
> + * Clean a list of pages. It is expected that all the pages on page_list have been
> + * locked as part of isolation from the LRU.

A rather pointless line of 80 chars.  I see the point for long string
literals, but here's it's just a pain.

> + *
> + * XXX: Is there a problem with holding multiple page locks like this?

I think there is.  There's quite a few places that do hold multiple
pages locked, but they always lock pages in increasing page->inxex order.
Given that this locks basically in random order it could cause problems
for those places.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:28                   ` Evgeniy Polyakov
@ 2010-06-15 10:55                     ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 10:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > > Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> > > doesn't handle that well (ie: by placing them nearby on disk), it's
> > > borked.
> > 
> > Filesystems already handle this case just fine as we see it from
> > writeback all the time. Untarring a kernel is a good example of
> > this...
> > 
> > I suggested sorting all the IO to be issued into per-mapping page
> > groups because:
> > 	a) makes IO issued from reclaim look almost exactly the same
> > 	   to the filesytem as if writeback is pushing out the IO.
> > 	b) it looks to be a trivial addition to the new code.
> > 
> > To me that's a no-brainer.
> 
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.
> 
> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

The solution is not to sort pages on their way to be submitted either,
really.

What I do in fsblock is to maintain a block-nr sorted tree of dirty
blocks. This works nicely because fsblock dirty state is properly
synchronized with page dirty state. So writeout can just walk this in
order and it provides pretty optimal submission pattern of any
interleavings of data and metadata. No need for buffer boundary or
hacks like that. (needs some intelligence for delalloc, though).

But even with all that, it's not the complete story. It doesn't know
about direct IO, sync IO, or fsyncs, and it would be very hard and
ugly to try to synchronise and sort all that from the pagecache level.
It also is a heuristic in terms of optimal block scheduling behaviour.
With smarter devices and drivers there might be better ways to go.

So what is needed is to get as much info into the block layer as
possible. As Andrew says, there shouldn't be such a big difference
between pages being writeback or dirty in pagecache.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 10:55                     ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 10:55 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > > Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> > > doesn't handle that well (ie: by placing them nearby on disk), it's
> > > borked.
> > 
> > Filesystems already handle this case just fine as we see it from
> > writeback all the time. Untarring a kernel is a good example of
> > this...
> > 
> > I suggested sorting all the IO to be issued into per-mapping page
> > groups because:
> > 	a) makes IO issued from reclaim look almost exactly the same
> > 	   to the filesytem as if writeback is pushing out the IO.
> > 	b) it looks to be a trivial addition to the new code.
> > 
> > To me that's a no-brainer.
> 
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.
> 
> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

The solution is not to sort pages on their way to be submitted either,
really.

What I do in fsblock is to maintain a block-nr sorted tree of dirty
blocks. This works nicely because fsblock dirty state is properly
synchronized with page dirty state. So writeout can just walk this in
order and it provides pretty optimal submission pattern of any
interleavings of data and metadata. No need for buffer boundary or
hacks like that. (needs some intelligence for delalloc, though).

But even with all that, it's not the complete story. It doesn't know
about direct IO, sync IO, or fsyncs, and it would be very hard and
ugly to try to synchronise and sort all that from the pagecache level.
It also is a heuristic in terms of optimal block scheduling behaviour.
With smarter devices and drivers there might be better ways to go.

So what is needed is to get as much info into the block layer as
possible. As Andrew says, there shouldn't be such a big difference
between pages being writeback or dirty in pagecache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-14 23:21       ` Andrew Morton
@ 2010-06-15 10:57         ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> Yup.
> 
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.
> 
> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Even if has not effect on the actual I/O patters it has a massive
benefit for the filesystem.  When probing delalloc/unwritten space at
least XFS does try to convert a larger extent forward from the index,
but doesn't bother to go backwards.  By providing the trivial sort here
we make life a lot easier for the filesystem.

In addition to that we do get better I/O patters especially with short
queues as smart writepage implementatons will also submit the next few
pages, which is essentially free given how the storage works.
This means we already have a page cleaned before we might even submit it
without sorting.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 10:57         ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> Yup.
> 
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.
> 
> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Even if has not effect on the actual I/O patters it has a massive
benefit for the filesystem.  When probing delalloc/unwritten space at
least XFS does try to convert a larger extent forward from the index,
but doesn't bother to go backwards.  By providing the trivial sort here
we make life a lot easier for the filesystem.

In addition to that we do get better I/O patters especially with short
queues as smart writepage implementatons will also submit the next few
pages, which is essentially free given how the storage works.
This means we already have a page cleaned before we might even submit it
without sorting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  1:16           ` Rik van Riel
@ 2010-06-15 11:01             ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
> >Besides, there really isn't the right context in the block layer to
> >be able to queue and prioritise large amounts of IO without
> >significant penalties to some higher layer operation.
> 
> Can we kick flushing for the whole inode at once from
> vmscan.c?

kswapd really should be a last effort tool to clean filesystem pages.
If it does enough I/O for this to matter significantly we need to
fix the VM to move more work to the flusher threads instead of trying
to fix kswapd.  

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

We already have that API, in Jens' latest tree that's
sync_inodes_sb/writeback_inodes_sb.  We could also add a non-waiting
variant if required, but I think the big problem with kswapd is that
we want to wait on I/O completion under circumstances.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:01             ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
> >Besides, there really isn't the right context in the block layer to
> >be able to queue and prioritise large amounts of IO without
> >significant penalties to some higher layer operation.
> 
> Can we kick flushing for the whole inode at once from
> vmscan.c?

kswapd really should be a last effort tool to clean filesystem pages.
If it does enough I/O for this to matter significantly we need to
fix the VM to move more work to the flusher threads instead of trying
to fix kswapd.  

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

We already have that API, in Jens' latest tree that's
sync_inodes_sb/writeback_inodes_sb.  We could also add a non-waiting
variant if required, but I think the big problem with kswapd is that
we want to wait on I/O completion under circumstances.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:28                   ` Evgeniy Polyakov
@ 2010-06-15 11:08                     ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:08 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.

That's why we still have block layer sorting.  But for the problem
of larger files doing the sorting above the filesystem is a lot
more efficient, not just primarily due to the I/O patters but because
it makes life for the filesystem writeback code and allocator a lot
simpler.

> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

Which is actually more or less true - if we do larger amounts of
writeback from kswapd we're toast anyway and performance and allocation
patters go down the toilet.  Then again throwing a list_sort in is
a rather trivial addition.  Note that in addition to page->index we
can also sort by the inode number in the sort function.  At least for
XFS and the traditional ufs derived allocators that will give you
additional locality for small files.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:08                     ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:08 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.

That's why we still have block layer sorting.  But for the problem
of larger files doing the sorting above the filesystem is a lot
more efficient, not just primarily due to the I/O patters but because
it makes life for the filesystem writeback code and allocator a lot
simpler.

> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

Which is actually more or less true - if we do larger amounts of
writeback from kswapd we're toast anyway and performance and allocation
patters go down the toilet.  Then again throwing a list_sort in is
a rather trivial addition.  Note that in addition to page->index we
can also sort by the inode number in the sort function.  At least for
XFS and the traditional ufs derived allocators that will give you
additional locality for small files.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:55                     ` Nick Piggin
@ 2010-06-15 11:10                       ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Evgeniy Polyakov, Dave Chinner, Andrew Morton, Mel Gorman,
	linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> 
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state. So writeout can just walk this in
> order and it provides pretty optimal submission pattern of any
> interleavings of data and metadata. No need for buffer boundary or
> hacks like that. (needs some intelligence for delalloc, though).

I think worrying about indirect blocks really doesn't matter much
these days.  For one thing extent based filesystems have a lot less
of these, and second for a journaling filesystem we only need to log
modification to the indirect blocks and not actually write them back
in place during the sync.  At least for XFS the actual writeback can
happen a lot later, as part of the ordered list of delwri buffers.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:10                       ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 11:10 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Evgeniy Polyakov, Dave Chinner, Andrew Morton, Mel Gorman,
	linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> 
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state. So writeout can just walk this in
> order and it provides pretty optimal submission pattern of any
> interleavings of data and metadata. No need for buffer boundary or
> hacks like that. (needs some intelligence for delalloc, though).

I think worrying about indirect blocks really doesn't matter much
these days.  For one thing extent based filesystems have a lot less
of these, and second for a journaling filesystem we only need to log
modification to the indirect blocks and not actually write them back
in place during the sync.  At least for XFS the actual writeback can
happen a lot later, as part of the ordered list of delwri buffers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:53     ` Christoph Hellwig
@ 2010-06-15 11:11       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > +	/*
> > +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > +	 * the page lock has been dropped by ->writepage, that mapping could
> > +	 * be anything
> > +	 */
> 
> Why is this an XXX comment?
> 

With the page lock released, the mapping may be no longer valid. Nick
posted a patch in relation to it that I need to look at. The comment was
because Andrew highlight that this was buggy and I wanted to make sure I
didn't forget about it.

> > +	/*
> > +	 * Wait on writeback if requested to. This happens when
> > +	 * direct reclaiming a large contiguous area and the
> > +	 * first attempt to free a range of pages fails.
> > +	 */
> > +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> > +		wait_on_page_writeback(page);
> > +
> > +	if (!PageWriteback(page)) {
> > +		/* synchronous write or broken a_ops? */
> > +		ClearPageReclaim(page);
> > +	}
> 
> how about:
> 
> 	if (PageWriteback(page) {
> 		if (sync_writeback == PAGEOUT_IO_SYNC)
> 			wait_on_page_writeback(page);
> 	} else {
> 		/* synchronous write or broken a_ops? */
> 		ClearPageReclaim(page);
> 	}
> 

Sure, that's tidier.

> >  	if (!may_write_to_queue(mapping->backing_dev_info))
> >  		return PAGE_KEEP;
> >  
> >  /*
> > + * Clean a list of pages. It is expected that all the pages on page_list have been
> > + * locked as part of isolation from the LRU.
> 
> A rather pointless line of 80 chars.  I see the point for long string
> literals, but here's it's just a pain.
> 

I'll trim it.

> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
> 
> I think there is.  There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.
> 

Hmm, ok. In that case, I'll have to release the locks on the list and
reacquire them. It was something I would have preferred to avoid. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:11       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > +	/*
> > +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > +	 * the page lock has been dropped by ->writepage, that mapping could
> > +	 * be anything
> > +	 */
> 
> Why is this an XXX comment?
> 

With the page lock released, the mapping may be no longer valid. Nick
posted a patch in relation to it that I need to look at. The comment was
because Andrew highlight that this was buggy and I wanted to make sure I
didn't forget about it.

> > +	/*
> > +	 * Wait on writeback if requested to. This happens when
> > +	 * direct reclaiming a large contiguous area and the
> > +	 * first attempt to free a range of pages fails.
> > +	 */
> > +	if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> > +		wait_on_page_writeback(page);
> > +
> > +	if (!PageWriteback(page)) {
> > +		/* synchronous write or broken a_ops? */
> > +		ClearPageReclaim(page);
> > +	}
> 
> how about:
> 
> 	if (PageWriteback(page) {
> 		if (sync_writeback == PAGEOUT_IO_SYNC)
> 			wait_on_page_writeback(page);
> 	} else {
> 		/* synchronous write or broken a_ops? */
> 		ClearPageReclaim(page);
> 	}
> 

Sure, that's tidier.

> >  	if (!may_write_to_queue(mapping->backing_dev_info))
> >  		return PAGE_KEEP;
> >  
> >  /*
> > + * Clean a list of pages. It is expected that all the pages on page_list have been
> > + * locked as part of isolation from the LRU.
> 
> A rather pointless line of 80 chars.  I see the point for long string
> literals, but here's it's just a pain.
> 

I'll trim it.

> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
> 
> I think there is.  There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.
> 

Hmm, ok. In that case, I'll have to release the locks on the list and
reacquire them. It was something I would have preferred to avoid. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:53     ` Christoph Hellwig
@ 2010-06-15 11:13       ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > +	/*
> > +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > +	 * the page lock has been dropped by ->writepage, that mapping could
> > +	 * be anything
> > +	 */
> 
> Why is this an XXX comment?

It's just a pretty simple use-after-free. Maybe people forget it because
->writepage is an asynchronous API.


> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
> 
> I think there is.  There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.

There shouldn't be a problem _holding_ the locks, but there is a problem
waiting for multiple locks out of page->index order.

But there is a problem with holding the lock of a lot of pages while
calling ->writepage on them. So yeah, you can't do that.

Hmm, I should rediff that lockdep page_lock patch and get it merged.
(although I don't know if that can catch these all these problems easily)


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:13       ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > +	/*
> > +	 * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > +	 * the page lock has been dropped by ->writepage, that mapping could
> > +	 * be anything
> > +	 */
> 
> Why is this an XXX comment?

It's just a pretty simple use-after-free. Maybe people forget it because
->writepage is an asynchronous API.


> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
> 
> I think there is.  There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.

There shouldn't be a problem _holding_ the locks, but there is a problem
waiting for multiple locks out of page->index order.

But there is a problem with holding the lock of a lot of pages while
calling ->writepage on them. So yeah, you can't do that.

Hmm, I should rediff that lockdep page_lock patch and get it merged.
(although I don't know if that can catch these all these problems easily)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 11:10                       ` Christoph Hellwig
@ 2010-06-15 11:20                         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 11:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Evgeniy Polyakov, Dave Chinner, Andrew Morton, Mel Gorman,
	linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 07:10:26AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> > 
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state. So writeout can just walk this in
> > order and it provides pretty optimal submission pattern of any
> > interleavings of data and metadata. No need for buffer boundary or
> > hacks like that. (needs some intelligence for delalloc, though).
> 
> I think worrying about indirect blocks really doesn't matter much
> these days.  For one thing extent based filesystems have a lot less
> of these, and second for a journaling filesystem we only need to log
> modification to the indirect blocks and not actually write them back
> in place during the sync.  At least for XFS the actual writeback can
> happen a lot later, as part of the ordered list of delwri buffers.

That's true, more importantly I meant any interleavings of data from
more than one file too.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:20                         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-15 11:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Evgeniy Polyakov, Dave Chinner, Andrew Morton, Mel Gorman,
	linux-kernel, linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 07:10:26AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> > 
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state. So writeout can just walk this in
> > order and it provides pretty optimal submission pattern of any
> > interleavings of data and metadata. No need for buffer boundary or
> > hacks like that. (needs some intelligence for delalloc, though).
> 
> I think worrying about indirect blocks really doesn't matter much
> these days.  For one thing extent based filesystems have a lot less
> of these, and second for a journaling filesystem we only need to log
> modification to the indirect blocks and not actually write them back
> in place during the sync.  At least for XFS the actual writeback can
> happen a lot later, as part of the ordered list of delwri buffers.

That's true, more importantly I meant any interleavings of data from
more than one file too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15  4:15               ` Andrew Morton
@ 2010-06-15 11:43                 ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > 
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > > 
> > > None of what you said had much to do with what I said.
> > > 
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > > that".
> > 
> > You can't sort until you've queued.
> 
> Yes you can.  That's exactly what you're recommending!  Only you're
> recommending doing it at the wrong level.  The fs-writeback radix-tree
> walks do it at the wrong level too.  Sorting should be done within, or
> in a layer above the block queues, not within the large number of
> individual callers.
> 
> > > And...  sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> > 
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> > 
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> > 
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
> 
> Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.
> 
> > Hence by sorting
> > where we have a {mapping,index} context, we push out IO that is
> > much more likely to be in contiguous physical chunks that the
> > current random page shootdown.
> > 
> > We optimise applications to use these sorts of correlations all the
> > time to improve IO patterns. Why can't we make the same sort of
> > optmisations to the IO that the VM issues?
> 
> We can, but it shouldn't be specific to page reclaim.  Other places
> submit IO too, and want the same treatment.
> 
> > > Still, I suspect none of it will improve anything anyway.  Those pages
> > > are still dirty, possibly-locked and need to go to disk.  It doesn't
> > > matter from the MM POV whether they sit in some VM list or in the
> > > request queue.
> > 
> > Oh, but it does.....
> 
> The only difference is that pages which are in the queue (current
> implementation thereof) can't be shot down by truncate.
> 
> > > Possibly there may be some benefit to not putting so many of these
> > > unreclaimable pages into the queue all at the the same time.  But
> > > that's a shortcoming in the block code: we should be able to shove
> > > arbitrary numbers of dirty page (segments) into the queue and not gum
> > > the system up.  Don't try to work around that in the VM.
> > 
> > I think you know perfectly well why the system gums up when we
> > increase block layer queue depth: it's the fact that the _VM_ relies
> > on block layer queue congestion to limit the amount of dirty memory
> > in the system.
> 
> mm, a little bit still, I guess.  Mainly because dirty memory
> management isn't zone aware, so even though we limit dirty memory
> globally, a particular zone(set) can get excessively dirtied.
> 
> Most of this problem happen on the balance_dirty_pages() path, where we
> already sort the pages in ascending logical order.
> 
> > We've got a feedback loop between the block layer and the VM that
> > only works if block device queues are kept shallow. Keeping the
> > number of dirty pages under control is a VM responsibility, but it
> > is putting limitations on the block layer to ensure that the VM
> > works correctly.  If you want the block layer to have deep queues,
> > then someone needs to fix the VM not to require knowledge of the
> > internal operation of the block layer for correct operation.
> > 
> > Adding a few lines of code to sort a list in the VM is far, far
> > easier than redesigning the write throttling code....
> 
> It's a hack and a workaround.  And I suspect it won't make any
> difference, especially given Mel's measurements of the number of dirty
> pages he's seeing coming off the LRU.  Although those numbers may well
> be due to the new quite-low dirty memory thresholds.  
> 

I tested with a dirty ratio of 40 but I didn't see a major problem.
It's still a case with the tests I saw that direct reclaim of dirty pages
was a relatively rare event except when lumpy reclaim was involved. What
did change is the amount of scanning the direct reclaim and kswapd had to do
(both increased quite a bit) but the percentage of dirty pages encountered was
roughly the same (1-2% of scanned pages were dirty in the case of sysbench).

This is sysbench only rather than flooding with more data.

DIRTY RATIO == 20
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               9843      13398      51651 
Direct reclaim pages scanned                871367    1008709    3080593 
Direct reclaim write async I/O               24883      30699          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       7070819    6961672   11268341 
Kswapd wakeups                                1578       1500        943 
Kswapd pages scanned                      22016558   21779455   17393431 
Kswapd reclaim write async I/O             1161346    1101641    1717759 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             26.11      45.04       2.97 
Time kswapd awake (ms)                     5105.06    5135.93    6086.32 

User/Sys Time Running Test (seconds)        734.52    712.39     703.9
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
Percentage Time kswapd Awake                 0.06%     0.00%     0.00%

DIRTY RATIO == 40
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                              29945      41887     163006 
Direct reclaim pages scanned               2853804    3075288   13142072 
Direct reclaim write async I/O               51498      63662          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                      11899105   12466894   15645364 
Kswapd wakeups                                 945        891        522 
Kswapd pages scanned                      20401921   20674788   11319791 
Kswapd reclaim write async I/O             1381897    1332436    1711266 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)            131.78     165.08       5.47 
Time kswapd awake (ms)                     6321.11    6413.79    6687.67 

User/Sys Time Running Test (seconds)        709.91    718.39    664.28
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9579.90   9700.42   9101.05
Percentage Time kswapd Awake                 0.07%     0.00%     0.00%

I guess what was really interesting was that even though raising the
dirty ratio allowed the test to complete faster, the percentage of time
spent in direct reclaim increased quite a lot. Again, just stopping
writeback in direct reclaim seemed to help.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.
> 

I'll do this just to see what it looks like. To be frank, I lack taste when
it comes to how the block layer and filesystem should behave so am having
troube deciding if sorting the pages prior to submission is a good thing or
if it would just encourage bad or lax behaviour in the IO submission queueing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 11:43                 ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > > 
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > > 
> > > None of what you said had much to do with what I said.
> > > 
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing".  I'm saying "fix
> > > that".
> > 
> > You can't sort until you've queued.
> 
> Yes you can.  That's exactly what you're recommending!  Only you're
> recommending doing it at the wrong level.  The fs-writeback radix-tree
> walks do it at the wrong level too.  Sorting should be done within, or
> in a layer above the block queues, not within the large number of
> individual callers.
> 
> > > And...  sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> > 
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> > 
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> > 
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
> 
> Nope.  Large-number-of-small-files is a pretty common case.  If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.
> 
> > Hence by sorting
> > where we have a {mapping,index} context, we push out IO that is
> > much more likely to be in contiguous physical chunks that the
> > current random page shootdown.
> > 
> > We optimise applications to use these sorts of correlations all the
> > time to improve IO patterns. Why can't we make the same sort of
> > optmisations to the IO that the VM issues?
> 
> We can, but it shouldn't be specific to page reclaim.  Other places
> submit IO too, and want the same treatment.
> 
> > > Still, I suspect none of it will improve anything anyway.  Those pages
> > > are still dirty, possibly-locked and need to go to disk.  It doesn't
> > > matter from the MM POV whether they sit in some VM list or in the
> > > request queue.
> > 
> > Oh, but it does.....
> 
> The only difference is that pages which are in the queue (current
> implementation thereof) can't be shot down by truncate.
> 
> > > Possibly there may be some benefit to not putting so many of these
> > > unreclaimable pages into the queue all at the the same time.  But
> > > that's a shortcoming in the block code: we should be able to shove
> > > arbitrary numbers of dirty page (segments) into the queue and not gum
> > > the system up.  Don't try to work around that in the VM.
> > 
> > I think you know perfectly well why the system gums up when we
> > increase block layer queue depth: it's the fact that the _VM_ relies
> > on block layer queue congestion to limit the amount of dirty memory
> > in the system.
> 
> mm, a little bit still, I guess.  Mainly because dirty memory
> management isn't zone aware, so even though we limit dirty memory
> globally, a particular zone(set) can get excessively dirtied.
> 
> Most of this problem happen on the balance_dirty_pages() path, where we
> already sort the pages in ascending logical order.
> 
> > We've got a feedback loop between the block layer and the VM that
> > only works if block device queues are kept shallow. Keeping the
> > number of dirty pages under control is a VM responsibility, but it
> > is putting limitations on the block layer to ensure that the VM
> > works correctly.  If you want the block layer to have deep queues,
> > then someone needs to fix the VM not to require knowledge of the
> > internal operation of the block layer for correct operation.
> > 
> > Adding a few lines of code to sort a list in the VM is far, far
> > easier than redesigning the write throttling code....
> 
> It's a hack and a workaround.  And I suspect it won't make any
> difference, especially given Mel's measurements of the number of dirty
> pages he's seeing coming off the LRU.  Although those numbers may well
> be due to the new quite-low dirty memory thresholds.  
> 

I tested with a dirty ratio of 40 but I didn't see a major problem.
It's still a case with the tests I saw that direct reclaim of dirty pages
was a relatively rare event except when lumpy reclaim was involved. What
did change is the amount of scanning the direct reclaim and kswapd had to do
(both increased quite a bit) but the percentage of dirty pages encountered was
roughly the same (1-2% of scanned pages were dirty in the case of sysbench).

This is sysbench only rather than flooding with more data.

DIRTY RATIO == 20
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                               9843      13398      51651 
Direct reclaim pages scanned                871367    1008709    3080593 
Direct reclaim write async I/O               24883      30699          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                       7070819    6961672   11268341 
Kswapd wakeups                                1578       1500        943 
Kswapd pages scanned                      22016558   21779455   17393431 
Kswapd reclaim write async I/O             1161346    1101641    1717759 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)             26.11      45.04       2.97 
Time kswapd awake (ms)                     5105.06    5135.93    6086.32 

User/Sys Time Running Test (seconds)        734.52    712.39     703.9
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
Percentage Time kswapd Awake                 0.06%     0.00%     0.00%

DIRTY RATIO == 40
FTrace Reclaim Statistics
                traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
Direct reclaims                              29945      41887     163006 
Direct reclaim pages scanned               2853804    3075288   13142072 
Direct reclaim write async I/O               51498      63662          0 
Direct reclaim write sync I/O                    0          0          0 
Wake kswapd requests                      11899105   12466894   15645364 
Kswapd wakeups                                 945        891        522 
Kswapd pages scanned                      20401921   20674788   11319791 
Kswapd reclaim write async I/O             1381897    1332436    1711266 
Kswapd reclaim write sync I/O                    0          0          0 
Time stalled direct reclaim (ms)            131.78     165.08       5.47 
Time kswapd awake (ms)                     6321.11    6413.79    6687.67 

User/Sys Time Running Test (seconds)        709.91    718.39    664.28
Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
Total Elapsed Time (seconds)               9579.90   9700.42   9101.05
Percentage Time kswapd Awake                 0.07%     0.00%     0.00%

I guess what was really interesting was that even though raising the
dirty ratio allowed the test to complete faster, the percentage of time
spent in direct reclaim increased quite a lot. Again, just stopping
writeback in direct reclaim seemed to help.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.
> 

I'll do this just to see what it looks like. To be frank, I lack taste when
it comes to how the block layer and filesystem should behave so am having
troube deciding if sorting the pages prior to submission is a good thing or
if it would just encourage bad or lax behaviour in the IO submission queueing.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-14 21:55     ` Rik van Riel
@ 2010-06-15 11:45       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4856a2a..574e816 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>   	return PAGE_SUCCESS;
>>   }
>>
>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>> +{
>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>> +}
>> +
>
> I'm not entirely convinced on this bit, but am willing to
> be convinced by the data.
>

Which bit?

You're not convinced that kswapd should be allowed to write back?
You're not convinced that memcg should be allowed to write back?
You're not convinced that direct reclaim writing back pages can overflow
	the stack?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 11:45       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4856a2a..574e816 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>   	return PAGE_SUCCESS;
>>   }
>>
>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>> +{
>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>> +}
>> +
>
> I'm not entirely convinced on this bit, but am willing to
> be convinced by the data.
>

Which bit?

You're not convinced that kswapd should be allowed to write back?
You're not convinced that memcg should be allowed to write back?
You're not convinced that direct reclaim writing back pages can overflow
	the stack?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
  2010-06-14 15:10   ` Christoph Hellwig
@ 2010-06-15 11:45     ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Mon, Jun 14, 2010 at 11:10:11AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > This is a merging of two series - the first of which reduces stack usage
> > in page reclaim and the second which writes contiguous pages during reclaim
> > and avoids writeback in direct reclaimers.
> 
> This stuff looks good to me from the filesystem POV.
> 
> You might want to throw in a follow on patch to remove the PF_MEMALLOC
> checks from the various ->writepage methods.
> 

Will do, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-15 11:45     ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Mon, Jun 14, 2010 at 11:10:11AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > This is a merging of two series - the first of which reduces stack usage
> > in page reclaim and the second which writes contiguous pages during reclaim
> > and avoids writeback in direct reclaimers.
> 
> This stuff looks good to me from the filesystem POV.
> 
> You might want to throw in a follow on patch to remove the PF_MEMALLOC
> checks from the various ->writepage methods.
> 

Will do, thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
  2010-06-15  0:08   ` KAMEZAWA Hiroyuki
@ 2010-06-15 11:49     ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, Jun 15, 2010 at 09:08:33AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 14 Jun 2010 12:17:41 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > SysBench
> > ========
> >                 traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> >            1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
> >            2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
> >            3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
> >            4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
> >            5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
> >            6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
> >            7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
> >            8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
> > FTrace Reclaim Statistics
> >                                      traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> > Direct reclaims                               9843      13398      51651 
> > Direct reclaim pages scanned                871367    1008709    3080593 
> > Direct reclaim write async I/O               24883      30699          0 
> > Direct reclaim write sync I/O                    0          0          0 
> 
> Hmm, page-scan and reclaims jumps up but...
> 

It could be accounted for by the fact that the direct reclaimers are
stalled less in direct reclaim. They make more forward progress needing
more pages so end up scanning more as a result.

> 
> > User/Sys Time Running Test (seconds)        734.52    712.39     703.9
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
> > Percentage Time kswapd Awake                 0.06%     0.00%     0.00%
> > 
> 
> Execution time is reduced. Does this shows removing "I/O noise" by direct
> reclaim makes the system happy? or writeback in direct reclaim give
> us too much costs ?
> 

I think it's accounted for by just making more forward progress rather than
IO noise. The throughput results for sysbench are all over the place because
the disk was maxed so I'm shying away from drawing any conclusions on the
IO efficiency.

> It seems I'll have to consider about avoiding direct-reciam in memcg, later.
> 
> BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
> direct reclaim if we want to avoid too much scanning, later.
> 

This happens for lumpy reclaim. I didn't think it was justified for
normal reclaim based on the percentage of dirty pages encountered during
scanning. If the percentage of dirty pages scanned, we'll need to first
figure out why that happened and then if stalling when they are
encountered is the correct thing to do.

> 
> Thank you for interesting test.
> 

You're welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2
@ 2010-06-15 11:49     ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 11:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, Jun 15, 2010 at 09:08:33AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 14 Jun 2010 12:17:41 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > SysBench
> > ========
> >                 traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> >            1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
> >            2  3844.63 ( 0.00%)  4988.95 (22.94%)  4038.95 ( 4.81%)
> >            3  3210.23 ( 0.00%)  2918.52 (-9.99%)  3113.38 (-3.11%)
> >            4  1958.91 ( 0.00%)  1987.69 ( 1.45%)  1808.37 (-8.32%)
> >            5  2864.92 ( 0.00%)  3126.13 ( 8.36%)  2355.70 (-21.62%)
> >            6  4831.63 ( 0.00%)  3815.67 (-26.63%)  4164.09 (-16.03%)
> >            7  3788.37 ( 0.00%)  3140.39 (-20.63%)  3471.36 (-9.13%)
> >            8  2293.61 ( 0.00%)  1636.87 (-40.12%)  1754.25 (-30.75%)
> > FTrace Reclaim Statistics
> >                                      traceonly-v2r5  stackreduce-v2r5     nodirect-v2r5
> > Direct reclaims                               9843      13398      51651 
> > Direct reclaim pages scanned                871367    1008709    3080593 
> > Direct reclaim write async I/O               24883      30699          0 
> > Direct reclaim write sync I/O                    0          0          0 
> 
> Hmm, page-scan and reclaims jumps up but...
> 

It could be accounted for by the fact that the direct reclaimers are
stalled less in direct reclaim. They make more forward progress needing
more pages so end up scanning more as a result.

> 
> > User/Sys Time Running Test (seconds)        734.52    712.39     703.9
> > Percentage Time Spent Direct Reclaim         0.00%     0.00%     0.00%
> > Total Elapsed Time (seconds)               9710.02   9589.20   9334.45
> > Percentage Time kswapd Awake                 0.06%     0.00%     0.00%
> > 
> 
> Execution time is reduced. Does this shows removing "I/O noise" by direct
> reclaim makes the system happy? or writeback in direct reclaim give
> us too much costs ?
> 

I think it's accounted for by just making more forward progress rather than
IO noise. The throughput results for sysbench are all over the place because
the disk was maxed so I'm shying away from drawing any conclusions on the
IO efficiency.

> It seems I'll have to consider about avoiding direct-reciam in memcg, later.
> 
> BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
> direct reclaim if we want to avoid too much scanning, later.
> 

This happens for lumpy reclaim. I didn't think it was justified for
normal reclaim based on the percentage of dirty pages encountered during
scanning. If the percentage of dirty pages scanned, we'll need to first
figure out why that happened and then if stalling when they are
encountered is the correct thing to do.

> 
> Thank you for interesting test.
> 

You're welcome.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 11:43                 ` Mel Gorman
@ 2010-06-15 13:07                   ` tytso
  -1 siblings, 0 replies; 198+ messages in thread
From: tytso @ 2010-06-15 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-mm, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
> 
> I'll do this just to see what it looks like. To be frank, I lack
> taste when it comes to how the block layer and filesystem should
> behave so am having troube deciding if sorting the pages prior to
> submission is a good thing or if it would just encourage bad or lax
> behaviour in the IO submission queueing.
> 

I suspect the right answer is we need to sort both at the block layer
and either (a) before you pass things to the filesystem layer, or if
you don't do that (b) the filesystem will be forced to do its own
queuing/sorting at the very least for delayed allocation pages, so the
allocator can do something sane.  And given that there are multiple
file systems that support delayed allocation, it would be nice if this
could be recognized by the writeback code, as opposed to having btrfs,
xfs, ext4, all having to implement something very similar at the fs
layer.

							- Ted

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 13:07                   ` tytso
  0 siblings, 0 replies; 198+ messages in thread
From: tytso @ 2010-06-15 13:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Dave Chinner, linux-kernel, linux-fsdevel,
	linux-mm, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
> 
> I'll do this just to see what it looks like. To be frank, I lack
> taste when it comes to how the block layer and filesystem should
> behave so am having troube deciding if sorting the pages prior to
> submission is a good thing or if it would just encourage bad or lax
> behaviour in the IO submission queueing.
> 

I suspect the right answer is we need to sort both at the block layer
and either (a) before you pass things to the filesystem layer, or if
you don't do that (b) the filesystem will be forced to do its own
queuing/sorting at the very least for delayed allocation pages, so the
allocator can do something sane.  And given that there are multiple
file systems that support delayed allocation, it would be nice if this
could be recognized by the writeback code, as opposed to having btrfs,
xfs, ext4, all having to implement something very similar at the fs
layer.

							- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch] mm: vmscan fix mapping use after free
  2010-06-15  5:43                     ` Nick Piggin
@ 2010-06-15 13:23                       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 03:43:44PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > > Once that page is unlocked, we can't touch *mapping - its inode can be
> > > concurrently reclaimed.  Although I guess the technique in
> > > handle_write_error() can be reused.
> > 
> > Nasty. That guy needs to be using lock_page_nosync().
> --
> 
> Need lock_page_nosync here because we have no reference to the mapping when
> taking the page lock.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 

Thanks, I've picked this up and merged it into the series and removed
the "hand grenade" comment.

> ---
>  mm/vmscan.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
>  static void handle_write_error(struct address_space *mapping,
>  				struct page *page, int error)
>  {
> -	lock_page(page);
> +	lock_page_nosync(page);
>  	if (page_mapping(page) == mapping)
>  		mapping_set_error(mapping, error);
>  	unlock_page(page);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [patch] mm: vmscan fix mapping use after free
@ 2010-06-15 13:23                       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Rik van Riel, Dave Chinner, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Johannes Weiner,
	Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 03:43:44PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > > Once that page is unlocked, we can't touch *mapping - its inode can be
> > > concurrently reclaimed.  Although I guess the technique in
> > > handle_write_error() can be reused.
> > 
> > Nasty. That guy needs to be using lock_page_nosync().
> --
> 
> Need lock_page_nosync here because we have no reference to the mapping when
> taking the page lock.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> 

Thanks, I've picked this up and merged it into the series and removed
the "hand grenade" comment.

> ---
>  mm/vmscan.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
>  static void handle_write_error(struct address_space *mapping,
>  				struct page *page, int error)
>  {
> -	lock_page(page);
> +	lock_page_nosync(page);
>  	if (page_mapping(page) == mapping)
>  		mapping_set_error(mapping, error);
>  	unlock_page(page);
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 11:01             ` Christoph Hellwig
@ 2010-06-15 13:32               ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 13:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki

On 06/15/2010 07:01 AM, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
>>> Besides, there really isn't the right context in the block layer to
>>> be able to queue and prioritise large amounts of IO without
>>> significant penalties to some higher layer operation.
>>
>> Can we kick flushing for the whole inode at once from
>> vmscan.c?
>
> kswapd really should be a last effort tool to clean filesystem pages.
> If it does enough I/O for this to matter significantly we need to
> fix the VM to move more work to the flusher threads instead of trying
> to fix kswapd.
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> We already have that API, in Jens' latest tree that's
> sync_inodes_sb/writeback_inodes_sb.  We could also add a non-waiting
> variant if required, but I think the big problem with kswapd is that
> we want to wait on I/O completion under circumstances.

However, kswapd does not need to wait on I/O completion of
any page in particular - it just wants to wait on I/O
completion of any inactive pages in the zone (or memcg)
where memory is being freed.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 13:32               ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 13:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Nick Piggin,
	Johannes Weiner, KAMEZAWA Hiroyuki

On 06/15/2010 07:01 AM, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
>>> Besides, there really isn't the right context in the block layer to
>>> be able to queue and prioritise large amounts of IO without
>>> significant penalties to some higher layer operation.
>>
>> Can we kick flushing for the whole inode at once from
>> vmscan.c?
>
> kswapd really should be a last effort tool to clean filesystem pages.
> If it does enough I/O for this to matter significantly we need to
> fix the VM to move more work to the flusher threads instead of trying
> to fix kswapd.
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> We already have that API, in Jens' latest tree that's
> sync_inodes_sb/writeback_inodes_sb.  We could also add a non-waiting
> variant if required, but I think the big problem with kswapd is that
> we want to wait on I/O completion under circumstances.

However, kswapd does not need to wait on I/O completion of
any page in particular - it just wants to wait on I/O
completion of any inactive pages in the zone (or memcg)
where memory is being freed.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 11:45       ` Mel Gorman
@ 2010-06-15 13:34         ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 13:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/15/2010 07:45 AM, Mel Gorman wrote:
> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4856a2a..574e816 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>>    	return PAGE_SUCCESS;
>>>    }
>>>
>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>> +{
>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>> +}
>>> +
>>
>> I'm not entirely convinced on this bit, but am willing to
>> be convinced by the data.
>>
>
> Which bit?
>
> You're not convinced that kswapd should be allowed to write back?
> You're not convinced that memcg should be allowed to write back?
> You're not convinced that direct reclaim writing back pages can overflow
> 	the stack?

If direct reclaim can overflow the stack, so can direct
memcg reclaim.  That means this patch does not solve the
stack overflow, while admitting that we do need the
ability to get specific pages flushed to disk from the
pageout code.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 13:34         ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 13:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/15/2010 07:45 AM, Mel Gorman wrote:
> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4856a2a..574e816 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>>    	return PAGE_SUCCESS;
>>>    }
>>>
>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>> +{
>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>> +}
>>> +
>>
>> I'm not entirely convinced on this bit, but am willing to
>> be convinced by the data.
>>
>
> Which bit?
>
> You're not convinced that kswapd should be allowed to write back?
> You're not convinced that memcg should be allowed to write back?
> You're not convinced that direct reclaim writing back pages can overflow
> 	the stack?

If direct reclaim can overflow the stack, so can direct
memcg reclaim.  That means this patch does not solve the
stack overflow, while admitting that we do need the
ability to get specific pages flushed to disk from the
pageout code.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:34         ` Rik van Riel
@ 2010-06-15 13:37           ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 13:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim.  That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.

Can you explain what the hell memcg reclaim is and why it needs
to reclaim from random contexts?
It seems everything that has a cg in it's name that I stumbled over
lately seems to be some ugly wart..


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 13:37           ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-15 13:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim.  That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.

Can you explain what the hell memcg reclaim is and why it needs
to reclaim from random contexts?
It seems everything that has a cg in it's name that I stumbled over
lately seems to be some ugly wart..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:37           ` Christoph Hellwig
@ 2010-06-15 13:54             ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > If direct reclaim can overflow the stack, so can direct
> > memcg reclaim.  That means this patch does not solve the
> > stack overflow, while admitting that we do need the
> > ability to get specific pages flushed to disk from the
> > pageout code.
> 
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

Kamezawa Hiroyuki has the full story here but here is a summary.

memcg is the Memory Controller cgroup
(Documentation/cgroups/memory.txt). It's intended for the control of the
amount of memory usable by a group of processes but its behaviour in
terms of reclaim differs from global reclaim. It has its own LRU lists
and kswapd operates on them. What is surprising is that direct reclaim
for a process in the control group also does not operate within the
cgroup.

Reclaim from a cgroup happens from the fault path. The new page is
"charged" to the cgroup. If it exceeds its allocated resources, some
pages within the group are reclaimed in a path that is similar to direct
reclaim except for its entry point.

So, memcg is not reclaiming from a random context, there is a limited
number of cases where a memcg is reclaiming and it is not expected to
overflow the stack.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..
> 

The wart in this case is that the behaviour of page reclaim within a
memcg and globally differ a fair bit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 13:54             ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > If direct reclaim can overflow the stack, so can direct
> > memcg reclaim.  That means this patch does not solve the
> > stack overflow, while admitting that we do need the
> > ability to get specific pages flushed to disk from the
> > pageout code.
> 
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

Kamezawa Hiroyuki has the full story here but here is a summary.

memcg is the Memory Controller cgroup
(Documentation/cgroups/memory.txt). It's intended for the control of the
amount of memory usable by a group of processes but its behaviour in
terms of reclaim differs from global reclaim. It has its own LRU lists
and kswapd operates on them. What is surprising is that direct reclaim
for a process in the control group also does not operate within the
cgroup.

Reclaim from a cgroup happens from the fault path. The new page is
"charged" to the cgroup. If it exceeds its allocated resources, some
pages within the group are reclaimed in a path that is similar to direct
reclaim except for its entry point.

So, memcg is not reclaiming from a random context, there is a limited
number of cases where a memcg is reclaiming and it is not expected to
overflow the stack.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..
> 

The wart in this case is that the behaviour of page reclaim within a
memcg and globally differ a fair bit.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:34         ` Rik van Riel
@ 2010-06-15 13:59           ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 4856a2a..574e816 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>>>    	return PAGE_SUCCESS;
>>>>    }
>>>>
>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>> +{
>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>> +}
>>>> +
>>>
>>> I'm not entirely convinced on this bit, but am willing to
>>> be convinced by the data.
>>>
>>
>> Which bit?
>>
>> You're not convinced that kswapd should be allowed to write back?
>> You're not convinced that memcg should be allowed to write back?
>> You're not convinced that direct reclaim writing back pages can overflow
>> 	the stack?
>
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim.  That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.
>

What path is taken with memcg != NULL that could overflow the stack? I
couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
its use cases is tricky. The critical path I had in mind though was
direct reclaim and for that path, memcg == NULL or did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 13:59           ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 13:59 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 4856a2a..574e816 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>>>    	return PAGE_SUCCESS;
>>>>    }
>>>>
>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>> +{
>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>> +}
>>>> +
>>>
>>> I'm not entirely convinced on this bit, but am willing to
>>> be convinced by the data.
>>>
>>
>> Which bit?
>>
>> You're not convinced that kswapd should be allowed to write back?
>> You're not convinced that memcg should be allowed to write back?
>> You're not convinced that direct reclaim writing back pages can overflow
>> 	the stack?
>
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim.  That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.
>

What path is taken with memcg != NULL that could overflow the stack? I
couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
its use cases is tricky. The critical path I had in mind though was
direct reclaim and for that path, memcg == NULL or did I miss something?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:37           ` Christoph Hellwig
@ 2010-06-15 14:02             ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 14:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On 06/15/2010 09:37 AM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim.  That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

The page fault code will call the cgroup accounting code.

When a cgroup goes over its memory limit, __mem_cgroup_try_charge
will call mem_cgroup_hierarchical_reclaim, which will then go
into the page reclaim code.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..

No argument there.  It took me a few minutes to find the code
path above :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 14:02             ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 14:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On 06/15/2010 09:37 AM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim.  That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

The page fault code will call the cgroup accounting code.

When a cgroup goes over its memory limit, __mem_cgroup_try_charge
will call mem_cgroup_hierarchical_reclaim, which will then go
into the page reclaim code.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..

No argument there.  It took me a few minutes to find the code
path above :)

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:59           ` Mel Gorman
@ 2010-06-15 14:04             ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/15/2010 09:59 AM, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> On 06/15/2010 07:45 AM, Mel Gorman wrote:

>>>>>
>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>> +{
>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>> +}

>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim.  That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>>
>
> What path is taken with memcg != NULL that could overflow the stack? I
> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> its use cases is tricky. The critical path I had in mind though was
> direct reclaim and for that path, memcg == NULL or did I miss something?

mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 14:04             ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-15 14:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On 06/15/2010 09:59 AM, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> On 06/15/2010 07:45 AM, Mel Gorman wrote:

>>>>>
>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>> +{
>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>> +}

>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim.  That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>>
>
> What path is taken with memcg != NULL that could overflow the stack? I
> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> its use cases is tricky. The critical path I had in mind though was
> direct reclaim and for that path, memcg == NULL or did I miss something?

mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 14:04             ` Rik van Riel
@ 2010-06-15 14:16               ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 14:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> On 06/15/2010 09:59 AM, Mel Gorman wrote:
>> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>
>>>>>>
>>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>>> +{
>>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>>> +}
>
>>> If direct reclaim can overflow the stack, so can direct
>>> memcg reclaim.  That means this patch does not solve the
>>> stack overflow, while admitting that we do need the
>>> ability to get specific pages flushed to disk from the
>>> pageout code.
>>>
>>
>> What path is taken with memcg != NULL that could overflow the stack? I
>> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
>> its use cases is tricky. The critical path I had in mind though was
>> direct reclaim and for that path, memcg == NULL or did I miss something?
>
> mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
>

But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
reclaim? It appears to be only called from the fault path or as a result
of the memcg changing size.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-15 14:16               ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 14:16 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki, Andrew Morton

On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> On 06/15/2010 09:59 AM, Mel Gorman wrote:
>> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>
>>>>>>
>>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>>> +{
>>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>>> +}
>
>>> If direct reclaim can overflow the stack, so can direct
>>> memcg reclaim.  That means this patch does not solve the
>>> stack overflow, while admitting that we do need the
>>> ability to get specific pages flushed to disk from the
>>> pageout code.
>>>
>>
>> What path is taken with memcg != NULL that could overflow the stack? I
>> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
>> its use cases is tricky. The critical path I had in mind though was
>> direct reclaim and for that path, memcg == NULL or did I miss something?
>
> mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
>

But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
reclaim? It appears to be only called from the fault path or as a result
of the memcg changing size.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 11:43                 ` Mel Gorman
@ 2010-06-15 15:44                   ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 15:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
> <SNIP on whether sorting should be in page cache or block layer>
> 
> > It would be interesting to code up a little test patch though, see if
> > there's benefit to be had going down this path.
> > 
> 
> I'll do this just to see what it looks like. To be frank, I lack taste when
> it comes to how the block layer and filesystem should behave so am having
> troube deciding if sorting the pages prior to submission is a good thing or
> if it would just encourage bad or lax behaviour in the IO submission queueing.
> 

The patch to sort the list being cleaned by reclaim looks like this.
It's not actually tested

vmscan: Sort pages being queued for IO before submitting to the filesystem

While page reclaim submits dirty pages in batch, it doesn't change the
order in which the IO is issued - it is still issued in LRU order. Given
that they are issued in a short period of time now, rather than across a
longer scan period, it is likely that it will not be any faster as:

        a) IO will not be started as soon, and
        b) the IO scheduler still only has a small re-ordering
           window and will choke just as much on random IO patterns.

This patch uses list_sort() function to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.

Credit mostly goes to Dave Chinner for this idea and the changelog text.

----

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68b3d22..02ab246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -32,6 +32,7 @@
 #include <linux/topology.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/list_sort.h>
 #include <linux/notifier.h>
 #include <linux/rwsem.h>
 #include <linux/delay.h>
@@ -651,6 +652,34 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 		__pagevec_free(&freed_pvec);
 }
 
+/* Sort based on mapping then index */
+static int page_writeback_cmp(void *data, struct list_head *a, struct list_head *b)
+{
+	struct page *ap = list_entry(a, struct page, lru);
+	struct page *bp = list_entry(b, struct page, lru);
+	pgoff_t diff;
+
+	/*
+	 * Page not locked but it's not critical, the mapping is just for sorting
+	 * If the mapping is no longer valid, it's of little consequence
+	 */
+	if (ap->mapping != bp->mapping) {
+		if (ap->mapping < bp->mapping)
+			return -1;
+		if (ap->mapping > bp->mapping)
+			return 1;
+		return 0;
+	}
+	
+	/* Then index */
+	diff = ap->index - bp->index;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
 static noinline_for_stack void clean_page_list(struct list_head *page_list,
 				struct scan_control *sc)
 {
@@ -660,6 +689,8 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
 	if (!sc->may_writepage)
 		return;
 
+	list_sort(NULL, page_list, page_writeback_cmp);
+
 	/* Write the pages out to disk in ranges where possible */
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 15:44                   ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 15:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, linux-mm, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
> <SNIP on whether sorting should be in page cache or block layer>
> 
> > It would be interesting to code up a little test patch though, see if
> > there's benefit to be had going down this path.
> > 
> 
> I'll do this just to see what it looks like. To be frank, I lack taste when
> it comes to how the block layer and filesystem should behave so am having
> troube deciding if sorting the pages prior to submission is a good thing or
> if it would just encourage bad or lax behaviour in the IO submission queueing.
> 

The patch to sort the list being cleaned by reclaim looks like this.
It's not actually tested

vmscan: Sort pages being queued for IO before submitting to the filesystem

While page reclaim submits dirty pages in batch, it doesn't change the
order in which the IO is issued - it is still issued in LRU order. Given
that they are issued in a short period of time now, rather than across a
longer scan period, it is likely that it will not be any faster as:

        a) IO will not be started as soon, and
        b) the IO scheduler still only has a small re-ordering
           window and will choke just as much on random IO patterns.

This patch uses list_sort() function to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.

Credit mostly goes to Dave Chinner for this idea and the changelog text.

----

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68b3d22..02ab246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -32,6 +32,7 @@
 #include <linux/topology.h>
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
+#include <linux/list_sort.h>
 #include <linux/notifier.h>
 #include <linux/rwsem.h>
 #include <linux/delay.h>
@@ -651,6 +652,34 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 		__pagevec_free(&freed_pvec);
 }
 
+/* Sort based on mapping then index */
+static int page_writeback_cmp(void *data, struct list_head *a, struct list_head *b)
+{
+	struct page *ap = list_entry(a, struct page, lru);
+	struct page *bp = list_entry(b, struct page, lru);
+	pgoff_t diff;
+
+	/*
+	 * Page not locked but it's not critical, the mapping is just for sorting
+	 * If the mapping is no longer valid, it's of little consequence
+	 */
+	if (ap->mapping != bp->mapping) {
+		if (ap->mapping < bp->mapping)
+			return -1;
+		if (ap->mapping > bp->mapping)
+			return 1;
+		return 0;
+	}
+	
+	/* Then index */
+	diff = ap->index - bp->index;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
 static noinline_for_stack void clean_page_list(struct list_head *page_list,
 				struct scan_control *sc)
 {
@@ -660,6 +689,8 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
 	if (!sc->may_writepage)
 		return;
 
+	list_sort(NULL, page_list, page_writeback_cmp);
+
 	/* Write the pages out to disk in ranges where possible */
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-15 10:47     ` Christoph Hellwig
@ 2010-06-15 15:56       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 15:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:47:01AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
> >  /*
> > + * TODO: Try merging with migrations version of putback_lru_pages
> > + */
> > +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> > +				struct zone_reclaim_stat *reclaim_stat,
> > +				unsigned long nr_anon, unsigned long nr_file,
> > + 				struct list_head *page_list)
> > +{
> 
> I hate to nitpick on this, but with noinline_for_stack a prototype
> really is unreadbale unless the function identifier goes to the next
> line.  Compare the one above with:
> 
> static noinline_for_stack void
> putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
> 		unsigned long nr_anon, unsigned long nr_file,
> 		struct list_head *page_list)
> 
> > -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >  			struct zone *zone, struct scan_control *sc,
> >  			int priority, int file)
> 
> Same here, just even worse due to the spill over ove 80 characters.
> 

It's a fair nitpick and so is cleaned up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-15 15:56       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-15 15:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki,
	Andrew Morton

On Tue, Jun 15, 2010 at 06:47:01AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
> >  /*
> > + * TODO: Try merging with migrations version of putback_lru_pages
> > + */
> > +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> > +				struct zone_reclaim_stat *reclaim_stat,
> > +				unsigned long nr_anon, unsigned long nr_file,
> > + 				struct list_head *page_list)
> > +{
> 
> I hate to nitpick on this, but with noinline_for_stack a prototype
> really is unreadbale unless the function identifier goes to the next
> line.  Compare the one above with:
> 
> static noinline_for_stack void
> putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
> 		unsigned long nr_anon, unsigned long nr_file,
> 		struct list_head *page_list)
> 
> > -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >  			struct zone *zone, struct scan_control *sc,
> >  			int priority, int file)
> 
> Same here, just even worse due to the spill over ove 80 characters.
> 

It's a fair nitpick and so is cleaned up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 10:55                     ` Nick Piggin
@ 2010-06-15 23:20                       ` Dave Chinner
  -1 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15 23:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Evgeniy Polyakov, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > Per-mapping sorting will not do anything good in this case, even if
> > files were previously created in a good facion being placed closely and
> > so on, and only block layer will find a correlation between adjacent
> > blocks in different files. But with existing queue management it has
> > quite a small opportunity, and that's what I think Andrew is arguing
> > about.
> 
> The solution is not to sort pages on their way to be submitted either,
> really.
> 
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state.

How does this work with delayed allocation where there is no block
number associated with the page until writeback calls the allocation
routine?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-15 23:20                       ` Dave Chinner
  0 siblings, 0 replies; 198+ messages in thread
From: Dave Chinner @ 2010-06-15 23:20 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Evgeniy Polyakov, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > Per-mapping sorting will not do anything good in this case, even if
> > files were previously created in a good facion being placed closely and
> > so on, and only block layer will find a correlation between adjacent
> > blocks in different files. But with existing queue management it has
> > quite a small opportunity, and that's what I think Andrew is arguing
> > about.
> 
> The solution is not to sort pages on their way to be submitted either,
> really.
> 
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state.

How does this work with delayed allocation where there is no block
number associated with the page until writeback calls the allocation
routine?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 14:16               ` Mel Gorman
@ 2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Tue, 15 Jun 2010 15:16:01 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> > On 06/15/2010 09:59 AM, Mel Gorman wrote:
> >> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> >>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
> >
> >>>>>>
> >>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> >>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> >>>>>> +{
> >>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
> >>>>>> +}
> >
> >>> If direct reclaim can overflow the stack, so can direct
> >>> memcg reclaim.  That means this patch does not solve the
> >>> stack overflow, while admitting that we do need the
> >>> ability to get specific pages flushed to disk from the
> >>> pageout code.
> >>>
> >>
> >> What path is taken with memcg != NULL that could overflow the stack? I
> >> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> >> its use cases is tricky. The critical path I had in mind though was
> >> direct reclaim and for that path, memcg == NULL or did I miss something?
> >
> > mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
> >
> 
> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> reclaim? It appears to be only called from the fault path or as a result
> of the memcg changing size.
> 
yes. It's only called from 
	- page fault
	- add_to_page_cache()

I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
reclaiming memory, it needs direct writeback.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Tue, 15 Jun 2010 15:16:01 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> > On 06/15/2010 09:59 AM, Mel Gorman wrote:
> >> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> >>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
> >
> >>>>>>
> >>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> >>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> >>>>>> +{
> >>>>>> +	return current_is_kswapd() || sc->mem_cgroup != NULL;
> >>>>>> +}
> >
> >>> If direct reclaim can overflow the stack, so can direct
> >>> memcg reclaim.  That means this patch does not solve the
> >>> stack overflow, while admitting that we do need the
> >>> ability to get specific pages flushed to disk from the
> >>> pageout code.
> >>>
> >>
> >> What path is taken with memcg != NULL that could overflow the stack? I
> >> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> >> its use cases is tricky. The critical path I had in mind though was
> >> direct reclaim and for that path, memcg == NULL or did I miss something?
> >
> > mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
> >
> 
> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> reclaim? It appears to be only called from the fault path or as a result
> of the memcg changing size.
> 
yes. It's only called from 
	- page fault
	- add_to_page_cache()

I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
reclaiming memory, it needs direct writeback.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
@ 2010-06-16  0:29                   ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16  0:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Jun 2010 15:16:01 +0100
> Mel Gorman<mel@csn.ul.ie>  wrote:

>> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
>> reclaim? It appears to be only called from the fault path or as a result
>> of the memcg changing size.
>>
> yes. It's only called from
> 	- page fault
> 	- add_to_page_cache()
>
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

Of course, a memcg page fault could still be triggered
from copy_to_user or copy_from_user, with a fairly
arbitrary stack frame above...

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  0:29                   ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16  0:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Jun 2010 15:16:01 +0100
> Mel Gorman<mel@csn.ul.ie>  wrote:

>> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
>> reclaim? It appears to be only called from the fault path or as a result
>> of the memcg changing size.
>>
> yes. It's only called from
> 	- page fault
> 	- add_to_page_cache()
>
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

Of course, a memcg page fault could still be triggered
from copy_to_user or copy_from_user, with a fairly
arbitrary stack frame above...

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-15 13:54             ` Mel Gorman
@ 2010-06-16  0:30               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Rik van Riel, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On Tue, 15 Jun 2010 14:54:08 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > > If direct reclaim can overflow the stack, so can direct
> > > memcg reclaim.  That means this patch does not solve the
> > > stack overflow, while admitting that we do need the
> > > ability to get specific pages flushed to disk from the
> > > pageout code.
> > 
> > Can you explain what the hell memcg reclaim is and why it needs
> > to reclaim from random contexts?
> 
> Kamezawa Hiroyuki has the full story here but here is a summary.
> 
Thank you.

> memcg is the Memory Controller cgroup
> (Documentation/cgroups/memory.txt). It's intended for the control of the
> amount of memory usable by a group of processes but its behaviour in
> terms of reclaim differs from global reclaim. It has its own LRU lists
> and kswapd operates on them.

No, we don't use kswapd. But we have some hooks in kswapd for implementing
soft-limit. Soft-limit is for giving a hint for kswapd "please reclaim memory
from this memcg" when global memory exhausts and kswapd runs.

What a memcg use when it his limit is just direct reclaim.
(*) Justfing using a cpu by a kswapd because a memcg hits limit is difficult 
    for me. So, I don't use kswapd until now.
    When direct-reclaim is used, cost-of-reclaim will be charged against
    a cpu cgroup which a thread belongs to.


> What is surprising is that direct reclaim
> for a process in the control group also does not operate within the
> cgroup.
Sorry, I can't understand ....

> 
> Reclaim from a cgroup happens from the fault path. The new page is
> "charged" to the cgroup. If it exceeds its allocated resources, some
> pages within the group are reclaimed in a path that is similar to direct
> reclaim except for its entry point.
> 
yes.

> So, memcg is not reclaiming from a random context, there is a limited
> number of cases where a memcg is reclaiming and it is not expected to
> overflow the stack.
> 

I think so. Especially, we'll never see 1k stack use of select().

> > It seems everything that has a cg in it's name that I stumbled over
> > lately seems to be some ugly wart..
> > 
> 
> The wart in this case is that the behaviour of page reclaim within a
> memcg and globally differ a fair bit.
> 

Sorry. But there has been very long story to reach current implementations.
But don't worry, of memcg is not activated (not mounted), it doesn't affect
the behavior of processes ;)

But Hmm..

>[kamezawa@bluextal mmotm-2.6.35-0611]$ wc -l mm/memcontrol.c
>4705 mm/memcontrol.c

may need some diet :(


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  0:30               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Hellwig, Rik van Riel, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On Tue, 15 Jun 2010 14:54:08 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > > If direct reclaim can overflow the stack, so can direct
> > > memcg reclaim.  That means this patch does not solve the
> > > stack overflow, while admitting that we do need the
> > > ability to get specific pages flushed to disk from the
> > > pageout code.
> > 
> > Can you explain what the hell memcg reclaim is and why it needs
> > to reclaim from random contexts?
> 
> Kamezawa Hiroyuki has the full story here but here is a summary.
> 
Thank you.

> memcg is the Memory Controller cgroup
> (Documentation/cgroups/memory.txt). It's intended for the control of the
> amount of memory usable by a group of processes but its behaviour in
> terms of reclaim differs from global reclaim. It has its own LRU lists
> and kswapd operates on them.

No, we don't use kswapd. But we have some hooks in kswapd for implementing
soft-limit. Soft-limit is for giving a hint for kswapd "please reclaim memory
from this memcg" when global memory exhausts and kswapd runs.

What a memcg use when it his limit is just direct reclaim.
(*) Justfing using a cpu by a kswapd because a memcg hits limit is difficult 
    for me. So, I don't use kswapd until now.
    When direct-reclaim is used, cost-of-reclaim will be charged against
    a cpu cgroup which a thread belongs to.


> What is surprising is that direct reclaim
> for a process in the control group also does not operate within the
> cgroup.
Sorry, I can't understand ....

> 
> Reclaim from a cgroup happens from the fault path. The new page is
> "charged" to the cgroup. If it exceeds its allocated resources, some
> pages within the group are reclaimed in a path that is similar to direct
> reclaim except for its entry point.
> 
yes.

> So, memcg is not reclaiming from a random context, there is a limited
> number of cases where a memcg is reclaiming and it is not expected to
> overflow the stack.
> 

I think so. Especially, we'll never see 1k stack use of select().

> > It seems everything that has a cg in it's name that I stumbled over
> > lately seems to be some ugly wart..
> > 
> 
> The wart in this case is that the behaviour of page reclaim within a
> memcg and globally differ a fair bit.
> 

Sorry. But there has been very long story to reach current implementations.
But don't worry, of memcg is not activated (not mounted), it doesn't affect
the behavior of processes ;)

But Hmm..

>[kamezawa@bluextal mmotm-2.6.35-0611]$ wc -l mm/memcontrol.c
>4705 mm/memcontrol.c

may need some diet :(


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:29                   ` Rik van Riel
@ 2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, 15 Jun 2010 20:29:49 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> > On Tue, 15 Jun 2010 15:16:01 +0100
> > Mel Gorman<mel@csn.ul.ie>  wrote:
> 
> >> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> >> reclaim? It appears to be only called from the fault path or as a result
> >> of the memcg changing size.
> >>
> > yes. It's only called from
> > 	- page fault
> > 	- add_to_page_cache()
> >
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
> 
> Of course, a memcg page fault could still be triggered
> from copy_to_user or copy_from_user, with a fairly
> arbitrary stack frame above...
> 

Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Should I prepare a thread for reclaiming memcg pages ?
Because we shouldn't limit kswapd's cpu time by CFS cgroup, waking up
kswapd just because "a memcg hit limits" isn't fun. 

Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
Maybe we'll see much swaps.

I want to go with this for a while, changing memcg's behavior will took
some amounts of time, there are only a few developpers.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  0:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, 15 Jun 2010 20:29:49 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> > On Tue, 15 Jun 2010 15:16:01 +0100
> > Mel Gorman<mel@csn.ul.ie>  wrote:
> 
> >> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> >> reclaim? It appears to be only called from the fault path or as a result
> >> of the memcg changing size.
> >>
> > yes. It's only called from
> > 	- page fault
> > 	- add_to_page_cache()
> >
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
> 
> Of course, a memcg page fault could still be triggered
> from copy_to_user or copy_from_user, with a fairly
> arbitrary stack frame above...
> 

Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Should I prepare a thread for reclaiming memcg pages ?
Because we shouldn't limit kswapd's cpu time by CFS cgroup, waking up
kswapd just because "a memcg hit limits" isn't fun. 

Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
Maybe we'll see much swaps.

I want to go with this for a while, changing memcg's behavior will took
some amounts of time, there are only a few developpers.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
@ 2010-06-16  0:53                       ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16  0:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:

> Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> Maybe we'll see much swaps.
>
> I want to go with this for a while, changing memcg's behavior will took
> some amounts of time, there are only a few developpers.

One thing we can do, for kswapd, memcg and direct reclaim alike,
is to tell the flusher threads to flush pages related to a pageout
candidate page to disk.

That way the reclaiming processes can wait on some disk IO to
finish, while the flusher thread takes care of the actual flushing.

That should also fix the "kswapd filesystem IO has really poor IO
patterns" issue.

There's no reason not to fix this issue the right way.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  0:53                       ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16  0:53 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:

> Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> Maybe we'll see much swaps.
>
> I want to go with this for a while, changing memcg's behavior will took
> some amounts of time, there are only a few developpers.

One thing we can do, for kswapd, memcg and direct reclaim alike,
is to tell the flusher threads to flush pages related to a pageout
candidate page to disk.

That way the reclaiming processes can wait on some disk IO to
finish, while the flusher thread takes care of the actual flushing.

That should also fix the "kswapd filesystem IO has really poor IO
patterns" issue.

There's no reason not to fix this issue the right way.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:53                       ` Rik van Riel
@ 2010-06-16  1:40                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  1:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, 15 Jun 2010 20:53:43 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > Maybe we'll see much swaps.
> >
> > I want to go with this for a while, changing memcg's behavior will took
> > some amounts of time, there are only a few developpers.
> 
> One thing we can do, for kswapd, memcg and direct reclaim alike,
> is to tell the flusher threads to flush pages related to a pageout
> candidate page to disk.
> 
> That way the reclaiming processes can wait on some disk IO to
> finish, while the flusher thread takes care of the actual flushing.
> 
> That should also fix the "kswapd filesystem IO has really poor IO
> patterns" issue.
> 
> There's no reason not to fix this issue the right way.
> 
yes. but this patch just stops writeback. I think it's sane to ask
not to change behavior until there are some useful changes in flusher
threads.

IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
depend on it because writeback allows stealing resource without it.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  1:40                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  1:40 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	Andrew Morton

On Tue, 15 Jun 2010 20:53:43 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
> 
> > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > Maybe we'll see much swaps.
> >
> > I want to go with this for a while, changing memcg's behavior will took
> > some amounts of time, there are only a few developpers.
> 
> One thing we can do, for kswapd, memcg and direct reclaim alike,
> is to tell the flusher threads to flush pages related to a pageout
> candidate page to disk.
> 
> That way the reclaiming processes can wait on some disk IO to
> finish, while the flusher thread takes care of the actual flushing.
> 
> That should also fix the "kswapd filesystem IO has really poor IO
> patterns" issue.
> 
> There's no reason not to fix this issue the right way.
> 
yes. but this patch just stops writeback. I think it's sane to ask
not to change behavior until there are some useful changes in flusher
threads.

IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
depend on it because writeback allows stealing resource without it.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  1:40                         ` KAMEZAWA Hiroyuki
@ 2010-06-16  2:20                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  2:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, 16 Jun 2010 10:40:36 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 15 Jun 2010 20:53:43 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
> > 
> > > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > > Maybe we'll see much swaps.
> > >
> > > I want to go with this for a while, changing memcg's behavior will took
> > > some amounts of time, there are only a few developpers.
> > 
> > One thing we can do, for kswapd, memcg and direct reclaim alike,
> > is to tell the flusher threads to flush pages related to a pageout
> > candidate page to disk.
> > 
> > That way the reclaiming processes can wait on some disk IO to
> > finish, while the flusher thread takes care of the actual flushing.
> > 
> > That should also fix the "kswapd filesystem IO has really poor IO
> > patterns" issue.
> > 
> > There's no reason not to fix this issue the right way.
> > 
> yes. but this patch just stops writeback. I think it's sane to ask
> not to change behavior until there are some useful changes in flusher
> threads.
> 
> IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
> depend on it because writeback allows stealing resource without it.
> 

BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
much more than memcg.

handle_mm_fault()
-> handle_pte_fault()
-> do_wp_page()
-> balance_dirty_page_rate_limited()
-> balance_dirty_pages()
-> writeback_inodes_wbc()
-> writeback_inodes_wb()
-> writeback_sb_inodes()
-> writeback_single_inode()
-> do_writepages()
-> generic_write_pages()
-> write_cache_pages()   // use on-stack pagevec.
-> writepage()

maybe much more stack consuming than memcg->writeback after vmscan.c diet.

Bye.
-Kame



















^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  2:20                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-16  2:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, 16 Jun 2010 10:40:36 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 15 Jun 2010 20:53:43 -0400
> Rik van Riel <riel@redhat.com> wrote:
> 
> > On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
> > 
> > > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > > Maybe we'll see much swaps.
> > >
> > > I want to go with this for a while, changing memcg's behavior will took
> > > some amounts of time, there are only a few developpers.
> > 
> > One thing we can do, for kswapd, memcg and direct reclaim alike,
> > is to tell the flusher threads to flush pages related to a pageout
> > candidate page to disk.
> > 
> > That way the reclaiming processes can wait on some disk IO to
> > finish, while the flusher thread takes care of the actual flushing.
> > 
> > That should also fix the "kswapd filesystem IO has really poor IO
> > patterns" issue.
> > 
> > There's no reason not to fix this issue the right way.
> > 
> yes. but this patch just stops writeback. I think it's sane to ask
> not to change behavior until there are some useful changes in flusher
> threads.
> 
> IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
> depend on it because writeback allows stealing resource without it.
> 

BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
much more than memcg.

handle_mm_fault()
-> handle_pte_fault()
-> do_wp_page()
-> balance_dirty_page_rate_limited()
-> balance_dirty_pages()
-> writeback_inodes_wbc()
-> writeback_inodes_wb()
-> writeback_sb_inodes()
-> writeback_single_inode()
-> do_writepages()
-> generic_write_pages()
-> write_cache_pages()   // use on-stack pagevec.
-> writepage()

maybe much more stack consuming than memcg->writeback after vmscan.c diet.

Bye.
-Kame


















--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
@ 2010-06-16  5:06                   ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> yes. It's only called from 
> 	- page fault
> 	- add_to_page_cache()
> 
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

The page fault code should be fine, but add_to_page_cache can be called
with quite deep stacks.  Two examples are grab_cache_page_write_begin
which already was part of one of the stack overflows mentioned in this
thread, or find_or_create_page which can be called via
_xfs_buf_lookup_pages, which can be called from under the whole XFS
allocator, or via grow_dev_page which might have a similarly deep
stack for users of the normal buffer cache.  Although for the
find_or_create_page we usually should not have __GFP_FS set in the
gfp_mask.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  5:06                   ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> yes. It's only called from 
> 	- page fault
> 	- add_to_page_cache()
> 
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

The page fault code should be fine, but add_to_page_cache can be called
with quite deep stacks.  Two examples are grab_cache_page_write_begin
which already was part of one of the stack overflows mentioned in this
thread, or find_or_create_page which can be called via
_xfs_buf_lookup_pages, which can be called from under the whole XFS
allocator, or via grow_dev_page which might have a similarly deep
stack for users of the normal buffer cache.  Although for the
find_or_create_page we usually should not have __GFP_FS set in the
gfp_mask.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
@ 2010-06-16  5:07                       ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, Jun 16, 2010 at 09:39:58AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Actually it is.  The poll code mentioned earlier in this thread is just
want nasty example.  I'm pretty sure there are tons of others in ioctl
code, as various ioctl implementations have been found to be massive
stack hogs in the past, even worse for out of tree drivers.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  5:07                       ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton

On Wed, Jun 16, 2010 at 09:39:58AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Actually it is.  The poll code mentioned earlier in this thread is just
want nasty example.  I'm pretty sure there are tons of others in ioctl
code, as various ioctl implementations have been found to be massive
stack hogs in the past, even worse for out of tree drivers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  2:20                           ` KAMEZAWA Hiroyuki
@ 2010-06-16  5:11                             ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton, axboe

On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
> much more than memcg.
> 
> handle_mm_fault()
> -> handle_pte_fault()
> -> do_wp_page()
> -> balance_dirty_page_rate_limited()
> -> balance_dirty_pages()
> -> writeback_inodes_wbc()
> -> writeback_inodes_wb()
> -> writeback_sb_inodes()
> -> writeback_single_inode()
> -> do_writepages()
> -> generic_write_pages()
> -> write_cache_pages()   // use on-stack pagevec.
> -> writepage()

Yes, this is a massive issue.  Strangely enough I just wondered about
this callstack as balance_dirty_pages is the only place calling into the
per-bdi/sb writeback code directly instead of offloading it to the
flusher threads.  It's something that should be fixed rather quickly
IMHO.  write_cache_pages and other bits of this writeback code can use
quite large amounts of stack.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16  5:11                             ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-16  5:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Rik van Riel, Mel Gorman, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Christoph Hellwig, Andrew Morton, axboe

On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
> much more than memcg.
> 
> handle_mm_fault()
> -> handle_pte_fault()
> -> do_wp_page()
> -> balance_dirty_page_rate_limited()
> -> balance_dirty_pages()
> -> writeback_inodes_wbc()
> -> writeback_inodes_wb()
> -> writeback_sb_inodes()
> -> writeback_single_inode()
> -> do_writepages()
> -> generic_write_pages()
> -> write_cache_pages()   // use on-stack pagevec.
> -> writepage()

Yes, this is a massive issue.  Strangely enough I just wondered about
this callstack as balance_dirty_pages is the only place calling into the
per-bdi/sb writeback code directly instead of offloading it to the
flusher threads.  It's something that should be fixed rather quickly
IMHO.  write_cache_pages and other bits of this writeback code can use
quite large amounts of stack.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
  2010-06-15 23:20                       ` Dave Chinner
@ 2010-06-16  6:04                         ` Nick Piggin
  -1 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-16  6:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Evgeniy Polyakov, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 09:20:09AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> > On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > > Per-mapping sorting will not do anything good in this case, even if
> > > files were previously created in a good facion being placed closely and
> > > so on, and only block layer will find a correlation between adjacent
> > > blocks in different files. But with existing queue management it has
> > > quite a small opportunity, and that's what I think Andrew is arguing
> > > about.
> > 
> > The solution is not to sort pages on their way to be submitted either,
> > really.
> > 
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state.
> 
> How does this work with delayed allocation where there is no block
> number associated with the page until writeback calls the allocation
> routine?

It doesn't. I have been thinking about how best to make that work.
The mm/writeback is not in a good position to know what to do, so
the fs would have to help.

So either an fs callback, or the fs would have to insert the blocks
(or some marker) into the tree itself. It's relatively easy to do for
a single file (just walk the radix-tree and do delalloc conversions),
but between multiple files is harder (current code has the same problem
though).


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 11/12] vmscan: Write out dirty pages in batch
@ 2010-06-16  6:04                         ` Nick Piggin
  0 siblings, 0 replies; 198+ messages in thread
From: Nick Piggin @ 2010-06-16  6:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Evgeniy Polyakov, Andrew Morton, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Chris Mason, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 09:20:09AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> > On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@fromorbit.com) wrote:
> > > Per-mapping sorting will not do anything good in this case, even if
> > > files were previously created in a good facion being placed closely and
> > > so on, and only block layer will find a correlation between adjacent
> > > blocks in different files. But with existing queue management it has
> > > quite a small opportunity, and that's what I think Andrew is arguing
> > > about.
> > 
> > The solution is not to sort pages on their way to be submitted either,
> > really.
> > 
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state.
> 
> How does this work with delayed allocation where there is no block
> number associated with the page until writeback calls the allocation
> routine?

It doesn't. I have been thinking about how best to make that work.
The mm/writeback is not in a good position to know what to do, so
the fs would have to help.

So either an fs callback, or the fs would have to insert the blocks
(or some marker) into the tree itself. It's relatively easy to do for
a single file (just walk the radix-tree and do delalloc conversions),
but between multiple files is harder (current code has the same problem
though).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  5:11                             ` Christoph Hellwig
@ 2010-06-16 10:51                               ` Jens Axboe
  -1 siblings, 0 replies; 198+ messages in thread
From: Jens Axboe @ 2010-06-16 10:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On 2010-06-16 07:11, Christoph Hellwig wrote:
> On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
>> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
>> much more than memcg.
>>
>> handle_mm_fault()
>> -> handle_pte_fault()
>> -> do_wp_page()
>> -> balance_dirty_page_rate_limited()
>> -> balance_dirty_pages()
>> -> writeback_inodes_wbc()
>> -> writeback_inodes_wb()
>> -> writeback_sb_inodes()
>> -> writeback_single_inode()
>> -> do_writepages()
>> -> generic_write_pages()
>> -> write_cache_pages()   // use on-stack pagevec.
>> -> writepage()
> 
> Yes, this is a massive issue.  Strangely enough I just wondered about
> this callstack as balance_dirty_pages is the only place calling into the
> per-bdi/sb writeback code directly instead of offloading it to the
> flusher threads.  It's something that should be fixed rather quickly
> IMHO.  write_cache_pages and other bits of this writeback code can use
> quite large amounts of stack.

I've had the same thought as well, bdp() should just signal a writeback
instead. Much cleaner than doing cleaning from that point.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-16 10:51                               ` Jens Axboe
  0 siblings, 0 replies; 198+ messages in thread
From: Jens Axboe @ 2010-06-16 10:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On 2010-06-16 07:11, Christoph Hellwig wrote:
> On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
>> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
>> much more than memcg.
>>
>> handle_mm_fault()
>> -> handle_pte_fault()
>> -> do_wp_page()
>> -> balance_dirty_page_rate_limited()
>> -> balance_dirty_pages()
>> -> writeback_inodes_wbc()
>> -> writeback_inodes_wb()
>> -> writeback_sb_inodes()
>> -> writeback_single_inode()
>> -> do_writepages()
>> -> generic_write_pages()
>> -> write_cache_pages()   // use on-stack pagevec.
>> -> writepage()
> 
> Yes, this is a massive issue.  Strangely enough I just wondered about
> this callstack as balance_dirty_pages is the only place calling into the
> per-bdi/sb writeback code directly instead of offloading it to the
> flusher threads.  It's something that should be fixed rather quickly
> IMHO.  write_cache_pages and other bits of this writeback code can use
> quite large amounts of stack.

I've had the same thought as well, bdp() should just signal a writeback
instead. Much cleaner than doing cleaning from that point.

-- 
Jens Axboe

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-16 23:37     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
> 
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.

This would have been badder in earlier days when we were using the
scanning priority to decide when to start unmapping pte-mapped pages -
page reclaim would have been recirculating large blobs of mapped pages
around the LRU until the priority had built to the level where we
started to unmap them.

However that priority-based decision got removed and right now I don't
recall what it got replaced with.  Aren't we now unmapping pages way
too early and suffering an increased major&minor fault rate?  Worried.


Things which are still broken after we broke prev_priority:

- If page reclaim is having a lot of trouble, prev_priority would
  have permitted do_try_to_free_pages() to call disable_swap_token()
  earlier on.  As things presently stand, we'll do a lot of
  thrash-detection stuff before (presumably correctly) dropping the
  swap token.

  So.  What's up with that?  I don't even remember _why_ we disable
  the swap token once the scanning priority gets severe and the code
  comments there are risible.  And why do we wait until priority==0
  rather than priority==1?

- Busted prev_priority means that lumpy reclaim will act oddly. 
  Every time someone goes into do some recalim, they'll start out not
  doing lumpy reclaim.  Then, after a while, they'll get a clue and
  will start doing the lumpy thing.  Then they return from reclaim and
  the next recalim caller will again forget that he should have done
  lumpy reclaim.

  I dunno what the effects of this are in the real world, but it
  seems dumb.

And one has to wonder: if we're making these incorrect decisions based
upon a bogus view of the current scanning difficulty, why are these
various priority-based thresholding heuristics even in there?  Are they
doing anything useful?

So..  either we have a load of useless-crap-and-cruft in there which
should be lopped out, or we don't have a load of useless-crap-and-cruft
in there, and we should fix prev_priority.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |   15 ------------
>  mm/page_alloc.c        |    2 -
>  mm/vmscan.c            |   57 ------------------------------------------------
>  mm/vmstat.c            |    2 -

The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-16 23:37     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
> 
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.

This would have been badder in earlier days when we were using the
scanning priority to decide when to start unmapping pte-mapped pages -
page reclaim would have been recirculating large blobs of mapped pages
around the LRU until the priority had built to the level where we
started to unmap them.

However that priority-based decision got removed and right now I don't
recall what it got replaced with.  Aren't we now unmapping pages way
too early and suffering an increased major&minor fault rate?  Worried.


Things which are still broken after we broke prev_priority:

- If page reclaim is having a lot of trouble, prev_priority would
  have permitted do_try_to_free_pages() to call disable_swap_token()
  earlier on.  As things presently stand, we'll do a lot of
  thrash-detection stuff before (presumably correctly) dropping the
  swap token.

  So.  What's up with that?  I don't even remember _why_ we disable
  the swap token once the scanning priority gets severe and the code
  comments there are risible.  And why do we wait until priority==0
  rather than priority==1?

- Busted prev_priority means that lumpy reclaim will act oddly. 
  Every time someone goes into do some recalim, they'll start out not
  doing lumpy reclaim.  Then, after a while, they'll get a clue and
  will start doing the lumpy thing.  Then they return from reclaim and
  the next recalim caller will again forget that he should have done
  lumpy reclaim.

  I dunno what the effects of this are in the real world, but it
  seems dumb.

And one has to wonder: if we're making these incorrect decisions based
upon a bogus view of the current scanning difficulty, why are these
various priority-based thresholding heuristics even in there?  Are they
doing anything useful?

So..  either we have a load of useless-crap-and-cruft in there which
should be lopped out, or we don't have a load of useless-crap-and-cruft
in there, and we should fix prev_priority.

> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/mmzone.h |   15 ------------
>  mm/page_alloc.c        |    2 -
>  mm/vmscan.c            |   57 ------------------------------------------------
>  mm/vmstat.c            |    2 -

The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-16 23:43     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:49 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.

You can get the entire pagevec off the stack - just make it a
static-to-shrink_inactive_list() pagevec-per-cpu.

Locking just requires pinning to a CPU.  We could trivially co-opt
shrink_inactive_list()'s spin_lock_irq() for that, but
pagevec_release() can be relatively expensive so it'd be sad to move
that inside spin_lock_irq().  It'd be better to slap a
get_cpu()/put_cpu() around the whole thing.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-16 23:43     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:49 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.

You can get the entire pagevec off the stack - just make it a
static-to-shrink_inactive_list() pagevec-per-cpu.

Locking just requires pinning to a CPU.  We could trivially co-opt
shrink_inactive_list()'s spin_lock_irq() for that, but
pagevec_release() can be relatively expensive so it'd be sad to move
that inside spin_lock_irq().  It'd be better to slap a
get_cpu()/put_cpu() around the whole thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-16 23:37     ` Andrew Morton
@ 2010-06-16 23:45       ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/16/2010 07:37 PM, Andrew Morton wrote:

> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
>
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with.  Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate?  Worried.

We keep a different set of statistics to decide whether to
reclaim only page cache pages, or both page cache and
anonymous pages. The function get_scan_ratio parses those
statistics.

>    So.  What's up with that?  I don't even remember _why_ we disable
>    the swap token once the scanning priority gets severe and the code
>    comments there are risible.  And why do we wait until priority==0
>    rather than priority==1?

The reason is that we never page out the pages belonging to the
process owning the swap token (with the exception of that process
evicting its own pages).

If that process has a really large RSS in the current zone, and
we are having problems freeing pages, it may be beneficial to
also evict pages from that process.

Now that the LRU lists are split out into file backed and swap
backed, it may be a lot easier to find pages to evict.  That
may mean we could notice we're getting into trouble at much
higher priority levels and disable the swap token at a higher
priority level.

I do not believe prev_priority will be very useful here, since
we'd like to start out with small scans whenever possible.

> - Busted prev_priority means that lumpy reclaim will act oddly.
>    Every time someone goes into do some recalim, they'll start out not
>    doing lumpy reclaim.  Then, after a while, they'll get a clue and
>    will start doing the lumpy thing.  Then they return from reclaim and
>    the next recalim caller will again forget that he should have done
>    lumpy reclaim.

How common are lumpy reclaims, anyway?

Isn't it more likely that in-between every two higher-order
reclaims, a number of order zero reclaims will be happening?

In that case, the prev_priority logic may have introduced the
kind of behavioural bug you describe above...

> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there?  Are they
> doing anything useful?

The prev_priority code was useful when we had filesystem and
swap backed pages mixed on the same LRU list.

I am not convinced it still has any use.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-16 23:45       ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-16 23:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/16/2010 07:37 PM, Andrew Morton wrote:

> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
>
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with.  Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate?  Worried.

We keep a different set of statistics to decide whether to
reclaim only page cache pages, or both page cache and
anonymous pages. The function get_scan_ratio parses those
statistics.

>    So.  What's up with that?  I don't even remember _why_ we disable
>    the swap token once the scanning priority gets severe and the code
>    comments there are risible.  And why do we wait until priority==0
>    rather than priority==1?

The reason is that we never page out the pages belonging to the
process owning the swap token (with the exception of that process
evicting its own pages).

If that process has a really large RSS in the current zone, and
we are having problems freeing pages, it may be beneficial to
also evict pages from that process.

Now that the LRU lists are split out into file backed and swap
backed, it may be a lot easier to find pages to evict.  That
may mean we could notice we're getting into trouble at much
higher priority levels and disable the swap token at a higher
priority level.

I do not believe prev_priority will be very useful here, since
we'd like to start out with small scans whenever possible.

> - Busted prev_priority means that lumpy reclaim will act oddly.
>    Every time someone goes into do some recalim, they'll start out not
>    doing lumpy reclaim.  Then, after a while, they'll get a clue and
>    will start doing the lumpy thing.  Then they return from reclaim and
>    the next recalim caller will again forget that he should have done
>    lumpy reclaim.

How common are lumpy reclaims, anyway?

Isn't it more likely that in-between every two higher-order
reclaims, a number of order zero reclaims will be happening?

In that case, the prev_priority logic may have introduced the
kind of behavioural bug you describe above...

> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there?  Are they
> doing anything useful?

The prev_priority code was useful when we had filesystem and
swap backed pages mixed on the same LRU list.

I am not convinced it still has any use.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
  2010-06-14 11:17   ` Mel Gorman
@ 2010-06-16 23:48     ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:50 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
> 

hm, spose so.  I cen't see any trivial way to eliminate the local
pagevec there.

> +	if (pagevec_count(&freed_pvec))
> +		__pagevec_free(&freed_pvec);
> ...
> -	if (pagevec_count(&freed_pvec))
> -		__pagevec_free(&freed_pvec);

That's an open-coded pagevec_free().

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
@ 2010-06-16 23:48     ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-16 23:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Mon, 14 Jun 2010 12:17:50 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
> 

hm, spose so.  I cen't see any trivial way to eliminate the local
pagevec there.

> +	if (pagevec_count(&freed_pvec))
> +		__pagevec_free(&freed_pvec);
> ...
> -	if (pagevec_count(&freed_pvec))
> -		__pagevec_free(&freed_pvec);

That's an open-coded pagevec_free().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-16 23:45       ` Rik van Riel
@ 2010-06-17  0:18         ` Andrew Morton
  -1 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-17  0:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, 16 Jun 2010 19:45:29 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/16/2010 07:37 PM, Andrew Morton wrote:
> 
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> >
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with.  Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate?  Worried.
> 
> We keep a different set of statistics to decide whether to
> reclaim only page cache pages, or both page cache and
> anonymous pages. The function get_scan_ratio parses those
> statistics.

I wasn't talking about anon-vs-file.  I was referring to mapped-file
versus not-mapped file.  If the code sees a mapped page come off the
tail of the LRU it'll just unmap and reclaim the thing.  This policy
caused awful amounts of paging activity when someone started doing lots
of read() activity, which is why the VM was changed to value mapped
pagecache higher than unmapped pagecache.  Did this biasing get
retained and if so, how?

> >    So.  What's up with that?  I don't even remember _why_ we disable
> >    the swap token once the scanning priority gets severe and the code
> >    comments there are risible.  And why do we wait until priority==0
> >    rather than priority==1?
> 
> The reason is that we never page out the pages belonging to the
> process owning the swap token (with the exception of that process
> evicting its own pages).
> 
> If that process has a really large RSS in the current zone, and
> we are having problems freeing pages, it may be beneficial to
> also evict pages from that process.
> 
> Now that the LRU lists are split out into file backed and swap
> backed, it may be a lot easier to find pages to evict.  That
> may mean we could notice we're getting into trouble at much
> higher priority levels and disable the swap token at a higher
> priority level.

hm, lots of "may"s there.

Does thrash-avoidance actually still work?

> I do not believe prev_priority will be very useful here, since
> we'd like to start out with small scans whenever possible.

Why?

> > - Busted prev_priority means that lumpy reclaim will act oddly.
> >    Every time someone goes into do some recalim, they'll start out not
> >    doing lumpy reclaim.  Then, after a while, they'll get a clue and
> >    will start doing the lumpy thing.  Then they return from reclaim and
> >    the next recalim caller will again forget that he should have done
> >    lumpy reclaim.
> 
> How common are lumpy reclaims, anyway?

A lot more than they should be, I suspect, given the recent trend
towards asking for higher-order allocations.  Kernel developers should
be prohibited from using more than 512M of RAM.

> Isn't it more likely that in-between every two higher-order
> reclaims, a number of order zero reclaims will be happening?

Sounds likely, yes.   Need prev_priority[DEF_PRIORITY], sigh.

> In that case, the prev_priority logic may have introduced the
> kind of behavioural bug you describe above...
> 
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there?  Are they
> > doing anything useful?
> 
> The prev_priority code was useful when we had filesystem and
> swap backed pages mixed on the same LRU list.

No, stop saying swap! ;)

It's all to do with mapped pagecache versus unmapped pagecache.  "ytf
does my browser get paged out all the time".




^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-17  0:18         ` Andrew Morton
  0 siblings, 0 replies; 198+ messages in thread
From: Andrew Morton @ 2010-06-17  0:18 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, 16 Jun 2010 19:45:29 -0400
Rik van Riel <riel@redhat.com> wrote:

> On 06/16/2010 07:37 PM, Andrew Morton wrote:
> 
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> >
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with.  Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate?  Worried.
> 
> We keep a different set of statistics to decide whether to
> reclaim only page cache pages, or both page cache and
> anonymous pages. The function get_scan_ratio parses those
> statistics.

I wasn't talking about anon-vs-file.  I was referring to mapped-file
versus not-mapped file.  If the code sees a mapped page come off the
tail of the LRU it'll just unmap and reclaim the thing.  This policy
caused awful amounts of paging activity when someone started doing lots
of read() activity, which is why the VM was changed to value mapped
pagecache higher than unmapped pagecache.  Did this biasing get
retained and if so, how?

> >    So.  What's up with that?  I don't even remember _why_ we disable
> >    the swap token once the scanning priority gets severe and the code
> >    comments there are risible.  And why do we wait until priority==0
> >    rather than priority==1?
> 
> The reason is that we never page out the pages belonging to the
> process owning the swap token (with the exception of that process
> evicting its own pages).
> 
> If that process has a really large RSS in the current zone, and
> we are having problems freeing pages, it may be beneficial to
> also evict pages from that process.
> 
> Now that the LRU lists are split out into file backed and swap
> backed, it may be a lot easier to find pages to evict.  That
> may mean we could notice we're getting into trouble at much
> higher priority levels and disable the swap token at a higher
> priority level.

hm, lots of "may"s there.

Does thrash-avoidance actually still work?

> I do not believe prev_priority will be very useful here, since
> we'd like to start out with small scans whenever possible.

Why?

> > - Busted prev_priority means that lumpy reclaim will act oddly.
> >    Every time someone goes into do some recalim, they'll start out not
> >    doing lumpy reclaim.  Then, after a while, they'll get a clue and
> >    will start doing the lumpy thing.  Then they return from reclaim and
> >    the next recalim caller will again forget that he should have done
> >    lumpy reclaim.
> 
> How common are lumpy reclaims, anyway?

A lot more than they should be, I suspect, given the recent trend
towards asking for higher-order allocations.  Kernel developers should
be prohibited from using more than 512M of RAM.

> Isn't it more likely that in-between every two higher-order
> reclaims, a number of order zero reclaims will be happening?

Sounds likely, yes.   Need prev_priority[DEF_PRIORITY], sigh.

> In that case, the prev_priority logic may have introduced the
> kind of behavioural bug you describe above...
> 
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there?  Are they
> > doing anything useful?
> 
> The prev_priority code was useful when we had filesystem and
> swap backed pages mixed on the same LRU list.

No, stop saying swap! ;)

It's all to do with mapped pagecache versus unmapped pagecache.  "ytf
does my browser get paged out all the time".



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-16  5:06                   ` Christoph Hellwig
@ 2010-06-17  0:25                     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-17  0:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Andrew Morton

On Wed, 16 Jun 2010 01:06:40 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> > yes. It's only called from 
> > 	- page fault
> > 	- add_to_page_cache()
> > 
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
> 
> The page fault code should be fine, but add_to_page_cache can be called
> with quite deep stacks.  Two examples are grab_cache_page_write_begin
> which already was part of one of the stack overflows mentioned in this
> thread, or find_or_create_page which can be called via
> _xfs_buf_lookup_pages, which can be called from under the whole XFS
> allocator, or via grow_dev_page which might have a similarly deep
> stack for users of the normal buffer cache.  Although for the
> find_or_create_page we usually should not have __GFP_FS set in the
> gfp_mask.
> 

Hmm. ok, then, memory cgroup needs some care.

BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
memory cgroup just limits pages for users, then, doesn't intend to
limit kernel pages. If this buffer is not for user(visible page cache), but for
internal structure, I'll have to add a code for ignoreing memory cgroup check
when gfp_mask doesn't have GFP_MOVABLE.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-17  0:25                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-17  0:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Andrew Morton

On Wed, 16 Jun 2010 01:06:40 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> > yes. It's only called from 
> > 	- page fault
> > 	- add_to_page_cache()
> > 
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
> 
> The page fault code should be fine, but add_to_page_cache can be called
> with quite deep stacks.  Two examples are grab_cache_page_write_begin
> which already was part of one of the stack overflows mentioned in this
> thread, or find_or_create_page which can be called via
> _xfs_buf_lookup_pages, which can be called from under the whole XFS
> allocator, or via grow_dev_page which might have a similarly deep
> stack for users of the normal buffer cache.  Although for the
> find_or_create_page we usually should not have __GFP_FS set in the
> gfp_mask.
> 

Hmm. ok, then, memory cgroup needs some care.

BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
memory cgroup just limits pages for users, then, doesn't intend to
limit kernel pages. If this buffer is not for user(visible page cache), but for
internal structure, I'll have to add a code for ignoreing memory cgroup check
when gfp_mask doesn't have GFP_MOVABLE.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-17  0:18         ` Andrew Morton
@ 2010-06-17  0:34           ` Rik van Riel
  -1 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-17  0:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/16/2010 08:18 PM, Andrew Morton wrote:
> On Wed, 16 Jun 2010 19:45:29 -0400
> Rik van Riel<riel@redhat.com>  wrote:
>
>> On 06/16/2010 07:37 PM, Andrew Morton wrote:
>>
>>> This would have been badder in earlier days when we were using the
>>> scanning priority to decide when to start unmapping pte-mapped pages -
>>> page reclaim would have been recirculating large blobs of mapped pages
>>> around the LRU until the priority had built to the level where we
>>> started to unmap them.
>>>
>>> However that priority-based decision got removed and right now I don't
>>> recall what it got replaced with.  Aren't we now unmapping pages way
>>> too early and suffering an increased major&minor fault rate?  Worried.
>>
>> We keep a different set of statistics to decide whether to
>> reclaim only page cache pages, or both page cache and
>> anonymous pages. The function get_scan_ratio parses those
>> statistics.
>
> I wasn't talking about anon-vs-file.  I was referring to mapped-file
> versus not-mapped file.  If the code sees a mapped page come off the
> tail of the LRU it'll just unmap and reclaim the thing.  This policy
> caused awful amounts of paging activity when someone started doing lots
> of read() activity, which is why the VM was changed to value mapped
> pagecache higher than unmapped pagecache.  Did this biasing get
> retained and if so, how?

It changed a little, but we still have it:

1) we do not deactivate active file pages if the active file
    list is smaller than the inactive file list - this protects
    the working set from streaming IO

2) we keep mapped referenced executable pages on the active file
    list if they got accessed while on the active list, while
    other file pages get deactivated unconditionally

> Does thrash-avoidance actually still work?

I suspect it does, but I have not actually tested that code
in years :)

>> I do not believe prev_priority will be very useful here, since
>> we'd like to start out with small scans whenever possible.
>
> Why?

For one, memory sizes today are a lot larger than they were
when 2.6.0 came out.

Secondly, we now know more exactly what is on each LRU list.
That should greatly reduce unnecessary turnover of the list.

For example, if we know there is no swap space available, we
will not bother scanning the anon LRU lists.

If we know there is not enough file cache left to get us up
to the zone high water mark, we will not bother scanning the
few remaining file pages.

Because of those simple checks (in get_scan_priority), I do
not expect that we will have to scan through all of memory
as frequently as we had to do in 2.6.0.

Furthermore, we unconditionally deactivate most active pages
and have a working used-once scheme for pages on the anon
lists.  This should also contribute to a reduction in the
number of pages that get scanned.

>> In that case, the prev_priority logic may have introduced the
>> kind of behavioural bug you describe above...
>>
>>> And one has to wonder: if we're making these incorrect decisions based
>>> upon a bogus view of the current scanning difficulty, why are these
>>> various priority-based thresholding heuristics even in there?  Are they
>>> doing anything useful?
>>
>> The prev_priority code was useful when we had filesystem and
>> swap backed pages mixed on the same LRU list.
>
> No, stop saying swap! ;)
>
> It's all to do with mapped pagecache versus unmapped pagecache.  "ytf
> does my browser get paged out all the time".

We have other measures in place now to protect the working set
on the file LRU lists (see above).  We are able to have those
measures in the kernel because we no longer have mixed LRU
lists.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-17  0:34           ` Rik van Riel
  0 siblings, 0 replies; 198+ messages in thread
From: Rik van Riel @ 2010-06-17  0:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner,
	Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On 06/16/2010 08:18 PM, Andrew Morton wrote:
> On Wed, 16 Jun 2010 19:45:29 -0400
> Rik van Riel<riel@redhat.com>  wrote:
>
>> On 06/16/2010 07:37 PM, Andrew Morton wrote:
>>
>>> This would have been badder in earlier days when we were using the
>>> scanning priority to decide when to start unmapping pte-mapped pages -
>>> page reclaim would have been recirculating large blobs of mapped pages
>>> around the LRU until the priority had built to the level where we
>>> started to unmap them.
>>>
>>> However that priority-based decision got removed and right now I don't
>>> recall what it got replaced with.  Aren't we now unmapping pages way
>>> too early and suffering an increased major&minor fault rate?  Worried.
>>
>> We keep a different set of statistics to decide whether to
>> reclaim only page cache pages, or both page cache and
>> anonymous pages. The function get_scan_ratio parses those
>> statistics.
>
> I wasn't talking about anon-vs-file.  I was referring to mapped-file
> versus not-mapped file.  If the code sees a mapped page come off the
> tail of the LRU it'll just unmap and reclaim the thing.  This policy
> caused awful amounts of paging activity when someone started doing lots
> of read() activity, which is why the VM was changed to value mapped
> pagecache higher than unmapped pagecache.  Did this biasing get
> retained and if so, how?

It changed a little, but we still have it:

1) we do not deactivate active file pages if the active file
    list is smaller than the inactive file list - this protects
    the working set from streaming IO

2) we keep mapped referenced executable pages on the active file
    list if they got accessed while on the active list, while
    other file pages get deactivated unconditionally

> Does thrash-avoidance actually still work?

I suspect it does, but I have not actually tested that code
in years :)

>> I do not believe prev_priority will be very useful here, since
>> we'd like to start out with small scans whenever possible.
>
> Why?

For one, memory sizes today are a lot larger than they were
when 2.6.0 came out.

Secondly, we now know more exactly what is on each LRU list.
That should greatly reduce unnecessary turnover of the list.

For example, if we know there is no swap space available, we
will not bother scanning the anon LRU lists.

If we know there is not enough file cache left to get us up
to the zone high water mark, we will not bother scanning the
few remaining file pages.

Because of those simple checks (in get_scan_priority), I do
not expect that we will have to scan through all of memory
as frequently as we had to do in 2.6.0.

Furthermore, we unconditionally deactivate most active pages
and have a working used-once scheme for pages on the anon
lists.  This should also contribute to a reduction in the
number of pages that get scanned.

>> In that case, the prev_priority logic may have introduced the
>> kind of behavioural bug you describe above...
>>
>>> And one has to wonder: if we're making these incorrect decisions based
>>> upon a bogus view of the current scanning difficulty, why are these
>>> various priority-based thresholding heuristics even in there?  Are they
>>> doing anything useful?
>>
>> The prev_priority code was useful when we had filesystem and
>> swap backed pages mixed on the same LRU list.
>
> No, stop saying swap! ;)
>
> It's all to do with mapped pagecache versus unmapped pagecache.  "ytf
> does my browser get paged out all the time".

We have other measures in place now to protect the working set
on the file LRU lists (see above).  We are able to have those
measures in the kernel because we no longer have mixed LRU
lists.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-17  0:25                     ` KAMEZAWA Hiroyuki
@ 2010-06-17  6:16                       ` Christoph Hellwig
  -1 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-17  6:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Rik van Riel, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> 
> BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> memory cgroup just limits pages for users, then, doesn't intend to
> limit kernel pages.

You mean xfs_buf_allocate?  It doesn't in the end.  It goes through the
xfs_kmem helper which clear __GFP_FS if we're currently inside a
filesystem transaction (PF_FSTRANS is set) or a caller specificly
requested it to be disabled even without that by passig the
XBF_DONT_BLOCK flag.


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-17  6:16                       ` Christoph Hellwig
  0 siblings, 0 replies; 198+ messages in thread
From: Christoph Hellwig @ 2010-06-17  6:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Christoph Hellwig, Mel Gorman, Rik van Riel, linux-kernel,
	linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin,
	Johannes Weiner, Andrew Morton

On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> 
> BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> memory cgroup just limits pages for users, then, doesn't intend to
> limit kernel pages.

You mean xfs_buf_allocate?  It doesn't in the end.  It goes through the
xfs_kmem helper which clear __GFP_FS if we're currently inside a
filesystem transaction (PF_FSTRANS is set) or a caller specificly
requested it to be disabled even without that by passig the
XBF_DONT_BLOCK flag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
  2010-06-17  6:16                       ` Christoph Hellwig
@ 2010-06-17  6:23                         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-17  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Andrew Morton

On Thu, 17 Jun 2010 02:16:47 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> > memory cgroup just limits pages for users, then, doesn't intend to
> > limit kernel pages.
> 
> You mean xfs_buf_allocate?  It doesn't in the end.  It goes through the
> xfs_kmem helper which clear __GFP_FS if we're currently inside a
> filesystem transaction (PF_FSTRANS is set) or a caller specificly
> requested it to be disabled even without that by passig the
> XBF_DONT_BLOCK flag.
> 
Ah, sorry. My question was wrong.

If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.

Then, I wonder I may have to add following check 

	if (!(gfp_mask & __GFP_RECLAIMABLE)) {
		/* ignore this. we just charge against reclaimable memory on LRU. */
		return 0;
	}

to mem_cgroup_charge_cache() which is a hook for accounting page-cache.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim
@ 2010-06-17  6:23                         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 198+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-06-17  6:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mel Gorman, Rik van Riel, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner,
	Andrew Morton

On Thu, 17 Jun 2010 02:16:47 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> > memory cgroup just limits pages for users, then, doesn't intend to
> > limit kernel pages.
> 
> You mean xfs_buf_allocate?  It doesn't in the end.  It goes through the
> xfs_kmem helper which clear __GFP_FS if we're currently inside a
> filesystem transaction (PF_FSTRANS is set) or a caller specificly
> requested it to be disabled even without that by passig the
> XBF_DONT_BLOCK flag.
> 
Ah, sorry. My question was wrong.

If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.

Then, I wonder I may have to add following check 

	if (!(gfp_mask & __GFP_RECLAIMABLE)) {
		/* ignore this. we just charge against reclaimable memory on LRU. */
		return 0;
	}

to mem_cgroup_charge_cache() which is a hook for accounting page-cache.


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
  2010-06-16 23:43     ` Andrew Morton
@ 2010-06-17 10:30       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-17 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 04:43:09PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> > uses significant amounts of stack doing this. This patch splits
> > shrink_inactive_list() to take the stack usage out of the main path so
> > that callers to writepage() do not contain an unused pagevec on the
> > stack.
> 
> You can get the entire pagevec off the stack - just make it a
> static-to-shrink_inactive_list() pagevec-per-cpu.
> 

That idea has been floated as well. I didn't pursue it because Dave
said that giving page reclaim a stack diet was never going to be the
full solution so I didn't think the complexity was justified.

I kept some of the stack reduction stuff because a) it was there and b)
it would give kswapd extra headroom when calling writepage.

> Locking just requires pinning to a CPU.  We could trivially co-opt
> shrink_inactive_list()'s spin_lock_irq() for that, but
> pagevec_release() can be relatively expensive so it'd be sad to move
> that inside spin_lock_irq().  It'd be better to slap a
> get_cpu()/put_cpu() around the whole thing.
> 

It'd be something interesting to try out when nothing else was happening but
I'm not going to focus on it for the moment unless I think it will really
help this stack overflow problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()
@ 2010-06-17 10:30       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-17 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 04:43:09PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:49 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> > uses significant amounts of stack doing this. This patch splits
> > shrink_inactive_list() to take the stack usage out of the main path so
> > that callers to writepage() do not contain an unused pagevec on the
> > stack.
> 
> You can get the entire pagevec off the stack - just make it a
> static-to-shrink_inactive_list() pagevec-per-cpu.
> 

That idea has been floated as well. I didn't pursue it because Dave
said that giving page reclaim a stack diet was never going to be the
full solution so I didn't think the complexity was justified.

I kept some of the stack reduction stuff because a) it was there and b)
it would give kswapd extra headroom when calling writepage.

> Locking just requires pinning to a CPU.  We could trivially co-opt
> shrink_inactive_list()'s spin_lock_irq() for that, but
> pagevec_release() can be relatively expensive so it'd be sad to move
> that inside spin_lock_irq().  It'd be better to slap a
> get_cpu()/put_cpu() around the whole thing.
> 

It'd be something interesting to try out when nothing else was happening but
I'm not going to focus on it for the moment unless I think it will really
help this stack overflow problem.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
  2010-06-16 23:48     ` Andrew Morton
@ 2010-06-17 10:46       ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-17 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 04:48:01PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:50 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > shrink_page_list() sets up a pagevec to release pages as according as they
> > are free. It uses significant amounts of stack on the pagevec. This
> > patch adds pages to be freed via pagevec to a linked list which is then
> > freed en-masse at the end. This avoids using stack in the main path that
> > potentially calls writepage().
> > 
> 
> hm, spose so.  I cen't see any trivial way to eliminate the local
> pagevec there.
> 
> > +	if (pagevec_count(&freed_pvec))
> > +		__pagevec_free(&freed_pvec);
> > ...
> > -	if (pagevec_count(&freed_pvec))
> > -		__pagevec_free(&freed_pvec);
> 
> That's an open-coded pagevec_free().
> 

Fair point, will correct. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()
@ 2010-06-17 10:46       ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-17 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason,
	Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig,
	KAMEZAWA Hiroyuki

On Wed, Jun 16, 2010 at 04:48:01PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:50 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > shrink_page_list() sets up a pagevec to release pages as according as they
> > are free. It uses significant amounts of stack on the pagevec. This
> > patch adds pages to be freed via pagevec to a linked list which is then
> > freed en-masse at the end. This avoids using stack in the main path that
> > potentially calls writepage().
> > 
> 
> hm, spose so.  I cen't see any trivial way to eliminate the local
> pagevec there.
> 
> > +	if (pagevec_count(&freed_pvec))
> > +		__pagevec_free(&freed_pvec);
> > ...
> > -	if (pagevec_count(&freed_pvec))
> > -		__pagevec_free(&freed_pvec);
> 
> That's an open-coded pagevec_free().
> 

Fair point, will correct. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-16 23:37     ` Andrew Morton
@ 2010-06-25  8:29       ` KOSAKI Motohiro
  -1 siblings, 0 replies; 198+ messages in thread
From: KOSAKI Motohiro @ 2010-06-25  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki


sorry for the long delay.
(and I'm a bit wonder why I was not CCed this thread ;)

> On Mon, 14 Jun 2010 12:17:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > safely. It reduce stack usage slightly.
> > 
> > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > can be integrate again, it's useful. but four (or more) times trying
> > haven't got good performance number. Thus I give up such approach.
> 
> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
> 
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with.  Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate?  Worried.
> 
> 
> Things which are still broken after we broke prev_priority:
> 
> - If page reclaim is having a lot of trouble, prev_priority would
>   have permitted do_try_to_free_pages() to call disable_swap_token()
>   earlier on.  As things presently stand, we'll do a lot of
>   thrash-detection stuff before (presumably correctly) dropping the
>   swap token.
> 
>   So.  What's up with that?  I don't even remember _why_ we disable
>   the swap token once the scanning priority gets severe and the code
>   comments there are risible.  And why do we wait until priority==0
>   rather than priority==1?
> 
> - Busted prev_priority means that lumpy reclaim will act oddly. 
>   Every time someone goes into do some recalim, they'll start out not
>   doing lumpy reclaim.  Then, after a while, they'll get a clue and
>   will start doing the lumpy thing.  Then they return from reclaim and
>   the next recalim caller will again forget that he should have done
>   lumpy reclaim.
> 
>   I dunno what the effects of this are in the real world, but it
>   seems dumb.
> 
> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there?  Are they
> doing anything useful?
> 
> So..  either we have a load of useless-crap-and-cruft in there which
> should be lopped out, or we don't have a load of useless-crap-and-cruft
> in there, and we should fix prev_priority.

May I explain my experience? I'd like to explain why prev_priority wouldn't
works nowadays. 

First of all, Yes, current vmscan still a lot of UP centric code. it 
expose some weakness on some dozens CPUs machine. I think we need 
more and more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but If the another task have mempolicy restriction,
It's unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue.

In other word, prev_priority concept assume the sysmtem don't have lots
threads.

And, I don't think lumpy reclaim threshold is big matter, because It was
introduced to case aim7 corner case issue. I don't think such situation
will occur frequently in the real workload. thus end users can't observe
such logic.

For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
introduced by Rik, unfortunatelly I had not joined its activity at 
making design time. I can only say, while my testing the current code 
works good.

That said, my conclusion is opposite. For long term view, we should
consider to kill reclaim priority completely. Instead, we should
consider to introduce per-zone pressure statistics.



> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |   15 ------------
> >  mm/page_alloc.c        |    2 -
> >  mm/vmscan.c            |   57 ------------------------------------------------
> >  mm/vmstat.c            |    2 -
> 
> The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

Sure. thanks.
Will fix.


btw, current zone reclaim have wrong swap token usage.

	static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
	{
	(snip)
	        disable_swap_token();
	        cond_resched();


I can't understand the reason why zone reclaim _always_ disable swap token.
that's mean, if the system is enabled zone reclaim, swap token don't works 
at all.

Perhaps, original author's intention was following, I guess.

                priority = ZONE_RECLAIM_PRIORITY;
                do {
                        if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority)	// here
			        disable_swap_token();				// here

                        note_zone_scanning_priority(zone, priority);
                        shrink_zone(priority, zone, &sc);
                        priority--;
                } while (priority >= 0 && sc.nr_reclaimed < nr_pages);


However, if my understanding is correct, we can remove this 
disable_swap_token() completely. because zone reclaim failure don't bring 
to OOM-Killer, instead melery cause normal try_to_free_pages().





^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-25  8:29       ` KOSAKI Motohiro
  0 siblings, 0 replies; 198+ messages in thread
From: KOSAKI Motohiro @ 2010-06-25  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Mel Gorman, linux-kernel, linux-fsdevel,
	linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki


sorry for the long delay.
(and I'm a bit wonder why I was not CCed this thread ;)

> On Mon, 14 Jun 2010 12:17:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > safely. It reduce stack usage slightly.
> > 
> > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > can be integrate again, it's useful. but four (or more) times trying
> > haven't got good performance number. Thus I give up such approach.
> 
> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
> 
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with.  Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate?  Worried.
> 
> 
> Things which are still broken after we broke prev_priority:
> 
> - If page reclaim is having a lot of trouble, prev_priority would
>   have permitted do_try_to_free_pages() to call disable_swap_token()
>   earlier on.  As things presently stand, we'll do a lot of
>   thrash-detection stuff before (presumably correctly) dropping the
>   swap token.
> 
>   So.  What's up with that?  I don't even remember _why_ we disable
>   the swap token once the scanning priority gets severe and the code
>   comments there are risible.  And why do we wait until priority==0
>   rather than priority==1?
> 
> - Busted prev_priority means that lumpy reclaim will act oddly. 
>   Every time someone goes into do some recalim, they'll start out not
>   doing lumpy reclaim.  Then, after a while, they'll get a clue and
>   will start doing the lumpy thing.  Then they return from reclaim and
>   the next recalim caller will again forget that he should have done
>   lumpy reclaim.
> 
>   I dunno what the effects of this are in the real world, but it
>   seems dumb.
> 
> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there?  Are they
> doing anything useful?
> 
> So..  either we have a load of useless-crap-and-cruft in there which
> should be lopped out, or we don't have a load of useless-crap-and-cruft
> in there, and we should fix prev_priority.

May I explain my experience? I'd like to explain why prev_priority wouldn't
works nowadays. 

First of all, Yes, current vmscan still a lot of UP centric code. it 
expose some weakness on some dozens CPUs machine. I think we need 
more and more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but If the another task have mempolicy restriction,
It's unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue.

In other word, prev_priority concept assume the sysmtem don't have lots
threads.

And, I don't think lumpy reclaim threshold is big matter, because It was
introduced to case aim7 corner case issue. I don't think such situation
will occur frequently in the real workload. thus end users can't observe
such logic.

For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
introduced by Rik, unfortunatelly I had not joined its activity at 
making design time. I can only say, while my testing the current code 
works good.

That said, my conclusion is opposite. For long term view, we should
consider to kill reclaim priority completely. Instead, we should
consider to introduce per-zone pressure statistics.



> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/mmzone.h |   15 ------------
> >  mm/page_alloc.c        |    2 -
> >  mm/vmscan.c            |   57 ------------------------------------------------
> >  mm/vmstat.c            |    2 -
> 
> The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

Sure. thanks.
Will fix.


btw, current zone reclaim have wrong swap token usage.

	static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
	{
	(snip)
	        disable_swap_token();
	        cond_resched();


I can't understand the reason why zone reclaim _always_ disable swap token.
that's mean, if the system is enabled zone reclaim, swap token don't works 
at all.

Perhaps, original author's intention was following, I guess.

                priority = ZONE_RECLAIM_PRIORITY;
                do {
                        if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority)	// here
			        disable_swap_token();				// here

                        note_zone_scanning_priority(zone, priority);
                        shrink_zone(priority, zone, &sc);
                        priority--;
                } while (priority >= 0 && sc.nr_reclaimed < nr_pages);


However, if my understanding is correct, we can remove this 
disable_swap_token() completely. because zone reclaim failure don't bring 
to OOM-Killer, instead melery cause normal try_to_free_pages().




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
  2010-06-25  8:29       ` KOSAKI Motohiro
@ 2010-06-28 10:35         ` Mel Gorman
  -1 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-28 10:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Fri, Jun 25, 2010 at 05:29:41PM +0900, KOSAKI Motohiro wrote:
> 
> sorry for the long delay.
> (and I'm a bit wonder why I was not CCed this thread ;)
> 

My fault, I unintentionally deleted your name from the send script.
Sorry about that.

> > On Mon, 14 Jun 2010 12:17:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > > safely. It reduce stack usage slightly.
> > > 
> > > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > > can be integrate again, it's useful. but four (or more) times trying
> > > haven't got good performance number. Thus I give up such approach.
> > 
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> > 
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with.  Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate?  Worried.
> > 
> > 
> > Things which are still broken after we broke prev_priority:
> > 
> > - If page reclaim is having a lot of trouble, prev_priority would
> >   have permitted do_try_to_free_pages() to call disable_swap_token()
> >   earlier on.  As things presently stand, we'll do a lot of
> >   thrash-detection stuff before (presumably correctly) dropping the
> >   swap token.
> > 
> >   So.  What's up with that?  I don't even remember _why_ we disable
> >   the swap token once the scanning priority gets severe and the code
> >   comments there are risible.  And why do we wait until priority==0
> >   rather than priority==1?
> > 
> > - Busted prev_priority means that lumpy reclaim will act oddly. 
> >   Every time someone goes into do some recalim, they'll start out not
> >   doing lumpy reclaim.  Then, after a while, they'll get a clue and
> >   will start doing the lumpy thing.  Then they return from reclaim and
> >   the next recalim caller will again forget that he should have done
> >   lumpy reclaim.
> > 

The intention of the code was to note that orders < PAGE_ALLOC_COSTLY_ORDER,
there was an expectatation that those pages would be free or nearly free
without page reclaim taking special steps with lumpy reclaim. If this has
changed, it's almost certainly because of a greater dependence on high-order
pages than previously which should be resisted (it has cropped up a few
times recently). I do have a script that uses ftrace to count call sites
using high-order allocations and how often they occur which would be of use
if this problem is being investigated.

> >   I dunno what the effects of this are in the real world, but it
> >   seems dumb.
> > 
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there?  Are they
> > doing anything useful?
> > 
> > So..  either we have a load of useless-crap-and-cruft in there which
> > should be lopped out, or we don't have a load of useless-crap-and-cruft
> > in there, and we should fix prev_priority.
> 
> May I explain my experience? I'd like to explain why prev_priority wouldn't
> works nowadays. 
> 
> First of all, Yes, current vmscan still a lot of UP centric code. it 
> expose some weakness on some dozens CPUs machine. I think we need 
> more and more improvement.
> 
> The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
> and per-task-pressure a bit. example, prev_priority try to boost priority to
> other concurrent priority. but If the another task have mempolicy restriction,
> It's unnecessary, but also makes wrong big latency and exceeding reclaim.
> per-task based priority + prev_priority adjustment make the emulation of
> per-system pressure. but it have two issue 1) too rough and brutal emulation
> 2) we need per-zone pressure, not per-system.
> 
> another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
> 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
> but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
> system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
> prev_priority can't solve such multithreads workload issue.
> 
> In other word, prev_priority concept assume the sysmtem don't have lots
> threads.
> 
> And, I don't think lumpy reclaim threshold is big matter, because It was
> introduced to case aim7 corner case issue. I don't think such situation
> will occur frequently in the real workload. thus end users can't observe
> such logic.
> 

I'm not aware of current problems with lumpy reclaim related stalls or
problems but it's not something I have specifically investigated. If
there is a known example workload that is felt to trigger lumpy reclaim
more than it should, someone point me in the general direction and I'll
take a look at it with ftrace and see what falls out.

> For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
> introduced by Rik, unfortunatelly I had not joined its activity at 
> making design time. I can only say, while my testing the current code 
> works good.
> 
> That said, my conclusion is opposite. For long term view, we should
> consider to kill reclaim priority completely. Instead, we should
> consider to introduce per-zone pressure statistics.

Ah, the "what is pressure?" rat-hole :)

> 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/mmzone.h |   15 ------------
> > >  mm/page_alloc.c        |    2 -
> > >  mm/vmscan.c            |   57 ------------------------------------------------
> > >  mm/vmstat.c            |    2 -
> > 
> > The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.
> 
> Sure. thanks.
> Will fix.
> 

I've fixed this up in the current patchset that is V3.

> 
> btw, current zone reclaim have wrong swap token usage.
> 
> 	static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> 	{
> 	(snip)
> 	        disable_swap_token();
> 	        cond_resched();
> 
> 
> I can't understand the reason why zone reclaim _always_ disable swap token.
> that's mean, if the system is enabled zone reclaim, swap token don't works 
> at all.
> 
> Perhaps, original author's intention was following, I guess.
> 
>                 priority = ZONE_RECLAIM_PRIORITY;
>                 do {
>                         if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority)	// here
> 			        disable_swap_token();				// here
> 
>                         note_zone_scanning_priority(zone, priority);
>                         shrink_zone(priority, zone, &sc);
>                         priority--;
>                 } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
> 
> 
> However, if my understanding is correct, we can remove this 
> disable_swap_token() completely. because zone reclaim failure don't bring 
> to OOM-Killer, instead melery cause normal try_to_free_pages().
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 198+ messages in thread

* Re: [PATCH 05/12] vmscan: kill prev_priority completely
@ 2010-06-28 10:35         ` Mel Gorman
  0 siblings, 0 replies; 198+ messages in thread
From: Mel Gorman @ 2010-06-28 10:35 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, linux-mm,
	Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel,
	Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki

On Fri, Jun 25, 2010 at 05:29:41PM +0900, KOSAKI Motohiro wrote:
> 
> sorry for the long delay.
> (and I'm a bit wonder why I was not CCed this thread ;)
> 

My fault, I unintentionally deleted your name from the send script.
Sorry about that.

> > On Mon, 14 Jun 2010 12:17:46 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > > safely. It reduce stack usage slightly.
> > > 
> > > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > > can be integrate again, it's useful. but four (or more) times trying
> > > haven't got good performance number. Thus I give up such approach.
> > 
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> > 
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with.  Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate?  Worried.
> > 
> > 
> > Things which are still broken after we broke prev_priority:
> > 
> > - If page reclaim is having a lot of trouble, prev_priority would
> >   have permitted do_try_to_free_pages() to call disable_swap_token()
> >   earlier on.  As things presently stand, we'll do a lot of
> >   thrash-detection stuff before (presumably correctly) dropping the
> >   swap token.
> > 
> >   So.  What's up with that?  I don't even remember _why_ we disable
> >   the swap token once the scanning priority gets severe and the code
> >   comments there are risible.  And why do we wait until priority==0
> >   rather than priority==1?
> > 
> > - Busted prev_priority means that lumpy reclaim will act oddly. 
> >   Every time someone goes into do some recalim, they'll start out not
> >   doing lumpy reclaim.  Then, after a while, they'll get a clue and
> >   will start doing the lumpy thing.  Then they return from reclaim and
> >   the next recalim caller will again forget that he should have done
> >   lumpy reclaim.
> > 

The intention of the code was to note that orders < PAGE_ALLOC_COSTLY_ORDER,
there was an expectatation that those pages would be free or nearly free
without page reclaim taking special steps with lumpy reclaim. If this has
changed, it's almost certainly because of a greater dependence on high-order
pages than previously which should be resisted (it has cropped up a few
times recently). I do have a script that uses ftrace to count call sites
using high-order allocations and how often they occur which would be of use
if this problem is being investigated.

> >   I dunno what the effects of this are in the real world, but it
> >   seems dumb.
> > 
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there?  Are they
> > doing anything useful?
> > 
> > So..  either we have a load of useless-crap-and-cruft in there which
> > should be lopped out, or we don't have a load of useless-crap-and-cruft
> > in there, and we should fix prev_priority.
> 
> May I explain my experience? I'd like to explain why prev_priority wouldn't
> works nowadays. 
> 
> First of all, Yes, current vmscan still a lot of UP centric code. it 
> expose some weakness on some dozens CPUs machine. I think we need 
> more and more improvement.
> 
> The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
> and per-task-pressure a bit. example, prev_priority try to boost priority to
> other concurrent priority. but If the another task have mempolicy restriction,
> It's unnecessary, but also makes wrong big latency and exceeding reclaim.
> per-task based priority + prev_priority adjustment make the emulation of
> per-system pressure. but it have two issue 1) too rough and brutal emulation
> 2) we need per-zone pressure, not per-system.
> 
> another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
> 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
> but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
> system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
> prev_priority can't solve such multithreads workload issue.
> 
> In other word, prev_priority concept assume the sysmtem don't have lots
> threads.
> 
> And, I don't think lumpy reclaim threshold is big matter, because It was
> introduced to case aim7 corner case issue. I don't think such situation
> will occur frequently in the real workload. thus end users can't observe
> such logic.
> 

I'm not aware of current problems with lumpy reclaim related stalls or
problems but it's not something I have specifically investigated. If
there is a known example workload that is felt to trigger lumpy reclaim
more than it should, someone point me in the general direction and I'll
take a look at it with ftrace and see what falls out.

> For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
> introduced by Rik, unfortunatelly I had not joined its activity at 
> making design time. I can only say, while my testing the current code 
> works good.
> 
> That said, my conclusion is opposite. For long term view, we should
> consider to kill reclaim priority completely. Instead, we should
> consider to introduce per-zone pressure statistics.

Ah, the "what is pressure?" rat-hole :)

> 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  include/linux/mmzone.h |   15 ------------
> > >  mm/page_alloc.c        |    2 -
> > >  mm/vmscan.c            |   57 ------------------------------------------------
> > >  mm/vmstat.c            |    2 -
> > 
> > The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.
> 
> Sure. thanks.
> Will fix.
> 

I've fixed this up in the current patchset that is V3.

> 
> btw, current zone reclaim have wrong swap token usage.
> 
> 	static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> 	{
> 	(snip)
> 	        disable_swap_token();
> 	        cond_resched();
> 
> 
> I can't understand the reason why zone reclaim _always_ disable swap token.
> that's mean, if the system is enabled zone reclaim, swap token don't works 
> at all.
> 
> Perhaps, original author's intention was following, I guess.
> 
>                 priority = ZONE_RECLAIM_PRIORITY;
>                 do {
>                         if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority)	// here
> 			        disable_swap_token();				// here
> 
>                         note_zone_scanning_priority(zone, priority);
>                         shrink_zone(priority, zone, &sc);
>                         priority--;
>                 } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
> 
> 
> However, if my understanding is correct, we can remove this 
> disable_swap_token() completely. because zone reclaim failure don't bring 
> to OOM-Killer, instead melery cause normal try_to_free_pages().
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 198+ messages in thread

end of thread, other threads:[~2010-06-28 10:35 UTC | newest]

Thread overview: 198+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-06-14 11:17 [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Mel Gorman
2010-06-14 11:17 ` Mel Gorman
2010-06-14 11:17 ` [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 15:45   ` Rik van Riel
2010-06-14 15:45     ` Rik van Riel
2010-06-14 21:01   ` Larry Woodman
2010-06-14 21:01     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 16:47   ` Rik van Riel
2010-06-14 16:47     ` Rik van Riel
2010-06-14 21:02   ` Larry Woodman
2010-06-14 21:02     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 03/12] tracing, vmscan: Add trace event when a page is written Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 16:48   ` Rik van Riel
2010-06-14 16:48     ` Rik van Riel
2010-06-14 21:02   ` Larry Woodman
2010-06-14 21:02     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 17:55   ` Rik van Riel
2010-06-14 17:55     ` Rik van Riel
2010-06-14 21:03   ` Larry Woodman
2010-06-14 21:03     ` Larry Woodman
2010-06-14 11:17 ` [PATCH 05/12] vmscan: kill prev_priority completely Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:04   ` Rik van Riel
2010-06-14 18:04     ` Rik van Riel
2010-06-16 23:37   ` Andrew Morton
2010-06-16 23:37     ` Andrew Morton
2010-06-16 23:45     ` Rik van Riel
2010-06-16 23:45       ` Rik van Riel
2010-06-17  0:18       ` Andrew Morton
2010-06-17  0:18         ` Andrew Morton
2010-06-17  0:34         ` Rik van Riel
2010-06-17  0:34           ` Rik van Riel
2010-06-25  8:29     ` KOSAKI Motohiro
2010-06-25  8:29       ` KOSAKI Motohiro
2010-06-28 10:35       ` Mel Gorman
2010-06-28 10:35         ` Mel Gorman
2010-06-14 11:17 ` [PATCH 06/12] vmscan: simplify shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:06   ` Rik van Riel
2010-06-14 18:06     ` Rik van Riel
2010-06-15 10:13     ` Mel Gorman
2010-06-15 10:13       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:14   ` Rik van Riel
2010-06-14 18:14     ` Rik van Riel
2010-06-14 11:17 ` [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 18:59   ` Rik van Riel
2010-06-14 18:59     ` Rik van Riel
2010-06-15 10:47   ` Christoph Hellwig
2010-06-15 10:47     ` Christoph Hellwig
2010-06-15 15:56     ` Mel Gorman
2010-06-15 15:56       ` Mel Gorman
2010-06-16 23:43   ` Andrew Morton
2010-06-16 23:43     ` Andrew Morton
2010-06-17 10:30     ` Mel Gorman
2010-06-17 10:30       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 19:24   ` Rik van Riel
2010-06-14 19:24     ` Rik van Riel
2010-06-16 23:48   ` Andrew Morton
2010-06-16 23:48     ` Andrew Morton
2010-06-17 10:46     ` Mel Gorman
2010-06-17 10:46       ` Mel Gorman
2010-06-14 11:17 ` [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list() Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 19:42   ` Rik van Riel
2010-06-14 19:42     ` Rik van Riel
2010-06-14 11:17 ` [PATCH 11/12] vmscan: Write out dirty pages in batch Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 21:13   ` Rik van Riel
2010-06-14 21:13     ` Rik van Riel
2010-06-15 10:18     ` Mel Gorman
2010-06-15 10:18       ` Mel Gorman
2010-06-14 23:11   ` Dave Chinner
2010-06-14 23:11     ` Dave Chinner
2010-06-14 23:21     ` Andrew Morton
2010-06-14 23:21       ` Andrew Morton
2010-06-15  0:39       ` Dave Chinner
2010-06-15  0:39         ` Dave Chinner
2010-06-15  1:16         ` Rik van Riel
2010-06-15  1:16           ` Rik van Riel
2010-06-15  1:45           ` Andrew Morton
2010-06-15  1:45             ` Andrew Morton
2010-06-15  4:08             ` Rik van Riel
2010-06-15  4:08               ` Rik van Riel
2010-06-15  4:37               ` Andrew Morton
2010-06-15  4:37                 ` Andrew Morton
2010-06-15  5:12                 ` Nick Piggin
2010-06-15  5:12                   ` Nick Piggin
2010-06-15  5:43                   ` [patch] mm: vmscan fix mapping use after free Nick Piggin
2010-06-15  5:43                     ` Nick Piggin
2010-06-15 13:23                     ` Mel Gorman
2010-06-15 13:23                       ` Mel Gorman
2010-06-15 11:01           ` [PATCH 11/12] vmscan: Write out dirty pages in batch Christoph Hellwig
2010-06-15 11:01             ` Christoph Hellwig
2010-06-15 13:32             ` Rik van Riel
2010-06-15 13:32               ` Rik van Riel
2010-06-15  1:39         ` Andrew Morton
2010-06-15  1:39           ` Andrew Morton
2010-06-15  3:20           ` Dave Chinner
2010-06-15  3:20             ` Dave Chinner
2010-06-15  4:15             ` Andrew Morton
2010-06-15  4:15               ` Andrew Morton
2010-06-15  6:36               ` Dave Chinner
2010-06-15  6:36                 ` Dave Chinner
2010-06-15 10:28                 ` Evgeniy Polyakov
2010-06-15 10:28                   ` Evgeniy Polyakov
2010-06-15 10:55                   ` Nick Piggin
2010-06-15 10:55                     ` Nick Piggin
2010-06-15 11:10                     ` Christoph Hellwig
2010-06-15 11:10                       ` Christoph Hellwig
2010-06-15 11:20                       ` Nick Piggin
2010-06-15 11:20                         ` Nick Piggin
2010-06-15 23:20                     ` Dave Chinner
2010-06-15 23:20                       ` Dave Chinner
2010-06-16  6:04                       ` Nick Piggin
2010-06-16  6:04                         ` Nick Piggin
2010-06-15 11:08                   ` Christoph Hellwig
2010-06-15 11:08                     ` Christoph Hellwig
2010-06-15 11:43               ` Mel Gorman
2010-06-15 11:43                 ` Mel Gorman
2010-06-15 13:07                 ` tytso
2010-06-15 13:07                   ` tytso
2010-06-15 15:44                 ` Mel Gorman
2010-06-15 15:44                   ` Mel Gorman
2010-06-15 10:57       ` Christoph Hellwig
2010-06-15 10:57         ` Christoph Hellwig
2010-06-15 10:53   ` Christoph Hellwig
2010-06-15 10:53     ` Christoph Hellwig
2010-06-15 11:11     ` Mel Gorman
2010-06-15 11:11       ` Mel Gorman
2010-06-15 11:13     ` Nick Piggin
2010-06-15 11:13       ` Nick Piggin
2010-06-14 11:17 ` [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim Mel Gorman
2010-06-14 11:17   ` Mel Gorman
2010-06-14 21:55   ` Rik van Riel
2010-06-14 21:55     ` Rik van Riel
2010-06-15 11:45     ` Mel Gorman
2010-06-15 11:45       ` Mel Gorman
2010-06-15 13:34       ` Rik van Riel
2010-06-15 13:34         ` Rik van Riel
2010-06-15 13:37         ` Christoph Hellwig
2010-06-15 13:37           ` Christoph Hellwig
2010-06-15 13:54           ` Mel Gorman
2010-06-15 13:54             ` Mel Gorman
2010-06-16  0:30             ` KAMEZAWA Hiroyuki
2010-06-16  0:30               ` KAMEZAWA Hiroyuki
2010-06-15 14:02           ` Rik van Riel
2010-06-15 14:02             ` Rik van Riel
2010-06-15 13:59         ` Mel Gorman
2010-06-15 13:59           ` Mel Gorman
2010-06-15 14:04           ` Rik van Riel
2010-06-15 14:04             ` Rik van Riel
2010-06-15 14:16             ` Mel Gorman
2010-06-15 14:16               ` Mel Gorman
2010-06-16  0:17               ` KAMEZAWA Hiroyuki
2010-06-16  0:17                 ` KAMEZAWA Hiroyuki
2010-06-16  0:29                 ` Rik van Riel
2010-06-16  0:29                   ` Rik van Riel
2010-06-16  0:39                   ` KAMEZAWA Hiroyuki
2010-06-16  0:39                     ` KAMEZAWA Hiroyuki
2010-06-16  0:53                     ` Rik van Riel
2010-06-16  0:53                       ` Rik van Riel
2010-06-16  1:40                       ` KAMEZAWA Hiroyuki
2010-06-16  1:40                         ` KAMEZAWA Hiroyuki
2010-06-16  2:20                         ` KAMEZAWA Hiroyuki
2010-06-16  2:20                           ` KAMEZAWA Hiroyuki
2010-06-16  5:11                           ` Christoph Hellwig
2010-06-16  5:11                             ` Christoph Hellwig
2010-06-16 10:51                             ` Jens Axboe
2010-06-16 10:51                               ` Jens Axboe
2010-06-16  5:07                     ` Christoph Hellwig
2010-06-16  5:07                       ` Christoph Hellwig
2010-06-16  5:06                 ` Christoph Hellwig
2010-06-16  5:06                   ` Christoph Hellwig
2010-06-17  0:25                   ` KAMEZAWA Hiroyuki
2010-06-17  0:25                     ` KAMEZAWA Hiroyuki
2010-06-17  6:16                     ` Christoph Hellwig
2010-06-17  6:16                       ` Christoph Hellwig
2010-06-17  6:23                       ` KAMEZAWA Hiroyuki
2010-06-17  6:23                         ` KAMEZAWA Hiroyuki
2010-06-14 15:10 ` [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Christoph Hellwig
2010-06-14 15:10   ` Christoph Hellwig
2010-06-15 11:45   ` Mel Gorman
2010-06-15 11:45     ` Mel Gorman
2010-06-15  0:08 ` KAMEZAWA Hiroyuki
2010-06-15  0:08   ` KAMEZAWA Hiroyuki
2010-06-15 11:49   ` Mel Gorman
2010-06-15 11:49     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.