All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:47 ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
gathers together three different patch sets which in combination should
tackle some of the root causes of latency problems being reported.

The first patch improves vmscan latency by tracking when pages get reclaimed
by shrink_inactive_list. For this series, the most important results is
being able to calculate the scanning/reclaim ratio as a measure of the
amount of work being done by page reclaim.

Patches 2 and 3 account for the time spent in congestion_wait() and avoids
calling going to sleep on congestion when it is unnecessary. This is expected
to reduce stalls in situations where the system is under memory pressure
but not due to congestion.

Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patches 9-10 revisit avoiding filesystem writeback from direct reclaim. This has been
reported as being a potential cause of stack overflow but it can also result in poor IO
patterns increasing reclaim latencies.

There are patches similar to 9-10 already in mmotm but Andrew had concerns
about their impact. Hence, I revisisted them as the last part of this series
for re-evaluation.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc3

traceonly-v1r5: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
lowlumpy-v1r5:  Patches 1-8 to test if lumpy reclaim is better
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
	compile-based benchmark. Smoke test performance

iozone
	Smoke test performance, isn't putting the system under major stress

sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.

stress-highalloc
	Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine and as they did not put the machine under memory pressure
the main paths this series deals with were not exercised. sysbench will be
re-run in the future with read-write testing as it is sensitive to writeback
performance under memory pressure. It is an oversight that it didn't happen
for this test.

X86-64 micro-mapped-file-stream
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
pgalloc_dma                       2631.00 (   0.00%)      2483.00 (  -5.96%)      2375.00 ( -10.78%)      2467.00 (  -6.65%)
pgalloc_dma32                  2840528.00 (   0.00%)   2841510.00 (   0.03%)   2841391.00 (   0.03%)   2842308.00 (   0.06%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                       1383.00 (   0.00%)      1182.00 ( -17.01%)      1177.00 ( -17.50%)      1181.00 ( -17.10%)
pgsteal_dma32                  2237658.00 (   0.00%)   2236581.00 (  -0.05%)   2219885.00 (  -0.80%)   2234527.00 (  -0.14%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 3006.00 (   0.00%)      1400.00 (-114.71%)      1547.00 ( -94.31%)      1347.00 (-123.16%)
pgscan_kswapd_dma32            4206487.00 (   0.00%)   3343082.00 ( -25.83%)   3425728.00 ( -22.79%)   3304369.00 ( -27.30%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                  629.00 (   0.00%)      1793.00 (  64.92%)      1643.00 (  61.72%)      1868.00 (  66.33%)
pgscan_direct_dma32             506741.00 (   0.00%)   1402557.00 (  63.87%)   1330777.00 (  61.92%)   1448345.00 (  65.01%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       15449.00 (   0.00%)     15555.00 (   0.68%)     15319.00 (  -0.85%)     15963.00 (   3.22%)
allocstall                         152.00 (   0.00%)       941.00 (  83.85%)       967.00 (  84.28%)       729.00 (  79.15%)

These are just the raw figures taken from /proc/vmstat. It's a rough measure
of reclaim activity. Note that allocstall counts are higher because we
are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                152        941        967        729 
Direct reclaim pages scanned                507377    1404350    1332420    1450213 
Direct reclaim pages reclaimed               10968      72042      77186      41097 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        127195     241025     254825     188846 
Kswapd wakeups                                   6          1          1          1 
Kswapd pages scanned                       4210101    3345122    3427915    3306356 
Kswapd pages reclaimed                     2228073    2165721    2143876    2194611 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         7.60       3.03       3.24       3.43 
Time kswapd awake (seconds)                  12.46       9.46       9.56       9.40 

Total pages scanned                        4717478   4749472   4760335   4756569
Total pages reclaimed                      2239041   2237763   2221062   2235708
%age total pages scanned/reclaimed          47.46%    47.12%    46.66%    47.00%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        43.80%    21.38%    22.34%    23.46%
Percentage Time kswapd Awake                79.92%    79.56%    79.20%    80.48%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               148          0          0          0 
Direct time   congest     waited            8376ms        0ms        0ms        0ms 
Direct full   congest     waited               127          0          0          0 
Direct number conditional waited                 0        711        693        627 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited               127          0          0          0 
KSwapd number congest     waited                38         11         12         14 
KSwapd time   congest     waited            3236ms      548ms      576ms      576ms 
KSwapd full   congest     waited                31          3          3          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                31          3          3          2 

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)          9.75     11.14     11.26     11.19
Total Elapsed Time (seconds)                 15.59     11.89     12.07     11.68

And overall, the tests complete significantly faster. Indicators are that
reclaim did less work and the test completed faster with fewer stalls. Seems
good.

PPC64 micro-mapped-file-stream
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
pgalloc_dma                    3027144.00 (   0.00%)   3025080.00 (  -0.07%)   3025463.00 (  -0.06%)   3026037.00 (  -0.04%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2399696.00 (   0.00%)   2399540.00 (  -0.01%)   2399592.00 (  -0.00%)   2399570.00 (  -0.01%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              3690319.00 (   0.00%)   2883661.00 ( -27.97%)   2852314.00 ( -29.38%)   3008323.00 ( -22.67%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma              1224036.00 (   0.00%)   1975664.00 (  38.04%)   2012185.00 (  39.17%)   1907869.00 (  35.84%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       15170.00 (   0.00%)     14636.00 (  -3.65%)     14664.00 (  -3.45%)     16027.00 (   5.35%)
allocstall                         712.00 (   0.00%)      1906.00 (  62.64%)      1912.00 (  62.76%)      2027.00 (  64.87%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                712       1906       1904       2021 
Direct reclaim pages scanned               1224100    1975664    2010015    1906767 
Direct reclaim pages reclaimed               79215     218292     202719     209388 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                       1154724     805852     767944     848063 
Kswapd wakeups                                   3          2          2          2 
Kswapd pages scanned                       3690799    2884173    2852026    3008835 
Kswapd pages reclaimed                     2320481    2181248    2195908    2189076 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)        21.02       7.19       7.72       6.76 
Time kswapd awake (seconds)                  39.55      25.31      24.88      24.83 

Total pages scanned                        4914899   4859837   4862041   4915602
Total pages reclaimed                      2399696   2399540   2398627   2398464
%age total pages scanned/reclaimed          48.82%    49.37%    49.33%    48.79%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        43.44%    19.64%    20.77%    18.43%
Percentage Time kswapd Awake                87.36%    81.94%    81.84%    81.28%

Again, a similar trend that the congestion_wait changes mean that direct reclaim
scans more pages but the overall number of pages scanned is very similar and
the ratio of scanning/reclaimed remains roughly similar. Once again, reclaim is
not doing more work, but spends less time in direct reclaim and with kswapd awake.

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               499          0          0          0 
Direct time   congest     waited           22700ms        0ms        0ms        0ms 
Direct full   congest     waited               421          0          0          0 
Direct number conditional waited                 0       1214       1242       1290 
Direct time   conditional waited               0ms        4ms        0ms        0ms 
Direct full   conditional waited               421          0          0          0 
KSwapd number congest     waited               257        103         94        104 
KSwapd time   congest     waited           22116ms     7344ms     7476ms     7528ms 
KSwapd full   congest     waited               203         57         59         56 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited               203         57         59         56 

The vanilla kernel spent 22 seconds asleep in direct reclaim and no time at
all asleep with the patches. which is a big improvement. The time kswapd spent congest
waited was also reduced by a large factor.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         27.37     29.42     29.45     29.91
Total Elapsed Time (seconds)                 45.27     30.89     30.40     30.55

And the test again completed far faster.

X86-64 STRESS-HIGHALLOC
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Pass 1          84.00 ( 0.00%)    84.00 ( 0.00%)    80.00 (-4.00%)    72.00 (-12.00%)
Pass 2          94.00 ( 0.00%)    94.00 ( 0.00%)    89.00 (-5.00%)    88.00 (-6.00%)
At Rest         95.00 ( 0.00%)    95.00 ( 0.00%)    95.00 ( 0.00%)    92.00 (-3.00%)

Success figures start dropping off for lowlumpy and nodirect. This ordinarily
would be a concern but the rest of the report paints a better picture.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                838       1189       1323       1197 
Direct reclaim pages scanned                182207     168696     146310     133117 
Direct reclaim pages reclaimed               84208      81706      80442      54879 
Direct reclaim write file async I/O            538        619        839          0 
Direct reclaim write anon async I/O          36403      32892      44126      22085 
Direct reclaim write file sync I/O              88        108          1          0 
Direct reclaim write anon sync I/O           19107      15514        871          0 
Wake kswapd requests                          7761        827        865       6502 
Kswapd wakeups                                 749        733        658        614 
Kswapd pages scanned                       6400676    6871918    6875056    3126591 
Kswapd pages reclaimed                     3122126    3376919    3001799    1669300 
Kswapd reclaim write file async I/O          58199      67175      28483        925 
Kswapd reclaim write anon async I/O        1740452    1851455    1680964     186578 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      3864.84    4426.77    3108.85     254.08 
Time kswapd awake (seconds)                1792.00    2130.10    1890.76     343.37 

Total pages scanned                        6582883   7040614   7021366   3259708
Total pages reclaimed                      3206334   3458625   3082241   1724179
%age total pages scanned/reclaimed          48.71%    49.12%    43.90%    52.89%
%age total pages scanned/written            28.18%    27.95%    25.00%     6.43%
%age  file pages scanned/written             0.89%     0.96%     0.42%     0.03%
Percentage Time Spent Direct Reclaim        53.38%    56.75%    47.80%     8.44%
Percentage Time kswapd Awake                35.35%    37.88%    43.97%    23.01%

Scanned/reclaimed ratios again look good. The Scanned/written ratios look
very good for the nodirect patches showing that the writeback is happening
more in the flusher threads and less from direct reclaim. The expectation
is that the IO should be more efficient and indeed the time spent in direct
reclaim is massively reduced by the full series and kswapd spends a little
less time awake.

Overall, indications here are that things are moving much faster.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1060          1          0          0 
Direct time   congest     waited           63664ms      100ms        0ms        0ms 
Direct full   congest     waited               617          1          0          0 
Direct number conditional waited                 0       1650        866        838 
Direct time   conditional waited               0ms    20296ms     1916ms    17652ms 
Direct full   conditional waited               617          1          0          0 
KSwapd number congest     waited               399          0        466         12 
KSwapd time   congest     waited           33376ms        0ms    33048ms      968ms 
KSwapd full   congest     waited               318          0        312          9 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited               318          0        312          9 

The sleep times for congest wait get interesting here. congestion_wait()
times are dropped to almost zero but wait_iff_congested() is detecting
when there is in fact congestion or too much writeback and still going to
sleep. Overall the times are reduced though - from 63ish seconds to about 20.
We are still backing off but less aggressively.


MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3375.95   3374.04   3395.56   2756.97
Total Elapsed Time (seconds)               5068.80   5623.06   4300.45   1492.09

Oddly, the nocongest patches took longer to complete the test but the
overall series reduces the test time by almost an hour or about in one
third of the time. I also looked at the latency figures when allocating
huge pages and got this

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-hydra-mean.ps

So it looks like the latencies in general are reduced. The full series
reduces latency by massive amounts but there is also a hint why nocongest
was slower overall. Its latencies were lower up to the point where 72%
of memory was allocated with huge pages. After the latencies were higher
but this problem is resolved later in the series.

PPC64 STRESS-HIGHALLOC
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Pass 1          27.00 ( 0.00%)    38.00 (11.00%)    31.00 ( 4.00%)    43.00 (16.00%)
Pass 2          41.00 ( 0.00%)    43.00 ( 2.00%)    33.00 (-8.00%)    55.00 (14.00%)
At Rest         84.00 ( 0.00%)    83.00 (-1.00%)    84.00 ( 0.00%)    85.00 ( 1.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                461        426        547        915 
Direct reclaim pages scanned                193118     171811     143647     138334 
Direct reclaim pages reclaimed              130100     108863      65954      63043 
Direct reclaim write file async I/O            442        293        748          0 
Direct reclaim write anon async I/O          52948      45149      29910       9949 
Direct reclaim write file sync I/O              34        154          0          0 
Direct reclaim write anon sync I/O           33128      27267        119          0 
Wake kswapd requests                           302        282        306        233 
Kswapd wakeups                                 154        146        123        132 
Kswapd pages scanned                      13019861   12506267    3409775    3072689 
Kswapd pages reclaimed                     4839299    4782393    1908499    1723469 
Kswapd reclaim write file async I/O          77348      77785      14580        214 
Kswapd reclaim write anon async I/O        2878272    2840643     428083     142755 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      7692.01    7473.31    1044.76     217.31 
Time kswapd awake (seconds)                7332.64    7171.23    1059.70     357.02 

Total pages scanned                       13212979  12678078   3553422   3211023
Total pages reclaimed                      4969399   4891256   1974453   1786512
%age total pages scanned/reclaimed          37.61%    38.58%    55.56%    55.64%
%age total pages scanned/written            23.02%    23.59%    13.32%     4.76%
%age  file pages scanned/written             0.59%     0.62%     0.43%     0.01%
Percentage Time Spent Direct Reclaim        42.66%    43.22%    26.30%     6.59%
Percentage Time kswapd Awake                82.06%    82.08%    45.82%    21.87%

Initially, it looks like the scanned/reclaimed ratios are much higher
and that's a bad thing.  However, the number of pages scanned is reduced
by around 75% and the times spent in direct reclaim and with kswapd are
*massively* reduced. Overall the VM seems to be doing a lot less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               811         23         38          0 
Direct time   congest     waited           40272ms      512ms     1496ms        0ms 
Direct full   congest     waited               484          4         14          0 
Direct number conditional waited                 0        703        345       1281 
Direct time   conditional waited               0ms    22776ms     1312ms    10428ms 
Direct full   conditional waited               484          4         14          0 
KSwapd number congest     waited                 1          0          6          6 
KSwapd time   congest     waited             100ms        0ms      124ms      404ms 
KSwapd full   congest     waited                 1          0          1          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 1          0          1          2 

Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)      10340.18   9818.41   2927.13   3078.91
Total Elapsed Time (seconds)               8936.19   8736.59   2312.71   1632.74

The time to complete this test goes way down. Take the allocation success
rates - we are allocating 16% more memory as huge pages in less than a
fifth of the time and this is reflected in the allocation latency data

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-powyah-mean.ps

I recognise that this is a weighty series but the desktop latency and other
stall issues are a tricky topic. There are multiple root causes as to what
might be causing them but I believe this series kicks a number of them.
I think the congestion_wait changes will also impact Dave Chinner's fs-mark
test that showed up in the minute-long livelock report but I'm hoping the
filesystem people that were complaining about latencies in the VM could
test this series with their respective workloads.

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
 include/linux/backing-dev.h                        |    2 +-
 include/trace/events/vmscan.h                      |   44 ++++-
 include/trace/events/writeback.h                   |   35 +++
 mm/backing-dev.c                                   |   71 ++++++-
 mm/page_alloc.c                                    |    4 +-
 mm/vmscan.c                                        |  253 +++++++++++++++-----
 7 files changed, 368 insertions(+), 80 deletions(-)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:47 ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
gathers together three different patch sets which in combination should
tackle some of the root causes of latency problems being reported.

The first patch improves vmscan latency by tracking when pages get reclaimed
by shrink_inactive_list. For this series, the most important results is
being able to calculate the scanning/reclaim ratio as a measure of the
amount of work being done by page reclaim.

Patches 2 and 3 account for the time spent in congestion_wait() and avoids
calling going to sleep on congestion when it is unnecessary. This is expected
to reduce stalls in situations where the system is under memory pressure
but not due to congestion.

Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.

Patches 9-10 revisit avoiding filesystem writeback from direct reclaim. This has been
reported as being a potential cause of stack overflow but it can also result in poor IO
patterns increasing reclaim latencies.

There are patches similar to 9-10 already in mmotm but Andrew had concerns
about their impact. Hence, I revisisted them as the last part of this series
for re-evaluation.

I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were

X86:    Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64:  PPC970MP

Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc3

traceonly-v1r5: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
lowlumpy-v1r5:  Patches 1-8 to test if lumpy reclaim is better
nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO

The tests run were as follows

kernbench
	compile-based benchmark. Smoke test performance

iozone
	Smoke test performance, isn't putting the system under major stress

sysbench
	OLTP read-only benchmark. Will be re-run in the future as read-write

micro-mapped-file-stream
	This is a micro-benchmark from Johannes Weiner that accesses a
	large sparse-file through mmap(). It was configured to run in only
	single-CPU mode but can be indicative of how well page reclaim
	identifies suitable pages.

stress-highalloc
	Tries to allocate huge pages under heavy load.

kernbench, iozone and sysbench did not report any performance regression
on any machine and as they did not put the machine under memory pressure
the main paths this series deals with were not exercised. sysbench will be
re-run in the future with read-write testing as it is sensitive to writeback
performance under memory pressure. It is an oversight that it didn't happen
for this test.

X86-64 micro-mapped-file-stream
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
pgalloc_dma                       2631.00 (   0.00%)      2483.00 (  -5.96%)      2375.00 ( -10.78%)      2467.00 (  -6.65%)
pgalloc_dma32                  2840528.00 (   0.00%)   2841510.00 (   0.03%)   2841391.00 (   0.03%)   2842308.00 (   0.06%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                       1383.00 (   0.00%)      1182.00 ( -17.01%)      1177.00 ( -17.50%)      1181.00 ( -17.10%)
pgsteal_dma32                  2237658.00 (   0.00%)   2236581.00 (  -0.05%)   2219885.00 (  -0.80%)   2234527.00 (  -0.14%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma                 3006.00 (   0.00%)      1400.00 (-114.71%)      1547.00 ( -94.31%)      1347.00 (-123.16%)
pgscan_kswapd_dma32            4206487.00 (   0.00%)   3343082.00 ( -25.83%)   3425728.00 ( -22.79%)   3304369.00 ( -27.30%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma                  629.00 (   0.00%)      1793.00 (  64.92%)      1643.00 (  61.72%)      1868.00 (  66.33%)
pgscan_direct_dma32             506741.00 (   0.00%)   1402557.00 (  63.87%)   1330777.00 (  61.92%)   1448345.00 (  65.01%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       15449.00 (   0.00%)     15555.00 (   0.68%)     15319.00 (  -0.85%)     15963.00 (   3.22%)
allocstall                         152.00 (   0.00%)       941.00 (  83.85%)       967.00 (  84.28%)       729.00 (  79.15%)

These are just the raw figures taken from /proc/vmstat. It's a rough measure
of reclaim activity. Note that allocstall counts are higher because we
are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                152        941        967        729 
Direct reclaim pages scanned                507377    1404350    1332420    1450213 
Direct reclaim pages reclaimed               10968      72042      77186      41097 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                        127195     241025     254825     188846 
Kswapd wakeups                                   6          1          1          1 
Kswapd pages scanned                       4210101    3345122    3427915    3306356 
Kswapd pages reclaimed                     2228073    2165721    2143876    2194611 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)         7.60       3.03       3.24       3.43 
Time kswapd awake (seconds)                  12.46       9.46       9.56       9.40 

Total pages scanned                        4717478   4749472   4760335   4756569
Total pages reclaimed                      2239041   2237763   2221062   2235708
%age total pages scanned/reclaimed          47.46%    47.12%    46.66%    47.00%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        43.80%    21.38%    22.34%    23.46%
Percentage Time kswapd Awake                79.92%    79.56%    79.20%    80.48%

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               148          0          0          0 
Direct time   congest     waited            8376ms        0ms        0ms        0ms 
Direct full   congest     waited               127          0          0          0 
Direct number conditional waited                 0        711        693        627 
Direct time   conditional waited               0ms        0ms        0ms        0ms 
Direct full   conditional waited               127          0          0          0 
KSwapd number congest     waited                38         11         12         14 
KSwapd time   congest     waited            3236ms      548ms      576ms      576ms 
KSwapd full   congest     waited                31          3          3          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                31          3          3          2 

The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)          9.75     11.14     11.26     11.19
Total Elapsed Time (seconds)                 15.59     11.89     12.07     11.68

And overall, the tests complete significantly faster. Indicators are that
reclaim did less work and the test completed faster with fewer stalls. Seems
good.

PPC64 micro-mapped-file-stream
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
pgalloc_dma                    3027144.00 (   0.00%)   3025080.00 (  -0.07%)   3025463.00 (  -0.06%)   3026037.00 (  -0.04%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                    2399696.00 (   0.00%)   2399540.00 (  -0.01%)   2399592.00 (  -0.00%)   2399570.00 (  -0.01%)
pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_kswapd_dma              3690319.00 (   0.00%)   2883661.00 ( -27.97%)   2852314.00 ( -29.38%)   3008323.00 ( -22.67%)
pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgscan_direct_dma              1224036.00 (   0.00%)   1975664.00 (  38.04%)   2012185.00 (  39.17%)   1907869.00 (  35.84%)
pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pageoutrun                       15170.00 (   0.00%)     14636.00 (  -3.65%)     14664.00 (  -3.45%)     16027.00 (   5.35%)
allocstall                         712.00 (   0.00%)      1906.00 (  62.64%)      1912.00 (  62.76%)      2027.00 (  64.87%)

Similar trends to x86-64. allocstalls are up but it's not necessarily bad.

FTrace Reclaim Statistics: vmscan
            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                712       1906       1904       2021 
Direct reclaim pages scanned               1224100    1975664    2010015    1906767 
Direct reclaim pages reclaimed               79215     218292     202719     209388 
Direct reclaim write file async I/O              0          0          0          0 
Direct reclaim write anon async I/O              0          0          0          0 
Direct reclaim write file sync I/O               0          0          0          0 
Direct reclaim write anon sync I/O               0          0          0          0 
Wake kswapd requests                       1154724     805852     767944     848063 
Kswapd wakeups                                   3          2          2          2 
Kswapd pages scanned                       3690799    2884173    2852026    3008835 
Kswapd pages reclaimed                     2320481    2181248    2195908    2189076 
Kswapd reclaim write file async I/O              0          0          0          0 
Kswapd reclaim write anon async I/O              0          0          0          0 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)        21.02       7.19       7.72       6.76 
Time kswapd awake (seconds)                  39.55      25.31      24.88      24.83 

Total pages scanned                        4914899   4859837   4862041   4915602
Total pages reclaimed                      2399696   2399540   2398627   2398464
%age total pages scanned/reclaimed          48.82%    49.37%    49.33%    48.79%
%age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
%age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
Percentage Time Spent Direct Reclaim        43.44%    19.64%    20.77%    18.43%
Percentage Time kswapd Awake                87.36%    81.94%    81.84%    81.28%

Again, a similar trend that the congestion_wait changes mean that direct reclaim
scans more pages but the overall number of pages scanned is very similar and
the ratio of scanning/reclaimed remains roughly similar. Once again, reclaim is
not doing more work, but spends less time in direct reclaim and with kswapd awake.

What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               499          0          0          0 
Direct time   congest     waited           22700ms        0ms        0ms        0ms 
Direct full   congest     waited               421          0          0          0 
Direct number conditional waited                 0       1214       1242       1290 
Direct time   conditional waited               0ms        4ms        0ms        0ms 
Direct full   conditional waited               421          0          0          0 
KSwapd number congest     waited               257        103         94        104 
KSwapd time   congest     waited           22116ms     7344ms     7476ms     7528ms 
KSwapd full   congest     waited               203         57         59         56 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited               203         57         59         56 

The vanilla kernel spent 22 seconds asleep in direct reclaim and no time at
all asleep with the patches. which is a big improvement. The time kswapd spent congest
waited was also reduced by a large factor.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)         27.37     29.42     29.45     29.91
Total Elapsed Time (seconds)                 45.27     30.89     30.40     30.55

And the test again completed far faster.

X86-64 STRESS-HIGHALLOC
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Pass 1          84.00 ( 0.00%)    84.00 ( 0.00%)    80.00 (-4.00%)    72.00 (-12.00%)
Pass 2          94.00 ( 0.00%)    94.00 ( 0.00%)    89.00 (-5.00%)    88.00 (-6.00%)
At Rest         95.00 ( 0.00%)    95.00 ( 0.00%)    95.00 ( 0.00%)    92.00 (-3.00%)

Success figures start dropping off for lowlumpy and nodirect. This ordinarily
would be a concern but the rest of the report paints a better picture.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                838       1189       1323       1197 
Direct reclaim pages scanned                182207     168696     146310     133117 
Direct reclaim pages reclaimed               84208      81706      80442      54879 
Direct reclaim write file async I/O            538        619        839          0 
Direct reclaim write anon async I/O          36403      32892      44126      22085 
Direct reclaim write file sync I/O              88        108          1          0 
Direct reclaim write anon sync I/O           19107      15514        871          0 
Wake kswapd requests                          7761        827        865       6502 
Kswapd wakeups                                 749        733        658        614 
Kswapd pages scanned                       6400676    6871918    6875056    3126591 
Kswapd pages reclaimed                     3122126    3376919    3001799    1669300 
Kswapd reclaim write file async I/O          58199      67175      28483        925 
Kswapd reclaim write anon async I/O        1740452    1851455    1680964     186578 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      3864.84    4426.77    3108.85     254.08 
Time kswapd awake (seconds)                1792.00    2130.10    1890.76     343.37 

Total pages scanned                        6582883   7040614   7021366   3259708
Total pages reclaimed                      3206334   3458625   3082241   1724179
%age total pages scanned/reclaimed          48.71%    49.12%    43.90%    52.89%
%age total pages scanned/written            28.18%    27.95%    25.00%     6.43%
%age  file pages scanned/written             0.89%     0.96%     0.42%     0.03%
Percentage Time Spent Direct Reclaim        53.38%    56.75%    47.80%     8.44%
Percentage Time kswapd Awake                35.35%    37.88%    43.97%    23.01%

Scanned/reclaimed ratios again look good. The Scanned/written ratios look
very good for the nodirect patches showing that the writeback is happening
more in the flusher threads and less from direct reclaim. The expectation
is that the IO should be more efficient and indeed the time spent in direct
reclaim is massively reduced by the full series and kswapd spends a little
less time awake.

Overall, indications here are that things are moving much faster.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited              1060          1          0          0 
Direct time   congest     waited           63664ms      100ms        0ms        0ms 
Direct full   congest     waited               617          1          0          0 
Direct number conditional waited                 0       1650        866        838 
Direct time   conditional waited               0ms    20296ms     1916ms    17652ms 
Direct full   conditional waited               617          1          0          0 
KSwapd number congest     waited               399          0        466         12 
KSwapd time   congest     waited           33376ms        0ms    33048ms      968ms 
KSwapd full   congest     waited               318          0        312          9 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited               318          0        312          9 

The sleep times for congest wait get interesting here. congestion_wait()
times are dropped to almost zero but wait_iff_congested() is detecting
when there is in fact congestion or too much writeback and still going to
sleep. Overall the times are reduced though - from 63ish seconds to about 20.
We are still backing off but less aggressively.


MMTests Statistics: duration
User/Sys Time Running Test (seconds)       3375.95   3374.04   3395.56   2756.97
Total Elapsed Time (seconds)               5068.80   5623.06   4300.45   1492.09

Oddly, the nocongest patches took longer to complete the test but the
overall series reduces the test time by almost an hour or about in one
third of the time. I also looked at the latency figures when allocating
huge pages and got this

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-hydra-mean.ps

So it looks like the latencies in general are reduced. The full series
reduces latency by massive amounts but there is also a hint why nocongest
was slower overall. Its latencies were lower up to the point where 72%
of memory was allocated with huge pages. After the latencies were higher
but this problem is resolved later in the series.

PPC64 STRESS-HIGHALLOC
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Pass 1          27.00 ( 0.00%)    38.00 (11.00%)    31.00 ( 4.00%)    43.00 (16.00%)
Pass 2          41.00 ( 0.00%)    43.00 ( 2.00%)    33.00 (-8.00%)    55.00 (14.00%)
At Rest         84.00 ( 0.00%)    83.00 (-1.00%)    84.00 ( 0.00%)    85.00 ( 1.00%)

Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.

FTrace Reclaim Statistics: vmscan
              stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
Direct reclaims                                461        426        547        915 
Direct reclaim pages scanned                193118     171811     143647     138334 
Direct reclaim pages reclaimed              130100     108863      65954      63043 
Direct reclaim write file async I/O            442        293        748          0 
Direct reclaim write anon async I/O          52948      45149      29910       9949 
Direct reclaim write file sync I/O              34        154          0          0 
Direct reclaim write anon sync I/O           33128      27267        119          0 
Wake kswapd requests                           302        282        306        233 
Kswapd wakeups                                 154        146        123        132 
Kswapd pages scanned                      13019861   12506267    3409775    3072689 
Kswapd pages reclaimed                     4839299    4782393    1908499    1723469 
Kswapd reclaim write file async I/O          77348      77785      14580        214 
Kswapd reclaim write anon async I/O        2878272    2840643     428083     142755 
Kswapd reclaim write file sync I/O               0          0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0          0 
Time stalled direct reclaim (seconds)      7692.01    7473.31    1044.76     217.31 
Time kswapd awake (seconds)                7332.64    7171.23    1059.70     357.02 

Total pages scanned                       13212979  12678078   3553422   3211023
Total pages reclaimed                      4969399   4891256   1974453   1786512
%age total pages scanned/reclaimed          37.61%    38.58%    55.56%    55.64%
%age total pages scanned/written            23.02%    23.59%    13.32%     4.76%
%age  file pages scanned/written             0.59%     0.62%     0.43%     0.01%
Percentage Time Spent Direct Reclaim        42.66%    43.22%    26.30%     6.59%
Percentage Time kswapd Awake                82.06%    82.08%    45.82%    21.87%

Initially, it looks like the scanned/reclaimed ratios are much higher
and that's a bad thing.  However, the number of pages scanned is reduced
by around 75% and the times spent in direct reclaim and with kswapd are
*massively* reduced. Overall the VM seems to be doing a lot less work.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited               811         23         38          0 
Direct time   congest     waited           40272ms      512ms     1496ms        0ms 
Direct full   congest     waited               484          4         14          0 
Direct number conditional waited                 0        703        345       1281 
Direct time   conditional waited               0ms    22776ms     1312ms    10428ms 
Direct full   conditional waited               484          4         14          0 
KSwapd number congest     waited                 1          0          6          6 
KSwapd time   congest     waited             100ms        0ms      124ms      404ms 
KSwapd full   congest     waited                 1          0          1          2 
KSwapd number conditional waited                 0          0          0          0 
KSwapd time   conditional waited               0ms        0ms        0ms        0ms 
KSwapd full   conditional waited                 1          0          1          2 

Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.

MMTests Statistics: duration
User/Sys Time Running Test (seconds)      10340.18   9818.41   2927.13   3078.91
Total Elapsed Time (seconds)               8936.19   8736.59   2312.71   1632.74

The time to complete this test goes way down. Take the allocation success
rates - we are allocating 16% more memory as huge pages in less than a
fifth of the time and this is reflected in the allocation latency data

http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-powyah-mean.ps

I recognise that this is a weighty series but the desktop latency and other
stall issues are a tricky topic. There are multiple root causes as to what
might be causing them but I believe this series kicks a number of them.
I think the congestion_wait changes will also impact Dave Chinner's fs-mark
test that showed up in the minute-long livelock report but I'm hoping the
filesystem people that were complaining about latencies in the VM could
test this series with their respective workloads.

 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++-
 include/linux/backing-dev.h                        |    2 +-
 include/trace/events/vmscan.h                      |   44 ++++-
 include/trace/events/writeback.h                   |   35 +++
 mm/backing-dev.c                                   |   71 ++++++-
 mm/page_alloc.c                                    |    4 +-
 mm/vmscan.c                                        |  253 +++++++++++++++-----
 7 files changed, 368 insertions(+), 80 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

This patch adds a trace event for shrink_inactive_list(). It can be used
to determine how many pages were reclaimed and for non-lumpy reclaim where
exactly the pages were reclaimed from.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++++++++++++-----
 include/trace/events/vmscan.h                      |   42 ++++++++++++++++++++
 mm/vmscan.c                                        |    6 +++
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY		=> 20;
 use constant HIGH_KSWAPD_REWAKEUP		=> 21;
 use constant HIGH_NR_SCANNED			=> 22;
 use constant HIGH_NR_TAKEN			=> 23;
-use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_RECLAIMED			=> 24;
 use constant HIGH_NR_CONTIG_DIRTY		=> 25;
 
 my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
 my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
 my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
 my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
 my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
 my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
 my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
 my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_inactive",
 			$regex_lru_shrink_inactive_default,
 			"nid", "zid",
-			"lru",
-			"nr_scanned", "nr_reclaimed", "priority");
+			"nr_scanned", "nr_reclaimed", "priority",
+			"flags");
 $regex_lru_shrink_active = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_active",
 			$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
 			my $nr_contig_dirty = $7;
 			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+			$details = $5;
+			if ($details !~ /$regex_lru_shrink_inactive/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_shrink_inactive/o\n";
+				next;
+			}
+			my $nr_reclaimed = $4;
+			$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
 		} elsif ($tracepoint eq "mm_vmscan_writepage") {
 			$details = $5;
 			if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
 
 	# Print out process activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO",  "Stalled");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
 			$index++;
 		}
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8u %8u %8.3f",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
 
 	# Print out kswapd activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
 		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
@@ -579,6 +595,7 @@ sub dump_stats {
 	print "\nSummary\n";
 	print "Direct reclaims:     			$total_direct_reclaim\n";
 	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim pages reclaimed:		$total_direct_nr_reclaimed\n";
 	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
 	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd pages reclaimed:			$total_kswapd_nr_reclaimed\n";
 	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..14c1586 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
 
 #define RECLAIM_WB_ANON		0x0001u
 #define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_MIXED	0x0010u
 #define RECLAIM_WB_SYNC		0x0004u
 #define RECLAIM_WB_ASYNC	0x0008u
 
@@ -17,6 +18,7 @@
 	(flags) ? __print_flags(flags, "|",			\
 		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
 		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_MIXED,	"RECLAIM_WB_MIXED"},	\
 		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
 	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
+#define trace_shrink_flags(file, sync) ( \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+	TP_PROTO(int nid, int zid,
+			unsigned long nr_scanned, unsigned long nr_reclaimed,
+			int priority, int reclaim_flags),
+
+	TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_reclaimed)
+		__field(int, priority)
+		__field(int, reclaim_flags)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zid = zid;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_reclaimed = nr_reclaimed;
+		__entry->priority = priority;
+		__entry->reclaim_flags = reclaim_flags;
+	),
+
+	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid, __entry->zid,
+		__entry->nr_scanned, __entry->nr_reclaimed,
+		__entry->priority,
+		show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+		zone_idx(zone),
+		nr_scanned, nr_reclaimed,
+		priority,
+		trace_shrink_flags(file, sc->lumpy_reclaim_mode));
 	return nr_reclaimed;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

This patch adds a trace event for shrink_inactive_list(). It can be used
to determine how many pages were reclaimed and for non-lumpy reclaim where
exactly the pages were reclaimed from.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 .../trace/postprocess/trace-vmscan-postprocess.pl  |   39 +++++++++++++-----
 include/trace/events/vmscan.h                      |   42 ++++++++++++++++++++
 mm/vmscan.c                                        |    6 +++
 3 files changed, 77 insertions(+), 10 deletions(-)

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY		=> 20;
 use constant HIGH_KSWAPD_REWAKEUP		=> 21;
 use constant HIGH_NR_SCANNED			=> 22;
 use constant HIGH_NR_TAKEN			=> 23;
-use constant HIGH_NR_RECLAIM			=> 24;
+use constant HIGH_NR_RECLAIMED			=> 24;
 use constant HIGH_NR_CONTIG_DIRTY		=> 25;
 
 my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
 my $total_wakeup_kswapd;
 my ($total_direct_reclaim, $total_direct_nr_scanned);
 my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
 my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
 my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
 my ($total_kswapd_nr_scanned, $total_kswapd_wake);
 my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
 my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
 
 # Catch sigint and exit on request
 my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
 my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
 my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
 my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
 my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
 my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
 
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_inactive",
 			$regex_lru_shrink_inactive_default,
 			"nid", "zid",
-			"lru",
-			"nr_scanned", "nr_reclaimed", "priority");
+			"nr_scanned", "nr_reclaimed", "priority",
+			"flags");
 $regex_lru_shrink_active = generate_traceevent_regex(
 			"vmscan/mm_vmscan_lru_shrink_active",
 			$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
 			my $nr_contig_dirty = $7;
 			$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
 			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+		} elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+			$details = $5;
+			if ($details !~ /$regex_lru_shrink_inactive/o) {
+				print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+				print "         $details\n";
+				print "         $regex_lru_shrink_inactive/o\n";
+				next;
+			}
+			my $nr_reclaimed = $4;
+			$perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
 		} elsif ($tracepoint eq "mm_vmscan_writepage") {
 			$details = $5;
 			if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
 
 	# Print out process activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",     "Time");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Sync-IO", "ASync-IO",  "Stalled");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "Process", "Direct",  "Wokeup", "Pages",   "Pages",   "Pages",   "Pages",     "Time");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s %8s  %8s %8s %8s %8s\n", "details", "Rclms",   "Kswapd", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO",  "Stalled");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
 		$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
 		$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
 			$index++;
 		}
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8u %8u %8.3f",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8u %8u %8.3f",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
 			$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
 			$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
 
 	# Print out kswapd activity
 	printf("\n");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",  "Pages");
-	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Kswapd",   "Kswapd",  "Order",     "Pages",   "Pages",   "Pages",  "Pages");
+	printf("%-" . $max_strlen . "s %8s %10s   %8s   %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed",  "Sync-IO", "ASync-IO");
 	foreach $process_pid (keys %stats) {
 
 		if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
 
 		$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
 		$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+		$total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
 		$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
 		$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
 
-		printf("%-" . $max_strlen . "s %8d %10d   %8u   %8i %8u",
+		printf("%-" . $max_strlen . "s %8d %10d   %8u %8u  %8i %8u",
 			$process_pid,
 			$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
 			$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
 			$stats{$process_pid}->{HIGH_NR_SCANNED},
+			$stats{$process_pid}->{HIGH_NR_RECLAIMED},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
 			$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
 
@@ -579,6 +595,7 @@ sub dump_stats {
 	print "\nSummary\n";
 	print "Direct reclaims:     			$total_direct_reclaim\n";
 	print "Direct reclaim pages scanned:		$total_direct_nr_scanned\n";
+	print "Direct reclaim pages reclaimed:		$total_direct_nr_reclaimed\n";
 	print "Direct reclaim write file sync I/O:	$total_direct_writepage_file_sync\n";
 	print "Direct reclaim write anon sync I/O:	$total_direct_writepage_anon_sync\n";
 	print "Direct reclaim write file async I/O:	$total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
 	print "\n";
 	print "Kswapd wakeups:				$total_kswapd_wake\n";
 	print "Kswapd pages scanned:			$total_kswapd_nr_scanned\n";
+	print "Kswapd pages reclaimed:			$total_kswapd_nr_reclaimed\n";
 	print "Kswapd reclaim write file sync I/O:	$total_kswapd_writepage_file_sync\n";
 	print "Kswapd reclaim write anon sync I/O:	$total_kswapd_writepage_anon_sync\n";
 	print "Kswapd reclaim write file async I/O:	$total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
 		$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
 		$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
 		$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+		$perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
 		$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..14c1586 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
 
 #define RECLAIM_WB_ANON		0x0001u
 #define RECLAIM_WB_FILE		0x0002u
+#define RECLAIM_WB_MIXED	0x0010u
 #define RECLAIM_WB_SYNC		0x0004u
 #define RECLAIM_WB_ASYNC	0x0008u
 
@@ -17,6 +18,7 @@
 	(flags) ? __print_flags(flags, "|",			\
 		{RECLAIM_WB_ANON,	"RECLAIM_WB_ANON"},	\
 		{RECLAIM_WB_FILE,	"RECLAIM_WB_FILE"},	\
+		{RECLAIM_WB_MIXED,	"RECLAIM_WB_MIXED"},	\
 		{RECLAIM_WB_SYNC,	"RECLAIM_WB_SYNC"},	\
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
 	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
+#define trace_shrink_flags(file, sync) ( \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
+	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	)
+
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 	TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
 		show_reclaim_flags(__entry->reclaim_flags))
 );
 
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+	TP_PROTO(int nid, int zid,
+			unsigned long nr_scanned, unsigned long nr_reclaimed,
+			int priority, int reclaim_flags),
+
+	TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, zid)
+		__field(unsigned long, nr_scanned)
+		__field(unsigned long, nr_reclaimed)
+		__field(int, priority)
+		__field(int, reclaim_flags)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->zid = zid;
+		__entry->nr_scanned = nr_scanned;
+		__entry->nr_reclaimed = nr_reclaimed;
+		__entry->priority = priority;
+		__entry->reclaim_flags = reclaim_flags;
+	),
+
+	TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+		__entry->nid, __entry->zid,
+		__entry->nr_scanned, __entry->nr_reclaimed,
+		__entry->priority,
+		show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+		zone_idx(zone),
+		nr_scanned, nr_reclaimed,
+		priority,
+		trace_shrink_flags(file, sc->lumpy_reclaim_mode));
 	return nr_reclaimed;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 02/10] writeback: Account for time spent congestion_waited
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/trace/events/writeback.h |   28 ++++++++++++++++++++++++++++
 mm/backing-dev.c                 |    5 +++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= usec_timeout;
+		__entry->usec_delayed	= usec_delayed;
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..298975a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,17 @@ EXPORT_SYMBOL(set_bdi_congested);
 long congestion_wait(int sync, long timeout)
 {
 	long ret;
+	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 02/10] writeback: Account for time spent congestion_waited
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/trace/events/writeback.h |   28 ++++++++++++++++++++++++++++
 mm/backing-dev.c                 |    5 +++++
 2 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= usec_timeout;
+		__entry->usec_delayed	= usec_delayed;
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..298975a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,17 @@ EXPORT_SYMBOL(set_bdi_congested);
 long congestion_wait(int sync, long timeout)
 {
 	long ret;
+	unsigned long start = jiffies;
 	DEFINE_WAIT(wait);
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
+
+	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

If congestion_wait() is called with no BDIs congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested or if there is a significant amount of writeback going on in an
interesting zone. Else, it calls cond_resched() to ensure the caller is
not hogging the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h      |    2 +-
 include/trace/events/writeback.h |    7 ++++
 mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c                  |    4 +-
 mm/vmscan.c                      |   26 ++++++++++++--
 5 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 298975a..94b5433 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	clear_bit(bit, &bdi->state);
+	if (test_and_clear_bit(bit, &bdi->state))
+		atomic_dec(&nr_bdi_congested[sync]);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	set_bit(bit, &bdi->state);
+	if (!test_and_set_bit(bit, &bdi->state))
+		atomic_inc(&nr_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/**
+ * congestion_wait - wait for a backing_dev to become uncongested
+ * @zone: A zone to consider the number of being being written back from
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
+ * write congestion.  If no backing_devs are congested then the number of
+ * writeback pages in the zone are checked and compared to the inactive
+ * list. If there is no sigificant writeback or congestion, there is no point
+ * in sleeping but cond_resched() is called in case the current process has
+ * consumed its CPU quota.
+ */
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
+{
+	long ret;
+	unsigned long start = jiffies;
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	/*
+	 * If there is no congestion, check the amount of writeback. If there
+	 * is no significant writeback and no congestion, just cond_resched
+	 */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+		unsigned long inactive, writeback;
+
+		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+				zone_page_state(zone, NR_INACTIVE_ANON);
+		writeback = zone_page_state(zone, NR_WRITEBACK);
+
+		/*
+		 * If less than half the inactive list is being written back,
+		 * reclaim might as well continue
+		 */
+		if (writeback < inactive / 2) {
+			cond_resched();
+
+			/* In case we scheduled, work out time remaining */
+			ret = timeout - (jiffies - start);
+			if (ret < 0)
+				ret = 0;
+
+			goto out;
+		}
+	}
+
+	/* Sleep until uncongested or a write happens */
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+out:
+	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
+	return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..641900a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,7 +1893,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2081,7 +2081,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..eabe987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,7 +1341,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The attempt at page out may have made some
@@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			sc->may_writepage = 1;
 		}
 
-		/* Take a nap, wait for some writeback to complete */
+		/* Take a nap if congested, wait for some writeback */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		    priority < DEF_PRIORITY - 2) {
+			struct zone *active_zone = NULL;
+			unsigned long max_writeback = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				unsigned long writeback;
+
+				/* Initialise for first zone */
+				if (active_zone == NULL)
+					active_zone = zone;
+
+				writeback = zone_page_state(zone, NR_WRITEBACK);
+				if (writeback > max_writeback) {
+					max_writeback = writeback;
+					active_zone = zone;
+				}
+			}
+
+			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+		}
 	}
 
 out:
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

If congestion_wait() is called with no BDIs congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested or if there is a significant amount of writeback going on in an
interesting zone. Else, it calls cond_resched() to ensure the caller is
not hogging the CPU longer than its quota but otherwise will not sleep.

This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/backing-dev.h      |    2 +-
 include/trace/events/writeback.h |    7 ++++
 mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
 mm/page_alloc.c                  |    4 +-
 mm/vmscan.c                      |   26 ++++++++++++--
 5 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
 void set_bdi_congested(struct backing_dev_info *bdi, int sync);
 long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
 
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+	TP_ARGS(usec_timeout, usec_delayed)
+);
+
 #endif /* _TRACE_WRITEBACK_H */
 
 /* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 298975a..94b5433 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	clear_bit(bit, &bdi->state);
+	if (test_and_clear_bit(bit, &bdi->state))
+		atomic_dec(&nr_bdi_congested[sync]);
 	smp_mb__after_clear_bit();
 	if (waitqueue_active(wqh))
 		wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 	enum bdi_state bit;
 
 	bit = sync ? BDI_sync_congested : BDI_async_congested;
-	set_bit(bit, &bdi->state);
+	if (!test_and_set_bit(bit, &bdi->state))
+		atomic_inc(&nr_bdi_congested[sync]);
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
@@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
 }
 EXPORT_SYMBOL(congestion_wait);
 
+/**
+ * congestion_wait - wait for a backing_dev to become uncongested
+ * @zone: A zone to consider the number of being being written back from
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
+ * write congestion.  If no backing_devs are congested then the number of
+ * writeback pages in the zone are checked and compared to the inactive
+ * list. If there is no sigificant writeback or congestion, there is no point
+ * in sleeping but cond_resched() is called in case the current process has
+ * consumed its CPU quota.
+ */
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
+{
+	long ret;
+	unsigned long start = jiffies;
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	/*
+	 * If there is no congestion, check the amount of writeback. If there
+	 * is no significant writeback and no congestion, just cond_resched
+	 */
+	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+		unsigned long inactive, writeback;
+
+		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+				zone_page_state(zone, NR_INACTIVE_ANON);
+		writeback = zone_page_state(zone, NR_WRITEBACK);
+
+		/*
+		 * If less than half the inactive list is being written back,
+		 * reclaim might as well continue
+		 */
+		if (writeback < inactive / 2) {
+			cond_resched();
+
+			/* In case we scheduled, work out time remaining */
+			ret = timeout - (jiffies - start);
+			if (ret < 0)
+				ret = 0;
+
+			goto out;
+		}
+	}
+
+	/* Sleep until uncongested or a write happens */
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+out:
+	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+					jiffies_to_usecs(jiffies - start));
+
+	return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..641900a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,7 +1893,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 			preferred_zone, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
-			congestion_wait(BLK_RW_ASYNC, HZ/50);
+			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 	} while (!page && (gfp_mask & __GFP_NOFAIL));
 
 	return page;
@@ -2081,7 +2081,7 @@ rebalance:
 	pages_reclaimed += did_some_progress;
 	if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
 		/* Wait for some write requests to complete then retry */
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..eabe987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,7 +1341,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The attempt at page out may have made some
@@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			sc->may_writepage = 1;
 		}
 
-		/* Take a nap, wait for some writeback to complete */
+		/* Take a nap if congested, wait for some writeback */
 		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		    priority < DEF_PRIORITY - 2) {
+			struct zone *active_zone = NULL;
+			unsigned long max_writeback = 0;
+			for_each_zone_zonelist(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask)) {
+				unsigned long writeback;
+
+				/* Initialise for first zone */
+				if (active_zone == NULL)
+					active_zone = zone;
+
+				writeback = zone_page_state(zone, NR_WRITEBACK);
+				if (writeback > max_writeback) {
+					max_writeback = writeback;
+					active_zone = zone;
+				}
+			}
+
+			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+		}
 	}
 
 out:
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

congestion_wait() mean "waiting queue congestion is cleared".  However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eabe987..5979850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-
 		/*
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

congestion_wait() mean "waiting queue congestion is cleared".  However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eabe987..5979850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-
 		/*
 		 * The attempt at page out may have made some
 		 * of the pages active, mark them inactive again.
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

With synchrounous lumpy reclaim, there is no reason to give up to reclaim
pages even if page is locked. This patch uses lock_page() instead of
trylock_page() in this case.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5979850..79bd812 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (!trylock_page(page))
+		if (sync_writeback == PAGEOUT_IO_SYNC)
+			lock_page(page);
+		else if (!trylock_page(page))
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

With synchrounous lumpy reclaim, there is no reason to give up to reclaim
pages even if page is locked. This patch uses lock_page() instead of
trylock_page() in this case.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5979850..79bd812 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (!trylock_page(page))
+		if (sync_writeback == PAGEOUT_IO_SYNC)
+			lock_page(page);
+		else if (!trylock_page(page))
 			goto keep;
 
 		VM_BUG_ON(PageActive(page));
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as

  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished

When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying.  This patch causes lumpy reclaim to abort
when it is known it will fail.

Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again

So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).

After this patch, pageout() behaves as follows;

 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |    6 +-
 mm/vmscan.c                   |  122 +++++++++++++++++++++++++---------------
 2 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 14c1586..6f07c44 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
 
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
 #define trace_shrink_flags(file, sync) ( \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
 			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
 	)
 
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79bd812..21d1153 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+enum lumpy_mode {
+	LUMPY_MODE_NONE,
+	LUMPY_MODE_ASYNC,
+	LUMPY_MODE_SYNC,
+};
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
 	 * Intend to reclaim enough contenious memory rather than to reclaim
 	 * enough amount memory. I.e, it's the mode for high order allocation.
 	 */
-	bool lumpy_reclaim_mode;
+	enum lumpy_mode lumpy_reclaim_mode;
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	return ret;
 }
 
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+				   bool sync)
+{
+	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+	/*
+	 * Some reclaim have alredy been failed. No worth to try synchronous
+	 * lumpy reclaim.
+	 */
+	if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+		return;
+
+	/*
+	 * If we need a large contiguous chunk of memory, or have
+	 * trouble getting a small set of contiguous pages, we
+	 * will reclaim both active and inactive pages.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		sc->lumpy_reclaim_mode = mode;
+	else if (sc->order && priority < DEF_PRIORITY - 2)
+		sc->lumpy_reclaim_mode = mode;
+	else
+		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
 	/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+			      struct scan_control *sc)
 {
 	if (current->flags & PF_SWAPWRITE)
 		return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 		return 1;
 	if (bdi == current->backing_dev_info)
 		return 1;
+
+	/* lumpy reclaim for hugepage often need a lot of write */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 	return 0;
 }
 
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-/* Request for sync pageout. */
-enum pageout_io {
-	PAGEOUT_IO_ASYNC,
-	PAGEOUT_IO_SYNC,
-};
-
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
  * Calls ->writepage().
  */
 static pageout_t pageout(struct page *page, struct address_space *mapping,
-						enum pageout_io sync_writeback)
+			 struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
 		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		if (PageWriteback(page) &&
+		    sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			wait_on_page_writeback(page);
 
 		if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			trace_reclaim_flags(page, sync_writeback));
+			trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (sync_writeback == PAGEOUT_IO_SYNC)
+		if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			lock_page(page);
 		else if (!trylock_page(page))
 			goto keep;
@@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+			    may_enter_fs)
 				wait_on_page_writeback(page);
-			else
-				goto keep_locked;
+			else {
+				unlock_page(page);
+				goto keep_lumpy;
+			}
 		}
 
 		references = page_check_references(page, sc);
@@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch (pageout(page, mapping, sync_writeback)) {
+			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
-				if (PageWriteback(page) || PageDirty(page))
+				if (PageWriteback(page))
+					goto keep_lumpy;
+				if (PageDirty(page))
 					goto keep;
+
 				/*
 				 * A synchronous write - probably a ramdisk.  Go
 				 * ahead and try to reclaim the page.
@@ -843,6 +886,7 @@ cull_mlocked:
 			try_to_free_swap(page);
 		unlock_page(page);
 		putback_lru_page(page);
+		disable_lumpy_reclaim_mode(sc);
 		continue;
 
 activate_locked:
@@ -855,6 +899,8 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
+		disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
@@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 		return false;
 
 	/* Only stall on lumpy reclaim */
-	if (!sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
 	/* If we have relaimed everything on the isolated list, no stall */
@@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			return SWAP_CLUSTER_MAX;
 	}
 
-
+	set_lumpy_reclaim_mode(priority, sc, false);
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
 		/*
@@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+		set_lumpy_reclaim_mode(priority, sc, true);
+		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
 
 	local_irq_disable();
@@ -1727,21 +1774,6 @@ out:
 	}
 }
 
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
-	/*
-	 * If we need a large contiguous chunk of memory, or have
-	 * trouble getting a small set of contiguous pages, we
-	 * will reclaim both active and inactive pages.
-	 */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
-		sc->lumpy_reclaim_mode = 1;
-	else
-		sc->lumpy_reclaim_mode = 0;
-}
-
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1756,8 +1788,6 @@ static void shrink_zone(int priority, struct zone *zone,
 
 	get_scan_count(zone, sc, nr, priority);
 
-	set_lumpy_reclaim_mode(priority, sc);
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as

  1. trylock_page() failure
  2. page is unevictable
  3. zone reclaim and page is mapped
  4. PageWriteback() is true
  5. page is swapbacked and swap is full
  6. add_to_swap() failure
  7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
  8. page is pinned
  9. IO queue is congested
 10. pageout() start IO, but not finished

When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying.  This patch causes lumpy reclaim to abort
when it is known it will fail.

Case (9) is more interesting. current behavior is,
  1. start shrink_page_list(async)
  2. found queue_congested()
  3. skip pageout write
  4. still start shrink_page_list(sync)
  5. wait on a lot of pages
  6. again, found queue_congested()
  7. give up pageout write again

So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).

After this patch, pageout() behaves as follows;

 - If order > PAGE_ALLOC_COSTLY_ORDER
	Ignore queue congestion always.
 - If order <= PAGE_ALLOC_COSTLY_ORDER
	skip write page and disable lumpy reclaim.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/vmscan.h |    6 +-
 mm/vmscan.c                   |  122 +++++++++++++++++++++++++---------------
 2 files changed, 79 insertions(+), 49 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 14c1586..6f07c44 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
 
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
 	)
 
 #define trace_shrink_flags(file, sync) ( \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
 			(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) |  \
-	(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+	(sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
 	)
 
 TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79bd812..21d1153 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/vmscan.h>
 
+enum lumpy_mode {
+	LUMPY_MODE_NONE,
+	LUMPY_MODE_ASYNC,
+	LUMPY_MODE_SYNC,
+};
+
 struct scan_control {
 	/* Incremented by the number of inactive pages that were scanned */
 	unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
 	 * Intend to reclaim enough contenious memory rather than to reclaim
 	 * enough amount memory. I.e, it's the mode for high order allocation.
 	 */
-	bool lumpy_reclaim_mode;
+	enum lumpy_mode lumpy_reclaim_mode;
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
 	return ret;
 }
 
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+				   bool sync)
+{
+	enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+	/*
+	 * Some reclaim have alredy been failed. No worth to try synchronous
+	 * lumpy reclaim.
+	 */
+	if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+		return;
+
+	/*
+	 * If we need a large contiguous chunk of memory, or have
+	 * trouble getting a small set of contiguous pages, we
+	 * will reclaim both active and inactive pages.
+	 */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		sc->lumpy_reclaim_mode = mode;
+	else if (sc->order && priority < DEF_PRIORITY - 2)
+		sc->lumpy_reclaim_mode = mode;
+	else
+		sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+	sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
 static inline int is_page_cache_freeable(struct page *page)
 {
 	/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
 	return page_count(page) - page_has_private(page) == 2;
 }
 
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+			      struct scan_control *sc)
 {
 	if (current->flags & PF_SWAPWRITE)
 		return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
 		return 1;
 	if (bdi == current->backing_dev_info)
 		return 1;
+
+	/* lumpy reclaim for hugepage often need a lot of write */
+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+		return 1;
 	return 0;
 }
 
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
 	unlock_page(page);
 }
 
-/* Request for sync pageout. */
-enum pageout_io {
-	PAGEOUT_IO_ASYNC,
-	PAGEOUT_IO_SYNC,
-};
-
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
  * Calls ->writepage().
  */
 static pageout_t pageout(struct page *page, struct address_space *mapping,
-						enum pageout_io sync_writeback)
+			 struct scan_control *sc)
 {
 	/*
 	 * If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 	}
 	if (mapping->a_ops->writepage == NULL)
 		return PAGE_ACTIVATE;
-	if (!may_write_to_queue(mapping->backing_dev_info))
+	if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+		disable_lumpy_reclaim_mode(sc);
 		return PAGE_KEEP;
+	}
 
 	if (clear_page_dirty_for_io(page)) {
 		int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
 		 */
-		if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+		if (PageWriteback(page) &&
+		    sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			wait_on_page_writeback(page);
 
 		if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 			ClearPageReclaim(page);
 		}
 		trace_mm_vmscan_writepage(page,
-			trace_reclaim_flags(page, sync_writeback));
+			trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
 		inc_zone_page_state(page, NR_VMSCAN_WRITE);
 		return PAGE_SUCCESS;
 	}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-					struct scan_control *sc,
-					enum pageout_io sync_writeback)
+				      struct scan_control *sc)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		page = lru_to_page(page_list);
 		list_del(&page->lru);
 
-		if (sync_writeback == PAGEOUT_IO_SYNC)
+		if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
 			lock_page(page);
 		else if (!trylock_page(page))
 			goto keep;
@@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * for any page for which writeback has already
 			 * started.
 			 */
-			if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+			if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+			    may_enter_fs)
 				wait_on_page_writeback(page);
-			else
-				goto keep_locked;
+			else {
+				unlock_page(page);
+				goto keep_lumpy;
+			}
 		}
 
 		references = page_check_references(page, sc);
@@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				goto keep_locked;
 
 			/* Page is dirty, try to write it out here */
-			switch (pageout(page, mapping, sync_writeback)) {
+			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
 			case PAGE_SUCCESS:
-				if (PageWriteback(page) || PageDirty(page))
+				if (PageWriteback(page))
+					goto keep_lumpy;
+				if (PageDirty(page))
 					goto keep;
+
 				/*
 				 * A synchronous write - probably a ramdisk.  Go
 				 * ahead and try to reclaim the page.
@@ -843,6 +886,7 @@ cull_mlocked:
 			try_to_free_swap(page);
 		unlock_page(page);
 		putback_lru_page(page);
+		disable_lumpy_reclaim_mode(sc);
 		continue;
 
 activate_locked:
@@ -855,6 +899,8 @@ activate_locked:
 keep_locked:
 		unlock_page(page);
 keep:
+		disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
@@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 		return false;
 
 	/* Only stall on lumpy reclaim */
-	if (!sc->lumpy_reclaim_mode)
+	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
 	/* If we have relaimed everything on the isolated list, no stall */
@@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			return SWAP_CLUSTER_MAX;
 	}
 
-
+	set_lumpy_reclaim_mode(priority, sc, false);
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
 
 	if (scanning_global_lru(sc)) {
 		nr_taken = isolate_pages_global(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, 0, file);
 		zone->pages_scanned += nr_scanned;
 		if (current_is_kswapd())
@@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	} else {
 		nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
 			&page_list, &nr_scanned, sc->order,
-			sc->lumpy_reclaim_mode ?
-				ISOLATE_BOTH : ISOLATE_INACTIVE,
+			sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+					ISOLATE_INACTIVE : ISOLATE_BOTH,
 			zone, sc->mem_cgroup,
 			0, file);
 		/*
@@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+	nr_reclaimed = shrink_page_list(&page_list, sc);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 		nr_active = clear_active_flags(&page_list, NULL);
 		count_vm_events(PGDEACTIVATE, nr_active);
 
-		nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+		set_lumpy_reclaim_mode(priority, sc, true);
+		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
 
 	local_irq_disable();
@@ -1727,21 +1774,6 @@ out:
 	}
 }
 
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
-	/*
-	 * If we need a large contiguous chunk of memory, or have
-	 * trouble getting a small set of contiguous pages, we
-	 * will reclaim both active and inactive pages.
-	 */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		sc->lumpy_reclaim_mode = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
-		sc->lumpy_reclaim_mode = 1;
-	else
-		sc->lumpy_reclaim_mode = 0;
-}
-
 /*
  * This is a basic per-zone page freer.  Used by both kswapd and direct reclaim.
  */
@@ -1756,8 +1788,6 @@ static void shrink_zone(int priority, struct zone *zone,
 
 	get_scan_count(zone, sc, nr, priority);
 
-	set_lumpy_reclaim_mode(priority, sc);
-
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
 					nr[LRU_INACTIVE_FILE]) {
 		for_each_evictable_lru(l) {
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d1153..64f9ca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1334,7 +1334,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1389,13 +1388,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
-
 		set_lumpy_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |    8 --------
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d1153..64f9ca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1334,7 +1334,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_active;
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
@@ -1389,13 +1388,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-		/*
-		 * The attempt at page out may have made some
-		 * of the pages active, mark them inactive again.
-		 */
-		nr_active = clear_active_flags(&page_list, NULL);
-		count_vm_events(PGDEACTIVATE, nr_active);
-
 		set_lumpy_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, sc);
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64f9ca5..ff52b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				continue;
 
 			/* Avoid holes within the zone. */
-			if (unlikely(!pfn_valid_within(pfn)))
+			if (unlikely(!pfn_valid_within(pfn))) {
+				nr_lumpy_failed++;
 				break;
+			}
 
 			cursor_page = pfn_to_page(pfn);
 
 			/* Check that we have not crossed a zone boundary. */
-			if (unlikely(page_zone_id(cursor_page) != zone_id))
-				continue;
+			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
+				nr_lumpy_failed++;
+				break;
+			}
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-					!PageSwapCache(cursor_page))
-				continue;
+			    !PageSwapCache(cursor_page)) {
+				nr_lumpy_failed++;
+				break;
+			}
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 					nr_lumpy_dirty++;
 				scan++;
 			} else {
-				if (mode == ISOLATE_BOTH &&
-						page_count(cursor_page))
-					nr_lumpy_failed++;
+				/* the page is freed already. */
+				if (!page_count(cursor_page))
+					continue;
+				nr_lumpy_failed++;
+				break;
 			}
 		}
 	}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   24 ++++++++++++++++--------
 1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64f9ca5..ff52b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				continue;
 
 			/* Avoid holes within the zone. */
-			if (unlikely(!pfn_valid_within(pfn)))
+			if (unlikely(!pfn_valid_within(pfn))) {
+				nr_lumpy_failed++;
 				break;
+			}
 
 			cursor_page = pfn_to_page(pfn);
 
 			/* Check that we have not crossed a zone boundary. */
-			if (unlikely(page_zone_id(cursor_page) != zone_id))
-				continue;
+			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
+				nr_lumpy_failed++;
+				break;
+			}
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-					!PageSwapCache(cursor_page))
-				continue;
+			    !PageSwapCache(cursor_page)) {
+				nr_lumpy_failed++;
+				break;
+			}
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 					nr_lumpy_dirty++;
 				scan++;
 			} else {
-				if (mode == ISOLATE_BOTH &&
-						page_count(cursor_page))
-					nr_lumpy_failed++;
+				/* the page is freed already. */
+				if (!page_count(cursor_page))
+					continue;
+				nr_lumpy_failed++;
+				break;
 			}
 		}
 	}
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff52b46..408c101 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct scan_control *sc)
+					struct scan_control *sc,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -908,6 +922,8 @@ keep_lumpy:
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
+
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
+	/* If we cannot writeback, there is no point stalling */
+	if (!sc->may_writepage)
+		return false;
+
 	/* If we have relaimed everything on the isolated list, no stall */
 	if (nr_freed == nr_taken)
 		return false;
@@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
+	LIST_HEAD(putback_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc);
+	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 		set_lumpy_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, sc);
+
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			struct page *page, *tmp;
+
+			/* Take off the clean pages marked for activation */
+			list_for_each_entry_safe(page, tmp, &page_list, lru) {
+				if (PageDirty(page) || PageWriteback(page))
+					continue;
+
+				list_del(&page->lru);
+				list_add(&page->lru, &putback_list);
+			}
+
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
+			nr_reclaimed = shrink_page_list(&page_list, sc,
+							&nr_dirty);
+		}
 	}
 
+	list_splice(&putback_list, &page_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back.  If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.

As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff52b46..408c101 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)	(1)
 #endif
 
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  * shrink_page_list() returns the number of reclaimed pages
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
-				      struct scan_control *sc)
+					struct scan_control *sc,
+					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
+	unsigned long nr_dirty = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				nr_dirty++;
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
@@ -908,6 +922,8 @@ keep_lumpy:
 	free_page_list(&free_pages);
 
 	list_splice(&ret_pages, page_list);
+
+	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
 	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
 		return false;
 
+	/* If we cannot writeback, there is no point stalling */
+	if (!sc->may_writepage)
+		return false;
+
 	/* If we have relaimed everything on the isolated list, no stall */
 	if (nr_freed == nr_taken)
 		return false;
@@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			struct scan_control *sc, int priority, int file)
 {
 	LIST_HEAD(page_list);
+	LIST_HEAD(putback_list);
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc);
+	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
 		set_lumpy_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, sc);
+
+		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+			struct page *page, *tmp;
+
+			/* Take off the clean pages marked for activation */
+			list_for_each_entry_safe(page, tmp, &page_list, lru) {
+				if (PageDirty(page) || PageWriteback(page))
+					continue;
+
+				list_del(&page->lru);
+				list_add(&page->lru, &putback_list);
+			}
+
+			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
+			nr_reclaimed = shrink_page_list(&page_list, sc,
+							&nr_dirty);
+		}
 	}
 
+	list_splice(&putback_list, &page_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.

Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
 1 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 408c101..33d27a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 /* Direct lumpy reclaim waits up to five seconds for background cleaning */
 #define MAX_SWAP_CLEAN_WAIT 50
 
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+	return laptop_mode ? 0 :
+			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
+					int file,
 					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_dirty_seen = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty_seen++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow
@@ -923,6 +939,18 @@ keep_lumpy:
 
 	list_splice(&ret_pages, page_list);
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though the
+	 * dirty_ratio may be satisified. In this case, wake flusher
+	 * threads to pro-actively clean up to a maximum of
+	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+	 * !may_writepage indicates that this is a direct reclaimer in
+	 * laptop mode avoiding disk spin-ups
+	 */
+	if (file && nr_dirty_seen && sc->may_writepage)
+		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
@@ -1414,7 +1442,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
+	nr_reclaimed = shrink_page_list(&page_list, sc, file, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1437,7 +1465,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
 			nr_reclaimed = shrink_page_list(&page_list, sc,
-							&nr_dirty);
+							file, &nr_dirty);
 		}
 	}
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 133+ messages in thread

* [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-06 10:47   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton, Mel Gorman

There are a number of cases where pages get cleaned but two of concern
to this patch are;
  o When dirtying pages, processes may be throttled to clean pages if
    dirty_ratio is not met.
  o Pages belonging to inodes dirtied longer than
    dirty_writeback_centisecs get cleaned.

The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.

Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.

Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
 1 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 408c101..33d27a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
 /* Direct lumpy reclaim waits up to five seconds for background cleaning */
 #define MAX_SWAP_CLEAN_WAIT 50
 
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+	return laptop_mode ? 0 :
+			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
 						  struct scan_control *sc)
 {
@@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 					struct scan_control *sc,
+					int file,
 					unsigned long *nr_still_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
 	int pgactivate = 0;
 	unsigned long nr_dirty = 0;
+	unsigned long nr_dirty_seen = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
@@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
+			nr_dirty_seen++;
+
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow
@@ -923,6 +939,18 @@ keep_lumpy:
 
 	list_splice(&ret_pages, page_list);
 
+	/*
+	 * If reclaim is encountering dirty pages, it may be because
+	 * dirty pages are reaching the end of the LRU even though the
+	 * dirty_ratio may be satisified. In this case, wake flusher
+	 * threads to pro-actively clean up to a maximum of
+	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+	 * !may_writepage indicates that this is a direct reclaimer in
+	 * laptop mode avoiding disk spin-ups
+	 */
+	if (file && nr_dirty_seen && sc->may_writepage)
+		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
 	*nr_still_dirty = nr_dirty;
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
@@ -1414,7 +1442,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
+	nr_reclaimed = shrink_page_list(&page_list, sc, file, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1437,7 +1465,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 
 			nr_reclaimed = shrink_page_list(&page_list, sc,
-							&nr_dirty);
+							file, &nr_dirty);
 		}
 	}
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:49   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:49 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

*sigh*

The subject should have been [PATCH 0/10] of course.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:49   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:49 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

*sigh*

The subject should have been [PATCH 0/10] of course.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-07 15:25     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h      |    2 +-
>  include/trace/events/writeback.h |    7 ++++
>  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c                  |    4 +-
>  mm/vmscan.c                      |   26 ++++++++++++--
>  5 files changed, 96 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
>  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
>  long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>  
>  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
>  {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
>  	TP_ARGS(usec_timeout, usec_delayed)
>  );
>  
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> +	TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
>  #endif /* _TRACE_WRITEBACK_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	enum bdi_state bit;
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	set_bit(bit, &bdi->state);
> +	if (!test_and_set_bit(bit, &bdi->state))
> +		atomic_inc(&nr_bdi_congested[sync]);
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
      wait_iff_congested

> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
                                                and 

> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

I am not sure this is best.

1. Without considering various speed class storage, could we fix it as half of inactive?
2. Isn't there any writeback throttling on above layer? Do we care of it in here?

Just out of curiosity. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-07 15:25     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h      |    2 +-
>  include/trace/events/writeback.h |    7 ++++
>  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c                  |    4 +-
>  mm/vmscan.c                      |   26 ++++++++++++--
>  5 files changed, 96 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
>  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
>  long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>  
>  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
>  {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
>  	TP_ARGS(usec_timeout, usec_delayed)
>  );
>  
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> +	TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
>  #endif /* _TRACE_WRITEBACK_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	enum bdi_state bit;
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	set_bit(bit, &bdi->state);
> +	if (!test_and_set_bit(bit, &bdi->state))
> +		atomic_inc(&nr_bdi_congested[sync]);
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
      wait_iff_congested

> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
                                                and 

> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

I am not sure this is best.

1. Without considering various speed class storage, could we fix it as half of inactive?
2. Isn't there any writeback throttling on above layer? Do we care of it in here?

Just out of curiosity. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-07 15:26     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-07 15:26     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-07 15:28     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-07 15:28     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-07 15:33     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:30AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> After synchrounous lumpy reclaim, the page_list is guaranteed to not
> have active pages as page activation in shrink_page_list() disables lumpy
> reclaim. Remove the dead code.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
@ 2010-09-07 15:33     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:30AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> After synchrounous lumpy reclaim, the page_list is guaranteed to not
> have active pages as page activation in shrink_page_list() disables lumpy
> reclaim. Remove the dead code.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-07 15:37     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   24 ++++++++++++++++--------
>  1 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  				continue;
>  
>  			/* Avoid holes within the zone. */
> -			if (unlikely(!pfn_valid_within(pfn)))
> +			if (unlikely(!pfn_valid_within(pfn))) {
> +				nr_lumpy_failed++;
>  				break;
> +			}
>  
>  			cursor_page = pfn_to_page(pfn);
>  
>  			/* Check that we have not crossed a zone boundary. */
> -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> -				continue;
> +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			/*
>  			 * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			 * pointless.
>  			 */
>  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> -					!PageSwapCache(cursor_page))
> -				continue;
> +			    !PageSwapCache(cursor_page)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
>  				list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  					nr_lumpy_dirty++;
>  				scan++;
>  			} else {
> -				if (mode == ISOLATE_BOTH &&

Why can we remove ISOLATION_BOTH check?
Is it a intentionall behavior change?

> -						page_count(cursor_page))
> -					nr_lumpy_failed++;
> +				/* the page is freed already. */
> +				if (!page_count(cursor_page))
> +					continue;
> +				nr_lumpy_failed++;
> +				break;
>  			}
>  		}
>  	}
> -- 
> 1.7.1
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-07 15:37     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   24 ++++++++++++++++--------
>  1 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  				continue;
>  
>  			/* Avoid holes within the zone. */
> -			if (unlikely(!pfn_valid_within(pfn)))
> +			if (unlikely(!pfn_valid_within(pfn))) {
> +				nr_lumpy_failed++;
>  				break;
> +			}
>  
>  			cursor_page = pfn_to_page(pfn);
>  
>  			/* Check that we have not crossed a zone boundary. */
> -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> -				continue;
> +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			/*
>  			 * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			 * pointless.
>  			 */
>  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> -					!PageSwapCache(cursor_page))
> -				continue;
> +			    !PageSwapCache(cursor_page)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
>  				list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  					nr_lumpy_dirty++;
>  				scan++;
>  			} else {
> -				if (mode == ISOLATE_BOTH &&

Why can we remove ISOLATION_BOTH check?
Is it a intentionall behavior change?

> -						page_count(cursor_page))
> -					nr_lumpy_failed++;
> +				/* the page is freed already. */
> +				if (!page_count(cursor_page))
> +					continue;
> +				nr_lumpy_failed++;
> +				break;
>  			}
>  		}
>  	}
> -- 
> 1.7.1
> 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08  3:14   ` KOSAKI Motohiro
  -1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  3:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> gathers together three different patch sets which in combination should
> tackle some of the root causes of latency problems being reported.
> 
> The first patch improves vmscan latency by tracking when pages get reclaimed
> by shrink_inactive_list. For this series, the most important results is
> being able to calculate the scanning/reclaim ratio as a measure of the
> amount of work being done by page reclaim.
> 
> Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> calling going to sleep on congestion when it is unnecessary. This is expected
> to reduce stalls in situations where the system is under memory pressure
> but not due to congestion.
> 
> Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.

Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
ASAP and take a feedback soon.





^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-08  3:14   ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08  3:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> gathers together three different patch sets which in combination should
> tackle some of the root causes of latency problems being reported.
> 
> The first patch improves vmscan latency by tracking when pages get reclaimed
> by shrink_inactive_list. For this series, the most important results is
> being able to calculate the scanning/reclaim ratio as a measure of the
> amount of work being done by page reclaim.
> 
> Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> calling going to sleep on congestion when it is unnecessary. This is expected
> to reduce stalls in situations where the system is under memory pressure
> but not due to congestion.
> 
> Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.

Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
ASAP and take a feedback soon.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08  6:15     ` Johannes Weiner
  -1 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08  6:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmscan.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eabe987..5979850 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> -
>  		/*
>  		 * The attempt at page out may have made some
>  		 * of the pages active, mark them inactive again.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-08  6:15     ` Johannes Weiner
  0 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08  6:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmscan.c |    2 --
>  1 files changed, 0 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eabe987..5979850 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> -		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> -
>  		/*
>  		 * The attempt at page out may have made some
>  		 * of the pages active, mark them inactive again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08  6:16     ` Johannes Weiner
  -1 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08  6:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmscan.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5979850..79bd812 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		page = lru_to_page(page_list);
>  		list_del(&page->lru);
>  
> -		if (!trylock_page(page))
> +		if (sync_writeback == PAGEOUT_IO_SYNC)
> +			lock_page(page);
> +		else if (!trylock_page(page))
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-08  6:16     ` Johannes Weiner
  0 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08  6:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

> ---
>  mm/vmscan.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5979850..79bd812 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		page = lru_to_page(page_list);
>  		list_del(&page->lru);
>  
> -		if (!trylock_page(page))
> +		if (sync_writeback == PAGEOUT_IO_SYNC)
> +			lock_page(page);
> +		else if (!trylock_page(page))
>  			goto keep;
>  
>  		VM_BUG_ON(PageActive(page));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
  2010-09-08  3:14   ` KOSAKI Motohiro
@ 2010-09-08  8:38     ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08  8:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Wed, Sep 08, 2010 at 12:14:29PM +0900, KOSAKI Motohiro wrote:
> > There have been numerous reports of stalls that pointed at the problem being
> > somewhere in the VM. There are multiple roots to the problems which means
> > dealing with any of the root problems in isolation is tricky to justify on
> > their own and they would still need integration testing. This patch series
> > gathers together three different patch sets which in combination should
> > tackle some of the root causes of latency problems being reported.
> > 
> > The first patch improves vmscan latency by tracking when pages get reclaimed
> > by shrink_inactive_list. For this series, the most important results is
> > being able to calculate the scanning/reclaim ratio as a measure of the
> > amount of work being done by page reclaim.
> > 
> > Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> > calling going to sleep on congestion when it is unnecessary. This is expected
> > to reduce stalls in situations where the system is under memory pressure
> > but not due to congestion.
> > 
> > Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> > this series. It has been noted that lumpy reclaim is far too aggressive and
> > trashes the system somewhat. As SLUB uses high-order allocations, a large
> > cost incurred by lumpy reclaim will be noticeable. It was also reported
> > during transparent hugepage support testing that lumpy reclaim was trashing
> > the system and these patches should mitigate that problem without disabling
> > lumpy reclaim.
> 
> Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
> ASAP and take a feedback soon.
> 

It did not bother me at all. I generally agreed with the direction and
it seemed sensible to take them into consideration before patches 9 and
10 in particular and make sure they all played nicely together.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-08  8:38     ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08  8:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Wed, Sep 08, 2010 at 12:14:29PM +0900, KOSAKI Motohiro wrote:
> > There have been numerous reports of stalls that pointed at the problem being
> > somewhere in the VM. There are multiple roots to the problems which means
> > dealing with any of the root problems in isolation is tricky to justify on
> > their own and they would still need integration testing. This patch series
> > gathers together three different patch sets which in combination should
> > tackle some of the root causes of latency problems being reported.
> > 
> > The first patch improves vmscan latency by tracking when pages get reclaimed
> > by shrink_inactive_list. For this series, the most important results is
> > being able to calculate the scanning/reclaim ratio as a measure of the
> > amount of work being done by page reclaim.
> > 
> > Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> > calling going to sleep on congestion when it is unnecessary. This is expected
> > to reduce stalls in situations where the system is under memory pressure
> > but not due to congestion.
> > 
> > Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> > this series. It has been noted that lumpy reclaim is far too aggressive and
> > trashes the system somewhat. As SLUB uses high-order allocations, a large
> > cost incurred by lumpy reclaim will be noticeable. It was also reported
> > during transparent hugepage support testing that lumpy reclaim was trashing
> > the system and these patches should mitigate that problem without disabling
> > lumpy reclaim.
> 
> Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
> ASAP and take a feedback soon.
> 

It did not bother me at all. I generally agreed with the direction and
it seemed sensible to take them into consideration before patches 9 and
10 in particular and make sure they all played nicely together.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-07 15:25     ` Minchan Kim
@ 2010-09-08 11:04       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > <SNIP>
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
>       wait_iff_congested
> 

Fixed, thanks.

> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
>                                                 and 
> 

Why and? "or" makes sense because we avoid sleeping on either condition.

> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> I am not sure this is best.
> 

I'm not saying it is. The objective is to identify a situation where
sleeping until the next write or congestion clears is pointless. We have
already identified that we are not congested so the question is "are we
writing a lot at the moment?". The assumption is that if there is a lot
of writing going on, we might as well sleep until one completes rather
than reclaiming more.

This is the first effort at identifying pointless sleeps. Better ones
might be identified in the future but that shouldn't stop us making a
semi-sensible decision now.

> 1. Without considering various speed class storage, could we fix it as half of inactive?

We don't really have a good means of identifying speed classes of
storage. Worse, we are considering on a zone-basis here, not a BDI
basis. The pages being written back in the zone could be backed by
anything so we cannot make decisions based on BDI speed.

> 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> 

There are but congestion_wait() and now wait_iff_congested() are part of
that. We can see from the figures in the leader that congestion_wait()
is sleeping more than is necessary or smart.

> Just out of curiosity. 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 11:04       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > <SNIP>
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
>       wait_iff_congested
> 

Fixed, thanks.

> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
>                                                 and 
> 

Why and? "or" makes sense because we avoid sleeping on either condition.

> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> I am not sure this is best.
> 

I'm not saying it is. The objective is to identify a situation where
sleeping until the next write or congestion clears is pointless. We have
already identified that we are not congested so the question is "are we
writing a lot at the moment?". The assumption is that if there is a lot
of writing going on, we might as well sleep until one completes rather
than reclaiming more.

This is the first effort at identifying pointless sleeps. Better ones
might be identified in the future but that shouldn't stop us making a
semi-sensible decision now.

> 1. Without considering various speed class storage, could we fix it as half of inactive?

We don't really have a good means of identifying speed classes of
storage. Worse, we are considering on a zone-basis here, not a BDI
basis. The pages being written back in the zone could be backed by
anything so we cannot make decisions based on BDI speed.

> 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> 

There are but congestion_wait() and now wait_iff_congested() are part of
that. We can see from the figures in the leader that congestion_wait()
is sleeping more than is necessary or smart.

> Just out of curiosity. 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-07 15:37     ` Minchan Kim
@ 2010-09-08 11:12       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   24 ++++++++++++++++--------
> >  1 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  				continue;
> >  
> >  			/* Avoid holes within the zone. */
> > -			if (unlikely(!pfn_valid_within(pfn)))
> > +			if (unlikely(!pfn_valid_within(pfn))) {
> > +				nr_lumpy_failed++;
> >  				break;
> > +			}
> >  
> >  			cursor_page = pfn_to_page(pfn);
> >  
> >  			/* Check that we have not crossed a zone boundary. */
> > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > -				continue;
> > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			/*
> >  			 * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			 * pointless.
> >  			 */
> >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > -					!PageSwapCache(cursor_page))
> > -				continue;
> > +			    !PageSwapCache(cursor_page)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> >  				list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  					nr_lumpy_dirty++;
> >  				scan++;
> >  			} else {
> > -				if (mode == ISOLATE_BOTH &&
> 
> Why can we remove ISOLATION_BOTH check?

Because this is lumpy reclaim and whether we are isolating inactive, active
or both doesn't matter. The fact we failed to isolate the page and it has
a reference count means that a contiguous allocation in that area will fail.

> Is it a intentionall behavior change?
> 

Yes.

> > -						page_count(cursor_page))
> > -					nr_lumpy_failed++;
> > +				/* the page is freed already. */
> > +				if (!page_count(cursor_page))
> > +					continue;
> > +				nr_lumpy_failed++;
> > +				break;
> >  			}
> >  		}
> >  	}

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 11:12       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   24 ++++++++++++++++--------
> >  1 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  				continue;
> >  
> >  			/* Avoid holes within the zone. */
> > -			if (unlikely(!pfn_valid_within(pfn)))
> > +			if (unlikely(!pfn_valid_within(pfn))) {
> > +				nr_lumpy_failed++;
> >  				break;
> > +			}
> >  
> >  			cursor_page = pfn_to_page(pfn);
> >  
> >  			/* Check that we have not crossed a zone boundary. */
> > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > -				continue;
> > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			/*
> >  			 * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			 * pointless.
> >  			 */
> >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > -					!PageSwapCache(cursor_page))
> > -				continue;
> > +			    !PageSwapCache(cursor_page)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> >  				list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  					nr_lumpy_dirty++;
> >  				scan++;
> >  			} else {
> > -				if (mode == ISOLATE_BOTH &&
> 
> Why can we remove ISOLATION_BOTH check?

Because this is lumpy reclaim and whether we are isolating inactive, active
or both doesn't matter. The fact we failed to isolate the page and it has
a reference count means that a contiguous allocation in that area will fail.

> Is it a intentionall behavior change?
> 

Yes.

> > -						page_count(cursor_page))
> > -					nr_lumpy_failed++;
> > +				/* the page is freed already. */
> > +				if (!page_count(cursor_page))
> > +					continue;
> > +				nr_lumpy_failed++;
> > +				break;
> >  			}
> >  		}
> >  	}

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08 11:25     ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:27PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-08 11:25     ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:27PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08 11:28     ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:28PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-08 11:28     ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:28PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08 11:37     ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   24 ++++++++++++++++--------
>  1 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  				continue;
>  
>  			/* Avoid holes within the zone. */
> -			if (unlikely(!pfn_valid_within(pfn)))
> +			if (unlikely(!pfn_valid_within(pfn))) {
> +				nr_lumpy_failed++;
>  				break;
> +			}
>  
>  			cursor_page = pfn_to_page(pfn);
>  
>  			/* Check that we have not crossed a zone boundary. */
> -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> -				continue;
> +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			/*
>  			 * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			 * pointless.
>  			 */
>  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> -					!PageSwapCache(cursor_page))
> -				continue;
> +			    !PageSwapCache(cursor_page)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
>  				list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  					nr_lumpy_dirty++;
>  				scan++;
>  			} else {
> -				if (mode == ISOLATE_BOTH &&
> -						page_count(cursor_page))
> -					nr_lumpy_failed++;
> +				/* the page is freed already. */
> +				if (!page_count(cursor_page))
> +					continue;
> +				nr_lumpy_failed++;
> +				break;
>  			}
>  		}

The many nr_lumpy_failed++ can be moved here:

                if (pfn < end_pfn)
                        nr_lumpy_failed++;

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 11:37     ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   24 ++++++++++++++++--------
>  1 files changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  				continue;
>  
>  			/* Avoid holes within the zone. */
> -			if (unlikely(!pfn_valid_within(pfn)))
> +			if (unlikely(!pfn_valid_within(pfn))) {
> +				nr_lumpy_failed++;
>  				break;
> +			}
>  
>  			cursor_page = pfn_to_page(pfn);
>  
>  			/* Check that we have not crossed a zone boundary. */
> -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> -				continue;
> +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			/*
>  			 * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  			 * pointless.
>  			 */
>  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> -					!PageSwapCache(cursor_page))
> -				continue;
> +			    !PageSwapCache(cursor_page)) {
> +				nr_lumpy_failed++;
> +				break;
> +			}
>  
>  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
>  				list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  					nr_lumpy_dirty++;
>  				scan++;
>  			} else {
> -				if (mode == ISOLATE_BOTH &&
> -						page_count(cursor_page))
> -					nr_lumpy_failed++;
> +				/* the page is freed already. */
> +				if (!page_count(cursor_page))
> +					continue;
> +				nr_lumpy_failed++;
> +				break;
>  			}
>  		}

The many nr_lumpy_failed++ can be moved here:

                if (pfn < end_pfn)
                        nr_lumpy_failed++;

Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-08 11:37     ` Wu Fengguang
@ 2010-09-08 12:50       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 12:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   24 ++++++++++++++++--------
> >  1 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  				continue;
> >  
> >  			/* Avoid holes within the zone. */
> > -			if (unlikely(!pfn_valid_within(pfn)))
> > +			if (unlikely(!pfn_valid_within(pfn))) {
> > +				nr_lumpy_failed++;
> >  				break;
> > +			}
> >  
> >  			cursor_page = pfn_to_page(pfn);
> >  
> >  			/* Check that we have not crossed a zone boundary. */
> > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > -				continue;
> > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			/*
> >  			 * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			 * pointless.
> >  			 */
> >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > -					!PageSwapCache(cursor_page))
> > -				continue;
> > +			    !PageSwapCache(cursor_page)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> >  				list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  					nr_lumpy_dirty++;
> >  				scan++;
> >  			} else {
> > -				if (mode == ISOLATE_BOTH &&
> > -						page_count(cursor_page))
> > -					nr_lumpy_failed++;
> > +				/* the page is freed already. */
> > +				if (!page_count(cursor_page))
> > +					continue;
> > +				nr_lumpy_failed++;
> > +				break;
> >  			}
> >  		}
> 
> The many nr_lumpy_failed++ can be moved here:
> 
>                 if (pfn < end_pfn)
>                         nr_lumpy_failed++;
> 

Because the break stops the loop iterating, is there an advantage to
making it a pfn check instead? I might be misunderstanding your
suggestion.

> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 12:50       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 12:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   24 ++++++++++++++++--------
> >  1 files changed, 16 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  				continue;
> >  
> >  			/* Avoid holes within the zone. */
> > -			if (unlikely(!pfn_valid_within(pfn)))
> > +			if (unlikely(!pfn_valid_within(pfn))) {
> > +				nr_lumpy_failed++;
> >  				break;
> > +			}
> >  
> >  			cursor_page = pfn_to_page(pfn);
> >  
> >  			/* Check that we have not crossed a zone boundary. */
> > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > -				continue;
> > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			/*
> >  			 * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  			 * pointless.
> >  			 */
> >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > -					!PageSwapCache(cursor_page))
> > -				continue;
> > +			    !PageSwapCache(cursor_page)) {
> > +				nr_lumpy_failed++;
> > +				break;
> > +			}
> >  
> >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> >  				list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> >  					nr_lumpy_dirty++;
> >  				scan++;
> >  			} else {
> > -				if (mode == ISOLATE_BOTH &&
> > -						page_count(cursor_page))
> > -					nr_lumpy_failed++;
> > +				/* the page is freed already. */
> > +				if (!page_count(cursor_page))
> > +					continue;
> > +				nr_lumpy_failed++;
> > +				break;
> >  			}
> >  		}
> 
> The many nr_lumpy_failed++ can be moved here:
> 
>                 if (pfn < end_pfn)
>                         nr_lumpy_failed++;
> 

Because the break stops the loop iterating, is there an advantage to
making it a pfn check instead? I might be misunderstanding your
suggestion.

> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-08 12:50       ` Mel Gorman
@ 2010-09-08 13:14         ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 13:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  				continue;
> > >  
> > >  			/* Avoid holes within the zone. */
> > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > +				nr_lumpy_failed++;
> > >  				break;
> > > +			}
> > >  
> > >  			cursor_page = pfn_to_page(pfn);
> > >  
> > >  			/* Check that we have not crossed a zone boundary. */
> > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > -				continue;
> > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			/*
> > >  			 * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			 * pointless.
> > >  			 */
> > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > -					!PageSwapCache(cursor_page))
> > > -				continue;
> > > +			    !PageSwapCache(cursor_page)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > >  				list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  					nr_lumpy_dirty++;
> > >  				scan++;
> > >  			} else {
> > > -				if (mode == ISOLATE_BOTH &&
> > > -						page_count(cursor_page))
> > > -					nr_lumpy_failed++;
> > > +				/* the page is freed already. */
> > > +				if (!page_count(cursor_page))
> > > +					continue;
> > > +				nr_lumpy_failed++;
> > > +				break;
> > >  			}
> > >  		}
> > 
> > The many nr_lumpy_failed++ can be moved here:
> > 
> >                 if (pfn < end_pfn)
> >                         nr_lumpy_failed++;
> > 
> 
> Because the break stops the loop iterating, is there an advantage to
> making it a pfn check instead? I might be misunderstanding your
> suggestion.

The complete view in my mind is

                for (; pfn < end_pfn; pfn++) {
                        if (failed 1)
                                break;
                        if (failed 2)
                                break;
                        if (failed 3)
                                break;
                }
                if (pfn < end_pfn)
                        nr_lumpy_failed++;

Sure it just reduces several lines of code :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 13:14         ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 13:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  				continue;
> > >  
> > >  			/* Avoid holes within the zone. */
> > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > +				nr_lumpy_failed++;
> > >  				break;
> > > +			}
> > >  
> > >  			cursor_page = pfn_to_page(pfn);
> > >  
> > >  			/* Check that we have not crossed a zone boundary. */
> > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > -				continue;
> > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			/*
> > >  			 * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			 * pointless.
> > >  			 */
> > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > -					!PageSwapCache(cursor_page))
> > > -				continue;
> > > +			    !PageSwapCache(cursor_page)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > >  				list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  					nr_lumpy_dirty++;
> > >  				scan++;
> > >  			} else {
> > > -				if (mode == ISOLATE_BOTH &&
> > > -						page_count(cursor_page))
> > > -					nr_lumpy_failed++;
> > > +				/* the page is freed already. */
> > > +				if (!page_count(cursor_page))
> > > +					continue;
> > > +				nr_lumpy_failed++;
> > > +				break;
> > >  			}
> > >  		}
> > 
> > The many nr_lumpy_failed++ can be moved here:
> > 
> >                 if (pfn < end_pfn)
> >                         nr_lumpy_failed++;
> > 
> 
> Because the break stops the loop iterating, is there an advantage to
> making it a pfn check instead? I might be misunderstanding your
> suggestion.

The complete view in my mind is

                for (; pfn < end_pfn; pfn++) {
                        if (failed 1)
                                break;
                        if (failed 2)
                                break;
                        if (failed 3)
                                break;
                }
                if (pfn < end_pfn)
                        nr_lumpy_failed++;

Sure it just reduces several lines of code :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-08 13:14         ` Wu Fengguang
@ 2010-09-08 13:27           ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 13:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 09:14:04PM +0800, Wu Fengguang wrote:
> On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > 
> > > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > > no longer result in a successful higher order allocation. This patch stops
> > > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > > 
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 64f9ca5..ff52b46 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  				continue;
> > > >  
> > > >  			/* Avoid holes within the zone. */
> > > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > > +				nr_lumpy_failed++;
> > > >  				break;
> > > > +			}
> > > >  
> > > >  			cursor_page = pfn_to_page(pfn);
> > > >  
> > > >  			/* Check that we have not crossed a zone boundary. */
> > > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > > -				continue;
> > > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > > +			}
> > > >  
> > > >  			/*
> > > >  			 * If we don't have enough swap space, reclaiming of
> > > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  			 * pointless.
> > > >  			 */
> > > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > > -					!PageSwapCache(cursor_page))
> > > > -				continue;
> > > > +			    !PageSwapCache(cursor_page)) {
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > > +			}
> > > >  
> > > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > >  				list_move(&cursor_page->lru, dst);
> > > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  					nr_lumpy_dirty++;
> > > >  				scan++;
> > > >  			} else {
> > > > -				if (mode == ISOLATE_BOTH &&
> > > > -						page_count(cursor_page))
> > > > -					nr_lumpy_failed++;
> > > > +				/* the page is freed already. */
> > > > +				if (!page_count(cursor_page))
> > > > +					continue;
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > >  			}
> > > >  		}
> > > 
> > > The many nr_lumpy_failed++ can be moved here:
> > > 
> > >                 if (pfn < end_pfn)
> > >                         nr_lumpy_failed++;
> > > 
> > 
> > Because the break stops the loop iterating, is there an advantage to
> > making it a pfn check instead? I might be misunderstanding your
> > suggestion.
> 
> The complete view in my mind is
> 
>                 for (; pfn < end_pfn; pfn++) {
>                         if (failed 1)
>                                 break;
>                         if (failed 2)
>                                 break;
>                         if (failed 3)
>                                 break;
>                 }
>                 if (pfn < end_pfn)
>                         nr_lumpy_failed++;
> 
> Sure it just reduces several lines of code :)
> 

Fair point. I applied the following patch on top.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33d27a4..54df972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1091,18 +1091,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				continue;
 
 			/* Avoid holes within the zone. */
-			if (unlikely(!pfn_valid_within(pfn))) {
-				nr_lumpy_failed++;
+			if (unlikely(!pfn_valid_within(pfn)))
 				break;
-			}
 
 			cursor_page = pfn_to_page(pfn);
 
 			/* Check that we have not crossed a zone boundary. */
-			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
-				nr_lumpy_failed++;
+			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				break;
-			}
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1110,10 +1106,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-			    !PageSwapCache(cursor_page)) {
-				nr_lumpy_failed++;
+			    !PageSwapCache(cursor_page))
 				break;
-			}
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1127,10 +1121,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				/* the page is freed already. */
 				if (!page_count(cursor_page))
 					continue;
-				nr_lumpy_failed++;
 				break;
 			}
 		}
+
+		/* If we break out of the loop above, lumpy reclaim failed */
+		if (pfn < end_pfn)
+			nr_lumpy_failed++;
 	}
 
 	*scanned = scan;

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 13:27           ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 13:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 09:14:04PM +0800, Wu Fengguang wrote:
> On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > 
> > > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > > no longer result in a successful higher order allocation. This patch stops
> > > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > > 
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 64f9ca5..ff52b46 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  				continue;
> > > >  
> > > >  			/* Avoid holes within the zone. */
> > > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > > +				nr_lumpy_failed++;
> > > >  				break;
> > > > +			}
> > > >  
> > > >  			cursor_page = pfn_to_page(pfn);
> > > >  
> > > >  			/* Check that we have not crossed a zone boundary. */
> > > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > > -				continue;
> > > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > > +			}
> > > >  
> > > >  			/*
> > > >  			 * If we don't have enough swap space, reclaiming of
> > > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  			 * pointless.
> > > >  			 */
> > > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > > -					!PageSwapCache(cursor_page))
> > > > -				continue;
> > > > +			    !PageSwapCache(cursor_page)) {
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > > +			}
> > > >  
> > > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > >  				list_move(&cursor_page->lru, dst);
> > > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > >  					nr_lumpy_dirty++;
> > > >  				scan++;
> > > >  			} else {
> > > > -				if (mode == ISOLATE_BOTH &&
> > > > -						page_count(cursor_page))
> > > > -					nr_lumpy_failed++;
> > > > +				/* the page is freed already. */
> > > > +				if (!page_count(cursor_page))
> > > > +					continue;
> > > > +				nr_lumpy_failed++;
> > > > +				break;
> > > >  			}
> > > >  		}
> > > 
> > > The many nr_lumpy_failed++ can be moved here:
> > > 
> > >                 if (pfn < end_pfn)
> > >                         nr_lumpy_failed++;
> > > 
> > 
> > Because the break stops the loop iterating, is there an advantage to
> > making it a pfn check instead? I might be misunderstanding your
> > suggestion.
> 
> The complete view in my mind is
> 
>                 for (; pfn < end_pfn; pfn++) {
>                         if (failed 1)
>                                 break;
>                         if (failed 2)
>                                 break;
>                         if (failed 3)
>                                 break;
>                 }
>                 if (pfn < end_pfn)
>                         nr_lumpy_failed++;
> 
> Sure it just reduces several lines of code :)
> 

Fair point. I applied the following patch on top.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33d27a4..54df972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1091,18 +1091,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				continue;
 
 			/* Avoid holes within the zone. */
-			if (unlikely(!pfn_valid_within(pfn))) {
-				nr_lumpy_failed++;
+			if (unlikely(!pfn_valid_within(pfn)))
 				break;
-			}
 
 			cursor_page = pfn_to_page(pfn);
 
 			/* Check that we have not crossed a zone boundary. */
-			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
-				nr_lumpy_failed++;
+			if (unlikely(page_zone_id(cursor_page) != zone_id))
 				break;
-			}
 
 			/*
 			 * If we don't have enough swap space, reclaiming of
@@ -1110,10 +1106,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 			 * pointless.
 			 */
 			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
-			    !PageSwapCache(cursor_page)) {
-				nr_lumpy_failed++;
+			    !PageSwapCache(cursor_page))
 				break;
-			}
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
 				list_move(&cursor_page->lru, dst);
@@ -1127,10 +1121,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				/* the page is freed already. */
 				if (!page_count(cursor_page))
 					continue;
-				nr_lumpy_failed++;
 				break;
 			}
 		}
+
+		/* If we break out of the loop above, lumpy reclaim failed */
+		if (pfn < end_pfn)
+			nr_lumpy_failed++;
 	}
 
 	*scanned = scan;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-08 11:04       ` Mel Gorman
@ 2010-09-08 14:52         ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > + * @zone: A zone to consider the number of being being written back from
> > > + * @sync: SYNC or ASYNC IO
> > > + * @timeout: timeout in jiffies
> > > + *
> > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > + * write congestion.  If no backing_devs are congested then the number of
> > > + * writeback pages in the zone are checked and compared to the inactive
> > > + * list. If there is no sigificant writeback or congestion, there is no point
> >                                                 and 
> > 
> 
> Why and? "or" makes sense because we avoid sleeping on either condition.

if (nr_bdi_congested[sync]) == 0) {
        if (writeback < inactive / 2) {
                cond_resched();
                ..
                goto out
        }
}

for avoiding sleeping, above two condition should meet. 
So I thought "and" is make sense. 
Am I missing something?

> 
> > > + * in sleeping but cond_resched() is called in case the current process has
> > > + * consumed its CPU quota.
> > > + */
> > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > +{
> > > +	long ret;
> > > +	unsigned long start = jiffies;
> > > +	DEFINE_WAIT(wait);
> > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > +
> > > +	/*
> > > +	 * If there is no congestion, check the amount of writeback. If there
> > > +	 * is no significant writeback and no congestion, just cond_resched
> > > +	 */
> > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > +		unsigned long inactive, writeback;
> > > +
> > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > +
> > > +		/*
> > > +		 * If less than half the inactive list is being written back,
> > > +		 * reclaim might as well continue
> > > +		 */
> > > +		if (writeback < inactive / 2) {
> > 
> > I am not sure this is best.
> > 
> 
> I'm not saying it is. The objective is to identify a situation where
> sleeping until the next write or congestion clears is pointless. We have
> already identified that we are not congested so the question is "are we
> writing a lot at the moment?". The assumption is that if there is a lot
> of writing going on, we might as well sleep until one completes rather
> than reclaiming more.
> 
> This is the first effort at identifying pointless sleeps. Better ones
> might be identified in the future but that shouldn't stop us making a
> semi-sensible decision now.

nr_bdi_congested is no problem since we have used it for a long time.
But you added new rule about writeback. 

Why I pointed out is that you added new rule and I hope let others know
this change since they have a good idea or any opinions. 
I think it's a one of roles as reviewer.

> 
> > 1. Without considering various speed class storage, could we fix it as half of inactive?
> 
> We don't really have a good means of identifying speed classes of
> storage. Worse, we are considering on a zone-basis here, not a BDI
> basis. The pages being written back in the zone could be backed by
> anything so we cannot make decisions based on BDI speed.

True. So it's why I have below question.
As you said, we don't have enough information in vmscan.
So I am not sure how effective such semi-sensible decision is. 

I think best is to throttle in page-writeback well. 
But I am not a expert about that and don't have any idea. Sorry.
So I can't insist on my nitpick. If others don't have any objection,
I don't mind this, either. 

Wu, Do you have any opinion?

> 
> > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > 
> 
> There are but congestion_wait() and now wait_iff_congested() are part of
> that. We can see from the figures in the leader that congestion_wait()
> is sleeping more than is necessary or smart.
> 
> > Just out of curiosity. 
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 14:52         ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > + * @zone: A zone to consider the number of being being written back from
> > > + * @sync: SYNC or ASYNC IO
> > > + * @timeout: timeout in jiffies
> > > + *
> > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > + * write congestion.  If no backing_devs are congested then the number of
> > > + * writeback pages in the zone are checked and compared to the inactive
> > > + * list. If there is no sigificant writeback or congestion, there is no point
> >                                                 and 
> > 
> 
> Why and? "or" makes sense because we avoid sleeping on either condition.

if (nr_bdi_congested[sync]) == 0) {
        if (writeback < inactive / 2) {
                cond_resched();
                ..
                goto out
        }
}

for avoiding sleeping, above two condition should meet. 
So I thought "and" is make sense. 
Am I missing something?

> 
> > > + * in sleeping but cond_resched() is called in case the current process has
> > > + * consumed its CPU quota.
> > > + */
> > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > +{
> > > +	long ret;
> > > +	unsigned long start = jiffies;
> > > +	DEFINE_WAIT(wait);
> > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > +
> > > +	/*
> > > +	 * If there is no congestion, check the amount of writeback. If there
> > > +	 * is no significant writeback and no congestion, just cond_resched
> > > +	 */
> > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > +		unsigned long inactive, writeback;
> > > +
> > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > +
> > > +		/*
> > > +		 * If less than half the inactive list is being written back,
> > > +		 * reclaim might as well continue
> > > +		 */
> > > +		if (writeback < inactive / 2) {
> > 
> > I am not sure this is best.
> > 
> 
> I'm not saying it is. The objective is to identify a situation where
> sleeping until the next write or congestion clears is pointless. We have
> already identified that we are not congested so the question is "are we
> writing a lot at the moment?". The assumption is that if there is a lot
> of writing going on, we might as well sleep until one completes rather
> than reclaiming more.
> 
> This is the first effort at identifying pointless sleeps. Better ones
> might be identified in the future but that shouldn't stop us making a
> semi-sensible decision now.

nr_bdi_congested is no problem since we have used it for a long time.
But you added new rule about writeback. 

Why I pointed out is that you added new rule and I hope let others know
this change since they have a good idea or any opinions. 
I think it's a one of roles as reviewer.

> 
> > 1. Without considering various speed class storage, could we fix it as half of inactive?
> 
> We don't really have a good means of identifying speed classes of
> storage. Worse, we are considering on a zone-basis here, not a BDI
> basis. The pages being written back in the zone could be backed by
> anything so we cannot make decisions based on BDI speed.

True. So it's why I have below question.
As you said, we don't have enough information in vmscan.
So I am not sure how effective such semi-sensible decision is. 

I think best is to throttle in page-writeback well. 
But I am not a expert about that and don't have any idea. Sorry.
So I can't insist on my nitpick. If others don't have any objection,
I don't mind this, either. 

Wu, Do you have any opinion?

> 
> > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > 
> 
> There are but congestion_wait() and now wait_iff_congested() are part of
> that. We can see from the figures in the leader that congestion_wait()
> is sleeping more than is necessary or smart.
> 
> > Just out of curiosity. 
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-08 11:12       ` Mel Gorman
@ 2010-09-08 14:58         ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:12:30PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> > On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  				continue;
> > >  
> > >  			/* Avoid holes within the zone. */
> > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > +				nr_lumpy_failed++;
> > >  				break;
> > > +			}
> > >  
> > >  			cursor_page = pfn_to_page(pfn);
> > >  
> > >  			/* Check that we have not crossed a zone boundary. */
> > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > -				continue;
> > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			/*
> > >  			 * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			 * pointless.
> > >  			 */
> > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > -					!PageSwapCache(cursor_page))
> > > -				continue;
> > > +			    !PageSwapCache(cursor_page)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > >  				list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  					nr_lumpy_dirty++;
> > >  				scan++;
> > >  			} else {
> > > -				if (mode == ISOLATE_BOTH &&
> > 
> > Why can we remove ISOLATION_BOTH check?
> 
> Because this is lumpy reclaim and whether we are isolating inactive, active
> or both doesn't matter. The fact we failed to isolate the page and it has
> a reference count means that a contiguous allocation in that area will fail.
> 
> > Is it a intentionall behavior change?
> > 
> 
> Yes.

It looks good to me. 
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 14:58         ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 12:12:30PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> > On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   24 ++++++++++++++++--------
> > >  1 files changed, 16 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  				continue;
> > >  
> > >  			/* Avoid holes within the zone. */
> > > -			if (unlikely(!pfn_valid_within(pfn)))
> > > +			if (unlikely(!pfn_valid_within(pfn))) {
> > > +				nr_lumpy_failed++;
> > >  				break;
> > > +			}
> > >  
> > >  			cursor_page = pfn_to_page(pfn);
> > >  
> > >  			/* Check that we have not crossed a zone boundary. */
> > > -			if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > -				continue;
> > > +			if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			/*
> > >  			 * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  			 * pointless.
> > >  			 */
> > >  			if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > -					!PageSwapCache(cursor_page))
> > > -				continue;
> > > +			    !PageSwapCache(cursor_page)) {
> > > +				nr_lumpy_failed++;
> > > +				break;
> > > +			}
> > >  
> > >  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > >  				list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > >  					nr_lumpy_dirty++;
> > >  				scan++;
> > >  			} else {
> > > -				if (mode == ISOLATE_BOTH &&
> > 
> > Why can we remove ISOLATION_BOTH check?
> 
> Because this is lumpy reclaim and whether we are isolating inactive, active
> or both doesn't matter. The fact we failed to isolate the page and it has
> a reference count means that a contiguous allocation in that area will fail.
> 
> > Is it a intentionall behavior change?
> > 
> 
> Yes.

It looks good to me. 
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-08 21:23     ` Andrew Morton
  -1 siblings, 0 replies; 133+ messages in thread
From: Andrew Morton @ 2010-09-08 21:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig

On Mon,  6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 

The patch series looks generally good.  Would like to see some testing
results ;)  A few touchups are planned so I'll await v2.

> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];

Let's remember that a queue can get congested because of reads as well
as writes.  It's very rare for this to happen - it needs either a
zillion read()ing threads or someone going berzerk with O_DIRECT aio,
etc.  Probably it doesn't matter much, but for memory reclaim purposes
read-congestion is somewhat irrelevant and a bit of thought is warranted.

vmscan currently only looks at *write* congestion, but in this patch
you secretly change that logic to newly look at write-or-read
congestion.  Talk to me.

>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);

Worried.  Having a single slow disk getting itself gummed up will
affect the entire machine!

There's potential for pathological corner-case problems here.  "When I
do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
page reclaim!".

What to do?

Of course, we'd very much prefer to know whether a queue which we're
interested in for writeback will block when we try to write to it. 
Much better than looking at all queues.

Important question: which of teh current congestion_wait() call sites
are causing appreciable stalls?

I think a more accurate way of implementing this is to be smarter with
the may_write_to_queue()->bdi_write_congested() result.  If a previous
attempt to write off this LRU encountered congestion then fine, call
congestion_wait().  But if writeback is not hitting
may_write_to_queue()->bdi_write_congested() then that is the time to
avoid calling congestion_wait().

In other words, save the bdi_write_congested() result in the zone
struct in some fashion and inspect that before deciding to synchronize
behind the underlying device's write rate.  Not hitting a congested
device for this LRU?  Then don't wait for congested devices.

> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from

That comments needs help.

> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.'

write or read congestion!!

>  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */

Document the return value?

> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

This is all getting seriously inaccurate :(

> +			cond_resched();
> +
> +			/* In case we scheduled, work out time remaining */
> +			ret = timeout - (jiffies - start);
> +			if (ret < 0)
> +				ret = 0;
> +
> +			goto out;
> +		}
> +	}
> +
> +	/* Sleep until uncongested or a write happens */
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	ret = io_schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +
> +out:
> +	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> +					jiffies_to_usecs(jiffies - start));

Does this tracepoint tell us how often wait_iff_congested() is sleeping
versus how often it is returning immediately?

> +	return ret;
> +}
> +EXPORT_SYMBOL(wait_iff_congested);
>
> ...
>
> @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  			sc->may_writepage = 1;
>  		}
>  
> -		/* Take a nap, wait for some writeback to complete */
> +		/* Take a nap if congested, wait for some writeback */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> -		    priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		    priority < DEF_PRIORITY - 2) {
> +			struct zone *active_zone = NULL;
> +			unsigned long max_writeback = 0;
> +			for_each_zone_zonelist(zone, z, zonelist,
> +					gfp_zone(sc->gfp_mask)) {
> +				unsigned long writeback;
> +
> +				/* Initialise for first zone */
> +				if (active_zone == NULL)
> +					active_zone = zone;
> +
> +				writeback = zone_page_state(zone, NR_WRITEBACK);
> +				if (writeback > max_writeback) {
> +					max_writeback = writeback;
> +					active_zone = zone;
> +				}
> +			}
> +
> +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> +		}

Again, we would benefit from more accuracy here.  In my above
suggestion I'm assuming that the (congestion) result of the most recent
attempt to perform writeback is a predictor of the next attempt.

Doing that on a kernel-wide basis would be rather inaccurate on large
machines in some scenarios.  Storing the state info in the zone would
help.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 21:23     ` Andrew Morton
  0 siblings, 0 replies; 133+ messages in thread
From: Andrew Morton @ 2010-09-08 21:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig

On Mon,  6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 

The patch series looks generally good.  Would like to see some testing
results ;)  A few touchups are planned so I'll await v2.

> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];

Let's remember that a queue can get congested because of reads as well
as writes.  It's very rare for this to happen - it needs either a
zillion read()ing threads or someone going berzerk with O_DIRECT aio,
etc.  Probably it doesn't matter much, but for memory reclaim purposes
read-congestion is somewhat irrelevant and a bit of thought is warranted.

vmscan currently only looks at *write* congestion, but in this patch
you secretly change that logic to newly look at write-or-read
congestion.  Talk to me.

>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);

Worried.  Having a single slow disk getting itself gummed up will
affect the entire machine!

There's potential for pathological corner-case problems here.  "When I
do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
page reclaim!".

What to do?

Of course, we'd very much prefer to know whether a queue which we're
interested in for writeback will block when we try to write to it. 
Much better than looking at all queues.

Important question: which of teh current congestion_wait() call sites
are causing appreciable stalls?

I think a more accurate way of implementing this is to be smarter with
the may_write_to_queue()->bdi_write_congested() result.  If a previous
attempt to write off this LRU encountered congestion then fine, call
congestion_wait().  But if writeback is not hitting
may_write_to_queue()->bdi_write_congested() then that is the time to
avoid calling congestion_wait().

In other words, save the bdi_write_congested() result in the zone
struct in some fashion and inspect that before deciding to synchronize
behind the underlying device's write rate.  Not hitting a congested
device for this LRU?  Then don't wait for congested devices.

> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from

That comments needs help.

> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.'

write or read congestion!!

>  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */

Document the return value?

> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

This is all getting seriously inaccurate :(

> +			cond_resched();
> +
> +			/* In case we scheduled, work out time remaining */
> +			ret = timeout - (jiffies - start);
> +			if (ret < 0)
> +				ret = 0;
> +
> +			goto out;
> +		}
> +	}
> +
> +	/* Sleep until uncongested or a write happens */
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	ret = io_schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +
> +out:
> +	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> +					jiffies_to_usecs(jiffies - start));

Does this tracepoint tell us how often wait_iff_congested() is sleeping
versus how often it is returning immediately?

> +	return ret;
> +}
> +EXPORT_SYMBOL(wait_iff_congested);
>
> ...
>
> @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  			sc->may_writepage = 1;
>  		}
>  
> -		/* Take a nap, wait for some writeback to complete */
> +		/* Take a nap if congested, wait for some writeback */
>  		if (!sc->hibernation_mode && sc->nr_scanned &&
> -		    priority < DEF_PRIORITY - 2)
> -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		    priority < DEF_PRIORITY - 2) {
> +			struct zone *active_zone = NULL;
> +			unsigned long max_writeback = 0;
> +			for_each_zone_zonelist(zone, z, zonelist,
> +					gfp_zone(sc->gfp_mask)) {
> +				unsigned long writeback;
> +
> +				/* Initialise for first zone */
> +				if (active_zone == NULL)
> +					active_zone = zone;
> +
> +				writeback = zone_page_state(zone, NR_WRITEBACK);
> +				if (writeback > max_writeback) {
> +					max_writeback = writeback;
> +					active_zone = zone;
> +				}
> +			}
> +
> +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> +		}

Again, we would benefit from more accuracy here.  In my above
suggestion I'm assuming that the (congestion) result of the most recent
attempt to perform writeback is a predictor of the next attempt.

Doing that on a kernel-wide basis would be rather inaccurate on large
machines in some scenarios.  Storing the state info in the zone would
help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:02     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h      |    2 +-
>  include/trace/events/writeback.h |    7 ++++
>  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c                  |    4 +-
>  mm/vmscan.c                      |   26 ++++++++++++--
>  5 files changed, 96 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
>  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
>  long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>  
>  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
>  {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
>  	TP_ARGS(usec_timeout, usec_delayed)
>  );
>  
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> +	TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
>  #endif /* _TRACE_WRITEBACK_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	enum bdi_state bit;
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	set_bit(bit, &bdi->state);
> +	if (!test_and_set_bit(bit, &bdi->state))
> +		atomic_inc(&nr_bdi_congested[sync]);
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

Hmm..can't we have a way that "find a page which can be just dropped without writeback"
rather than sleeping ? I think we can throttole the number of victims for avoidng I/O
congestion as pages/tick....if exhausted, ok, we should sleep.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09  3:02     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
> 
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  include/linux/backing-dev.h      |    2 +-
>  include/trace/events/writeback.h |    7 ++++
>  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
>  mm/page_alloc.c                  |    4 +-
>  mm/vmscan.c                      |   26 ++++++++++++--
>  5 files changed, 96 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
>  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
>  long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>  
>  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
>  {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
>  	TP_ARGS(usec_timeout, usec_delayed)
>  );
>  
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> +	TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
>  #endif /* _TRACE_WRITEBACK_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
>  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
>  	};
> +static atomic_t nr_bdi_congested[2];
>  
>  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	clear_bit(bit, &bdi->state);
> +	if (test_and_clear_bit(bit, &bdi->state))
> +		atomic_dec(&nr_bdi_congested[sync]);
>  	smp_mb__after_clear_bit();
>  	if (waitqueue_active(wqh))
>  		wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  	enum bdi_state bit;
>  
>  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> -	set_bit(bit, &bdi->state);
> +	if (!test_and_set_bit(bit, &bdi->state))
> +		atomic_inc(&nr_bdi_congested[sync]);
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
>  }
>  EXPORT_SYMBOL(congestion_wait);
>  
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.  If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> +	long ret;
> +	unsigned long start = jiffies;
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	/*
> +	 * If there is no congestion, check the amount of writeback. If there
> +	 * is no significant writeback and no congestion, just cond_resched
> +	 */
> +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> +		unsigned long inactive, writeback;
> +
> +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +				zone_page_state(zone, NR_INACTIVE_ANON);
> +		writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> +		/*
> +		 * If less than half the inactive list is being written back,
> +		 * reclaim might as well continue
> +		 */
> +		if (writeback < inactive / 2) {

Hmm..can't we have a way that "find a page which can be just dropped without writeback"
rather than sleeping ? I think we can throttole the number of victims for avoidng I/O
congestion as pages/tick....if exhausted, ok, we should sleep.

Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:03     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-09  3:03     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> congestion_wait() mean "waiting queue congestion is cleared".  However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:04     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:28 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09  3:04     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:28 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:14     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:29 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> shrink_page_list() can decide to give up reclaiming a page under a
> number of conditions such as
> 
>   1. trylock_page() failure
>   2. page is unevictable
>   3. zone reclaim and page is mapped
>   4. PageWriteback() is true
>   5. page is swapbacked and swap is full
>   6. add_to_swap() failure
>   7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
>   8. page is pinned
>   9. IO queue is congested
>  10. pageout() start IO, but not finished
> 
> When lumpy reclaim, all of failure result in entering synchronous lumpy
> reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
> (8), there is no point retrying.  This patch causes lumpy reclaim to abort
> when it is known it will fail.
> 
> Case (9) is more interesting. current behavior is,
>   1. start shrink_page_list(async)
>   2. found queue_congested()
>   3. skip pageout write
>   4. still start shrink_page_list(sync)
>   5. wait on a lot of pages
>   6. again, found queue_congested()
>   7. give up pageout write again
> 
> So, it's meaningless time wasting. However, just skipping page reclaim is
> also not a good as as x86 allocating a huge page needs 512 pages for example.
> It can have more dirty pages than queue congestion threshold (~=128).
> 
> After this patch, pageout() behaves as follows;
> 
>  - If order > PAGE_ALLOC_COSTLY_ORDER
> 	Ignore queue congestion always.
>  - If order <= PAGE_ALLOC_COSTLY_ORDER
> 	skip write page and disable lumpy reclaim.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

seems nice.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
@ 2010-09-09  3:14     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:29 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> shrink_page_list() can decide to give up reclaiming a page under a
> number of conditions such as
> 
>   1. trylock_page() failure
>   2. page is unevictable
>   3. zone reclaim and page is mapped
>   4. PageWriteback() is true
>   5. page is swapbacked and swap is full
>   6. add_to_swap() failure
>   7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
>   8. page is pinned
>   9. IO queue is congested
>  10. pageout() start IO, but not finished
> 
> When lumpy reclaim, all of failure result in entering synchronous lumpy
> reclaim but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and
> (8), there is no point retrying.  This patch causes lumpy reclaim to abort
> when it is known it will fail.
> 
> Case (9) is more interesting. current behavior is,
>   1. start shrink_page_list(async)
>   2. found queue_congested()
>   3. skip pageout write
>   4. still start shrink_page_list(sync)
>   5. wait on a lot of pages
>   6. again, found queue_congested()
>   7. give up pageout write again
> 
> So, it's meaningless time wasting. However, just skipping page reclaim is
> also not a good as as x86 allocating a huge page needs 512 pages for example.
> It can have more dirty pages than queue congestion threshold (~=128).
> 
> After this patch, pageout() behaves as follows;
> 
>  - If order > PAGE_ALLOC_COSTLY_ORDER
> 	Ignore queue congestion always.
>  - If order <= PAGE_ALLOC_COSTLY_ORDER
> 	skip write page and disable lumpy reclaim.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

seems nice.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-09  3:04     ` KAMEZAWA Hiroyuki
@ 2010-09-09  3:15       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, 9 Sep 2010 12:04:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon,  6 Sep 2010 11:47:28 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > pages even if page is locked. This patch uses lock_page() instead of
> > trylock_page() in this case.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
Ah......but can't this change cause dead lock ??

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09  3:15       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, 9 Sep 2010 12:04:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon,  6 Sep 2010 11:47:28 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > 
> > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > pages even if page is locked. This patch uses lock_page() instead of
> > trylock_page() in this case.
> > 
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
Ah......but can't this change cause dead lock ??

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:17     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:31 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-09  3:17     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:31 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> 
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
> 
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-09  3:22     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:33 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
>  1 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 408c101..33d27a4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
>  #define MAX_SWAP_CLEAN_WAIT 50
>  
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> +	return laptop_mode ? 0 :
> +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> +					int file,
>  					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
> +	unsigned long nr_dirty_seen = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty_seen++;
> +
>  			/*
>  			 * Only kswapd can writeback filesystem pages to
>  			 * avoid risk of stack overflow
> @@ -923,6 +939,18 @@ keep_lumpy:
>  
>  	list_splice(&ret_pages, page_list);
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +

Thank you. Ok, I'll check what happens in memcg.

Can I add
	if (sc->memcg) {
		memcg_check_flusher_wakeup()
	}
or some here ?

Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-09  3:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09  3:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Mon,  6 Sep 2010 11:47:33 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
>   o When dirtying pages, processes may be throttled to clean pages if
>     dirty_ratio is not met.
>   o Pages belonging to inodes dirtied longer than
>     dirty_writeback_centisecs get cleaned.
> 
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
> 
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
> 
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
>  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
>  1 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 408c101..33d27a4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
>  #define MAX_SWAP_CLEAN_WAIT 50
>  
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> +	return laptop_mode ? 0 :
> +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  					struct scan_control *sc,
> +					int file,
>  					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
>  	unsigned long nr_dirty = 0;
> +	unsigned long nr_dirty_seen = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			nr_dirty_seen++;
> +
>  			/*
>  			 * Only kswapd can writeback filesystem pages to
>  			 * avoid risk of stack overflow
> @@ -923,6 +939,18 @@ keep_lumpy:
>  
>  	list_splice(&ret_pages, page_list);
>  
> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +

Thank you. Ok, I'll check what happens in memcg.

Can I add
	if (sc->memcg) {
		memcg_check_flusher_wakeup()
	}
or some here ?

Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-09  3:15       ` KAMEZAWA Hiroyuki
@ 2010-09-09  3:25         ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-09  3:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 11:15:47AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon,  6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Ah......but can't this change cause dead lock ??

You mean the task goes for page allocation while holding some page
lock? Seems possible.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09  3:25         ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-09  3:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 11:15:47AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon,  6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Ah......but can't this change cause dead lock ??

You mean the task goes for page allocation while holding some page
lock? Seems possible.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-09  3:15       ` KAMEZAWA Hiroyuki
@ 2010-09-09  4:13         ` KOSAKI Motohiro
  -1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  4:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon,  6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Ah......but can't this change cause dead lock ??

Yes, this patch is purely crappy. please drop. I guess I was poisoned
by poisonous mushroom of Mario Bros.





^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09  4:13         ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09  4:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: kosaki.motohiro, Mel Gorman, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon,  6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > 
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > > 
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Ah......but can't this change cause dead lock ??

Yes, this patch is purely crappy. please drop. I guess I was poisoned
by poisonous mushroom of Mario Bros.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-08 14:52         ` Minchan Kim
@ 2010-09-09  8:54           ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  8:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > + * @zone: A zone to consider the number of being being written back from
> > > > + * @sync: SYNC or ASYNC IO
> > > > + * @timeout: timeout in jiffies
> > > > + *
> > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > + * write congestion.  If no backing_devs are congested then the number of
> > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > >                                                 and 
> > > 
> > 
> > Why and? "or" makes sense because we avoid sleeping on either condition.
> 
> if (nr_bdi_congested[sync]) == 0) {
>         if (writeback < inactive / 2) {
>                 cond_resched();
>                 ..
>                 goto out
>         }
> }
> 
> for avoiding sleeping, above two condition should meet. 

This is a terrible comment that is badly written. Is this any clearer?

/**
 * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
 * @zone: A zone to consider the number of being being written back from
 * @sync: SYNC or ASYNC IO
 * @timeout: timeout in jiffies
 *
 * In the event of a congested backing_dev (any backing_dev) or a given @zone
 * having a large number of pages in writeback, this waits for up to @timeout
 * jiffies for either a BDI to exit congestion or a write to complete.
 *
 * If there is no congestion and few pending writes, then cond_resched()
 * is called to yield the processor if necessary but otherwise does not
 * sleep.
 */

> > 
> > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > + * consumed its CPU quota.
> > > > + */
> > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > +{
> > > > +	long ret;
> > > > +	unsigned long start = jiffies;
> > > > +	DEFINE_WAIT(wait);
> > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > +
> > > > +	/*
> > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > +	 */
> > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > +		unsigned long inactive, writeback;
> > > > +
> > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > +
> > > > +		/*
> > > > +		 * If less than half the inactive list is being written back,
> > > > +		 * reclaim might as well continue
> > > > +		 */
> > > > +		if (writeback < inactive / 2) {
> > > 
> > > I am not sure this is best.
> > > 
> > 
> > I'm not saying it is. The objective is to identify a situation where
> > sleeping until the next write or congestion clears is pointless. We have
> > already identified that we are not congested so the question is "are we
> > writing a lot at the moment?". The assumption is that if there is a lot
> > of writing going on, we might as well sleep until one completes rather
> > than reclaiming more.
> > 
> > This is the first effort at identifying pointless sleeps. Better ones
> > might be identified in the future but that shouldn't stop us making a
> > semi-sensible decision now.
> 
> nr_bdi_congested is no problem since we have used it for a long time.
> But you added new rule about writeback. 
> 

Yes, I'm trying to add a new rule about throttling in the page allocator
and from vmscan. As you can see from the results in the leader, we are
currently sleeping more than we need to.

> Why I pointed out is that you added new rule and I hope let others know
> this change since they have a good idea or any opinions. 
> I think it's a one of roles as reviewer.
> 

Of course.

> > 
> > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > 
> > We don't really have a good means of identifying speed classes of
> > storage. Worse, we are considering on a zone-basis here, not a BDI
> > basis. The pages being written back in the zone could be backed by
> > anything so we cannot make decisions based on BDI speed.
> 
> True. So it's why I have below question.
> As you said, we don't have enough information in vmscan.
> So I am not sure how effective such semi-sensible decision is. 
> 

What additional metrics would you apply than the ones I used in the
leader mail?

> I think best is to throttle in page-writeback well. 

I do not think there is a problem as such in page writeback throttling.
The problem is that we are going to sleep without any congestion or without
writes in progress. We sleep for a full timeout in this case for no reason
and this is what I'm trying to avoid.

> But I am not a expert about that and don't have any idea. Sorry.

Don't be, this is something that needs thinking about!

> So I can't insist on my nitpick. If others don't have any objection,
> I don't mind this, either. 
> 
> Wu, Do you have any opinion?
> 
> > 
> > > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > > 
> > 
> > There are but congestion_wait() and now wait_iff_congested() are part of
> > that. We can see from the figures in the leader that congestion_wait()
> > is sleeping more than is necessary or smart.
> > 
> > > Just out of curiosity. 
> > > 
> > 
> > -- 
> > Mel Gorman
> > Part-time Phd Student                          Linux Technology Center
> > University of Limerick                         IBM Dublin Software Lab
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09  8:54           ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  8:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > + * @zone: A zone to consider the number of being being written back from
> > > > + * @sync: SYNC or ASYNC IO
> > > > + * @timeout: timeout in jiffies
> > > > + *
> > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > + * write congestion.  If no backing_devs are congested then the number of
> > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > >                                                 and 
> > > 
> > 
> > Why and? "or" makes sense because we avoid sleeping on either condition.
> 
> if (nr_bdi_congested[sync]) == 0) {
>         if (writeback < inactive / 2) {
>                 cond_resched();
>                 ..
>                 goto out
>         }
> }
> 
> for avoiding sleeping, above two condition should meet. 

This is a terrible comment that is badly written. Is this any clearer?

/**
 * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
 * @zone: A zone to consider the number of being being written back from
 * @sync: SYNC or ASYNC IO
 * @timeout: timeout in jiffies
 *
 * In the event of a congested backing_dev (any backing_dev) or a given @zone
 * having a large number of pages in writeback, this waits for up to @timeout
 * jiffies for either a BDI to exit congestion or a write to complete.
 *
 * If there is no congestion and few pending writes, then cond_resched()
 * is called to yield the processor if necessary but otherwise does not
 * sleep.
 */

> > 
> > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > + * consumed its CPU quota.
> > > > + */
> > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > +{
> > > > +	long ret;
> > > > +	unsigned long start = jiffies;
> > > > +	DEFINE_WAIT(wait);
> > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > +
> > > > +	/*
> > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > +	 */
> > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > +		unsigned long inactive, writeback;
> > > > +
> > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > +
> > > > +		/*
> > > > +		 * If less than half the inactive list is being written back,
> > > > +		 * reclaim might as well continue
> > > > +		 */
> > > > +		if (writeback < inactive / 2) {
> > > 
> > > I am not sure this is best.
> > > 
> > 
> > I'm not saying it is. The objective is to identify a situation where
> > sleeping until the next write or congestion clears is pointless. We have
> > already identified that we are not congested so the question is "are we
> > writing a lot at the moment?". The assumption is that if there is a lot
> > of writing going on, we might as well sleep until one completes rather
> > than reclaiming more.
> > 
> > This is the first effort at identifying pointless sleeps. Better ones
> > might be identified in the future but that shouldn't stop us making a
> > semi-sensible decision now.
> 
> nr_bdi_congested is no problem since we have used it for a long time.
> But you added new rule about writeback. 
> 

Yes, I'm trying to add a new rule about throttling in the page allocator
and from vmscan. As you can see from the results in the leader, we are
currently sleeping more than we need to.

> Why I pointed out is that you added new rule and I hope let others know
> this change since they have a good idea or any opinions. 
> I think it's a one of roles as reviewer.
> 

Of course.

> > 
> > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > 
> > We don't really have a good means of identifying speed classes of
> > storage. Worse, we are considering on a zone-basis here, not a BDI
> > basis. The pages being written back in the zone could be backed by
> > anything so we cannot make decisions based on BDI speed.
> 
> True. So it's why I have below question.
> As you said, we don't have enough information in vmscan.
> So I am not sure how effective such semi-sensible decision is. 
> 

What additional metrics would you apply than the ones I used in the
leader mail?

> I think best is to throttle in page-writeback well. 

I do not think there is a problem as such in page writeback throttling.
The problem is that we are going to sleep without any congestion or without
writes in progress. We sleep for a full timeout in this case for no reason
and this is what I'm trying to avoid.

> But I am not a expert about that and don't have any idea. Sorry.

Don't be, this is something that needs thinking about!

> So I can't insist on my nitpick. If others don't have any objection,
> I don't mind this, either. 
> 
> Wu, Do you have any opinion?
> 
> > 
> > > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > > 
> > 
> > There are but congestion_wait() and now wait_iff_congested() are part of
> > that. We can see from the figures in the leader that congestion_wait()
> > is sleeping more than is necessary or smart.
> > 
> > > Just out of curiosity. 
> > > 
> > 
> > -- 
> > Mel Gorman
> > Part-time Phd Student                          Linux Technology Center
> > University of Limerick                         IBM Dublin Software Lab
> 
> -- 
> Kind regards,
> Minchan Kim
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-09  3:02     ` KAMEZAWA Hiroyuki
@ 2010-09-09  8:58       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  8:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 12:02:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/backing-dev.h      |    2 +-
> >  include/trace/events/writeback.h |    7 ++++
> >  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
> >  mm/page_alloc.c                  |    4 +-
> >  mm/vmscan.c                      |   26 ++++++++++++--
> >  5 files changed, 96 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 35b0074..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -285,7 +285,7 @@ enum {
> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> >  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> >  long congestion_wait(int sync, long timeout);
> > -
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout);
> >  
> >  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> >  {
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index 275d477..eeaf1f5 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> >  	TP_ARGS(usec_timeout, usec_delayed)
> >  );
> >  
> > +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> > +
> > +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> > +
> > +	TP_ARGS(usec_timeout, usec_delayed)
> > +);
> > +
> >  #endif /* _TRACE_WRITEBACK_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 298975a..94b5433 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> >  	};
> > +static atomic_t nr_bdi_congested[2];
> >  
> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	clear_bit(bit, &bdi->state);
> > +	if (test_and_clear_bit(bit, &bdi->state))
> > +		atomic_dec(&nr_bdi_congested[sync]);
> >  	smp_mb__after_clear_bit();
> >  	if (waitqueue_active(wqh))
> >  		wake_up(wqh);
> > @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	enum bdi_state bit;
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	set_bit(bit, &bdi->state);
> > +	if (!test_and_set_bit(bit, &bdi->state))
> > +		atomic_inc(&nr_bdi_congested[sync]);
> >  }
> >  EXPORT_SYMBOL(set_bdi_congested);
> >  
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> >  }
> >  EXPORT_SYMBOL(congestion_wait);
> >  
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> Hmm..can't we have a way that "find a page which can be just dropped without writeback"
> rather than sleeping ?

Sure, just scan for clean pages but then younger clean pages would be reclaimed
before old dirty pages because we were not waiting on writeback. It's a
significant change.

> I think we can throttole the number of victims for avoidng I/O
> congestion as pages/tick....if exhausted, ok, we should sleep.
> 

I think it would be tricky to throttle based on time effectively. I find
it easier to think about throttling in terms of congested device, number
of dirty pages in a zone or number of pages currently being written back
because these are events that can prevent reclaim taking place.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09  8:58       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  8:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 12:02:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  include/linux/backing-dev.h      |    2 +-
> >  include/trace/events/writeback.h |    7 ++++
> >  mm/backing-dev.c                 |   66 ++++++++++++++++++++++++++++++++++++-
> >  mm/page_alloc.c                  |    4 +-
> >  mm/vmscan.c                      |   26 ++++++++++++--
> >  5 files changed, 96 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 35b0074..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -285,7 +285,7 @@ enum {
> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> >  void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> >  long congestion_wait(int sync, long timeout);
> > -
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout);
> >  
> >  static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> >  {
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index 275d477..eeaf1f5 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> >  	TP_ARGS(usec_timeout, usec_delayed)
> >  );
> >  
> > +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> > +
> > +	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> > +
> > +	TP_ARGS(usec_timeout, usec_delayed)
> > +);
> > +
> >  #endif /* _TRACE_WRITEBACK_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 298975a..94b5433 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> >  	};
> > +static atomic_t nr_bdi_congested[2];
> >  
> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	clear_bit(bit, &bdi->state);
> > +	if (test_and_clear_bit(bit, &bdi->state))
> > +		atomic_dec(&nr_bdi_congested[sync]);
> >  	smp_mb__after_clear_bit();
> >  	if (waitqueue_active(wqh))
> >  		wake_up(wqh);
> > @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	enum bdi_state bit;
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	set_bit(bit, &bdi->state);
> > +	if (!test_and_set_bit(bit, &bdi->state))
> > +		atomic_inc(&nr_bdi_congested[sync]);
> >  }
> >  EXPORT_SYMBOL(set_bdi_congested);
> >  
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> >  }
> >  EXPORT_SYMBOL(congestion_wait);
> >  
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> Hmm..can't we have a way that "find a page which can be just dropped without writeback"
> rather than sleeping ?

Sure, just scan for clean pages but then younger clean pages would be reclaimed
before old dirty pages because we were not waiting on writeback. It's a
significant change.

> I think we can throttole the number of victims for avoidng I/O
> congestion as pages/tick....if exhausted, ok, we should sleep.
> 

I think it would be tricky to throttle based on time effectively. I find
it easier to think about throttling in terms of congested device, number
of dirty pages in a zone or number of pages currently being written back
because these are events that can prevent reclaim taking place.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-09  4:13         ` KOSAKI Motohiro
@ 2010-09-09  9:22           ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  9:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > On Thu, 9 Sep 2010 12:04:48 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > 
> > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > trylock_page() in this case.
> > > > 
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > Ah......but can't this change cause dead lock ??
> 
> Yes, this patch is purely crappy. please drop. I guess I was poisoned
> by poisonous mushroom of Mario Bros.
> 

Lets be clear on what the exact dead lock conditions are. The ones I had
thought about when I felt this patch was ok were;

o We are not holding the LRU lock (or any lock, we just called cond_resched())
o We do not have another page locked because we cannot lock multiple pages
o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
o lock_page() itself is not allocating anything that we could recurse on

One potential dead lock would be if the direct reclaimer held a page
lock and ended up here but is that situation even allowed? I did not
think of an obvious example of when this would happen. Similarly,
deadlock situations with mmap_sem shouldn't happen unless multiple page
locks are being taken.

(prepares to feel foolish)

What did I miss?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09  9:22           ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  9:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > On Thu, 9 Sep 2010 12:04:48 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > 
> > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > trylock_page() in this case.
> > > > 
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > Ah......but can't this change cause dead lock ??
> 
> Yes, this patch is purely crappy. please drop. I guess I was poisoned
> by poisonous mushroom of Mario Bros.
> 

Lets be clear on what the exact dead lock conditions are. The ones I had
thought about when I felt this patch was ok were;

o We are not holding the LRU lock (or any lock, we just called cond_resched())
o We do not have another page locked because we cannot lock multiple pages
o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
o lock_page() itself is not allocating anything that we could recurse on

One potential dead lock would be if the direct reclaimer held a page
lock and ended up here but is that situation even allowed? I did not
think of an obvious example of when this would happen. Similarly,
deadlock situations with mmap_sem shouldn't happen unless multiple page
locks are being taken.

(prepares to feel foolish)

What did I miss?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-09  3:22     ` KAMEZAWA Hiroyuki
@ 2010-09-09  9:32       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  9:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  6 Sep 2010 11:47:33 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
> >  1 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 408c101..33d27a4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> >  #define MAX_SWAP_CLEAN_WAIT 50
> >  
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > +	return laptop_mode ? 0 :
> > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > +					int file,
> >  					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> >  	unsigned long nr_dirty = 0;
> > +	unsigned long nr_dirty_seen = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty_seen++;
> > +
> >  			/*
> >  			 * Only kswapd can writeback filesystem pages to
> >  			 * avoid risk of stack overflow
> > @@ -923,6 +939,18 @@ keep_lumpy:
> >  
> >  	list_splice(&ret_pages, page_list);
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > +
> 
> Thank you. Ok, I'll check what happens in memcg.
> 

Thanks

> Can I add
> 	if (sc->memcg) {
> 		memcg_check_flusher_wakeup()
> 	}
> or some here ?
> 

It seems reasonable.

> Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> 

I'm afraid I cannot make a judgement call on which is the best as I am
not very familiar with how cgroups behave in comparison to normal
reclaim. There could easily be a follow-on patch though that was cgroup
specific?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-09  9:32       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09  9:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon,  6 Sep 2010 11:47:33 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> >   o When dirtying pages, processes may be throttled to clean pages if
> >     dirty_ratio is not met.
> >   o Pages belonging to inodes dirtied longer than
> >     dirty_writeback_centisecs get cleaned.
> > 
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> > 
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> > 
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> >  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
> >  1 files changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 408c101..33d27a4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> >  #define MAX_SWAP_CLEAN_WAIT 50
> >  
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > +	return laptop_mode ? 0 :
> > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  					struct scan_control *sc,
> > +					int file,
> >  					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> >  	unsigned long nr_dirty = 0;
> > +	unsigned long nr_dirty_seen = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			nr_dirty_seen++;
> > +
> >  			/*
> >  			 * Only kswapd can writeback filesystem pages to
> >  			 * avoid risk of stack overflow
> > @@ -923,6 +939,18 @@ keep_lumpy:
> >  
> >  	list_splice(&ret_pages, page_list);
> >  
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > +
> 
> Thank you. Ok, I'll check what happens in memcg.
> 

Thanks

> Can I add
> 	if (sc->memcg) {
> 		memcg_check_flusher_wakeup()
> 	}
> or some here ?
> 

It seems reasonable.

> Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> 

I'm afraid I cannot make a judgement call on which is the best as I am
not very familiar with how cgroups behave in comparison to normal
reclaim. There could easily be a follow-on patch though that was cgroup
specific?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-08 21:23     ` Andrew Morton
@ 2010-09-09 10:43       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 10:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig

On Wed, Sep 08, 2010 at 02:23:30PM -0700, Andrew Morton wrote:
> On Mon,  6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> 
> The patch series looks generally good.  Would like to see some testing
> results ;) 

They are all in the leader. They are all based on a test-suite that I'm
bound to stick a README on and release one of these days :/

> A few touchups are planned so I'll await v2.
> 

Good plan.

> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> >  	};
> > +static atomic_t nr_bdi_congested[2];
> 
> Let's remember that a queue can get congested because of reads as well
> as writes.  It's very rare for this to happen - it needs either a
> zillion read()ing threads or someone going berzerk with O_DIRECT aio,
> etc.  Probably it doesn't matter much, but for memory reclaim purposes
> read-congestion is somewhat irrelevant and a bit of thought is warranted.
> 

This is an interesting point and would be well worth digging into if
we got a new bug report about stalls under heavy reads.

> vmscan currently only looks at *write* congestion, but in this patch
> you secretly change that logic to newly look at write-or-read
> congestion.  Talk to me.
> 

vmscan currently only looks at write congestion because it's checking the
BLK_RW_ASYNC and all reads will be BLK_RW_SYNC. Currently, this is why we
are only looking at write congestion even though it's approximate, right?

Remember, congestion_wait used to be about READ and WRITE but now it's about
SYNC and ASYNC.

In the patch, there are separate SYNC and ASYNC nr_bdi_congested counters.
wait_iff_congested() is only called for BLK_RW_ASYNC so we still checking
write congestion only.

What stupid thing did I miss?

> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	clear_bit(bit, &bdi->state);
> > +	if (test_and_clear_bit(bit, &bdi->state))
> > +		atomic_dec(&nr_bdi_congested[sync]);
> >  	smp_mb__after_clear_bit();
> >  	if (waitqueue_active(wqh))
> >  		wake_up(wqh);
> 
> Worried.  Having a single slow disk getting itself gummed up will
> affect the entire machine!
> 

This can already happen today. In fact, I think it's one of the sources of
desktop stalls during IO from https://bugzilla.kernel.org/show_bug.cgi?id=12309
that you brought up a few weeks back. I was tempted to try resolve it in
this patch but thought I was reaching far enough with this series as it was.

> There's potential for pathological corner-case problems here.  "When I
> do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
> page reclaim!".
> 

I thought it might be enough to just do a huge backup to an external USB
drive. I guess I could make it worse by starting up one copy per CPU
thread preferably to more than one slow USB device.

> What to do?
> 
> Of course, we'd very much prefer to know whether a queue which we're
> interested in for writeback will block when we try to write to it. 
> Much better than looking at all queues.
> 

And somehow reconciling the queue being written to with the zone the pages
are coming from.

> Important question: which of teh current congestion_wait() call sites
> are causing appreciable stalls?
> 

This potentially can be found out from the tracepoints if they record
the stack trace as well. In this patch, I avoided changing all callers to
congestion_wait() and changed a few callers to wait_iff_congested() instead
to limit the scope of what was being changed in this cycle.

> I think a more accurate way of implementing this is to be smarter with
> the may_write_to_queue()->bdi_write_congested() result.  If a previous
> attempt to write off this LRU encountered congestion then fine, call
> congestion_wait().  But if writeback is not hitting
> may_write_to_queue()->bdi_write_congested() then that is the time to
> avoid calling congestion_wait().
> 

I see the logic. If we assume that there is large amounts of anon page
reclaim while writeback is happening to a USB device for example, we would
avoid a stall in this case. It would still encounter a problem if all the
reclaim is from the file LRU and there are a few pages being written to a
USB stick. We'll still wait on congestion even though it might not have been
necessary and it's why I was counting the number of writeback pages versus
the size of the inactive queue and making a decision based on that.


> In other words, save the bdi_write_congested() result in the zone
> struct in some fashion and inspect that before deciding to synchronize
> behind the underlying device's write rate.  Not hitting a congested
> device for this LRU?  Then don't wait for congested devices.
> 

I think the idea has potential. It will take a fair amount of time to work
out the details though. Testing tends to take a *long* time even with
automation.

> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> >  }
> >  EXPORT_SYMBOL(congestion_wait);
> >  
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> 
> That comments needs help.
> 

Indeed it does. It currently stands as

/**
 * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
 * @zone: A zone to consider the number of being being written back from
 * @sync: SYNC or ASYNC IO
 * @timeout: timeout in jiffies
 *
 * In the event of a congested backing_dev (any backing_dev) or a given zone
 * having a large number of pages in writeback, this waits for up to @timeout
 * jiffies for either a BDI to exit congestion of the given @sync queue.
 *
 * If there is no congestion and few pending writes, then cond_resched()
 * is called to yield the processor if necessary but otherwise does not
 * sleep.

 * The return value is 0 if the sleep is for the full timeout. Otherwise,
 * it is the number of jiffies that were still remaining when the function
 * returned. return_value == timeout implies the function did not sleep.
 */

> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.'
> 
> write or read congestion!!
> 

I just know I'm going to spot where we wait on read congestion the
second I push send and make a fool of myself :(

> >  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> 
> Document the return value?
> 

What's the fun in that? :)

I included a blurb on the return value in the updated comment above.

> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> This is all getting seriously inaccurate :(
> 

We are already woefully inaccurate.

The intention here is to catch where we are not congested but that there
is sufficient writeback in the zone to make it worthwhile waiting for
some of it to complete. Minimally, we have a reasonable expectation that
if writeback is happening that we'll be woken up if we go to sleep on
the congestion queue.

i.e. it's not great but it's better than what we have at the moment which
can be seen from the micro-mapped-file-stream results in the leader. Time to
completion is reduced, sleepy time is reduced while the ratio of scans/writes
does not get worse.


> > +			cond_resched();
> > +
> > +			/* In case we scheduled, work out time remaining */
> > +			ret = timeout - (jiffies - start);
> > +			if (ret < 0)
> > +				ret = 0;
> > +
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	/* Sleep until uncongested or a write happens */
> > +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > +	ret = io_schedule_timeout(timeout);
> > +	finish_wait(wqh, &wait);
> > +
> > +out:
> > +	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> > +					jiffies_to_usecs(jiffies - start));
> 
> Does this tracepoint tell us how often wait_iff_congested() is sleeping
> versus how often it is returning immediately?
> 

Yes. Taking an example from the leader

FTrace Reclaim Statistics: congestion_wait
                                    traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5     nodirect-v1r5
Direct number congest     waited               499          0          0          0
Direct time   congest     waited           22700ms        0ms        0ms        0ms
Direct full   congest     waited               421          0          0          0
Direct number conditional waited                 0       1214       1242       1290
Direct time   conditional waited               0ms        4ms        0ms        0ms
Direct full   conditional waited               421          0          0          0
KSwapd number congest     waited               257        103         94        104
KSwapd time   congest     waited           22116ms     7344ms     7476ms     7528ms
KSwapd full   congest     waited               203         57         59         56
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited               203         57         59         56

A "full congest waited" is a count of the number of times we slept for
more than the timeout. The trace-only kernel reports that direct reclaimers
slept the full timeout 421 times and kswapd slept for the full timeout 203
times. The patch (nocongest-v1r5) reduces these counts significantly.

The report is from a script that reads ftrace information. It's similar in
operation to what's in Documentation/trace/postprocess/.

> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(wait_iff_congested);
> >
> > ...
> >
> > @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  			sc->may_writepage = 1;
> >  		}
> >  
> > -		/* Take a nap, wait for some writeback to complete */
> > +		/* Take a nap if congested, wait for some writeback */
> >  		if (!sc->hibernation_mode && sc->nr_scanned &&
> > -		    priority < DEF_PRIORITY - 2)
> > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +		    priority < DEF_PRIORITY - 2) {
> > +			struct zone *active_zone = NULL;
> > +			unsigned long max_writeback = 0;
> > +			for_each_zone_zonelist(zone, z, zonelist,
> > +					gfp_zone(sc->gfp_mask)) {
> > +				unsigned long writeback;
> > +
> > +				/* Initialise for first zone */
> > +				if (active_zone == NULL)
> > +					active_zone = zone;
> > +
> > +				writeback = zone_page_state(zone, NR_WRITEBACK);
> > +				if (writeback > max_writeback) {
> > +					max_writeback = writeback;
> > +					active_zone = zone;
> > +				}
> > +			}
> > +
> > +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > +		}
> 
> Again, we would benefit from more accuracy here.  In my above
> suggestion I'm assuming that the (congestion) result of the most recent
> attempt to perform writeback is a predictor of the next attempt.
> 

I suspect you are on to something but it will take me some time to work out
the details and to build a setup involving a few USB sticks to trigger that
test case. What are the possibilities of starting with this heuristic (in
release v2 or v3 of this series) because it improves on what we have today and
then trying out different ideas for how and when to call wait_iff_congested()
in the next cycle?

> Doing that on a kernel-wide basis would be rather inaccurate on large
> machines in some scenarios.  Storing the state info in the zone would
> help.
> 

We are already depending on kernel-wide inaccuracy. The series aims to chip
away at some of the obvious badness to start with.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09 10:43       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 10:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig

On Wed, Sep 08, 2010 at 02:23:30PM -0700, Andrew Morton wrote:
> On Mon,  6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> > 
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> > 
> 
> The patch series looks generally good.  Would like to see some testing
> results ;) 

They are all in the leader. They are all based on a test-suite that I'm
bound to stick a README on and release one of these days :/

> A few touchups are planned so I'll await v2.
> 

Good plan.

> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> >  		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> >  	};
> > +static atomic_t nr_bdi_congested[2];
> 
> Let's remember that a queue can get congested because of reads as well
> as writes.  It's very rare for this to happen - it needs either a
> zillion read()ing threads or someone going berzerk with O_DIRECT aio,
> etc.  Probably it doesn't matter much, but for memory reclaim purposes
> read-congestion is somewhat irrelevant and a bit of thought is warranted.
> 

This is an interesting point and would be well worth digging into if
we got a new bug report about stalls under heavy reads.

> vmscan currently only looks at *write* congestion, but in this patch
> you secretly change that logic to newly look at write-or-read
> congestion.  Talk to me.
> 

vmscan currently only looks at write congestion because it's checking the
BLK_RW_ASYNC and all reads will be BLK_RW_SYNC. Currently, this is why we
are only looking at write congestion even though it's approximate, right?

Remember, congestion_wait used to be about READ and WRITE but now it's about
SYNC and ASYNC.

In the patch, there are separate SYNC and ASYNC nr_bdi_congested counters.
wait_iff_congested() is only called for BLK_RW_ASYNC so we still checking
write congestion only.

What stupid thing did I miss?

> >  void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> >  	wait_queue_head_t *wqh = &congestion_wqh[sync];
> >  
> >  	bit = sync ? BDI_sync_congested : BDI_async_congested;
> > -	clear_bit(bit, &bdi->state);
> > +	if (test_and_clear_bit(bit, &bdi->state))
> > +		atomic_dec(&nr_bdi_congested[sync]);
> >  	smp_mb__after_clear_bit();
> >  	if (waitqueue_active(wqh))
> >  		wake_up(wqh);
> 
> Worried.  Having a single slow disk getting itself gummed up will
> affect the entire machine!
> 

This can already happen today. In fact, I think it's one of the sources of
desktop stalls during IO from https://bugzilla.kernel.org/show_bug.cgi?id=12309
that you brought up a few weeks back. I was tempted to try resolve it in
this patch but thought I was reaching far enough with this series as it was.

> There's potential for pathological corner-case problems here.  "When I
> do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
> page reclaim!".
> 

I thought it might be enough to just do a huge backup to an external USB
drive. I guess I could make it worse by starting up one copy per CPU
thread preferably to more than one slow USB device.

> What to do?
> 
> Of course, we'd very much prefer to know whether a queue which we're
> interested in for writeback will block when we try to write to it. 
> Much better than looking at all queues.
> 

And somehow reconciling the queue being written to with the zone the pages
are coming from.

> Important question: which of teh current congestion_wait() call sites
> are causing appreciable stalls?
> 

This potentially can be found out from the tracepoints if they record
the stack trace as well. In this patch, I avoided changing all callers to
congestion_wait() and changed a few callers to wait_iff_congested() instead
to limit the scope of what was being changed in this cycle.

> I think a more accurate way of implementing this is to be smarter with
> the may_write_to_queue()->bdi_write_congested() result.  If a previous
> attempt to write off this LRU encountered congestion then fine, call
> congestion_wait().  But if writeback is not hitting
> may_write_to_queue()->bdi_write_congested() then that is the time to
> avoid calling congestion_wait().
> 

I see the logic. If we assume that there is large amounts of anon page
reclaim while writeback is happening to a USB device for example, we would
avoid a stall in this case. It would still encounter a problem if all the
reclaim is from the file LRU and there are a few pages being written to a
USB stick. We'll still wait on congestion even though it might not have been
necessary and it's why I was counting the number of writeback pages versus
the size of the inactive queue and making a decision based on that.


> In other words, save the bdi_write_congested() result in the zone
> struct in some fashion and inspect that before deciding to synchronize
> behind the underlying device's write rate.  Not hitting a congested
> device for this LRU?  Then don't wait for congested devices.
> 

I think the idea has potential. It will take a fair amount of time to work
out the details though. Testing tends to take a *long* time even with
automation.

> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> >  }
> >  EXPORT_SYMBOL(congestion_wait);
> >  
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> 
> That comments needs help.
> 

Indeed it does. It currently stands as

/**
 * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
 * @zone: A zone to consider the number of being being written back from
 * @sync: SYNC or ASYNC IO
 * @timeout: timeout in jiffies
 *
 * In the event of a congested backing_dev (any backing_dev) or a given zone
 * having a large number of pages in writeback, this waits for up to @timeout
 * jiffies for either a BDI to exit congestion of the given @sync queue.
 *
 * If there is no congestion and few pending writes, then cond_resched()
 * is called to yield the processor if necessary but otherwise does not
 * sleep.

 * The return value is 0 if the sleep is for the full timeout. Otherwise,
 * it is the number of jiffies that were still remaining when the function
 * returned. return_value == timeout implies the function did not sleep.
 */

> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.'
> 
> write or read congestion!!
> 

I just know I'm going to spot where we wait on read congestion the
second I push send and make a fool of myself :(

> >  If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> 
> Document the return value?
> 

What's the fun in that? :)

I included a blurb on the return value in the updated comment above.

> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > +	long ret;
> > +	unsigned long start = jiffies;
> > +	DEFINE_WAIT(wait);
> > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > +	/*
> > +	 * If there is no congestion, check the amount of writeback. If there
> > +	 * is no significant writeback and no congestion, just cond_resched
> > +	 */
> > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > +		unsigned long inactive, writeback;
> > +
> > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > +		/*
> > +		 * If less than half the inactive list is being written back,
> > +		 * reclaim might as well continue
> > +		 */
> > +		if (writeback < inactive / 2) {
> 
> This is all getting seriously inaccurate :(
> 

We are already woefully inaccurate.

The intention here is to catch where we are not congested but that there
is sufficient writeback in the zone to make it worthwhile waiting for
some of it to complete. Minimally, we have a reasonable expectation that
if writeback is happening that we'll be woken up if we go to sleep on
the congestion queue.

i.e. it's not great but it's better than what we have at the moment which
can be seen from the micro-mapped-file-stream results in the leader. Time to
completion is reduced, sleepy time is reduced while the ratio of scans/writes
does not get worse.


> > +			cond_resched();
> > +
> > +			/* In case we scheduled, work out time remaining */
> > +			ret = timeout - (jiffies - start);
> > +			if (ret < 0)
> > +				ret = 0;
> > +
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	/* Sleep until uncongested or a write happens */
> > +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > +	ret = io_schedule_timeout(timeout);
> > +	finish_wait(wqh, &wait);
> > +
> > +out:
> > +	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> > +					jiffies_to_usecs(jiffies - start));
> 
> Does this tracepoint tell us how often wait_iff_congested() is sleeping
> versus how often it is returning immediately?
> 

Yes. Taking an example from the leader

FTrace Reclaim Statistics: congestion_wait
                                    traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5     nodirect-v1r5
Direct number congest     waited               499          0          0          0
Direct time   congest     waited           22700ms        0ms        0ms        0ms
Direct full   congest     waited               421          0          0          0
Direct number conditional waited                 0       1214       1242       1290
Direct time   conditional waited               0ms        4ms        0ms        0ms
Direct full   conditional waited               421          0          0          0
KSwapd number congest     waited               257        103         94        104
KSwapd time   congest     waited           22116ms     7344ms     7476ms     7528ms
KSwapd full   congest     waited               203         57         59         56
KSwapd number conditional waited                 0          0          0          0
KSwapd time   conditional waited               0ms        0ms        0ms        0ms
KSwapd full   conditional waited               203         57         59         56

A "full congest waited" is a count of the number of times we slept for
more than the timeout. The trace-only kernel reports that direct reclaimers
slept the full timeout 421 times and kswapd slept for the full timeout 203
times. The patch (nocongest-v1r5) reduces these counts significantly.

The report is from a script that reads ftrace information. It's similar in
operation to what's in Documentation/trace/postprocess/.

> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(wait_iff_congested);
> >
> > ...
> >
> > @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  			sc->may_writepage = 1;
> >  		}
> >  
> > -		/* Take a nap, wait for some writeback to complete */
> > +		/* Take a nap if congested, wait for some writeback */
> >  		if (!sc->hibernation_mode && sc->nr_scanned &&
> > -		    priority < DEF_PRIORITY - 2)
> > -			congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +		    priority < DEF_PRIORITY - 2) {
> > +			struct zone *active_zone = NULL;
> > +			unsigned long max_writeback = 0;
> > +			for_each_zone_zonelist(zone, z, zonelist,
> > +					gfp_zone(sc->gfp_mask)) {
> > +				unsigned long writeback;
> > +
> > +				/* Initialise for first zone */
> > +				if (active_zone == NULL)
> > +					active_zone = zone;
> > +
> > +				writeback = zone_page_state(zone, NR_WRITEBACK);
> > +				if (writeback > max_writeback) {
> > +					max_writeback = writeback;
> > +					active_zone = zone;
> > +				}
> > +			}
> > +
> > +			wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > +		}
> 
> Again, we would benefit from more accuracy here.  In my above
> suggestion I'm assuming that the (congestion) result of the most recent
> attempt to perform writeback is a predictor of the next attempt.
> 

I suspect you are on to something but it will take me some time to work out
the details and to build a setup involving a few USB sticks to trigger that
test case. What are the possibilities of starting with this heuristic (in
release v2 or v3 of this series) because it improves on what we have today and
then trying out different ideas for how and when to call wait_iff_congested()
in the next cycle?

> Doing that on a kernel-wide basis would be rather inaccurate on large
> machines in some scenarios.  Storing the state info in the zone would
> help.
> 

We are already depending on kernel-wide inaccuracy. The series aims to chip
away at some of the obvious badness to start with.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-09  9:22           ` Mel Gorman
@ 2010-09-10 10:25             ` KOSAKI Motohiro
  -1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > 
> > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > 
> > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > trylock_page() in this case.
> > > > > 
> > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > 
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > Ah......but can't this change cause dead lock ??
> > 
> > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > by poisonous mushroom of Mario Bros.
> > 
> 
> Lets be clear on what the exact dead lock conditions are. The ones I had
> thought about when I felt this patch was ok were;
> 
> o We are not holding the LRU lock (or any lock, we just called cond_resched())
> o We do not have another page locked because we cannot lock multiple pages
> o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> o lock_page() itself is not allocating anything that we could recurse on

True, all.

> 
> One potential dead lock would be if the direct reclaimer held a page
> lock and ended up here but is that situation even allowed?

example, 

__do_fault()
{
(snip)
        if (unlikely(!(ret & VM_FAULT_LOCKED)))
                lock_page(vmf.page);
        else
                VM_BUG_ON(!PageLocked(vmf.page));

        /*
         * Should we do an early C-O-W break?
         */
        page = vmf.page;
        if (flags & FAULT_FLAG_WRITE) {
                if (!(vma->vm_flags & VM_SHARED)) {
                        anon = 1;
                        if (unlikely(anon_vma_prepare(vma))) {
                                ret = VM_FAULT_OOM;
                                goto out;
                        }
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
                                                vma, address);


Afaik, detailed rule is,

o kswapd can call lock_page() because they never take page lock outside vmscan
o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
  because the task have gurantee of no lock taken.
o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.

I think.


>  I did not
> think of an obvious example of when this would happen. Similarly,
> deadlock situations with mmap_sem shouldn't happen unless multiple page
> locks are being taken.
> 
> (prepares to feel foolish)
> 
> What did I miss?






^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:25             ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > 
> > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > 
> > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > trylock_page() in this case.
> > > > > 
> > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > 
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > Ah......but can't this change cause dead lock ??
> > 
> > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > by poisonous mushroom of Mario Bros.
> > 
> 
> Lets be clear on what the exact dead lock conditions are. The ones I had
> thought about when I felt this patch was ok were;
> 
> o We are not holding the LRU lock (or any lock, we just called cond_resched())
> o We do not have another page locked because we cannot lock multiple pages
> o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> o lock_page() itself is not allocating anything that we could recurse on

True, all.

> 
> One potential dead lock would be if the direct reclaimer held a page
> lock and ended up here but is that situation even allowed?

example, 

__do_fault()
{
(snip)
        if (unlikely(!(ret & VM_FAULT_LOCKED)))
                lock_page(vmf.page);
        else
                VM_BUG_ON(!PageLocked(vmf.page));

        /*
         * Should we do an early C-O-W break?
         */
        page = vmf.page;
        if (flags & FAULT_FLAG_WRITE) {
                if (!(vma->vm_flags & VM_SHARED)) {
                        anon = 1;
                        if (unlikely(anon_vma_prepare(vma))) {
                                ret = VM_FAULT_OOM;
                                goto out;
                        }
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
                                                vma, address);


Afaik, detailed rule is,

o kswapd can call lock_page() because they never take page lock outside vmscan
o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
  because the task have gurantee of no lock taken.
o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.

I think.


>  I did not
> think of an obvious example of when this would happen. Similarly,
> deadlock situations with mmap_sem shouldn't happen unless multiple page
> locks are being taken.
> 
> (prepares to feel foolish)
> 
> What did I miss?





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-10 10:25             ` KOSAKI Motohiro
  (?)
@ 2010-09-10 10:33               ` KOSAKI Motohiro
  -1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Mel Gorman, KAMEZAWA Hiroyuki, linux-mm,
	linux-fsdevel, Linux Kernel List, Rik van Riel, Johannes Weiner,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, Dave Chinner,
	Chris Mason, Christoph Hellwig, Andrew Morton

> Afaik, detailed rule is,
> 
> o kswapd can call lock_page() because they never take page lock outside vmscan

s/lock_page()/lock_page_nosync()/



> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
>   because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> 
> I think.
> 
> 
> >  I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> > 
> > (prepares to feel foolish)
> > 
> > What did I miss?
> 
> 
> 
> 
> 




^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:33               ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Mel Gorman, KAMEZAWA Hiroyuki, linux-mm,
	linux-fsdevel, Linux Kernel List, Rik van Riel, Johannes Weiner,
	Minchan Kim, Wu Fengguang, Andrea Arcangeli, Dave Chinner,
	Chris Mason, Christoph Hellwig, Andrew Morton

> Afaik, detailed rule is,
> 
> o kswapd can call lock_page() because they never take page lock outside vmscan

s/lock_page()/lock_page_nosync()/



> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
>   because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> 
> I think.
> 
> 
> >  I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> > 
> > (prepares to feel foolish)
> > 
> > What did I miss?
> 
> 
> 
> 
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:33               ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> Afaik, detailed rule is,
> 
> o kswapd can call lock_page() because they never take page lock outside vmscan

s/lock_page()/lock_page_nosync()/



> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
>   because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> 
> I think.
> 
> 
> >  I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> > 
> > (prepares to feel foolish)
> > 
> > What did I miss?
> 
> 
> 
> 
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-09  8:54           ` Mel Gorman
@ 2010-09-12 15:37             ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-12 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, Sep 09, 2010 at 09:54:36AM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> > On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > > + * @zone: A zone to consider the number of being being written back from
> > > > > + * @sync: SYNC or ASYNC IO
> > > > > + * @timeout: timeout in jiffies
> > > > > + *
> > > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > > + * write congestion.  If no backing_devs are congested then the number of
> > > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > >                                                 and 
> > > > 
> > > 
> > > Why and? "or" makes sense because we avoid sleeping on either condition.
> > 
> > if (nr_bdi_congested[sync]) == 0) {
> >         if (writeback < inactive / 2) {
> >                 cond_resched();
> >                 ..
> >                 goto out
> >         }
> > }
> > 
> > for avoiding sleeping, above two condition should meet. 
> 
> This is a terrible comment that is badly written. Is this any clearer?
> 
> /**
>  * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
>  * @zone: A zone to consider the number of being being written back from
>  * @sync: SYNC or ASYNC IO
>  * @timeout: timeout in jiffies
>  *
>  * In the event of a congested backing_dev (any backing_dev) or a given @zone
>  * having a large number of pages in writeback, this waits for up to @timeout
>  * jiffies for either a BDI to exit congestion or a write to complete.
>  *
>  * If there is no congestion and few pending writes, then cond_resched()
>  * is called to yield the processor if necessary but otherwise does not
>  * sleep.
>  */

Looks good.

> 
> > > 
> > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > + * consumed its CPU quota.
> > > > > + */
> > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > +{
> > > > > +	long ret;
> > > > > +	unsigned long start = jiffies;
> > > > > +	DEFINE_WAIT(wait);
> > > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > +
> > > > > +	/*
> > > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > > +	 */
> > > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > +		unsigned long inactive, writeback;
> > > > > +
> > > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > +
> > > > > +		/*
> > > > > +		 * If less than half the inactive list is being written back,
> > > > > +		 * reclaim might as well continue
> > > > > +		 */
> > > > > +		if (writeback < inactive / 2) {
> > > > 
> > > > I am not sure this is best.
> > > > 
> > > 
> > > I'm not saying it is. The objective is to identify a situation where
> > > sleeping until the next write or congestion clears is pointless. We have
> > > already identified that we are not congested so the question is "are we
> > > writing a lot at the moment?". The assumption is that if there is a lot
> > > of writing going on, we might as well sleep until one completes rather
> > > than reclaiming more.
> > > 
> > > This is the first effort at identifying pointless sleeps. Better ones
> > > might be identified in the future but that shouldn't stop us making a
> > > semi-sensible decision now.
> > 
> > nr_bdi_congested is no problem since we have used it for a long time.
> > But you added new rule about writeback. 
> > 
> 
> Yes, I'm trying to add a new rule about throttling in the page allocator
> and from vmscan. As you can see from the results in the leader, we are
> currently sleeping more than we need to.

I can see the about avoiding congestion_wait but can't find about 
(writeback < incative / 2) hueristic result. 

> 
> > Why I pointed out is that you added new rule and I hope let others know
> > this change since they have a good idea or any opinions. 
> > I think it's a one of roles as reviewer.
> > 
> 
> Of course.
> 
> > > 
> > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > 
> > > We don't really have a good means of identifying speed classes of
> > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > basis. The pages being written back in the zone could be backed by
> > > anything so we cannot make decisions based on BDI speed.
> > 
> > True. So it's why I have below question.
> > As you said, we don't have enough information in vmscan.
> > So I am not sure how effective such semi-sensible decision is. 
> > 
> 
> What additional metrics would you apply than the ones I used in the
> leader mail?

effectiveness of (writeback < inactive / 2) heuristic. 

> 
> > I think best is to throttle in page-writeback well. 
> 
> I do not think there is a problem as such in page writeback throttling.
> The problem is that we are going to sleep without any congestion or without
> writes in progress. We sleep for a full timeout in this case for no reason
> and this is what I'm trying to avoid.

Yes. I agree. 
Just my concern is heuristic accuarcy I mentioned.
In your previous verstion, you don't add the heuristic.
But suddenly you added it in this version. 
So I think you have any clue to add it in this version.
Please, write down cause and data if you have. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-12 15:37             ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-12 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, Sep 09, 2010 at 09:54:36AM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> > On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > > + * @zone: A zone to consider the number of being being written back from
> > > > > + * @sync: SYNC or ASYNC IO
> > > > > + * @timeout: timeout in jiffies
> > > > > + *
> > > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > > + * write congestion.  If no backing_devs are congested then the number of
> > > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > >                                                 and 
> > > > 
> > > 
> > > Why and? "or" makes sense because we avoid sleeping on either condition.
> > 
> > if (nr_bdi_congested[sync]) == 0) {
> >         if (writeback < inactive / 2) {
> >                 cond_resched();
> >                 ..
> >                 goto out
> >         }
> > }
> > 
> > for avoiding sleeping, above two condition should meet. 
> 
> This is a terrible comment that is badly written. Is this any clearer?
> 
> /**
>  * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
>  * @zone: A zone to consider the number of being being written back from
>  * @sync: SYNC or ASYNC IO
>  * @timeout: timeout in jiffies
>  *
>  * In the event of a congested backing_dev (any backing_dev) or a given @zone
>  * having a large number of pages in writeback, this waits for up to @timeout
>  * jiffies for either a BDI to exit congestion or a write to complete.
>  *
>  * If there is no congestion and few pending writes, then cond_resched()
>  * is called to yield the processor if necessary but otherwise does not
>  * sleep.
>  */

Looks good.

> 
> > > 
> > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > + * consumed its CPU quota.
> > > > > + */
> > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > +{
> > > > > +	long ret;
> > > > > +	unsigned long start = jiffies;
> > > > > +	DEFINE_WAIT(wait);
> > > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > +
> > > > > +	/*
> > > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > > +	 */
> > > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > +		unsigned long inactive, writeback;
> > > > > +
> > > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > +
> > > > > +		/*
> > > > > +		 * If less than half the inactive list is being written back,
> > > > > +		 * reclaim might as well continue
> > > > > +		 */
> > > > > +		if (writeback < inactive / 2) {
> > > > 
> > > > I am not sure this is best.
> > > > 
> > > 
> > > I'm not saying it is. The objective is to identify a situation where
> > > sleeping until the next write or congestion clears is pointless. We have
> > > already identified that we are not congested so the question is "are we
> > > writing a lot at the moment?". The assumption is that if there is a lot
> > > of writing going on, we might as well sleep until one completes rather
> > > than reclaiming more.
> > > 
> > > This is the first effort at identifying pointless sleeps. Better ones
> > > might be identified in the future but that shouldn't stop us making a
> > > semi-sensible decision now.
> > 
> > nr_bdi_congested is no problem since we have used it for a long time.
> > But you added new rule about writeback. 
> > 
> 
> Yes, I'm trying to add a new rule about throttling in the page allocator
> and from vmscan. As you can see from the results in the leader, we are
> currently sleeping more than we need to.

I can see the about avoiding congestion_wait but can't find about 
(writeback < incative / 2) hueristic result. 

> 
> > Why I pointed out is that you added new rule and I hope let others know
> > this change since they have a good idea or any opinions. 
> > I think it's a one of roles as reviewer.
> > 
> 
> Of course.
> 
> > > 
> > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > 
> > > We don't really have a good means of identifying speed classes of
> > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > basis. The pages being written back in the zone could be backed by
> > > anything so we cannot make decisions based on BDI speed.
> > 
> > True. So it's why I have below question.
> > As you said, we don't have enough information in vmscan.
> > So I am not sure how effective such semi-sensible decision is. 
> > 
> 
> What additional metrics would you apply than the ones I used in the
> leader mail?

effectiveness of (writeback < inactive / 2) heuristic. 

> 
> > I think best is to throttle in page-writeback well. 
> 
> I do not think there is a problem as such in page writeback throttling.
> The problem is that we are going to sleep without any congestion or without
> writes in progress. We sleep for a full timeout in this case for no reason
> and this is what I'm trying to avoid.

Yes. I agree. 
Just my concern is heuristic accuarcy I mentioned.
In your previous verstion, you don't add the heuristic.
But suddenly you added it in this version. 
So I think you have any clue to add it in this version.
Please, write down cause and data if you have. 

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-09  9:32       ` Mel Gorman
@ 2010-09-13  0:53         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  0:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, 9 Sep 2010 10:32:11 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  6 Sep 2010 11:47:33 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > There are a number of cases where pages get cleaned but two of concern
> > > to this patch are;
> > >   o When dirtying pages, processes may be throttled to clean pages if
> > >     dirty_ratio is not met.
> > >   o Pages belonging to inodes dirtied longer than
> > >     dirty_writeback_centisecs get cleaned.
> > > 
> > > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > > pages are being dirtied slowly so that neither the throttling or a flusher
> > > thread waking periodically cleans them.
> > > 
> > > Background flush is already cleaning old or expired inodes first but the
> > > expire time is too far in the future at the time of page reclaim. To mitigate
> > > future problems, this patch wakes flusher threads to clean 4M of data -
> > > an amount that should be manageable without causing congestion in many cases.
> > > 
> > > Ideally, the background flushers would only be cleaning pages belonging
> > > to the zone being scanned but it's not clear if this would be of benefit
> > > (less IO) or not (potentially less efficient IO if an inode is scattered
> > > across multiple zones).
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
> > >  1 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 408c101..33d27a4 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > >  #define MAX_SWAP_CLEAN_WAIT 50
> > >  
> > > +/*
> > > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > > + * a maximum of 4M of data.
> > > + */
> > > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > +{
> > > +	return laptop_mode ? 0 :
> > > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > +}
> > > +
> > >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > >  						  struct scan_control *sc)
> > >  {
> > > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > >  					struct scan_control *sc,
> > > +					int file,
> > >  					unsigned long *nr_still_dirty)
> > >  {
> > >  	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > >  	int pgactivate = 0;
> > >  	unsigned long nr_dirty = 0;
> > > +	unsigned long nr_dirty_seen = 0;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > >  	cond_resched();
> > > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		}
> > >  
> > >  		if (PageDirty(page)) {
> > > +			nr_dirty_seen++;
> > > +
> > >  			/*
> > >  			 * Only kswapd can writeback filesystem pages to
> > >  			 * avoid risk of stack overflow
> > > @@ -923,6 +939,18 @@ keep_lumpy:
> > >  
> > >  	list_splice(&ret_pages, page_list);
> > >  
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though the
> > > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > > +	 * threads to pro-actively clean up to a maximum of
> > > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > +	 * !may_writepage indicates that this is a direct reclaimer in
> > > +	 * laptop mode avoiding disk spin-ups
> > > +	 */
> > > +	if (file && nr_dirty_seen && sc->may_writepage)
> > > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > > +
> > 
> > Thank you. Ok, I'll check what happens in memcg.
> > 
> 
> Thanks
> 
> > Can I add
> > 	if (sc->memcg) {
> > 		memcg_check_flusher_wakeup()
> > 	}
> > or some here ?
> > 
> 
> It seems reasonable.
> 
> > Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> > 
> 
> I'm afraid I cannot make a judgement call on which is the best as I am
> not very familiar with how cgroups behave in comparison to normal
> reclaim. There could easily be a follow-on patch though that was cgroup
> specific?
> 

Yes, I'd like to make patches when this series is merged. It's not difficult and
makes it clear how memcg and flusher works for getting good reviews.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13  0:53         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13  0:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Thu, 9 Sep 2010 10:32:11 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon,  6 Sep 2010 11:47:33 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > There are a number of cases where pages get cleaned but two of concern
> > > to this patch are;
> > >   o When dirtying pages, processes may be throttled to clean pages if
> > >     dirty_ratio is not met.
> > >   o Pages belonging to inodes dirtied longer than
> > >     dirty_writeback_centisecs get cleaned.
> > > 
> > > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > > pages are being dirtied slowly so that neither the throttling or a flusher
> > > thread waking periodically cleans them.
> > > 
> > > Background flush is already cleaning old or expired inodes first but the
> > > expire time is too far in the future at the time of page reclaim. To mitigate
> > > future problems, this patch wakes flusher threads to clean 4M of data -
> > > an amount that should be manageable without causing congestion in many cases.
> > > 
> > > Ideally, the background flushers would only be cleaning pages belonging
> > > to the zone being scanned but it's not clear if this would be of benefit
> > > (less IO) or not (potentially less efficient IO if an inode is scattered
> > > across multiple zones).
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > >  mm/vmscan.c |   32 ++++++++++++++++++++++++++++++--
> > >  1 files changed, 30 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 408c101..33d27a4 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >  /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > >  #define MAX_SWAP_CLEAN_WAIT 50
> > >  
> > > +/*
> > > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > > + * a maximum of 4M of data.
> > > + */
> > > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > +{
> > > +	return laptop_mode ? 0 :
> > > +			min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > +}
> > > +
> > >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > >  						  struct scan_control *sc)
> > >  {
> > > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > >   */
> > >  static unsigned long shrink_page_list(struct list_head *page_list,
> > >  					struct scan_control *sc,
> > > +					int file,
> > >  					unsigned long *nr_still_dirty)
> > >  {
> > >  	LIST_HEAD(ret_pages);
> > >  	LIST_HEAD(free_pages);
> > >  	int pgactivate = 0;
> > >  	unsigned long nr_dirty = 0;
> > > +	unsigned long nr_dirty_seen = 0;
> > >  	unsigned long nr_reclaimed = 0;
> > >  
> > >  	cond_resched();
> > > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  		}
> > >  
> > >  		if (PageDirty(page)) {
> > > +			nr_dirty_seen++;
> > > +
> > >  			/*
> > >  			 * Only kswapd can writeback filesystem pages to
> > >  			 * avoid risk of stack overflow
> > > @@ -923,6 +939,18 @@ keep_lumpy:
> > >  
> > >  	list_splice(&ret_pages, page_list);
> > >  
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though the
> > > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > > +	 * threads to pro-actively clean up to a maximum of
> > > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > +	 * !may_writepage indicates that this is a direct reclaimer in
> > > +	 * laptop mode avoiding disk spin-ups
> > > +	 */
> > > +	if (file && nr_dirty_seen && sc->may_writepage)
> > > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > > +
> > 
> > Thank you. Ok, I'll check what happens in memcg.
> > 
> 
> Thanks
> 
> > Can I add
> > 	if (sc->memcg) {
> > 		memcg_check_flusher_wakeup()
> > 	}
> > or some here ?
> > 
> 
> It seems reasonable.
> 
> > Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> > 
> 
> I'm afraid I cannot make a judgement call on which is the best as I am
> not very familiar with how cgroups behave in comparison to normal
> reclaim. There could easily be a follow-on patch though that was cgroup
> specific?
> 

Yes, I'd like to make patches when this series is merged. It's not difficult and
makes it clear how memcg and flusher works for getting good reviews.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-12 15:37             ` Minchan Kim
@ 2010-09-13  8:55               ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13  8:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > >
> > > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > > + * consumed its CPU quota.
> > > > > > + */
> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > > +{
> > > > > > +	long ret;
> > > > > > +	unsigned long start = jiffies;
> > > > > > +	DEFINE_WAIT(wait);
> > > > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > > > +	 */
> > > > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > > +		unsigned long inactive, writeback;
> > > > > > +
> > > > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If less than half the inactive list is being written back,
> > > > > > +		 * reclaim might as well continue
> > > > > > +		 */
> > > > > > +		if (writeback < inactive / 2) {
> > > > > 
> > > > > I am not sure this is best.
> > > > > 
> > > > 
> > > > I'm not saying it is. The objective is to identify a situation where
> > > > sleeping until the next write or congestion clears is pointless. We have
> > > > already identified that we are not congested so the question is "are we
> > > > writing a lot at the moment?". The assumption is that if there is a lot
> > > > of writing going on, we might as well sleep until one completes rather
> > > > than reclaiming more.
> > > > 
> > > > This is the first effort at identifying pointless sleeps. Better ones
> > > > might be identified in the future but that shouldn't stop us making a
> > > > semi-sensible decision now.
> > > 
> > > nr_bdi_congested is no problem since we have used it for a long time.
> > > But you added new rule about writeback. 
> > > 
> > 
> > Yes, I'm trying to add a new rule about throttling in the page allocator
> > and from vmscan. As you can see from the results in the leader, we are
> > currently sleeping more than we need to.
> 
> I can see the about avoiding congestion_wait but can't find about 
> (writeback < incative / 2) hueristic result. 
> 

See the leader and each of the report sections entitled 
"FTrace Reclaim Statistics: congestion_wait". It provides a measure of
how sleep times are affected.

"congest waited" are waits due to calling congestion_wait. "conditional waited"
are those related to wait_iff_congested(). As you will see from the reports,
sleep times are reduced overall while callers of wait_iff_congested() still
go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
how reclaim is behaving and indicators so far are that reclaim is not hurt
by introducing wait_iff_congested().

> > 
> > > Why I pointed out is that you added new rule and I hope let others know
> > > this change since they have a good idea or any opinions. 
> > > I think it's a one of roles as reviewer.
> > > 
> > 
> > Of course.
> > 
> > > > 
> > > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > > 
> > > > We don't really have a good means of identifying speed classes of
> > > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > > basis. The pages being written back in the zone could be backed by
> > > > anything so we cannot make decisions based on BDI speed.
> > > 
> > > True. So it's why I have below question.
> > > As you said, we don't have enough information in vmscan.
> > > So I am not sure how effective such semi-sensible decision is. 
> > > 
> > 
> > What additional metrics would you apply than the ones I used in the
> > leader mail?
> 
> effectiveness of (writeback < inactive / 2) heuristic. 
> 

Define effectiveness.

In the reports I gave, I reported on the sleep times and whether the full
timeout was slept or not. Sleep times are reduced while not negatively
impacting reclaim.

> > 
> > > I think best is to throttle in page-writeback well. 
> > 
> > I do not think there is a problem as such in page writeback throttling.
> > The problem is that we are going to sleep without any congestion or without
> > writes in progress. We sleep for a full timeout in this case for no reason
> > and this is what I'm trying to avoid.
> 
> Yes. I agree. 
> Just my concern is heuristic accuarcy I mentioned.
> In your previous verstion, you don't add the heuristic.

In the previous version, I also changed all callers to congestion_wait(). V1
simply was not that great a patch and Johannes pointed out that I wasn't
measuring the scanning/reclaim ratios to see how reclaim was impacted. The
reports now include this data and things are looking better.

> But suddenly you added it in this version. 
> So I think you have any clue to add it in this version.
> Please, write down cause and data if you have. 
> 

The leader has a large amount of data on how this and the other patches
affected results for a good variety of workloads.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13  8:55               ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13  8:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > >
> > > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > > + * consumed its CPU quota.
> > > > > > + */
> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > > +{
> > > > > > +	long ret;
> > > > > > +	unsigned long start = jiffies;
> > > > > > +	DEFINE_WAIT(wait);
> > > > > > +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If there is no congestion, check the amount of writeback. If there
> > > > > > +	 * is no significant writeback and no congestion, just cond_resched
> > > > > > +	 */
> > > > > > +	if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > > +		unsigned long inactive, writeback;
> > > > > > +
> > > > > > +		inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > > +				zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > > +		writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If less than half the inactive list is being written back,
> > > > > > +		 * reclaim might as well continue
> > > > > > +		 */
> > > > > > +		if (writeback < inactive / 2) {
> > > > > 
> > > > > I am not sure this is best.
> > > > > 
> > > > 
> > > > I'm not saying it is. The objective is to identify a situation where
> > > > sleeping until the next write or congestion clears is pointless. We have
> > > > already identified that we are not congested so the question is "are we
> > > > writing a lot at the moment?". The assumption is that if there is a lot
> > > > of writing going on, we might as well sleep until one completes rather
> > > > than reclaiming more.
> > > > 
> > > > This is the first effort at identifying pointless sleeps. Better ones
> > > > might be identified in the future but that shouldn't stop us making a
> > > > semi-sensible decision now.
> > > 
> > > nr_bdi_congested is no problem since we have used it for a long time.
> > > But you added new rule about writeback. 
> > > 
> > 
> > Yes, I'm trying to add a new rule about throttling in the page allocator
> > and from vmscan. As you can see from the results in the leader, we are
> > currently sleeping more than we need to.
> 
> I can see the about avoiding congestion_wait but can't find about 
> (writeback < incative / 2) hueristic result. 
> 

See the leader and each of the report sections entitled 
"FTrace Reclaim Statistics: congestion_wait". It provides a measure of
how sleep times are affected.

"congest waited" are waits due to calling congestion_wait. "conditional waited"
are those related to wait_iff_congested(). As you will see from the reports,
sleep times are reduced overall while callers of wait_iff_congested() still
go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
how reclaim is behaving and indicators so far are that reclaim is not hurt
by introducing wait_iff_congested().

> > 
> > > Why I pointed out is that you added new rule and I hope let others know
> > > this change since they have a good idea or any opinions. 
> > > I think it's a one of roles as reviewer.
> > > 
> > 
> > Of course.
> > 
> > > > 
> > > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > > 
> > > > We don't really have a good means of identifying speed classes of
> > > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > > basis. The pages being written back in the zone could be backed by
> > > > anything so we cannot make decisions based on BDI speed.
> > > 
> > > True. So it's why I have below question.
> > > As you said, we don't have enough information in vmscan.
> > > So I am not sure how effective such semi-sensible decision is. 
> > > 
> > 
> > What additional metrics would you apply than the ones I used in the
> > leader mail?
> 
> effectiveness of (writeback < inactive / 2) heuristic. 
> 

Define effectiveness.

In the reports I gave, I reported on the sleep times and whether the full
timeout was slept or not. Sleep times are reduced while not negatively
impacting reclaim.

> > 
> > > I think best is to throttle in page-writeback well. 
> > 
> > I do not think there is a problem as such in page writeback throttling.
> > The problem is that we are going to sleep without any congestion or without
> > writes in progress. We sleep for a full timeout in this case for no reason
> > and this is what I'm trying to avoid.
> 
> Yes. I agree. 
> Just my concern is heuristic accuarcy I mentioned.
> In your previous verstion, you don't add the heuristic.

In the previous version, I also changed all callers to congestion_wait(). V1
simply was not that great a patch and Johannes pointed out that I wasn't
measuring the scanning/reclaim ratios to see how reclaim was impacted. The
reports now include this data and things are looking better.

> But suddenly you added it in this version. 
> So I think you have any clue to add it in this version.
> Please, write down cause and data if you have. 
> 

The leader has a large amount of data on how this and the other patches
affected results for a good variety of workloads.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-10 10:25             ` KOSAKI Motohiro
@ 2010-09-13  9:14               ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13  9:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Fri, Sep 10, 2010 at 07:25:43PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > 
> > > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > 
> > > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > > trylock_page() in this case.
> > > > > > 
> > > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > 
> > > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > 
> > > > Ah......but can't this change cause dead lock ??
> > > 
> > > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > > by poisonous mushroom of Mario Bros.
> > > 
> > 
> > Lets be clear on what the exact dead lock conditions are. The ones I had
> > thought about when I felt this patch was ok were;
> > 
> > o We are not holding the LRU lock (or any lock, we just called cond_resched())
> > o We do not have another page locked because we cannot lock multiple pages
> > o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> > o lock_page() itself is not allocating anything that we could recurse on
> 
> True, all.
> 
> > 
> > One potential dead lock would be if the direct reclaimer held a page
> > lock and ended up here but is that situation even allowed?
> 
> example, 
> 
> __do_fault()
> {
> (snip)
>         if (unlikely(!(ret & VM_FAULT_LOCKED)))
>                 lock_page(vmf.page);
>         else
>                 VM_BUG_ON(!PageLocked(vmf.page));
> 
>         /*
>          * Should we do an early C-O-W break?
>          */
>         page = vmf.page;
>         if (flags & FAULT_FLAG_WRITE) {
>                 if (!(vma->vm_flags & VM_SHARED)) {
>                         anon = 1;
>                         if (unlikely(anon_vma_prepare(vma))) {
>                                 ret = VM_FAULT_OOM;
>                                 goto out;
>                         }
>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>                                                 vma, address);
> 

Correct, this is a problem. I already had dropped the patch but thanks for
pointing out a deadlock because I was missing this case. Nothing stops the
page being faulted being sent to shrink_page_list() when alloc_page_vma()
is called. The deadlock might be hard to hit, but it's there.

> 
> Afaik, detailed rule is,
> 
> o kswapd can call lock_page() because they never take page lock outside vmscan

lock_page_nosync as you point out in your next mail. While it can call
it, kswapd shouldn't because normally it avoids stalls but it would not
deadlock as a result of calling it.

> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
>   because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> 

I think the safer bet is simply to say "direct reclaimers should not
call lock_page() because the fault path could be holding a lock on that
page already".

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-13  9:14               ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13  9:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
	Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
	Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
	Andrew Morton

On Fri, Sep 10, 2010 at 07:25:43PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > 
> > > > > On Mon,  6 Sep 2010 11:47:28 +0100
> > > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > > 
> > > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > 
> > > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > > trylock_page() in this case.
> > > > > > 
> > > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > 
> > > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > > 
> > > > Ah......but can't this change cause dead lock ??
> > > 
> > > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > > by poisonous mushroom of Mario Bros.
> > > 
> > 
> > Lets be clear on what the exact dead lock conditions are. The ones I had
> > thought about when I felt this patch was ok were;
> > 
> > o We are not holding the LRU lock (or any lock, we just called cond_resched())
> > o We do not have another page locked because we cannot lock multiple pages
> > o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> > o lock_page() itself is not allocating anything that we could recurse on
> 
> True, all.
> 
> > 
> > One potential dead lock would be if the direct reclaimer held a page
> > lock and ended up here but is that situation even allowed?
> 
> example, 
> 
> __do_fault()
> {
> (snip)
>         if (unlikely(!(ret & VM_FAULT_LOCKED)))
>                 lock_page(vmf.page);
>         else
>                 VM_BUG_ON(!PageLocked(vmf.page));
> 
>         /*
>          * Should we do an early C-O-W break?
>          */
>         page = vmf.page;
>         if (flags & FAULT_FLAG_WRITE) {
>                 if (!(vma->vm_flags & VM_SHARED)) {
>                         anon = 1;
>                         if (unlikely(anon_vma_prepare(vma))) {
>                                 ret = VM_FAULT_OOM;
>                                 goto out;
>                         }
>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
>                                                 vma, address);
> 

Correct, this is a problem. I already had dropped the patch but thanks for
pointing out a deadlock because I was missing this case. Nothing stops the
page being faulted being sent to shrink_page_list() when alloc_page_vma()
is called. The deadlock might be hard to hit, but it's there.

> 
> Afaik, detailed rule is,
> 
> o kswapd can call lock_page() because they never take page lock outside vmscan

lock_page_nosync as you point out in your next mail. While it can call
it, kswapd shouldn't because normally it avoids stalls but it would not
deadlock as a result of calling it.

> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
>   because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> 

I think the safer bet is simply to say "direct reclaimers should not
call lock_page() because the fault path could be holding a lock on that
page already".

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-13  8:55               ` Mel Gorman
@ 2010-09-13  9:48                 ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 5:55 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
>> > > > > > <SNIP>
>> > > > > >
>> > > > > > + * in sleeping but cond_resched() is called in case the current process has
>> > > > > > + * consumed its CPU quota.
>> > > > > > + */
>> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
>> > > > > > +{
>> > > > > > +   long ret;
>> > > > > > +   unsigned long start = jiffies;
>> > > > > > +   DEFINE_WAIT(wait);
>> > > > > > +   wait_queue_head_t *wqh = &congestion_wqh[sync];
>> > > > > > +
>> > > > > > +   /*
>> > > > > > +    * If there is no congestion, check the amount of writeback. If there
>> > > > > > +    * is no significant writeback and no congestion, just cond_resched
>> > > > > > +    */
>> > > > > > +   if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>> > > > > > +           unsigned long inactive, writeback;
>> > > > > > +
>> > > > > > +           inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
>> > > > > > +                           zone_page_state(zone, NR_INACTIVE_ANON);
>> > > > > > +           writeback = zone_page_state(zone, NR_WRITEBACK);
>> > > > > > +
>> > > > > > +           /*
>> > > > > > +            * If less than half the inactive list is being written back,
>> > > > > > +            * reclaim might as well continue
>> > > > > > +            */
>> > > > > > +           if (writeback < inactive / 2) {
>> > > > >
>> > > > > I am not sure this is best.
>> > > > >
>> > > >
>> > > > I'm not saying it is. The objective is to identify a situation where
>> > > > sleeping until the next write or congestion clears is pointless. We have
>> > > > already identified that we are not congested so the question is "are we
>> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> > > > of writing going on, we might as well sleep until one completes rather
>> > > > than reclaiming more.
>> > > >
>> > > > This is the first effort at identifying pointless sleeps. Better ones
>> > > > might be identified in the future but that shouldn't stop us making a
>> > > > semi-sensible decision now.
>> > >
>> > > nr_bdi_congested is no problem since we have used it for a long time.
>> > > But you added new rule about writeback.
>> > >
>> >
>> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> > and from vmscan. As you can see from the results in the leader, we are
>> > currently sleeping more than we need to.
>>
>> I can see the about avoiding congestion_wait but can't find about
>> (writeback < incative / 2) hueristic result.
>>
>
> See the leader and each of the report sections entitled
> "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> how sleep times are affected.
>
> "congest waited" are waits due to calling congestion_wait. "conditional waited"
> are those related to wait_iff_congested(). As you will see from the reports,
> sleep times are reduced overall while callers of wait_iff_congested() still
> go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> how reclaim is behaving and indicators so far are that reclaim is not hurt
> by introducing wait_iff_congested().

I saw  the result.
It was a result about effectiveness _both_ nr_bdi_congested and
(writeback < inactive/2).
What I mean is just effectiveness (writeback < inactive/2) _alone_.
If we remove (writeback < inactive / 2) check and unconditionally
return, how does the behavior changed?

Am I misunderstanding your report in leader?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13  9:48                 ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 5:55 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
>> > > > > > <SNIP>
>> > > > > >
>> > > > > > + * in sleeping but cond_resched() is called in case the current process has
>> > > > > > + * consumed its CPU quota.
>> > > > > > + */
>> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
>> > > > > > +{
>> > > > > > +   long ret;
>> > > > > > +   unsigned long start = jiffies;
>> > > > > > +   DEFINE_WAIT(wait);
>> > > > > > +   wait_queue_head_t *wqh = &congestion_wqh[sync];
>> > > > > > +
>> > > > > > +   /*
>> > > > > > +    * If there is no congestion, check the amount of writeback. If there
>> > > > > > +    * is no significant writeback and no congestion, just cond_resched
>> > > > > > +    */
>> > > > > > +   if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>> > > > > > +           unsigned long inactive, writeback;
>> > > > > > +
>> > > > > > +           inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
>> > > > > > +                           zone_page_state(zone, NR_INACTIVE_ANON);
>> > > > > > +           writeback = zone_page_state(zone, NR_WRITEBACK);
>> > > > > > +
>> > > > > > +           /*
>> > > > > > +            * If less than half the inactive list is being written back,
>> > > > > > +            * reclaim might as well continue
>> > > > > > +            */
>> > > > > > +           if (writeback < inactive / 2) {
>> > > > >
>> > > > > I am not sure this is best.
>> > > > >
>> > > >
>> > > > I'm not saying it is. The objective is to identify a situation where
>> > > > sleeping until the next write or congestion clears is pointless. We have
>> > > > already identified that we are not congested so the question is "are we
>> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> > > > of writing going on, we might as well sleep until one completes rather
>> > > > than reclaiming more.
>> > > >
>> > > > This is the first effort at identifying pointless sleeps. Better ones
>> > > > might be identified in the future but that shouldn't stop us making a
>> > > > semi-sensible decision now.
>> > >
>> > > nr_bdi_congested is no problem since we have used it for a long time

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-13  9:48                 ` Minchan Kim
@ 2010-09-13 10:07                   ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> > > > already identified that we are not congested so the question is "are we
> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> > > > of writing going on, we might as well sleep until one completes rather
> >> > > > than reclaiming more.
> >> > > >
> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> > > > might be identified in the future but that shouldn't stop us making a
> >> > > > semi-sensible decision now.
> >> > >
> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> > > But you added new rule about writeback.
> >> > >
> >> >
> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> > and from vmscan. As you can see from the results in the leader, we are
> >> > currently sleeping more than we need to.
> >>
> >> I can see the about avoiding congestion_wait but can't find about
> >> (writeback < incative / 2) hueristic result.
> >>
> >
> > See the leader and each of the report sections entitled
> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> > how sleep times are affected.
> >
> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> > are those related to wait_iff_congested(). As you will see from the reports,
> > sleep times are reduced overall while callers of wait_iff_congested() still
> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> > by introducing wait_iff_congested().
> 
> I saw  the result.
> It was a result about effectiveness _both_ nr_bdi_congested and
> (writeback < inactive/2).
> What I mean is just effectiveness (writeback < inactive/2) _alone_.

I didn't measured it because such a change means that wait_iff_congested()
ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
it could mean that a BDI gets flooded with requests if we only checked the
ratios of one zone if little writeback was happening in that zone at the
time. It did not seem like a good idea to ignore congestion.

> If we remove (writeback < inactive / 2) check and unconditionally
> return, how does the behavior changed?
> 

Based on just the workload Johannes sent, scanning and completion times both
increased without any improvement in the scanning/reclaim ratio (a bad result)
hence why this logic was introduced to back off where there is some
writeback taking place even if the BDI is not congested.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:07                   ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> > > > already identified that we are not congested so the question is "are we
> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> > > > of writing going on, we might as well sleep until one completes rather
> >> > > > than reclaiming more.
> >> > > >
> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> > > > might be identified in the future but that shouldn't stop us making a
> >> > > > semi-sensible decision now.
> >> > >
> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> > > But you added new rule about writeback.
> >> > >
> >> >
> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> > and from vmscan. As you can see from the results in the leader, we are
> >> > currently sleeping more than we need to.
> >>
> >> I can see the about avoiding congestion_wait but can't find about
> >> (writeback < incative / 2) hueristic result.
> >>
> >
> > See the leader and each of the report sections entitled
> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> > how sleep times are affected.
> >
> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> > are those related to wait_iff_congested(). As you will see from the reports,
> > sleep times are reduced overall while callers of wait_iff_congested() still
> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> > by introducing wait_iff_congested().
> 
> I saw  the result.
> It was a result about effectiveness _both_ nr_bdi_congested and
> (writeback < inactive/2).
> What I mean is just effectiveness (writeback < inactive/2) _alone_.

I didn't measured it because such a change means that wait_iff_congested()
ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
it could mean that a BDI gets flooded with requests if we only checked the
ratios of one zone if little writeback was happening in that zone at the
time. It did not seem like a good idea to ignore congestion.

> If we remove (writeback < inactive / 2) check and unconditionally
> return, how does the behavior changed?
> 

Based on just the workload Johannes sent, scanning and completion times both
increased without any improvement in the scanning/reclaim ratio (a bad result)
hence why this logic was introduced to back off where there is some
writeback taking place even if the BDI is not congested.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-13 10:07                   ` Mel Gorman
@ 2010-09-13 10:20                     ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 10:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
>> >> > > > <SNIP>
>> >> > > > I'm not saying it is. The objective is to identify a situation where
>> >> > > > sleeping until the next write or congestion clears is pointless. We have
>> >> > > > already identified that we are not congested so the question is "are we
>> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> >> > > > of writing going on, we might as well sleep until one completes rather
>> >> > > > than reclaiming more.
>> >> > > >
>> >> > > > This is the first effort at identifying pointless sleeps. Better ones
>> >> > > > might be identified in the future but that shouldn't stop us making a
>> >> > > > semi-sensible decision now.
>> >> > >
>> >> > > nr_bdi_congested is no problem since we have used it for a long time.
>> >> > > But you added new rule about writeback.
>> >> > >
>> >> >
>> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> >> > and from vmscan. As you can see from the results in the leader, we are
>> >> > currently sleeping more than we need to.
>> >>
>> >> I can see the about avoiding congestion_wait but can't find about
>> >> (writeback < incative / 2) hueristic result.
>> >>
>> >
>> > See the leader and each of the report sections entitled
>> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
>> > how sleep times are affected.
>> >
>> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
>> > are those related to wait_iff_congested(). As you will see from the reports,
>> > sleep times are reduced overall while callers of wait_iff_congested() still
>> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
>> > how reclaim is behaving and indicators so far are that reclaim is not hurt
>> > by introducing wait_iff_congested().
>>
>> I saw  the result.
>> It was a result about effectiveness _both_ nr_bdi_congested and
>> (writeback < inactive/2).
>> What I mean is just effectiveness (writeback < inactive/2) _alone_.
>
> I didn't measured it because such a change means that wait_iff_congested()
> ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> it could mean that a BDI gets flooded with requests if we only checked the
> ratios of one zone if little writeback was happening in that zone at the
> time. It did not seem like a good idea to ignore congestion.

You seem to misunderstand my word.
Sorry for not clear sentence.

I don't mean ignore congestion.
First of all, we should consider congestion of bdi.
My meant is whether we need adding up (nr_writeback < nr_inacive /2)
heuristic plus congestion bdi.
It wasn't previous version in your patch but it showed up in this version.
So I thought apparently you have any evidence why we should add such heuristic.

>
>> If we remove (writeback < inactive / 2) check and unconditionally
>> return, how does the behavior changed?
>>
>
> Based on just the workload Johannes sent, scanning and completion times both
> increased without any improvement in the scanning/reclaim ratio (a bad result)
> hence why this logic was introduced to back off where there is some
> writeback taking place even if the BDI is not congested.

Yes. That's what I want. At least, comment of function should have it
to understand the logic.  In addition, It would be better to add the
number to show how it back off well.


>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:20                     ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 10:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
>> >> > > > <SNIP>
>> >> > > > I'm not saying it is. The objective is to identify a situation where
>> >> > > > sleeping until the next write or congestion clears is pointless. We have
>> >> > > > already identified that we are not congested so the question is "are we
>> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> >> > > > of writing going on, we might as well sleep until one completes rather
>> >> > > > than reclaiming more.
>> >> > > >
>> >> > > > This is the first effort at identifying pointless sleeps. Better ones
>> >> > > > might be identified in the future but that shouldn't stop us making a
>> >> > > > semi-sensible decision now.
>> >> > >
>> >> > > nr_bdi_congested is no problem since we have used it for a long time.
>> >> > > But you added new rule about writeback.
>> >> > >
>> >> >
>> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> >> > and from vmscan. As you can see from the results in the leader, we are
>> >> > currently sleeping more than we need to.
>> >>
>> >> I can see the about avoiding congestion_wait but can't find about
>> >> (writeback < incative / 2) hueristic result.
>> >>
>> >
>> > See the leader and each of the report sections entitled
>> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
>> > how sleep times are affected.
>> >
>> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
>> > are those related to wait_iff_congested(). As you will see from the reports,
>> > sleep times are reduced overall while callers of wait_iff_congested() still
>> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
>> > how reclaim is behaving and indicators so far are that reclaim is not hurt
>> > by introducing wait_iff_congested().
>>
>> I saw  the result.
>> It was a result about effectiveness _both_ nr_bdi_congested and
>> (writeback < inactive/2).
>> What I mean is just effectiveness (writeback < inactive/2) _alone_.
>
> I didn't measured it because such a change means that wait_iff_congested()
> ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> it could mean that a BDI gets flooded with requests if we only checked the
> ratios of one zone if little writeback was happening in that zone at the
> time. It did not seem like a good idea to ignore congestion.

You seem to misunderstand my word.
Sorry for not clear sentence.

I don't mean ignore congestion.
First of all, we should consider congestion of bdi.
My meant is whether we need adding up (nr_writeback < nr_inacive /2)
heuristic plus congestion bdi.
It wasn't previous version in your patch but it showed up in this version.
So I thought apparently you have any evidence why we should add such heuristic.

>
>> If we remove (writeback < inactive / 2) check and unconditionally
>> return, how does the behavior changed?
>>
>
> Based on just the workload Johannes sent, scanning and completion times both
> increased without any improvement in the scanning/reclaim ratio (a bad result)
> hence why this logic was introduced to back off where there is some
> writeback taking place even if the BDI is not congested.

Yes. That's what I want. At least, comment of function should have it
to understand the logic.  In addition, It would be better to add the
number to show how it back off well.


>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
  2010-09-13 10:20                     ` Minchan Kim
@ 2010-09-13 10:30                       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 07:20:37PM +0900, Minchan Kim wrote:
> On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> >> > > > <SNIP>
> >> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> >> > > > already identified that we are not congested so the question is "are we
> >> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> >> > > > of writing going on, we might as well sleep until one completes rather
> >> >> > > > than reclaiming more.
> >> >> > > >
> >> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> >> > > > might be identified in the future but that shouldn't stop us making a
> >> >> > > > semi-sensible decision now.
> >> >> > >
> >> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> >> > > But you added new rule about writeback.
> >> >> > >
> >> >> >
> >> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> >> > and from vmscan. As you can see from the results in the leader, we are
> >> >> > currently sleeping more than we need to.
> >> >>
> >> >> I can see the about avoiding congestion_wait but can't find about
> >> >> (writeback < incative / 2) hueristic result.
> >> >>
> >> >
> >> > See the leader and each of the report sections entitled
> >> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> >> > how sleep times are affected.
> >> >
> >> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> >> > are those related to wait_iff_congested(). As you will see from the reports,
> >> > sleep times are reduced overall while callers of wait_iff_congested() still
> >> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> >> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> >> > by introducing wait_iff_congested().
> >>
> >> I saw  the result.
> >> It was a result about effectiveness _both_ nr_bdi_congested and
> >> (writeback < inactive/2).
> >> What I mean is just effectiveness (writeback < inactive/2) _alone_.
> >
> > I didn't measured it because such a change means that wait_iff_congested()
> > ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> > it could mean that a BDI gets flooded with requests if we only checked the
> > ratios of one zone if little writeback was happening in that zone at the
> > time. It did not seem like a good idea to ignore congestion.
> 
> You seem to misunderstand my word.
> Sorry for not clear sentence.
> 
> I don't mean ignore congestion.
> First of all, we should consider congestion of bdi.
> My meant is whether we need adding up (nr_writeback < nr_inacive /2)
> heuristic plus congestion bdi.

Early tests indicated "yes".

> It wasn't previous version in your patch but it showed up in this version.
> So I thought apparently you have any evidence why we should add such heuristic.
> 

Only the feedback from the first patch where Johannes posted a workload that
did exhibit a problem. Isolated tests on just that workload led to the 
(nr_writeback < inactive / 2) change.

> >
> >> If we remove (writeback < inactive / 2) check and unconditionally
> >> return, how does the behavior changed?
> >>
> >
> > Based on just the workload Johannes sent, scanning and completion times both
> > increased without any improvement in the scanning/reclaim ratio (a bad result)
> > hence why this logic was introduced to back off where there is some
> > writeback taking place even if the BDI is not congested.
> 
> Yes. That's what I want. At least, comment of function should have it
> to understand the logic.  In addition, It would be better to add the
> number to show how it back off well.
> 

Very well. I'll hold off posting v2 of the series now then because producing
such results take many hours and my machines are currently busy.
Hopefully I'll have something by Wednesday.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:30                       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 07:20:37PM +0900, Minchan Kim wrote:
> On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> >> > > > <SNIP>
> >> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> >> > > > already identified that we are not congested so the question is "are we
> >> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> >> > > > of writing going on, we might as well sleep until one completes rather
> >> >> > > > than reclaiming more.
> >> >> > > >
> >> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> >> > > > might be identified in the future but that shouldn't stop us making a
> >> >> > > > semi-sensible decision now.
> >> >> > >
> >> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> >> > > But you added new rule about writeback.
> >> >> > >
> >> >> >
> >> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> >> > and from vmscan. As you can see from the results in the leader, we are
> >> >> > currently sleeping more than we need to.
> >> >>
> >> >> I can see the about avoiding congestion_wait but can't find about
> >> >> (writeback < incative / 2) hueristic result.
> >> >>
> >> >
> >> > See the leader and each of the report sections entitled
> >> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> >> > how sleep times are affected.
> >> >
> >> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> >> > are those related to wait_iff_congested(). As you will see from the reports,
> >> > sleep times are reduced overall while callers of wait_iff_congested() still
> >> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> >> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> >> > by introducing wait_iff_congested().
> >>
> >> I saw  the result.
> >> It was a result about effectiveness _both_ nr_bdi_congested and
> >> (writeback < inactive/2).
> >> What I mean is just effectiveness (writeback < inactive/2) _alone_.
> >
> > I didn't measured it because such a change means that wait_iff_congested()
> > ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> > it could mean that a BDI gets flooded with requests if we only checked the
> > ratios of one zone if little writeback was happening in that zone at the
> > time. It did not seem like a good idea to ignore congestion.
> 
> You seem to misunderstand my word.
> Sorry for not clear sentence.
> 
> I don't mean ignore congestion.
> First of all, we should consider congestion of bdi.
> My meant is whether we need adding up (nr_writeback < nr_inacive /2)
> heuristic plus congestion bdi.

Early tests indicated "yes".

> It wasn't previous version in your patch but it showed up in this version.
> So I thought apparently you have any evidence why we should add such heuristic.
> 

Only the feedback from the first patch where Johannes posted a workload that
did exhibit a problem. Isolated tests on just that workload led to the 
(nr_writeback < inactive / 2) change.

> >
> >> If we remove (writeback < inactive / 2) check and unconditionally
> >> return, how does the behavior changed?
> >>
> >
> > Based on just the workload Johannes sent, scanning and completion times both
> > increased without any improvement in the scanning/reclaim ratio (a bad result)
> > hence why this logic was introduced to back off where there is some
> > writeback taking place even if the BDI is not congested.
> 
> Yes. That's what I want. At least, comment of function should have it
> to understand the logic.  In addition, It would be better to add the
> number to show how it back off well.
> 

Very well. I'll hold off posting v2 of the series now then because producing
such results take many hours and my machines are currently busy.
Hopefully I'll have something by Wednesday.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-13 13:31     ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

Mel,

Sorry for being late, I'm doing pretty much prework these days ;)

On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ff52b46..408c101 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)	(1)
>  #endif
>  
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -				      struct scan_control *sc)
> +					struct scan_control *sc,
> +					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				nr_dirty++;
> +				goto keep_locked;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -908,6 +922,8 @@ keep_lumpy:
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> +
> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
>  	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
>  		return false;
>  
> +	/* If we cannot writeback, there is no point stalling */
> +	if (!sc->may_writepage)
> +		return false;
> +
>  	/* If we have relaimed everything on the isolated list, no stall */
>  	if (nr_freed == nr_taken)
>  		return false;
> @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	LIST_HEAD(page_list);
> +	LIST_HEAD(putback_list);
>  	unsigned long nr_scanned;
>  	unsigned long nr_reclaimed = 0;
>  	unsigned long nr_taken;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {

It is possible to OOM if the LRU list is small and/or the storage is slow, so
that the flusher cannot clean enough pages before the LRU is fully scanned.

So we may need do waits on dirty/writeback pages on *order-0*
direct reclaims, when priority goes rather low (such as < 3).

> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  		set_lumpy_reclaim_mode(priority, sc, true);
> -		nr_reclaimed += shrink_page_list(&page_list, sc);
> +
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			struct page *page, *tmp;
> +

> +			/* Take off the clean pages marked for activation */
> +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +				if (PageDirty(page) || PageWriteback(page))
> +					continue;
> +
> +				list_del(&page->lru);
> +				list_add(&page->lru, &putback_list);
> +			}

nitpick: I guess the above loop is optional code to avoid overheads
of shrink_page_list() repeatedly going through some unfreeable pages?
Considering this is the slow code path, I'd prefer to keep the code
simple than to do such optimizations.

> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

how about 
                        if (!laptop_mode)
                                wakeup_flusher_threads(nr_dirty);

> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +
> +			nr_reclaimed = shrink_page_list(&page_list, sc,
> +							&nr_dirty);
> +		}
>  	}
>  
> +	list_splice(&putback_list, &page_list);
> +
>  	local_irq_disable();
>  	if (current_is_kswapd())
>  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 13:31     ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

Mel,

Sorry for being late, I'm doing pretty much prework these days ;)

On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
> 
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back.  If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
> 
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
>  1 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ff52b46..408c101 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #define scanning_global_lru(sc)	(1)
>  #endif
>  
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
>  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
>  						  struct scan_control *sc)
>  {
> @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>   * shrink_page_list() returns the number of reclaimed pages
>   */
>  static unsigned long shrink_page_list(struct list_head *page_list,
> -				      struct scan_control *sc)
> +					struct scan_control *sc,
> +					unsigned long *nr_still_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
>  	int pgactivate = 0;
> +	unsigned long nr_dirty = 0;
>  	unsigned long nr_reclaimed = 0;
>  
>  	cond_resched();
> @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		}
>  
>  		if (PageDirty(page)) {
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> +				nr_dirty++;
> +				goto keep_locked;
> +			}
> +
>  			if (references == PAGEREF_RECLAIM_CLEAN)
>  				goto keep_locked;
>  			if (!may_enter_fs)
> @@ -908,6 +922,8 @@ keep_lumpy:
>  	free_page_list(&free_pages);
>  
>  	list_splice(&ret_pages, page_list);
> +
> +	*nr_still_dirty = nr_dirty;
>  	count_vm_events(PGACTIVATE, pgactivate);
>  	return nr_reclaimed;
>  }
> @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
>  	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
>  		return false;
>  
> +	/* If we cannot writeback, there is no point stalling */
> +	if (!sc->may_writepage)
> +		return false;
> +
>  	/* If we have relaimed everything on the isolated list, no stall */
>  	if (nr_freed == nr_taken)
>  		return false;
> @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  			struct scan_control *sc, int priority, int file)
>  {
>  	LIST_HEAD(page_list);
> +	LIST_HEAD(putback_list);
>  	unsigned long nr_scanned;
>  	unsigned long nr_reclaimed = 0;
>  	unsigned long nr_taken;
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
> +	unsigned long nr_dirty;
>  
>  	while (unlikely(too_many_isolated(zone, file, sc))) {
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>  
>  	spin_unlock_irq(&zone->lru_lock);
>  
> -	nr_reclaimed = shrink_page_list(&page_list, sc);
> +	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
>  
>  	/* Check if we should syncronously wait for writeback */
>  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {

It is possible to OOM if the LRU list is small and/or the storage is slow, so
that the flusher cannot clean enough pages before the LRU is fully scanned.

So we may need do waits on dirty/writeback pages on *order-0*
direct reclaims, when priority goes rather low (such as < 3).

> +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
>  		set_lumpy_reclaim_mode(priority, sc, true);
> -		nr_reclaimed += shrink_page_list(&page_list, sc);
> +
> +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> +			struct page *page, *tmp;
> +

> +			/* Take off the clean pages marked for activation */
> +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> +				if (PageDirty(page) || PageWriteback(page))
> +					continue;
> +
> +				list_del(&page->lru);
> +				list_add(&page->lru, &putback_list);
> +			}

nitpick: I guess the above loop is optional code to avoid overheads
of shrink_page_list() repeatedly going through some unfreeable pages?
Considering this is the slow code path, I'd prefer to keep the code
simple than to do such optimizations.

> +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);

how about 
                        if (!laptop_mode)
                                wakeup_flusher_threads(nr_dirty);

> +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +
> +			nr_reclaimed = shrink_page_list(&page_list, sc,
> +							&nr_dirty);
> +		}
>  	}
>  
> +	list_splice(&putback_list, &page_list);
> +
>  	local_irq_disable();
>  	if (current_is_kswapd())
>  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> -- 
> 1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-06 10:47   ` Mel Gorman
@ 2010-09-13 13:48     ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));

wakeup_flusher_threads() works, but seems not the pertinent one.

- locally, it needs some luck to clean the pages that direct reclaim is waiting on
- globally, it cleans up some dirty pages, however some heavy dirtier
  may quickly create new ones..

So how about taking the approaches in these patches?

- "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
- "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"

In particular the first patch should work very nicely with memcg, as
all pages of an inode typically belong to the same memcg. So doing
write-around helps clean lots of dirty pages in the target LRU list in
one shot.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 13:48     ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> +	/*
> +	 * If reclaim is encountering dirty pages, it may be because
> +	 * dirty pages are reaching the end of the LRU even though the
> +	 * dirty_ratio may be satisified. In this case, wake flusher
> +	 * threads to pro-actively clean up to a maximum of
> +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> +	 * !may_writepage indicates that this is a direct reclaimer in
> +	 * laptop mode avoiding disk spin-ups
> +	 */
> +	if (file && nr_dirty_seen && sc->may_writepage)
> +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));

wakeup_flusher_threads() works, but seems not the pertinent one.

- locally, it needs some luck to clean the pages that direct reclaim is waiting on
- globally, it cleans up some dirty pages, however some heavy dirtier
  may quickly create new ones..

So how about taking the approaches in these patches?

- "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
- "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"

In particular the first patch should work very nicely with memcg, as
all pages of an inode typically belong to the same memcg. So doing
write-around helps clean lots of dirty pages in the target LRU list in
one shot.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-09-13 13:31     ` Wu Fengguang
@ 2010-09-13 13:55       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 13:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 09:31:56PM +0800, Wu Fengguang wrote:
> Mel,
> 
> Sorry for being late, I'm doing pretty much prework these days ;)
> 

No worries, I'm all over the place at the moment so cannot lecture on
response times :)

> On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 files changed, 46 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ff52b46..408c101 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)	(1)
> >  #endif
> >  
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > -				      struct scan_control *sc)
> > +					struct scan_control *sc,
> > +					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				nr_dirty++;
> > +				goto keep_locked;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -908,6 +922,8 @@ keep_lumpy:
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > +
> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> > @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> >  	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> >  		return false;
> >  
> > +	/* If we cannot writeback, there is no point stalling */
> > +	if (!sc->may_writepage)
> > +		return false;
> > +
> >  	/* If we have relaimed everything on the isolated list, no stall */
> >  	if (nr_freed == nr_taken)
> >  		return false;
> > @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			struct scan_control *sc, int priority, int file)
> >  {
> >  	LIST_HEAD(page_list);
> > +	LIST_HEAD(putback_list);
> >  	unsigned long nr_scanned;
> >  	unsigned long nr_reclaimed = 0;
> >  	unsigned long nr_taken;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
> >  
> >  	/* Check if we should syncronously wait for writeback */
> >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> 
> It is possible to OOM if the LRU list is small and/or the storage is slow, so
> that the flusher cannot clean enough pages before the LRU is fully scanned.
> 

To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
would have to be dirty or unreclaimable. If that situation happens, is
the dirty throttling not also broken?

> So we may need do waits on dirty/writeback pages on *order-0*
> direct reclaims, when priority goes rather low (such as < 3).
> 

In case this is really necessary, the necessary stalling could be done by
removing the check for lumpy reclaim in should_reclaim_stall().  What do
you think of the following replacement?

/*
 * Returns true if the caller should wait to clean dirty/writeback pages.
 *
 * If we are direct reclaiming for contiguous pages and we do not reclaim
 * everything in the list, try again and wait for writeback IO to complete.
 * This will stall high-order allocations noticeably. Only do that when really
 * need to free the pages under high memory pressure.
 *
 * Alternatively, if priority is getting high, it may be because there are
 * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
 * and potentially causing an OOM, we stall on writeback.
 */
static inline bool should_reclaim_stall(unsigned long nr_taken,
                                        unsigned long nr_freed,
                                        int priority,
                                        struct scan_control *sc)
{
        int stall_priority;

        /* kswapd should not stall on sync IO */
        if (current_is_kswapd())
                return false;

        /* If we cannot writeback, there is no point stalling */
        if (!sc->may_writepage)
                return false;

        /* If we have relaimed everything on the isolated list, no stall */
        if (nr_freed == nr_taken)
                return false;

        /*
         * For high-order allocations, there are two stall thresholds.
         * High-cost allocations stall immediately where as lower
         * order allocations such as stacks require the scanning
         * priority to be much higher before stalling.
         */
        if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
                stall_priority = DEF_PRIORITY;
        else
                stall_priority = DEF_PRIORITY / 3;

        return priority <= stall_priority;
}


> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  		set_lumpy_reclaim_mode(priority, sc, true);
> > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > +
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			struct page *page, *tmp;
> > +
> 
> > +			/* Take off the clean pages marked for activation */
> > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > +				if (PageDirty(page) || PageWriteback(page))
> > +					continue;
> > +
> > +				list_del(&page->lru);
> > +				list_add(&page->lru, &putback_list);
> > +			}
> 
> nitpick: I guess the above loop is optional code to avoid overheads
> of shrink_page_list() repeatedly going through some unfreeable pages?

Pretty much, if they are to be activated, there is no point trying to reclaim
them again. It's unnecessary overhead. A strong motivation for this
series is to reduce overheads in the reclaim paths and unnecessary
retrying of unfreeable pages.

> Considering this is the slow code path, I'd prefer to keep the code
> simple than to do such optimizations.
> 
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> how about 
>                         if (!laptop_mode)
>                                 wakeup_flusher_threads(nr_dirty);
> 

It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
clean all pages if some need dirtying. laptop_mode cleans all pages to
minimise disk spinups.

> > +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +
> > +			nr_reclaimed = shrink_page_list(&page_list, sc,
> > +							&nr_dirty);
> > +		}
> >  	}
> >  
> > +	list_splice(&putback_list, &page_list);
> > +
> >  	local_irq_disable();
> >  	if (current_is_kswapd())
> >  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > -- 
> > 1.7.1
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 13:55       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 13:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 09:31:56PM +0800, Wu Fengguang wrote:
> Mel,
> 
> Sorry for being late, I'm doing pretty much prework these days ;)
> 

No worries, I'm all over the place at the moment so cannot lecture on
response times :)

> On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back.  If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> > 
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/vmscan.c |   49 ++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 files changed, 46 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ff52b46..408c101 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> >  #define scanning_global_lru(sc)	(1)
> >  #endif
> >  
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> >  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> >  						  struct scan_control *sc)
> >  {
> > @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >   * shrink_page_list() returns the number of reclaimed pages
> >   */
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> > -				      struct scan_control *sc)
> > +					struct scan_control *sc,
> > +					unsigned long *nr_still_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> >  	int pgactivate = 0;
> > +	unsigned long nr_dirty = 0;
> >  	unsigned long nr_reclaimed = 0;
> >  
> >  	cond_resched();
> > @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		}
> >  
> >  		if (PageDirty(page)) {
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd()) {
> > +				nr_dirty++;
> > +				goto keep_locked;
> > +			}
> > +
> >  			if (references == PAGEREF_RECLAIM_CLEAN)
> >  				goto keep_locked;
> >  			if (!may_enter_fs)
> > @@ -908,6 +922,8 @@ keep_lumpy:
> >  	free_page_list(&free_pages);
> >  
> >  	list_splice(&ret_pages, page_list);
> > +
> > +	*nr_still_dirty = nr_dirty;
> >  	count_vm_events(PGACTIVATE, pgactivate);
> >  	return nr_reclaimed;
> >  }
> > @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> >  	if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> >  		return false;
> >  
> > +	/* If we cannot writeback, there is no point stalling */
> > +	if (!sc->may_writepage)
> > +		return false;
> > +
> >  	/* If we have relaimed everything on the isolated list, no stall */
> >  	if (nr_freed == nr_taken)
> >  		return false;
> > @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  			struct scan_control *sc, int priority, int file)
> >  {
> >  	LIST_HEAD(page_list);
> > +	LIST_HEAD(putback_list);
> >  	unsigned long nr_scanned;
> >  	unsigned long nr_reclaimed = 0;
> >  	unsigned long nr_taken;
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> > +	unsigned long nr_dirty;
> >  
> >  	while (unlikely(too_many_isolated(zone, file, sc))) {
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >  
> >  	spin_unlock_irq(&zone->lru_lock);
> >  
> > -	nr_reclaimed = shrink_page_list(&page_list, sc);
> > +	nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
> >  
> >  	/* Check if we should syncronously wait for writeback */
> >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> 
> It is possible to OOM if the LRU list is small and/or the storage is slow, so
> that the flusher cannot clean enough pages before the LRU is fully scanned.
> 

To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
would have to be dirty or unreclaimable. If that situation happens, is
the dirty throttling not also broken?

> So we may need do waits on dirty/writeback pages on *order-0*
> direct reclaims, when priority goes rather low (such as < 3).
> 

In case this is really necessary, the necessary stalling could be done by
removing the check for lumpy reclaim in should_reclaim_stall().  What do
you think of the following replacement?

/*
 * Returns true if the caller should wait to clean dirty/writeback pages.
 *
 * If we are direct reclaiming for contiguous pages and we do not reclaim
 * everything in the list, try again and wait for writeback IO to complete.
 * This will stall high-order allocations noticeably. Only do that when really
 * need to free the pages under high memory pressure.
 *
 * Alternatively, if priority is getting high, it may be because there are
 * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
 * and potentially causing an OOM, we stall on writeback.
 */
static inline bool should_reclaim_stall(unsigned long nr_taken,
                                        unsigned long nr_freed,
                                        int priority,
                                        struct scan_control *sc)
{
        int stall_priority;

        /* kswapd should not stall on sync IO */
        if (current_is_kswapd())
                return false;

        /* If we cannot writeback, there is no point stalling */
        if (!sc->may_writepage)
                return false;

        /* If we have relaimed everything on the isolated list, no stall */
        if (nr_freed == nr_taken)
                return false;

        /*
         * For high-order allocations, there are two stall thresholds.
         * High-cost allocations stall immediately where as lower
         * order allocations such as stacks require the scanning
         * priority to be much higher before stalling.
         */
        if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
                stall_priority = DEF_PRIORITY;
        else
                stall_priority = DEF_PRIORITY / 3;

        return priority <= stall_priority;
}


> > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> >  		set_lumpy_reclaim_mode(priority, sc, true);
> > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > +
> > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > +			struct page *page, *tmp;
> > +
> 
> > +			/* Take off the clean pages marked for activation */
> > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > +				if (PageDirty(page) || PageWriteback(page))
> > +					continue;
> > +
> > +				list_del(&page->lru);
> > +				list_add(&page->lru, &putback_list);
> > +			}
> 
> nitpick: I guess the above loop is optional code to avoid overheads
> of shrink_page_list() repeatedly going through some unfreeable pages?

Pretty much, if they are to be activated, there is no point trying to reclaim
them again. It's unnecessary overhead. A strong motivation for this
series is to reduce overheads in the reclaim paths and unnecessary
retrying of unfreeable pages.

> Considering this is the slow code path, I'd prefer to keep the code
> simple than to do such optimizations.
> 
> > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> 
> how about 
>                         if (!laptop_mode)
>                                 wakeup_flusher_threads(nr_dirty);
> 

It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
clean all pages if some need dirtying. laptop_mode cleans all pages to
minimise disk spinups.

> > +			wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +
> > +			nr_reclaimed = shrink_page_list(&page_list, sc,
> > +							&nr_dirty);
> > +		}
> >  	}
> >  
> > +	list_splice(&putback_list, &page_list);
> > +
> >  	local_irq_disable();
> >  	if (current_is_kswapd())
> >  		__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > -- 
> > 1.7.1
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-13 13:48     ` Wu Fengguang
@ 2010-09-13 14:10       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 14:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> 
> wakeup_flusher_threads() works, but seems not the pertinent one.
> 
> - locally, it needs some luck to clean the pages that direct reclaim is waiting on

There is a certain amount of luck involved but it's depending on there being a
correlation between old inodes and old pages on the LRU list. As long as that
correlation is accurate, some relevant pages will get cleaned.  Testing on
previously released versions of this patch did show that the percentage of
dirty pages encountered during reclaim were reduced as a result of this patch.

> - globally, it cleans up some dirty pages, however some heavy dirtier
>   may quickly create new ones..
> 
> So how about taking the approaches in these patches?
> 
> - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> 

There is a lot going on in those patches. It's going to take me a while to
figure them out and formulate an opinion.

> In particular the first patch should work very nicely with memcg, as
> all pages of an inode typically belong to the same memcg. So doing
> write-around helps clean lots of dirty pages in the target LRU list in
> one shot.
> 

It might but as there is also a correlation between old dirty inodes and
the location of dirty pages, it is tricky to predict if it is better and
if so, by how much.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 14:10       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 14:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > +	/*
> > +	 * If reclaim is encountering dirty pages, it may be because
> > +	 * dirty pages are reaching the end of the LRU even though the
> > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > +	 * threads to pro-actively clean up to a maximum of
> > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > +	 * !may_writepage indicates that this is a direct reclaimer in
> > +	 * laptop mode avoiding disk spin-ups
> > +	 */
> > +	if (file && nr_dirty_seen && sc->may_writepage)
> > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> 
> wakeup_flusher_threads() works, but seems not the pertinent one.
> 
> - locally, it needs some luck to clean the pages that direct reclaim is waiting on

There is a certain amount of luck involved but it's depending on there being a
correlation between old inodes and old pages on the LRU list. As long as that
correlation is accurate, some relevant pages will get cleaned.  Testing on
previously released versions of this patch did show that the percentage of
dirty pages encountered during reclaim were reduced as a result of this patch.

> - globally, it cleans up some dirty pages, however some heavy dirtier
>   may quickly create new ones..
> 
> So how about taking the approaches in these patches?
> 
> - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> 

There is a lot going on in those patches. It's going to take me a while to
figure them out and formulate an opinion.

> In particular the first patch should work very nicely with memcg, as
> all pages of an inode typically belong to the same memcg. So doing
> write-around helps clean lots of dirty pages in the target LRU list in
> one shot.
> 

It might but as there is also a correlation between old dirty inodes and
the location of dirty pages, it is tricky to predict if it is better and
if so, by how much.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-09-13 13:55       ` Mel Gorman
@ 2010-09-13 14:33         ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> > >  	/* Check if we should syncronously wait for writeback */
> > >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> > 
> > It is possible to OOM if the LRU list is small and/or the storage is slow, so
> > that the flusher cannot clean enough pages before the LRU is fully scanned.
> > 
> 
> To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
> would have to be dirty or unreclaimable. If that situation happens, is
> the dirty throttling not also broken?

My worry is, even if the dirty throttling limit is instantly set to 0,
it may still take time to knock down the number of dirty pages. Think
about 500MB dirty pages waiting to be flushed to a slow USB stick.

> > So we may need do waits on dirty/writeback pages on *order-0*
> > direct reclaims, when priority goes rather low (such as < 3).
> > 
> 
> In case this is really necessary, the necessary stalling could be done by
> removing the check for lumpy reclaim in should_reclaim_stall().  What do
> you think of the following replacement?

I merely want to provide a guarantee, so it may be enough to add this:

        if (nr_freed == nr_taken)
                return false;

+       if (!priority)
+               return true;

This ensures the last full LRU scan will do necessary waits to prevent
the OOM.

> /*
>  * Returns true if the caller should wait to clean dirty/writeback pages.
>  *
>  * If we are direct reclaiming for contiguous pages and we do not reclaim
>  * everything in the list, try again and wait for writeback IO to complete.
>  * This will stall high-order allocations noticeably. Only do that when really
>  * need to free the pages under high memory pressure.
>  *
>  * Alternatively, if priority is getting high, it may be because there are
>  * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
>  * and potentially causing an OOM, we stall on writeback.
>  */
> static inline bool should_reclaim_stall(unsigned long nr_taken,
>                                         unsigned long nr_freed,
>                                         int priority,
>                                         struct scan_control *sc)
> {
>         int stall_priority;
> 
>         /* kswapd should not stall on sync IO */
>         if (current_is_kswapd())
>                 return false;
> 
>         /* If we cannot writeback, there is no point stalling */
>         if (!sc->may_writepage)
>                 return false;
> 
>         /* If we have relaimed everything on the isolated list, no stall */
>         if (nr_freed == nr_taken)
>                 return false;
> 
>         /*
>          * For high-order allocations, there are two stall thresholds.
>          * High-cost allocations stall immediately where as lower
>          * order allocations such as stacks require the scanning
>          * priority to be much higher before stalling.
>          */
>         if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>                 stall_priority = DEF_PRIORITY;
>         else
>                 stall_priority = DEF_PRIORITY / 3;
> 
>         return priority <= stall_priority;
> }
> 
> 
> > > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > >  		set_lumpy_reclaim_mode(priority, sc, true);
> > > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > > +
> > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +			struct page *page, *tmp;
> > > +
> > 
> > > +			/* Take off the clean pages marked for activation */
> > > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > > +				if (PageDirty(page) || PageWriteback(page))
> > > +					continue;
> > > +
> > > +				list_del(&page->lru);
> > > +				list_add(&page->lru, &putback_list);
> > > +			}
> > 
> > nitpick: I guess the above loop is optional code to avoid overheads
> > of shrink_page_list() repeatedly going through some unfreeable pages?
> 
> Pretty much, if they are to be activated, there is no point trying to reclaim
> them again. It's unnecessary overhead. A strong motivation for this
> series is to reduce overheads in the reclaim paths and unnecessary
> retrying of unfreeable pages.

We do so much waits in this loop, so that users will get upset by the
iowait stalls much much more than the CPU overheads.. best option is
always to avoid entering this loop in the first place, and if we
succeeded on that, these lines of optimizations will be nothing but
mind destroyers for newbie developers.

> > Considering this is the slow code path, I'd prefer to keep the code
> > simple than to do such optimizations.
> > 
> > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > 
> > how about 
> >                         if (!laptop_mode)
> >                                 wakeup_flusher_threads(nr_dirty);
> > 
> 
> It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
> clean all pages if some need dirtying. laptop_mode cleans all pages to
> minimise disk spinups.

Ah.. that's sure fine. I wonder if the flusher could be more smart to
automatically extend the number of pages to write in laptop mode. This
could simplify some callers.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 14:33         ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> > >  	/* Check if we should syncronously wait for writeback */
> > >  	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> > 
> > It is possible to OOM if the LRU list is small and/or the storage is slow, so
> > that the flusher cannot clean enough pages before the LRU is fully scanned.
> > 
> 
> To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
> would have to be dirty or unreclaimable. If that situation happens, is
> the dirty throttling not also broken?

My worry is, even if the dirty throttling limit is instantly set to 0,
it may still take time to knock down the number of dirty pages. Think
about 500MB dirty pages waiting to be flushed to a slow USB stick.

> > So we may need do waits on dirty/writeback pages on *order-0*
> > direct reclaims, when priority goes rather low (such as < 3).
> > 
> 
> In case this is really necessary, the necessary stalling could be done by
> removing the check for lumpy reclaim in should_reclaim_stall().  What do
> you think of the following replacement?

I merely want to provide a guarantee, so it may be enough to add this:

        if (nr_freed == nr_taken)
                return false;

+       if (!priority)
+               return true;

This ensures the last full LRU scan will do necessary waits to prevent
the OOM.

> /*
>  * Returns true if the caller should wait to clean dirty/writeback pages.
>  *
>  * If we are direct reclaiming for contiguous pages and we do not reclaim
>  * everything in the list, try again and wait for writeback IO to complete.
>  * This will stall high-order allocations noticeably. Only do that when really
>  * need to free the pages under high memory pressure.
>  *
>  * Alternatively, if priority is getting high, it may be because there are
>  * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
>  * and potentially causing an OOM, we stall on writeback.
>  */
> static inline bool should_reclaim_stall(unsigned long nr_taken,
>                                         unsigned long nr_freed,
>                                         int priority,
>                                         struct scan_control *sc)
> {
>         int stall_priority;
> 
>         /* kswapd should not stall on sync IO */
>         if (current_is_kswapd())
>                 return false;
> 
>         /* If we cannot writeback, there is no point stalling */
>         if (!sc->may_writepage)
>                 return false;
> 
>         /* If we have relaimed everything on the isolated list, no stall */
>         if (nr_freed == nr_taken)
>                 return false;
> 
>         /*
>          * For high-order allocations, there are two stall thresholds.
>          * High-cost allocations stall immediately where as lower
>          * order allocations such as stacks require the scanning
>          * priority to be much higher before stalling.
>          */
>         if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>                 stall_priority = DEF_PRIORITY;
>         else
>                 stall_priority = DEF_PRIORITY / 3;
> 
>         return priority <= stall_priority;
> }
> 
> 
> > > +		int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > >  		set_lumpy_reclaim_mode(priority, sc, true);
> > > -		nr_reclaimed += shrink_page_list(&page_list, sc);
> > > +
> > > +		while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > +			struct page *page, *tmp;
> > > +
> > 
> > > +			/* Take off the clean pages marked for activation */
> > > +			list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > > +				if (PageDirty(page) || PageWriteback(page))
> > > +					continue;
> > > +
> > > +				list_del(&page->lru);
> > > +				list_add(&page->lru, &putback_list);
> > > +			}
> > 
> > nitpick: I guess the above loop is optional code to avoid overheads
> > of shrink_page_list() repeatedly going through some unfreeable pages?
> 
> Pretty much, if they are to be activated, there is no point trying to reclaim
> them again. It's unnecessary overhead. A strong motivation for this
> series is to reduce overheads in the reclaim paths and unnecessary
> retrying of unfreeable pages.

We do so much waits in this loop, so that users will get upset by the
iowait stalls much much more than the CPU overheads.. best option is
always to avoid entering this loop in the first place, and if we
succeeded on that, these lines of optimizations will be nothing but
mind destroyers for newbie developers.

> > Considering this is the slow code path, I'd prefer to keep the code
> > simple than to do such optimizations.
> > 
> > > +			wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> > 
> > how about 
> >                         if (!laptop_mode)
> >                                 wakeup_flusher_threads(nr_dirty);
> > 
> 
> It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
> clean all pages if some need dirtying. laptop_mode cleans all pages to
> minimise disk spinups.

Ah.. that's sure fine. I wonder if the flusher could be more smart to
automatically extend the number of pages to write in laptop mode. This
could simplify some callers.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
  2010-09-13 14:10       ` Mel Gorman
@ 2010-09-13 14:41         ` Wu Fengguang
  -1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 10:10:46PM +0800, Mel Gorman wrote:
> On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though the
> > > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > > +	 * threads to pro-actively clean up to a maximum of
> > > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > +	 * !may_writepage indicates that this is a direct reclaimer in
> > > +	 * laptop mode avoiding disk spin-ups
> > > +	 */
> > > +	if (file && nr_dirty_seen && sc->may_writepage)
> > > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > 
> > wakeup_flusher_threads() works, but seems not the pertinent one.
> > 
> > - locally, it needs some luck to clean the pages that direct reclaim is waiting on
> 
> There is a certain amount of luck involved but it's depending on there being a
> correlation between old inodes and old pages on the LRU list. As long as that
> correlation is accurate, some relevant pages will get cleaned.  Testing on
> previously released versions of this patch did show that the percentage of
> dirty pages encountered during reclaim were reduced as a result of this patch.

Yup.

> > - globally, it cleans up some dirty pages, however some heavy dirtier
> >   may quickly create new ones..
> > 
> > So how about taking the approaches in these patches?
> > 
> > - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> > - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> > 
> 
> There is a lot going on in those patches. It's going to take me a while to
> figure them out and formulate an opinion.

OK. I also need some time off for doing other works :)

> > In particular the first patch should work very nicely with memcg, as
> > all pages of an inode typically belong to the same memcg. So doing
> > write-around helps clean lots of dirty pages in the target LRU list in
> > one shot.
> > 
> 
> It might but as there is also a correlation between old dirty inodes and
> the location of dirty pages, it is tricky to predict if it is better and
> if so, by how much.

It at least guarantees to clean the one page pageout() is running into :)
Others will depend on the locality/sequentiality of the workload. But
as the write-around pages are in the same LRU lists, the vmscan code
will hit them sooner or later.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 14:41         ` Wu Fengguang
  0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 13, 2010 at 10:10:46PM +0800, Mel Gorman wrote:
> On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > > +	/*
> > > +	 * If reclaim is encountering dirty pages, it may be because
> > > +	 * dirty pages are reaching the end of the LRU even though the
> > > +	 * dirty_ratio may be satisified. In this case, wake flusher
> > > +	 * threads to pro-actively clean up to a maximum of
> > > +	 * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > +	 * !may_writepage indicates that this is a direct reclaimer in
> > > +	 * laptop mode avoiding disk spin-ups
> > > +	 */
> > > +	if (file && nr_dirty_seen && sc->may_writepage)
> > > +		wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > 
> > wakeup_flusher_threads() works, but seems not the pertinent one.
> > 
> > - locally, it needs some luck to clean the pages that direct reclaim is waiting on
> 
> There is a certain amount of luck involved but it's depending on there being a
> correlation between old inodes and old pages on the LRU list. As long as that
> correlation is accurate, some relevant pages will get cleaned.  Testing on
> previously released versions of this patch did show that the percentage of
> dirty pages encountered during reclaim were reduced as a result of this patch.

Yup.

> > - globally, it cleans up some dirty pages, however some heavy dirtier
> >   may quickly create new ones..
> > 
> > So how about taking the approaches in these patches?
> > 
> > - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> > - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> > 
> 
> There is a lot going on in those patches. It's going to take me a while to
> figure them out and formulate an opinion.

OK. I also need some time off for doing other works :)

> > In particular the first patch should work very nicely with memcg, as
> > all pages of an inode typically belong to the same memcg. So doing
> > write-around helps clean lots of dirty pages in the target LRU list in
> > one shot.
> > 
> 
> It might but as there is also a correlation between old dirty inodes and
> the location of dirty pages, it is tricky to predict if it is better and
> if so, by how much.

It at least guarantees to clean the one page pageout() is running into :)
Others will depend on the locality/sequentiality of the workload. But
as the write-around pages are in the same LRU lists, the vmscan code
will hit them sooner or later.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
  2010-09-06 10:47 ` Mel Gorman
@ 2010-09-13 23:10   ` Minchan Kim
  -1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 23:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 6, 2010 at 7:47 PM, Mel Gorman <mel@csn.ul.ie> wrote:

<snip>

>
> These are just the raw figures taken from /proc/vmstat. It's a rough measure
> of reclaim activity. Note that allocstall counts are higher because we
> are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
>
> FTrace Reclaim Statistics: vmscan
>            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
>                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
> Direct reclaims                                152        941        967        729
> Direct reclaim pages scanned                507377    1404350    1332420    1450213
> Direct reclaim pages reclaimed               10968      72042      77186      41097
> Direct reclaim write file async I/O              0          0          0          0
> Direct reclaim write anon async I/O              0          0          0          0
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        127195     241025     254825     188846
> Kswapd wakeups                                   6          1          1          1
> Kswapd pages scanned                       4210101    3345122    3427915    3306356
> Kswapd pages reclaimed                     2228073    2165721    2143876    2194611
> Kswapd reclaim write file async I/O              0          0          0          0
> Kswapd reclaim write anon async I/O              0          0          0          0
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         7.60       3.03       3.24       3.43
> Time kswapd awake (seconds)                  12.46       9.46       9.56       9.40
>
> Total pages scanned                        4717478   4749472   4760335   4756569
> Total pages reclaimed                      2239041   2237763   2221062   2235708
> %age total pages scanned/reclaimed          47.46%    47.12%    46.66%    47.00%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        43.80%    21.38%    22.34%    23.46%
> Percentage Time kswapd Awake                79.92%    79.56%    79.20%    80.48%


There is a nitpick about stalled reclaim time.
For example, In direct reclaim

===
       trace_mm_vmscan_direct_reclaim_begin(order,
                               sc.may_writepage,
                               gfp_mask);

       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
===

In this case, Isn't this time accumulated value?
My point is following as.

Process A                                                       Process B
direct reclaim begin
do_try_to_free_pages
cond_resched

                direct reclaim begin

                do_try_to_free_pages

                direct reclaim end
direct reclaim end


So A's result includes B's time so total stall time would be bigger than real.


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-13 23:10   ` Minchan Kim
  0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 23:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Mon, Sep 6, 2010 at 7:47 PM, Mel Gorman <mel@csn.ul.ie> wrote:

<snip>

>
> These are just the raw figures taken from /proc/vmstat. It's a rough measure
> of reclaim activity. Note that allocstall counts are higher because we
> are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
>
> FTrace Reclaim Statistics: vmscan
>            micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
>                traceonly-v1r5    nocongest-v1r5     lowlumpy-v1r5     nodirect-v1r5
> Direct reclaims                                152        941        967        729
> Direct reclaim pages scanned                507377    1404350    1332420    1450213
> Direct reclaim pages reclaimed               10968      72042      77186      41097
> Direct reclaim write file async I/O              0          0          0          0
> Direct reclaim write anon async I/O              0          0          0          0
> Direct reclaim write file sync I/O               0          0          0          0
> Direct reclaim write anon sync I/O               0          0          0          0
> Wake kswapd requests                        127195     241025     254825     188846
> Kswapd wakeups                                   6          1          1          1
> Kswapd pages scanned                       4210101    3345122    3427915    3306356
> Kswapd pages reclaimed                     2228073    2165721    2143876    2194611
> Kswapd reclaim write file async I/O              0          0          0          0
> Kswapd reclaim write anon async I/O              0          0          0          0
> Kswapd reclaim write file sync I/O               0          0          0          0
> Kswapd reclaim write anon sync I/O               0          0          0          0
> Time stalled direct reclaim (seconds)         7.60       3.03       3.24       3.43
> Time kswapd awake (seconds)                  12.46       9.46       9.56       9.40
>
> Total pages scanned                        4717478   4749472   4760335   4756569
> Total pages reclaimed                      2239041   2237763   2221062   2235708
> %age total pages scanned/reclaimed          47.46%    47.12%    46.66%    47.00%
> %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
> %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim        43.80%    21.38%    22.34%    23.46%
> Percentage Time kswapd Awake                79.92%    79.56%    79.20%    80.48%


There is a nitpick about stalled reclaim time.
For example, In direct reclaim

===
       trace_mm_vmscan_direct_reclaim_begin(order,
                               sc.may_writepage,
                               gfp_mask);

       nr_reclaimed = do_try_to_free_pages(zonelist, &sc);

       trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
===

In this case, Isn't this time accumulated value?
My point is following as.

Process A                                                       Process B
direct reclaim begin
do_try_to_free_pages
cond_resched

                direct reclaim begin

                do_try_to_free_pages

                direct reclaim end
direct reclaim end


So A's result includes B's time so total stall time would be bigger than real.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
  2010-09-13  9:14               ` Mel Gorman
@ 2010-09-14 10:14                 ` KOSAKI Motohiro
  -1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 10:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> > example, 
> > 
> > __do_fault()
> > {
> > (snip)
> >         if (unlikely(!(ret & VM_FAULT_LOCKED)))
> >                 lock_page(vmf.page);
> >         else
> >                 VM_BUG_ON(!PageLocked(vmf.page));
> > 
> >         /*
> >          * Should we do an early C-O-W break?
> >          */
> >         page = vmf.page;
> >         if (flags & FAULT_FLAG_WRITE) {
> >                 if (!(vma->vm_flags & VM_SHARED)) {
> >                         anon = 1;
> >                         if (unlikely(anon_vma_prepare(vma))) {
> >                                 ret = VM_FAULT_OOM;
> >                                 goto out;
> >                         }
> >                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> >                                                 vma, address);
> > 
> 
> Correct, this is a problem. I already had dropped the patch but thanks for
> pointing out a deadlock because I was missing this case. Nothing stops the
> page being faulted being sent to shrink_page_list() when alloc_page_vma()
> is called. The deadlock might be hard to hit, but it's there.

Yup, unfortunatelly.



> > Afaik, detailed rule is,
> > 
> > o kswapd can call lock_page() because they never take page lock outside vmscan
> 
> lock_page_nosync as you point out in your next mail. While it can call
> it, kswapd shouldn't because normally it avoids stalls but it would not
> deadlock as a result of calling it.

Agreed.


> > o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> >   because the task have gurantee of no lock taken.
> > o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> > 
> 
> I think the safer bet is simply to say "direct reclaimers should not
> call lock_page() because the fault path could be holding a lock on that
> page already".

Yup, agreed.




^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-14 10:14                 ` KOSAKI Motohiro
  0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 10:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
	Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
	Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

> > example, 
> > 
> > __do_fault()
> > {
> > (snip)
> >         if (unlikely(!(ret & VM_FAULT_LOCKED)))
> >                 lock_page(vmf.page);
> >         else
> >                 VM_BUG_ON(!PageLocked(vmf.page));
> > 
> >         /*
> >          * Should we do an early C-O-W break?
> >          */
> >         page = vmf.page;
> >         if (flags & FAULT_FLAG_WRITE) {
> >                 if (!(vma->vm_flags & VM_SHARED)) {
> >                         anon = 1;
> >                         if (unlikely(anon_vma_prepare(vma))) {
> >                                 ret = VM_FAULT_OOM;
> >                                 goto out;
> >                         }
> >                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> >                                                 vma, address);
> > 
> 
> Correct, this is a problem. I already had dropped the patch but thanks for
> pointing out a deadlock because I was missing this case. Nothing stops the
> page being faulted being sent to shrink_page_list() when alloc_page_vma()
> is called. The deadlock might be hard to hit, but it's there.

Yup, unfortunatelly.



> > Afaik, detailed rule is,
> > 
> > o kswapd can call lock_page() because they never take page lock outside vmscan
> 
> lock_page_nosync as you point out in your next mail. While it can call
> it, kswapd shouldn't because normally it avoids stalls but it would not
> deadlock as a result of calling it.

Agreed.


> > o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> >   because the task have gurantee of no lock taken.
> > o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> > 
> 
> I think the safer bet is simply to say "direct reclaimers should not
> call lock_page() because the fault path could be holding a lock on that
> page already".

Yup, agreed.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-09-06 10:47   ` Mel Gorman
@ 2010-10-28 21:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2010-10-28 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

Looks like this once again didn't get merged for 2.6.37.  Any reason
for that?


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-10-28 21:50     ` Christoph Hellwig
  0 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2010-10-28 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

Looks like this once again didn't get merged for 2.6.37.  Any reason
for that?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
  2010-10-28 21:50     ` Christoph Hellwig
@ 2010-10-29 10:26       ` Mel Gorman
  -1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-10-29 10:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, Oct 28, 2010 at 05:50:46PM -0400, Christoph Hellwig wrote:
> Looks like this once again didn't get merged for 2.6.37.  Any reason
> for that?
> 

There are still concerns as to whether this is a good idea or or whether we
are papering over the fact that there are too many dirty pages at the end
of the LRU. The tracepoints necessary to track the dirty pages encountered
went in this cycle as well as some writeback and congestion-waiting changes.
I was waiting for some of the writeback churn to die down before
revisiting this. The ideal point to reach is "we hardly ever encounter
dirty pages so disabling direct writeback has no impact".

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-10-29 10:26       ` Mel Gorman
  0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-10-29 10:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
	Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
	Christoph Hellwig, Andrew Morton

On Thu, Oct 28, 2010 at 05:50:46PM -0400, Christoph Hellwig wrote:
> Looks like this once again didn't get merged for 2.6.37.  Any reason
> for that?
> 

There are still concerns as to whether this is a good idea or or whether we
are papering over the fact that there are too many dirty pages at the end
of the LRU. The tracepoints necessary to track the dirty pages encountered
went in this cycle as well as some writeback and congestion-waiting changes.
I was waiting for some of the writeback churn to die down before
revisiting this. The ideal point to reach is "we hardly ever encounter
dirty pages so disabling direct writeback has no impact".

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2010-10-29 10:27 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-06 10:47 [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1 Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-06 10:47 ` [PATCH 02/10] writeback: Account for time spent congestion_waited Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-06 10:47 ` [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-07 15:25   ` Minchan Kim
2010-09-07 15:25     ` Minchan Kim
2010-09-08 11:04     ` Mel Gorman
2010-09-08 11:04       ` Mel Gorman
2010-09-08 14:52       ` Minchan Kim
2010-09-08 14:52         ` Minchan Kim
2010-09-09  8:54         ` Mel Gorman
2010-09-09  8:54           ` Mel Gorman
2010-09-12 15:37           ` Minchan Kim
2010-09-12 15:37             ` Minchan Kim
2010-09-13  8:55             ` Mel Gorman
2010-09-13  8:55               ` Mel Gorman
2010-09-13  9:48               ` Minchan Kim
2010-09-13  9:48                 ` Minchan Kim
2010-09-13 10:07                 ` Mel Gorman
2010-09-13 10:07                   ` Mel Gorman
2010-09-13 10:20                   ` Minchan Kim
2010-09-13 10:20                     ` Minchan Kim
2010-09-13 10:30                     ` Mel Gorman
2010-09-13 10:30                       ` Mel Gorman
2010-09-08 21:23   ` Andrew Morton
2010-09-08 21:23     ` Andrew Morton
2010-09-09 10:43     ` Mel Gorman
2010-09-09 10:43       ` Mel Gorman
2010-09-09  3:02   ` KAMEZAWA Hiroyuki
2010-09-09  3:02     ` KAMEZAWA Hiroyuki
2010-09-09  8:58     ` Mel Gorman
2010-09-09  8:58       ` Mel Gorman
2010-09-06 10:47 ` [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait() Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-07 15:26   ` Minchan Kim
2010-09-07 15:26     ` Minchan Kim
2010-09-08  6:15   ` Johannes Weiner
2010-09-08  6:15     ` Johannes Weiner
2010-09-08 11:25   ` Wu Fengguang
2010-09-08 11:25     ` Wu Fengguang
2010-09-09  3:03   ` KAMEZAWA Hiroyuki
2010-09-09  3:03     ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page() Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-07 15:28   ` Minchan Kim
2010-09-07 15:28     ` Minchan Kim
2010-09-08  6:16   ` Johannes Weiner
2010-09-08  6:16     ` Johannes Weiner
2010-09-08 11:28   ` Wu Fengguang
2010-09-08 11:28     ` Wu Fengguang
2010-09-09  3:04   ` KAMEZAWA Hiroyuki
2010-09-09  3:04     ` KAMEZAWA Hiroyuki
2010-09-09  3:15     ` KAMEZAWA Hiroyuki
2010-09-09  3:15       ` KAMEZAWA Hiroyuki
2010-09-09  3:25       ` Wu Fengguang
2010-09-09  3:25         ` Wu Fengguang
2010-09-09  4:13       ` KOSAKI Motohiro
2010-09-09  4:13         ` KOSAKI Motohiro
2010-09-09  9:22         ` Mel Gorman
2010-09-09  9:22           ` Mel Gorman
2010-09-10 10:25           ` KOSAKI Motohiro
2010-09-10 10:25             ` KOSAKI Motohiro
2010-09-10 10:33             ` KOSAKI Motohiro
2010-09-10 10:33               ` KOSAKI Motohiro
2010-09-10 10:33               ` KOSAKI Motohiro
2010-09-13  9:14             ` Mel Gorman
2010-09-13  9:14               ` Mel Gorman
2010-09-14 10:14               ` KOSAKI Motohiro
2010-09-14 10:14                 ` KOSAKI Motohiro
2010-09-06 10:47 ` [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-09  3:14   ` KAMEZAWA Hiroyuki
2010-09-09  3:14     ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list() Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-07 15:33   ` Minchan Kim
2010-09-07 15:33     ` Minchan Kim
2010-09-06 10:47 ` [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-07 15:37   ` Minchan Kim
2010-09-07 15:37     ` Minchan Kim
2010-09-08 11:12     ` Mel Gorman
2010-09-08 11:12       ` Mel Gorman
2010-09-08 14:58       ` Minchan Kim
2010-09-08 14:58         ` Minchan Kim
2010-09-08 11:37   ` Wu Fengguang
2010-09-08 11:37     ` Wu Fengguang
2010-09-08 12:50     ` Mel Gorman
2010-09-08 12:50       ` Mel Gorman
2010-09-08 13:14       ` Wu Fengguang
2010-09-08 13:14         ` Wu Fengguang
2010-09-08 13:27         ` Mel Gorman
2010-09-08 13:27           ` Mel Gorman
2010-09-09  3:17   ` KAMEZAWA Hiroyuki
2010-09-09  3:17     ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-13 13:31   ` Wu Fengguang
2010-09-13 13:31     ` Wu Fengguang
2010-09-13 13:55     ` Mel Gorman
2010-09-13 13:55       ` Mel Gorman
2010-09-13 14:33       ` Wu Fengguang
2010-09-13 14:33         ` Wu Fengguang
2010-10-28 21:50   ` Christoph Hellwig
2010-10-28 21:50     ` Christoph Hellwig
2010-10-29 10:26     ` Mel Gorman
2010-10-29 10:26       ` Mel Gorman
2010-09-06 10:47 ` [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-09-06 10:47   ` Mel Gorman
2010-09-09  3:22   ` KAMEZAWA Hiroyuki
2010-09-09  3:22     ` KAMEZAWA Hiroyuki
2010-09-09  9:32     ` Mel Gorman
2010-09-09  9:32       ` Mel Gorman
2010-09-13  0:53       ` KAMEZAWA Hiroyuki
2010-09-13  0:53         ` KAMEZAWA Hiroyuki
2010-09-13 13:48   ` Wu Fengguang
2010-09-13 13:48     ` Wu Fengguang
2010-09-13 14:10     ` Mel Gorman
2010-09-13 14:10       ` Mel Gorman
2010-09-13 14:41       ` Wu Fengguang
2010-09-13 14:41         ` Wu Fengguang
2010-09-06 10:49 ` [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1 Mel Gorman
2010-09-06 10:49   ` Mel Gorman
2010-09-08  3:14 ` KOSAKI Motohiro
2010-09-08  3:14   ` KOSAKI Motohiro
2010-09-08  8:38   ` Mel Gorman
2010-09-08  8:38     ` Mel Gorman
2010-09-13 23:10 ` Minchan Kim
2010-09-13 23:10   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.