* [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
gathers together three different patch sets which in combination should
tackle some of the root causes of latency problems being reported.
The first patch improves vmscan latency by tracking when pages get reclaimed
by shrink_inactive_list. For this series, the most important results is
being able to calculate the scanning/reclaim ratio as a measure of the
amount of work being done by page reclaim.
Patches 2 and 3 account for the time spent in congestion_wait() and avoids
calling going to sleep on congestion when it is unnecessary. This is expected
to reduce stalls in situations where the system is under memory pressure
but not due to congestion.
Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.
Patches 9-10 revisit avoiding filesystem writeback from direct reclaim. This has been
reported as being a potential cause of stack overflow but it can also result in poor IO
patterns increasing reclaim latencies.
There are patches similar to 9-10 already in mmotm but Andrew had concerns
about their impact. Hence, I revisisted them as the last part of this series
for re-evaluation.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc3
traceonly-v1r5: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
lowlumpy-v1r5: Patches 1-8 to test if lumpy reclaim is better
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
iozone
Smoke test performance, isn't putting the system under major stress
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine and as they did not put the machine under memory pressure
the main paths this series deals with were not exercised. sysbench will be
re-run in the future with read-write testing as it is sensitive to writeback
performance under memory pressure. It is an oversight that it didn't happen
for this test.
X86-64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 2631.00 ( 0.00%) 2483.00 ( -5.96%) 2375.00 ( -10.78%) 2467.00 ( -6.65%)
pgalloc_dma32 2840528.00 ( 0.00%) 2841510.00 ( 0.03%) 2841391.00 ( 0.03%) 2842308.00 ( 0.06%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 1383.00 ( 0.00%) 1182.00 ( -17.01%) 1177.00 ( -17.50%) 1181.00 ( -17.10%)
pgsteal_dma32 2237658.00 ( 0.00%) 2236581.00 ( -0.05%) 2219885.00 ( -0.80%) 2234527.00 ( -0.14%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3006.00 ( 0.00%) 1400.00 (-114.71%) 1547.00 ( -94.31%) 1347.00 (-123.16%)
pgscan_kswapd_dma32 4206487.00 ( 0.00%) 3343082.00 ( -25.83%) 3425728.00 ( -22.79%) 3304369.00 ( -27.30%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629.00 ( 0.00%) 1793.00 ( 64.92%) 1643.00 ( 61.72%) 1868.00 ( 66.33%)
pgscan_direct_dma32 506741.00 ( 0.00%) 1402557.00 ( 63.87%) 1330777.00 ( 61.92%) 1448345.00 ( 65.01%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15449.00 ( 0.00%) 15555.00 ( 0.68%) 15319.00 ( -0.85%) 15963.00 ( 3.22%)
allocstall 152.00 ( 0.00%) 941.00 ( 83.85%) 967.00 ( 84.28%) 729.00 ( 79.15%)
These are just the raw figures taken from /proc/vmstat. It's a rough measure
of reclaim activity. Note that allocstall counts are higher because we
are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 152 941 967 729
Direct reclaim pages scanned 507377 1404350 1332420 1450213
Direct reclaim pages reclaimed 10968 72042 77186 41097
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 127195 241025 254825 188846
Kswapd wakeups 6 1 1 1
Kswapd pages scanned 4210101 3345122 3427915 3306356
Kswapd pages reclaimed 2228073 2165721 2143876 2194611
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7.60 3.03 3.24 3.43
Time kswapd awake (seconds) 12.46 9.46 9.56 9.40
Total pages scanned 4717478 4749472 4760335 4756569
Total pages reclaimed 2239041 2237763 2221062 2235708
%age total pages scanned/reclaimed 47.46% 47.12% 46.66% 47.00%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.80% 21.38% 22.34% 23.46%
Percentage Time kswapd Awake 79.92% 79.56% 79.20% 80.48%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 148 0 0 0
Direct time congest waited 8376ms 0ms 0ms 0ms
Direct full congest waited 127 0 0 0
Direct number conditional waited 0 711 693 627
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 127 0 0 0
KSwapd number congest waited 38 11 12 14
KSwapd time congest waited 3236ms 548ms 576ms 576ms
KSwapd full congest waited 31 3 3 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 31 3 3 2
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 9.75 11.14 11.26 11.19
Total Elapsed Time (seconds) 15.59 11.89 12.07 11.68
And overall, the tests complete significantly faster. Indicators are that
reclaim did less work and the test completed faster with fewer stalls. Seems
good.
PPC64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 3027144.00 ( 0.00%) 3025080.00 ( -0.07%) 3025463.00 ( -0.06%) 3026037.00 ( -0.04%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2399696.00 ( 0.00%) 2399540.00 ( -0.01%) 2399592.00 ( -0.00%) 2399570.00 ( -0.01%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3690319.00 ( 0.00%) 2883661.00 ( -27.97%) 2852314.00 ( -29.38%) 3008323.00 ( -22.67%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 1224036.00 ( 0.00%) 1975664.00 ( 38.04%) 2012185.00 ( 39.17%) 1907869.00 ( 35.84%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15170.00 ( 0.00%) 14636.00 ( -3.65%) 14664.00 ( -3.45%) 16027.00 ( 5.35%)
allocstall 712.00 ( 0.00%) 1906.00 ( 62.64%) 1912.00 ( 62.76%) 2027.00 ( 64.87%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 712 1906 1904 2021
Direct reclaim pages scanned 1224100 1975664 2010015 1906767
Direct reclaim pages reclaimed 79215 218292 202719 209388
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 1154724 805852 767944 848063
Kswapd wakeups 3 2 2 2
Kswapd pages scanned 3690799 2884173 2852026 3008835
Kswapd pages reclaimed 2320481 2181248 2195908 2189076
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 21.02 7.19 7.72 6.76
Time kswapd awake (seconds) 39.55 25.31 24.88 24.83
Total pages scanned 4914899 4859837 4862041 4915602
Total pages reclaimed 2399696 2399540 2398627 2398464
%age total pages scanned/reclaimed 48.82% 49.37% 49.33% 48.79%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.44% 19.64% 20.77% 18.43%
Percentage Time kswapd Awake 87.36% 81.94% 81.84% 81.28%
Again, a similar trend that the congestion_wait changes mean that direct reclaim
scans more pages but the overall number of pages scanned is very similar and
the ratio of scanning/reclaimed remains roughly similar. Once again, reclaim is
not doing more work, but spends less time in direct reclaim and with kswapd awake.
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 499 0 0 0
Direct time congest waited 22700ms 0ms 0ms 0ms
Direct full congest waited 421 0 0 0
Direct number conditional waited 0 1214 1242 1290
Direct time conditional waited 0ms 4ms 0ms 0ms
Direct full conditional waited 421 0 0 0
KSwapd number congest waited 257 103 94 104
KSwapd time congest waited 22116ms 7344ms 7476ms 7528ms
KSwapd full congest waited 203 57 59 56
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 203 57 59 56
The vanilla kernel spent 22 seconds asleep in direct reclaim and no time at
all asleep with the patches. which is a big improvement. The time kswapd spent congest
waited was also reduced by a large factor.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 27.37 29.42 29.45 29.91
Total Elapsed Time (seconds) 45.27 30.89 30.40 30.55
And the test again completed far faster.
X86-64 STRESS-HIGHALLOC
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 84.00 ( 0.00%) 84.00 ( 0.00%) 80.00 (-4.00%) 72.00 (-12.00%)
Pass 2 94.00 ( 0.00%) 94.00 ( 0.00%) 89.00 (-5.00%) 88.00 (-6.00%)
At Rest 95.00 ( 0.00%) 95.00 ( 0.00%) 95.00 ( 0.00%) 92.00 (-3.00%)
Success figures start dropping off for lowlumpy and nodirect. This ordinarily
would be a concern but the rest of the report paints a better picture.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 838 1189 1323 1197
Direct reclaim pages scanned 182207 168696 146310 133117
Direct reclaim pages reclaimed 84208 81706 80442 54879
Direct reclaim write file async I/O 538 619 839 0
Direct reclaim write anon async I/O 36403 32892 44126 22085
Direct reclaim write file sync I/O 88 108 1 0
Direct reclaim write anon sync I/O 19107 15514 871 0
Wake kswapd requests 7761 827 865 6502
Kswapd wakeups 749 733 658 614
Kswapd pages scanned 6400676 6871918 6875056 3126591
Kswapd pages reclaimed 3122126 3376919 3001799 1669300
Kswapd reclaim write file async I/O 58199 67175 28483 925
Kswapd reclaim write anon async I/O 1740452 1851455 1680964 186578
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 3864.84 4426.77 3108.85 254.08
Time kswapd awake (seconds) 1792.00 2130.10 1890.76 343.37
Total pages scanned 6582883 7040614 7021366 3259708
Total pages reclaimed 3206334 3458625 3082241 1724179
%age total pages scanned/reclaimed 48.71% 49.12% 43.90% 52.89%
%age total pages scanned/written 28.18% 27.95% 25.00% 6.43%
%age file pages scanned/written 0.89% 0.96% 0.42% 0.03%
Percentage Time Spent Direct Reclaim 53.38% 56.75% 47.80% 8.44%
Percentage Time kswapd Awake 35.35% 37.88% 43.97% 23.01%
Scanned/reclaimed ratios again look good. The Scanned/written ratios look
very good for the nodirect patches showing that the writeback is happening
more in the flusher threads and less from direct reclaim. The expectation
is that the IO should be more efficient and indeed the time spent in direct
reclaim is massively reduced by the full series and kswapd spends a little
less time awake.
Overall, indications here are that things are moving much faster.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1060 1 0 0
Direct time congest waited 63664ms 100ms 0ms 0ms
Direct full congest waited 617 1 0 0
Direct number conditional waited 0 1650 866 838
Direct time conditional waited 0ms 20296ms 1916ms 17652ms
Direct full conditional waited 617 1 0 0
KSwapd number congest waited 399 0 466 12
KSwapd time congest waited 33376ms 0ms 33048ms 968ms
KSwapd full congest waited 318 0 312 9
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 318 0 312 9
The sleep times for congest wait get interesting here. congestion_wait()
times are dropped to almost zero but wait_iff_congested() is detecting
when there is in fact congestion or too much writeback and still going to
sleep. Overall the times are reduced though - from 63ish seconds to about 20.
We are still backing off but less aggressively.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3375.95 3374.04 3395.56 2756.97
Total Elapsed Time (seconds) 5068.80 5623.06 4300.45 1492.09
Oddly, the nocongest patches took longer to complete the test but the
overall series reduces the test time by almost an hour or about in one
third of the time. I also looked at the latency figures when allocating
huge pages and got this
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-hydra-mean.ps
So it looks like the latencies in general are reduced. The full series
reduces latency by massive amounts but there is also a hint why nocongest
was slower overall. Its latencies were lower up to the point where 72%
of memory was allocated with huge pages. After the latencies were higher
but this problem is resolved later in the series.
PPC64 STRESS-HIGHALLOC
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 27.00 ( 0.00%) 38.00 (11.00%) 31.00 ( 4.00%) 43.00 (16.00%)
Pass 2 41.00 ( 0.00%) 43.00 ( 2.00%) 33.00 (-8.00%) 55.00 (14.00%)
At Rest 84.00 ( 0.00%) 83.00 (-1.00%) 84.00 ( 0.00%) 85.00 ( 1.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 461 426 547 915
Direct reclaim pages scanned 193118 171811 143647 138334
Direct reclaim pages reclaimed 130100 108863 65954 63043
Direct reclaim write file async I/O 442 293 748 0
Direct reclaim write anon async I/O 52948 45149 29910 9949
Direct reclaim write file sync I/O 34 154 0 0
Direct reclaim write anon sync I/O 33128 27267 119 0
Wake kswapd requests 302 282 306 233
Kswapd wakeups 154 146 123 132
Kswapd pages scanned 13019861 12506267 3409775 3072689
Kswapd pages reclaimed 4839299 4782393 1908499 1723469
Kswapd reclaim write file async I/O 77348 77785 14580 214
Kswapd reclaim write anon async I/O 2878272 2840643 428083 142755
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7692.01 7473.31 1044.76 217.31
Time kswapd awake (seconds) 7332.64 7171.23 1059.70 357.02
Total pages scanned 13212979 12678078 3553422 3211023
Total pages reclaimed 4969399 4891256 1974453 1786512
%age total pages scanned/reclaimed 37.61% 38.58% 55.56% 55.64%
%age total pages scanned/written 23.02% 23.59% 13.32% 4.76%
%age file pages scanned/written 0.59% 0.62% 0.43% 0.01%
Percentage Time Spent Direct Reclaim 42.66% 43.22% 26.30% 6.59%
Percentage Time kswapd Awake 82.06% 82.08% 45.82% 21.87%
Initially, it looks like the scanned/reclaimed ratios are much higher
and that's a bad thing. However, the number of pages scanned is reduced
by around 75% and the times spent in direct reclaim and with kswapd are
*massively* reduced. Overall the VM seems to be doing a lot less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 811 23 38 0
Direct time congest waited 40272ms 512ms 1496ms 0ms
Direct full congest waited 484 4 14 0
Direct number conditional waited 0 703 345 1281
Direct time conditional waited 0ms 22776ms 1312ms 10428ms
Direct full conditional waited 484 4 14 0
KSwapd number congest waited 1 0 6 6
KSwapd time congest waited 100ms 0ms 124ms 404ms
KSwapd full congest waited 1 0 1 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1 0 1 2
Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10340.18 9818.41 2927.13 3078.91
Total Elapsed Time (seconds) 8936.19 8736.59 2312.71 1632.74
The time to complete this test goes way down. Take the allocation success
rates - we are allocating 16% more memory as huge pages in less than a
fifth of the time and this is reflected in the allocation latency data
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-powyah-mean.ps
I recognise that this is a weighty series but the desktop latency and other
stall issues are a tricky topic. There are multiple root causes as to what
might be causing them but I believe this series kicks a number of them.
I think the congestion_wait changes will also impact Dave Chinner's fs-mark
test that showed up in the minute-long livelock report but I'm hoping the
filesystem people that were complaining about latencies in the VM could
test this series with their respective workloads.
.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++-
include/linux/backing-dev.h | 2 +-
include/trace/events/vmscan.h | 44 ++++-
include/trace/events/writeback.h | 35 +++
mm/backing-dev.c | 71 ++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 253 +++++++++++++++-----
7 files changed, 368 insertions(+), 80 deletions(-)
^ permalink raw reply [flat|nested] 133+ messages in thread
* [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There have been numerous reports of stalls that pointed at the problem being
somewhere in the VM. There are multiple roots to the problems which means
dealing with any of the root problems in isolation is tricky to justify on
their own and they would still need integration testing. This patch series
gathers together three different patch sets which in combination should
tackle some of the root causes of latency problems being reported.
The first patch improves vmscan latency by tracking when pages get reclaimed
by shrink_inactive_list. For this series, the most important results is
being able to calculate the scanning/reclaim ratio as a measure of the
amount of work being done by page reclaim.
Patches 2 and 3 account for the time spent in congestion_wait() and avoids
calling going to sleep on congestion when it is unnecessary. This is expected
to reduce stalls in situations where the system is under memory pressure
but not due to congestion.
Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive and
trashes the system somewhat. As SLUB uses high-order allocations, a large
cost incurred by lumpy reclaim will be noticeable. It was also reported
during transparent hugepage support testing that lumpy reclaim was trashing
the system and these patches should mitigate that problem without disabling
lumpy reclaim.
Patches 9-10 revisit avoiding filesystem writeback from direct reclaim. This has been
reported as being a potential cause of stack overflow but it can also result in poor IO
patterns increasing reclaim latencies.
There are patches similar to 9-10 already in mmotm but Andrew had concerns
about their impact. Hence, I revisisted them as the last part of this series
for re-evaluation.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was left
at 20. I'm just going to report for X86-64 and PPC64 in a vague attempt to
keep this report short. Four kernels were tested each based on v2.6.36-rc3
traceonly-v1r5: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
lowlumpy-v1r5: Patches 1-8 to test if lumpy reclaim is better
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
iozone
Smoke test performance, isn't putting the system under major stress
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine and as they did not put the machine under memory pressure
the main paths this series deals with were not exercised. sysbench will be
re-run in the future with read-write testing as it is sensitive to writeback
performance under memory pressure. It is an oversight that it didn't happen
for this test.
X86-64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 2631.00 ( 0.00%) 2483.00 ( -5.96%) 2375.00 ( -10.78%) 2467.00 ( -6.65%)
pgalloc_dma32 2840528.00 ( 0.00%) 2841510.00 ( 0.03%) 2841391.00 ( 0.03%) 2842308.00 ( 0.06%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 1383.00 ( 0.00%) 1182.00 ( -17.01%) 1177.00 ( -17.50%) 1181.00 ( -17.10%)
pgsteal_dma32 2237658.00 ( 0.00%) 2236581.00 ( -0.05%) 2219885.00 ( -0.80%) 2234527.00 ( -0.14%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3006.00 ( 0.00%) 1400.00 (-114.71%) 1547.00 ( -94.31%) 1347.00 (-123.16%)
pgscan_kswapd_dma32 4206487.00 ( 0.00%) 3343082.00 ( -25.83%) 3425728.00 ( -22.79%) 3304369.00 ( -27.30%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629.00 ( 0.00%) 1793.00 ( 64.92%) 1643.00 ( 61.72%) 1868.00 ( 66.33%)
pgscan_direct_dma32 506741.00 ( 0.00%) 1402557.00 ( 63.87%) 1330777.00 ( 61.92%) 1448345.00 ( 65.01%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15449.00 ( 0.00%) 15555.00 ( 0.68%) 15319.00 ( -0.85%) 15963.00 ( 3.22%)
allocstall 152.00 ( 0.00%) 941.00 ( 83.85%) 967.00 ( 84.28%) 729.00 ( 79.15%)
These are just the raw figures taken from /proc/vmstat. It's a rough measure
of reclaim activity. Note that allocstall counts are higher because we
are entering direct reclaim more often as a result of not sleeping in
congestion. In itself, it's not necessarily a bad thing. It's easier to
get a view of what happened from the vmscan tracepoint report.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 152 941 967 729
Direct reclaim pages scanned 507377 1404350 1332420 1450213
Direct reclaim pages reclaimed 10968 72042 77186 41097
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 127195 241025 254825 188846
Kswapd wakeups 6 1 1 1
Kswapd pages scanned 4210101 3345122 3427915 3306356
Kswapd pages reclaimed 2228073 2165721 2143876 2194611
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7.60 3.03 3.24 3.43
Time kswapd awake (seconds) 12.46 9.46 9.56 9.40
Total pages scanned 4717478 4749472 4760335 4756569
Total pages reclaimed 2239041 2237763 2221062 2235708
%age total pages scanned/reclaimed 47.46% 47.12% 46.66% 47.00%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.80% 21.38% 22.34% 23.46%
Percentage Time kswapd Awake 79.92% 79.56% 79.20% 80.48%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 148 0 0 0
Direct time congest waited 8376ms 0ms 0ms 0ms
Direct full congest waited 127 0 0 0
Direct number conditional waited 0 711 693 627
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 127 0 0 0
KSwapd number congest waited 38 11 12 14
KSwapd time congest waited 3236ms 548ms 576ms 576ms
KSwapd full congest waited 31 3 3 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 31 3 3 2
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 9.75 11.14 11.26 11.19
Total Elapsed Time (seconds) 15.59 11.89 12.07 11.68
And overall, the tests complete significantly faster. Indicators are that
reclaim did less work and the test completed faster with fewer stalls. Seems
good.
PPC64 micro-mapped-file-stream
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
pgalloc_dma 3027144.00 ( 0.00%) 3025080.00 ( -0.07%) 3025463.00 ( -0.06%) 3026037.00 ( -0.04%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2399696.00 ( 0.00%) 2399540.00 ( -0.01%) 2399592.00 ( -0.00%) 2399570.00 ( -0.01%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 3690319.00 ( 0.00%) 2883661.00 ( -27.97%) 2852314.00 ( -29.38%) 3008323.00 ( -22.67%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 1224036.00 ( 0.00%) 1975664.00 ( 38.04%) 2012185.00 ( 39.17%) 1907869.00 ( 35.84%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 15170.00 ( 0.00%) 14636.00 ( -3.65%) 14664.00 ( -3.45%) 16027.00 ( 5.35%)
allocstall 712.00 ( 0.00%) 1906.00 ( 62.64%) 1912.00 ( 62.76%) 2027.00 ( 64.87%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 712 1906 1904 2021
Direct reclaim pages scanned 1224100 1975664 2010015 1906767
Direct reclaim pages reclaimed 79215 218292 202719 209388
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 1154724 805852 767944 848063
Kswapd wakeups 3 2 2 2
Kswapd pages scanned 3690799 2884173 2852026 3008835
Kswapd pages reclaimed 2320481 2181248 2195908 2189076
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 21.02 7.19 7.72 6.76
Time kswapd awake (seconds) 39.55 25.31 24.88 24.83
Total pages scanned 4914899 4859837 4862041 4915602
Total pages reclaimed 2399696 2399540 2398627 2398464
%age total pages scanned/reclaimed 48.82% 49.37% 49.33% 48.79%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 43.44% 19.64% 20.77% 18.43%
Percentage Time kswapd Awake 87.36% 81.94% 81.84% 81.28%
Again, a similar trend that the congestion_wait changes mean that direct reclaim
scans more pages but the overall number of pages scanned is very similar and
the ratio of scanning/reclaimed remains roughly similar. Once again, reclaim is
not doing more work, but spends less time in direct reclaim and with kswapd awake.
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the same
and the ratio of pages scanned to pages reclaimed is more or less the same. In
other words, while we are sleeping less, reclaim is not doing more work and
in fact, direct reclaim and kswapd is awake for less time. Overall, the series
reduces reclaim work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 499 0 0 0
Direct time congest waited 22700ms 0ms 0ms 0ms
Direct full congest waited 421 0 0 0
Direct number conditional waited 0 1214 1242 1290
Direct time conditional waited 0ms 4ms 0ms 0ms
Direct full conditional waited 421 0 0 0
KSwapd number congest waited 257 103 94 104
KSwapd time congest waited 22116ms 7344ms 7476ms 7528ms
KSwapd full congest waited 203 57 59 56
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 203 57 59 56
The vanilla kernel spent 22 seconds asleep in direct reclaim and no time at
all asleep with the patches. which is a big improvement. The time kswapd spent congest
waited was also reduced by a large factor.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 27.37 29.42 29.45 29.91
Total Elapsed Time (seconds) 45.27 30.89 30.40 30.55
And the test again completed far faster.
X86-64 STRESS-HIGHALLOC
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 84.00 ( 0.00%) 84.00 ( 0.00%) 80.00 (-4.00%) 72.00 (-12.00%)
Pass 2 94.00 ( 0.00%) 94.00 ( 0.00%) 89.00 (-5.00%) 88.00 (-6.00%)
At Rest 95.00 ( 0.00%) 95.00 ( 0.00%) 95.00 ( 0.00%) 92.00 (-3.00%)
Success figures start dropping off for lowlumpy and nodirect. This ordinarily
would be a concern but the rest of the report paints a better picture.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 838 1189 1323 1197
Direct reclaim pages scanned 182207 168696 146310 133117
Direct reclaim pages reclaimed 84208 81706 80442 54879
Direct reclaim write file async I/O 538 619 839 0
Direct reclaim write anon async I/O 36403 32892 44126 22085
Direct reclaim write file sync I/O 88 108 1 0
Direct reclaim write anon sync I/O 19107 15514 871 0
Wake kswapd requests 7761 827 865 6502
Kswapd wakeups 749 733 658 614
Kswapd pages scanned 6400676 6871918 6875056 3126591
Kswapd pages reclaimed 3122126 3376919 3001799 1669300
Kswapd reclaim write file async I/O 58199 67175 28483 925
Kswapd reclaim write anon async I/O 1740452 1851455 1680964 186578
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 3864.84 4426.77 3108.85 254.08
Time kswapd awake (seconds) 1792.00 2130.10 1890.76 343.37
Total pages scanned 6582883 7040614 7021366 3259708
Total pages reclaimed 3206334 3458625 3082241 1724179
%age total pages scanned/reclaimed 48.71% 49.12% 43.90% 52.89%
%age total pages scanned/written 28.18% 27.95% 25.00% 6.43%
%age file pages scanned/written 0.89% 0.96% 0.42% 0.03%
Percentage Time Spent Direct Reclaim 53.38% 56.75% 47.80% 8.44%
Percentage Time kswapd Awake 35.35% 37.88% 43.97% 23.01%
Scanned/reclaimed ratios again look good. The Scanned/written ratios look
very good for the nodirect patches showing that the writeback is happening
more in the flusher threads and less from direct reclaim. The expectation
is that the IO should be more efficient and indeed the time spent in direct
reclaim is massively reduced by the full series and kswapd spends a little
less time awake.
Overall, indications here are that things are moving much faster.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1060 1 0 0
Direct time congest waited 63664ms 100ms 0ms 0ms
Direct full congest waited 617 1 0 0
Direct number conditional waited 0 1650 866 838
Direct time conditional waited 0ms 20296ms 1916ms 17652ms
Direct full conditional waited 617 1 0 0
KSwapd number congest waited 399 0 466 12
KSwapd time congest waited 33376ms 0ms 33048ms 968ms
KSwapd full congest waited 318 0 312 9
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 318 0 312 9
The sleep times for congest wait get interesting here. congestion_wait()
times are dropped to almost zero but wait_iff_congested() is detecting
when there is in fact congestion or too much writeback and still going to
sleep. Overall the times are reduced though - from 63ish seconds to about 20.
We are still backing off but less aggressively.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3375.95 3374.04 3395.56 2756.97
Total Elapsed Time (seconds) 5068.80 5623.06 4300.45 1492.09
Oddly, the nocongest patches took longer to complete the test but the
overall series reduces the test time by almost an hour or about in one
third of the time. I also looked at the latency figures when allocating
huge pages and got this
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-hydra-mean.ps
So it looks like the latencies in general are reduced. The full series
reduces latency by massive amounts but there is also a hint why nocongest
was slower overall. Its latencies were lower up to the point where 72%
of memory was allocated with huge pages. After the latencies were higher
but this problem is resolved later in the series.
PPC64 STRESS-HIGHALLOC
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Pass 1 27.00 ( 0.00%) 38.00 (11.00%) 31.00 ( 4.00%) 43.00 (16.00%)
Pass 2 41.00 ( 0.00%) 43.00 ( 2.00%) 33.00 (-8.00%) 55.00 (14.00%)
At Rest 84.00 ( 0.00%) 83.00 (-1.00%) 84.00 ( 0.00%) 85.00 ( 1.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct reclaims 461 426 547 915
Direct reclaim pages scanned 193118 171811 143647 138334
Direct reclaim pages reclaimed 130100 108863 65954 63043
Direct reclaim write file async I/O 442 293 748 0
Direct reclaim write anon async I/O 52948 45149 29910 9949
Direct reclaim write file sync I/O 34 154 0 0
Direct reclaim write anon sync I/O 33128 27267 119 0
Wake kswapd requests 302 282 306 233
Kswapd wakeups 154 146 123 132
Kswapd pages scanned 13019861 12506267 3409775 3072689
Kswapd pages reclaimed 4839299 4782393 1908499 1723469
Kswapd reclaim write file async I/O 77348 77785 14580 214
Kswapd reclaim write anon async I/O 2878272 2840643 428083 142755
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 7692.01 7473.31 1044.76 217.31
Time kswapd awake (seconds) 7332.64 7171.23 1059.70 357.02
Total pages scanned 13212979 12678078 3553422 3211023
Total pages reclaimed 4969399 4891256 1974453 1786512
%age total pages scanned/reclaimed 37.61% 38.58% 55.56% 55.64%
%age total pages scanned/written 23.02% 23.59% 13.32% 4.76%
%age file pages scanned/written 0.59% 0.62% 0.43% 0.01%
Percentage Time Spent Direct Reclaim 42.66% 43.22% 26.30% 6.59%
Percentage Time kswapd Awake 82.06% 82.08% 45.82% 21.87%
Initially, it looks like the scanned/reclaimed ratios are much higher
and that's a bad thing. However, the number of pages scanned is reduced
by around 75% and the times spent in direct reclaim and with kswapd are
*massively* reduced. Overall the VM seems to be doing a lot less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 811 23 38 0
Direct time congest waited 40272ms 512ms 1496ms 0ms
Direct full congest waited 484 4 14 0
Direct number conditional waited 0 703 345 1281
Direct time conditional waited 0ms 22776ms 1312ms 10428ms
Direct full conditional waited 484 4 14 0
KSwapd number congest waited 1 0 6 6
KSwapd time congest waited 100ms 0ms 124ms 404ms
KSwapd full congest waited 1 0 1 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1 0 1 2
Not as dramatic a story here but the time spent asleep is reduced and we can still
see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10340.18 9818.41 2927.13 3078.91
Total Elapsed Time (seconds) 8936.19 8736.59 2312.71 1632.74
The time to complete this test goes way down. Take the allocation success
rates - we are allocating 16% more memory as huge pages in less than a
fifth of the time and this is reflected in the allocation latency data
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20100609/highalloc-interlatency-powyah-mean.ps
I recognise that this is a weighty series but the desktop latency and other
stall issues are a tricky topic. There are multiple root causes as to what
might be causing them but I believe this series kicks a number of them.
I think the congestion_wait changes will also impact Dave Chinner's fs-mark
test that showed up in the minute-long livelock report but I'm hoping the
filesystem people that were complaining about latencies in the VM could
test this series with their respective workloads.
.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++-
include/linux/backing-dev.h | 2 +-
include/trace/events/vmscan.h | 44 ++++-
include/trace/events/writeback.h | 35 +++
mm/backing-dev.c | 71 ++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 253 +++++++++++++++-----
7 files changed, 368 insertions(+), 80 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
This patch adds a trace event for shrink_inactive_list(). It can be used
to determine how many pages were reclaimed and for non-lumpy reclaim where
exactly the pages were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++++++++++++-----
include/trace/events/vmscan.h | 42 ++++++++++++++++++++
mm/vmscan.c | 6 +++
3 files changed, 77 insertions(+), 10 deletions(-)
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY => 20;
use constant HIGH_KSWAPD_REWAKEUP => 21;
use constant HIGH_NR_SCANNED => 22;
use constant HIGH_NR_TAKEN => 23;
-use constant HIGH_NR_RECLAIM => 24;
+use constant HIGH_NR_RECLAIMED => 24;
use constant HIGH_NR_CONTIG_DIRTY => 25;
my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
my $total_wakeup_kswapd;
my ($total_direct_reclaim, $total_direct_nr_scanned);
my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
my ($total_kswapd_nr_scanned, $total_kswapd_wake);
my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
# Catch sigint and exit on request
my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_inactive",
$regex_lru_shrink_inactive_default,
"nid", "zid",
- "lru",
- "nr_scanned", "nr_reclaimed", "priority");
+ "nr_scanned", "nr_reclaimed", "priority",
+ "flags");
$regex_lru_shrink_active = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_active",
$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
my $nr_contig_dirty = $7;
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+ } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+ $details = $5;
+ if ($details !~ /$regex_lru_shrink_inactive/o) {
+ print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+ print " $details\n";
+ print " $regex_lru_shrink_inactive/o\n";
+ next;
+ }
+ my $nr_reclaimed = $4;
+ $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
} elsif ($tracepoint eq "mm_vmscan_writepage") {
$details = $5;
if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
# Print out process activity
printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
$index++;
}
- printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f",
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
$stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
# Print out kswapd activity
printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
- printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u",
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
$stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
@@ -579,6 +595,7 @@ sub dump_stats {
print "\nSummary\n";
print "Direct reclaims: $total_direct_reclaim\n";
print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
+ print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n";
print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n";
print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n";
print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
print "\n";
print "Kswapd wakeups: $total_kswapd_wake\n";
print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
+ print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n";
print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n";
print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n";
print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+ $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..14c1586 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
#define RECLAIM_WB_ANON 0x0001u
#define RECLAIM_WB_FILE 0x0002u
+#define RECLAIM_WB_MIXED 0x0010u
#define RECLAIM_WB_SYNC 0x0004u
#define RECLAIM_WB_ASYNC 0x0008u
@@ -17,6 +18,7 @@
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \
{RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \
+ {RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \
{RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
+#define trace_shrink_flags(file, sync) ( \
+ (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+ (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) | \
+ (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ )
+
TRACE_EVENT(mm_vmscan_kswapd_sleep,
TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
show_reclaim_flags(__entry->reclaim_flags))
);
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+ TP_PROTO(int nid, int zid,
+ unsigned long nr_scanned, unsigned long nr_reclaimed,
+ int priority, int reclaim_flags),
+
+ TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, zid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_reclaimed)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->zid = zid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_reclaimed = nr_reclaimed;
+ __entry->priority = priority;
+ __entry->reclaim_flags = reclaim_flags;
+ ),
+
+ TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ __entry->nid, __entry->zid,
+ __entry->nr_scanned, __entry->nr_reclaimed,
+ __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
#endif /* _TRACE_VMSCAN_H */
/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+ trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+ zone_idx(zone),
+ nr_scanned, nr_reclaimed,
+ priority,
+ trace_shrink_flags(file, sc->lumpy_reclaim_mode));
return nr_reclaimed;
}
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
This patch adds a trace event for shrink_inactive_list(). It can be used
to determine how many pages were reclaimed and for non-lumpy reclaim where
exactly the pages were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
.../trace/postprocess/trace-vmscan-postprocess.pl | 39 +++++++++++++-----
include/trace/events/vmscan.h | 42 ++++++++++++++++++++
mm/vmscan.c | 6 +++
3 files changed, 77 insertions(+), 10 deletions(-)
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 1b55146..b3e73dd 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -46,7 +46,7 @@ use constant HIGH_KSWAPD_LATENCY => 20;
use constant HIGH_KSWAPD_REWAKEUP => 21;
use constant HIGH_NR_SCANNED => 22;
use constant HIGH_NR_TAKEN => 23;
-use constant HIGH_NR_RECLAIM => 24;
+use constant HIGH_NR_RECLAIMED => 24;
use constant HIGH_NR_CONTIG_DIRTY => 25;
my %perprocesspid;
@@ -58,11 +58,13 @@ my $opt_read_procstat;
my $total_wakeup_kswapd;
my ($total_direct_reclaim, $total_direct_nr_scanned);
my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_nr_reclaimed);
my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async);
my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async);
my ($total_kswapd_nr_scanned, $total_kswapd_wake);
my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async);
my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async);
+my ($total_kswapd_nr_reclaimed);
# Catch sigint and exit on request
my $sigint_report = 0;
@@ -104,7 +106,7 @@ my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
-my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
@@ -203,8 +205,8 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_inactive",
$regex_lru_shrink_inactive_default,
"nid", "zid",
- "lru",
- "nr_scanned", "nr_reclaimed", "priority");
+ "nr_scanned", "nr_reclaimed", "priority",
+ "flags");
$regex_lru_shrink_active = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_active",
$regex_lru_shrink_active_default,
@@ -375,6 +377,16 @@ EVENT_PROCESS:
my $nr_contig_dirty = $7;
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+ } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") {
+ $details = $5;
+ if ($details !~ /$regex_lru_shrink_inactive/o) {
+ print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n";
+ print " $details\n";
+ print " $regex_lru_shrink_inactive/o\n";
+ next;
+ }
+ my $nr_reclaimed = $4;
+ $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed;
} elsif ($tracepoint eq "mm_vmscan_writepage") {
$details = $5;
if ($details !~ /$regex_writepage/o) {
@@ -464,8 +476,8 @@ sub dump_stats {
# Print out process activity
printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
@@ -475,6 +487,7 @@ sub dump_stats {
$total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
$total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
@@ -489,11 +502,12 @@ sub dump_stats {
$index++;
}
- printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f",
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
$stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
$stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC},
$this_reclaim_delay / 1000);
@@ -529,8 +543,8 @@ sub dump_stats {
# Print out kswapd activity
printf("\n");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages");
- printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO");
foreach $process_pid (keys %stats) {
if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
@@ -539,16 +553,18 @@ sub dump_stats {
$total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
$total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED};
$total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
$total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC};
- printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u",
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u",
$process_pid,
$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
$stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
$stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{HIGH_NR_RECLAIMED},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC},
$stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC});
@@ -579,6 +595,7 @@ sub dump_stats {
print "\nSummary\n";
print "Direct reclaims: $total_direct_reclaim\n";
print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
+ print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n";
print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n";
print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n";
print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n";
@@ -588,6 +605,7 @@ sub dump_stats {
print "\n";
print "Kswapd wakeups: $total_kswapd_wake\n";
print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
+ print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n";
print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n";
print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n";
print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n";
@@ -612,6 +630,7 @@ sub aggregate_perprocesspid() {
$perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
$perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
$perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+ $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC};
$perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC};
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 370aa5a..14c1586 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -10,6 +10,7 @@
#define RECLAIM_WB_ANON 0x0001u
#define RECLAIM_WB_FILE 0x0002u
+#define RECLAIM_WB_MIXED 0x0010u
#define RECLAIM_WB_SYNC 0x0004u
#define RECLAIM_WB_ASYNC 0x0008u
@@ -17,6 +18,7 @@
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \
{RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \
+ {RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \
{RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"
@@ -26,6 +28,12 @@
(sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
+#define trace_shrink_flags(file, sync) ( \
+ (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+ (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) | \
+ (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ )
+
TRACE_EVENT(mm_vmscan_kswapd_sleep,
TP_PROTO(int nid),
@@ -269,6 +277,40 @@ TRACE_EVENT(mm_vmscan_writepage,
show_reclaim_flags(__entry->reclaim_flags))
);
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+
+ TP_PROTO(int nid, int zid,
+ unsigned long nr_scanned, unsigned long nr_reclaimed,
+ int priority, int reclaim_flags),
+
+ TP_ARGS(nid, zid, nr_scanned, nr_reclaimed, priority, reclaim_flags),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, zid)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_reclaimed)
+ __field(int, priority)
+ __field(int, reclaim_flags)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->zid = zid;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_reclaimed = nr_reclaimed;
+ __entry->priority = priority;
+ __entry->reclaim_flags = reclaim_flags;
+ ),
+
+ TP_printk("nid=%d zid=%d nr_scanned=%ld nr_reclaimed=%ld priority=%d flags=%s",
+ __entry->nid, __entry->zid,
+ __entry->nr_scanned, __entry->nr_reclaimed,
+ __entry->priority,
+ show_reclaim_flags(__entry->reclaim_flags))
+);
+
+
#endif /* _TRACE_VMSCAN_H */
/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..652650f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1359,6 +1359,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+
+ trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
+ zone_idx(zone),
+ nr_scanned, nr_reclaimed,
+ priority,
+ trace_shrink_flags(file, sc->lumpy_reclaim_mode));
return nr_reclaimed;
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 02/10] writeback: Account for time spent congestion_waited
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/trace/events/writeback.h | 28 ++++++++++++++++++++++++++++
mm/backing-dev.c | 5 +++++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, usec_timeout )
+ __field( unsigned int, usec_delayed )
+ ),
+
+ TP_fast_assign(
+ __entry->usec_timeout = usec_timeout;
+ __entry->usec_delayed = usec_delayed;
+ ),
+
+ TP_printk("usec_timeout=%u usec_delayed=%u",
+ __entry->usec_timeout,
+ __entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..298975a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,17 @@ EXPORT_SYMBOL(set_bdi_congested);
long congestion_wait(int sync, long timeout)
{
long ret;
+ unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+ jiffies_to_usecs(jiffies - start));
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 02/10] writeback: Account for time spent congestion_waited
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a tracepoint
for congestion_wait to record when congestion_wait() was called, how long
the timeout was for and how long it actually slept.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/trace/events/writeback.h | 28 ++++++++++++++++++++++++++++
mm/backing-dev.c | 5 +++++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..275d477 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,34 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+DECLARE_EVENT_CLASS(writeback_congest_waited_template,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, usec_timeout )
+ __field( unsigned int, usec_delayed )
+ ),
+
+ TP_fast_assign(
+ __entry->usec_timeout = usec_timeout;
+ __entry->usec_delayed = usec_delayed;
+ ),
+
+ TP_printk("usec_timeout=%u usec_delayed=%u",
+ __entry->usec_timeout,
+ __entry->usec_delayed)
+);
+
+DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..298975a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,17 @@ EXPORT_SYMBOL(set_bdi_congested);
long congestion_wait(int sync, long timeout)
{
long ret;
+ unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
+ jiffies_to_usecs(jiffies - start));
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
If congestion_wait() is called with no BDIs congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested or if there is a significant amount of writeback going on in an
interesting zone. Else, it calls cond_resched() to ensure the caller is
not hogging the CPU longer than its quota but otherwise will not sleep.
This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/backing-dev.h | 2 +-
include/trace/events/writeback.h | 7 ++++
mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 26 ++++++++++++--
5 files changed, 96 insertions(+), 9 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
TP_ARGS(usec_timeout, usec_delayed)
);
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 298975a..94b5433 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];
bit = sync ? BDI_sync_congested : BDI_async_congested;
- clear_bit(bit, &bdi->state);
+ if (test_and_clear_bit(bit, &bdi->state))
+ atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum bdi_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
- set_bit(bit, &bdi->state);
+ if (!test_and_set_bit(bit, &bdi->state))
+ atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
@@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
}
EXPORT_SYMBOL(congestion_wait);
+/**
+ * congestion_wait - wait for a backing_dev to become uncongested
+ * @zone: A zone to consider the number of being being written back from
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
+ * write congestion. If no backing_devs are congested then the number of
+ * writeback pages in the zone are checked and compared to the inactive
+ * list. If there is no sigificant writeback or congestion, there is no point
+ * in sleeping but cond_resched() is called in case the current process has
+ * consumed its CPU quota.
+ */
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
+{
+ long ret;
+ unsigned long start = jiffies;
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+ /*
+ * If there is no congestion, check the amount of writeback. If there
+ * is no significant writeback and no congestion, just cond_resched
+ */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+ unsigned long inactive, writeback;
+
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_ANON);
+ writeback = zone_page_state(zone, NR_WRITEBACK);
+
+ /*
+ * If less than half the inactive list is being written back,
+ * reclaim might as well continue
+ */
+ if (writeback < inactive / 2) {
+ cond_resched();
+
+ /* In case we scheduled, work out time remaining */
+ ret = timeout - (jiffies - start);
+ if (ret < 0)
+ ret = 0;
+
+ goto out;
+ }
+ }
+
+ /* Sleep until uncongested or a write happens */
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+
+out:
+ trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+ jiffies_to_usecs(jiffies - start));
+
+ return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..641900a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,7 +1893,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -2081,7 +2081,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
goto rebalance;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..eabe987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,7 +1341,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
/*
* The attempt at page out may have made some
@@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->may_writepage = 1;
}
- /* Take a nap, wait for some writeback to complete */
+ /* Take a nap if congested, wait for some writeback */
if (!sc->hibernation_mode && sc->nr_scanned &&
- priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ priority < DEF_PRIORITY - 2) {
+ struct zone *active_zone = NULL;
+ unsigned long max_writeback = 0;
+ for_each_zone_zonelist(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask)) {
+ unsigned long writeback;
+
+ /* Initialise for first zone */
+ if (active_zone == NULL)
+ active_zone = zone;
+
+ writeback = zone_page_state(zone, NR_WRITEBACK);
+ if (writeback > max_writeback) {
+ max_writeback = writeback;
+ active_zone = zone;
+ }
+ }
+
+ wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+ }
}
out:
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
If congestion_wait() is called with no BDIs congested, the caller will sleep
for the full timeout and this may be an unnecessary sleep. This patch adds
a wait_iff_congested() that checks congestion and only sleeps if a BDI is
congested or if there is a significant amount of writeback going on in an
interesting zone. Else, it calls cond_resched() to ensure the caller is
not hogging the CPU longer than its quota but otherwise will not sleep.
This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/backing-dev.h | 2 +-
include/trace/events/writeback.h | 7 ++++
mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
mm/page_alloc.c | 4 +-
mm/vmscan.c | 26 ++++++++++++--
5 files changed, 96 insertions(+), 9 deletions(-)
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 35b0074..f1b402a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -285,7 +285,7 @@ enum {
void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
-
+long wait_iff_congested(struct zone *zone, int sync, long timeout);
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 275d477..eeaf1f5 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
TP_ARGS(usec_timeout, usec_delayed)
);
+DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
+
+ TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
+
+ TP_ARGS(usec_timeout, usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 298975a..94b5433 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];
bit = sync ? BDI_sync_congested : BDI_async_congested;
- clear_bit(bit, &bdi->state);
+ if (test_and_clear_bit(bit, &bdi->state))
+ atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum bdi_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
- set_bit(bit, &bdi->state);
+ if (!test_and_set_bit(bit, &bdi->state))
+ atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
@@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
}
EXPORT_SYMBOL(congestion_wait);
+/**
+ * congestion_wait - wait for a backing_dev to become uncongested
+ * @zone: A zone to consider the number of being being written back from
+ * @sync: SYNC or ASYNC IO
+ * @timeout: timeout in jiffies
+ *
+ * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
+ * write congestion. If no backing_devs are congested then the number of
+ * writeback pages in the zone are checked and compared to the inactive
+ * list. If there is no sigificant writeback or congestion, there is no point
+ * in sleeping but cond_resched() is called in case the current process has
+ * consumed its CPU quota.
+ */
+long wait_iff_congested(struct zone *zone, int sync, long timeout)
+{
+ long ret;
+ unsigned long start = jiffies;
+ DEFINE_WAIT(wait);
+ wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+ /*
+ * If there is no congestion, check the amount of writeback. If there
+ * is no significant writeback and no congestion, just cond_resched
+ */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0) {
+ unsigned long inactive, writeback;
+
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_ANON);
+ writeback = zone_page_state(zone, NR_WRITEBACK);
+
+ /*
+ * If less than half the inactive list is being written back,
+ * reclaim might as well continue
+ */
+ if (writeback < inactive / 2) {
+ cond_resched();
+
+ /* In case we scheduled, work out time remaining */
+ ret = timeout - (jiffies - start);
+ if (ret < 0)
+ ret = 0;
+
+ goto out;
+ }
+ }
+
+ /* Sleep until uncongested or a write happens */
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+
+out:
+ trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
+ jiffies_to_usecs(jiffies - start));
+
+ return ret;
+}
+EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..641900a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,7 +1893,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -2081,7 +2081,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
goto rebalance;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 652650f..eabe987 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,7 +1341,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
/*
* The attempt at page out may have made some
@@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->may_writepage = 1;
}
- /* Take a nap, wait for some writeback to complete */
+ /* Take a nap if congested, wait for some writeback */
if (!sc->hibernation_mode && sc->nr_scanned &&
- priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ priority < DEF_PRIORITY - 2) {
+ struct zone *active_zone = NULL;
+ unsigned long max_writeback = 0;
+ for_each_zone_zonelist(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask)) {
+ unsigned long writeback;
+
+ /* Initialise for first zone */
+ if (active_zone == NULL)
+ active_zone = zone;
+
+ writeback = zone_page_state(zone, NR_WRITEBACK);
+ if (writeback > max_writeback) {
+ max_writeback = writeback;
+ active_zone = zone;
+ }
+ }
+
+ wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
+ }
}
out:
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
congestion_wait() mean "waiting queue congestion is cleared". However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eabe987..5979850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
congestion_wait() mean "waiting queue congestion is cleared". However,
synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
and it provides the necessary waiting.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eabe987..5979850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
With synchrounous lumpy reclaim, there is no reason to give up to reclaim
pages even if page is locked. This patch uses lock_page() instead of
trylock_page() in this case.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5979850..79bd812 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);
- if (!trylock_page(page))
+ if (sync_writeback == PAGEOUT_IO_SYNC)
+ lock_page(page);
+ else if (!trylock_page(page))
goto keep;
VM_BUG_ON(PageActive(page));
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
With synchrounous lumpy reclaim, there is no reason to give up to reclaim
pages even if page is locked. This patch uses lock_page() instead of
trylock_page() in this case.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5979850..79bd812 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);
- if (!trylock_page(page))
+ if (sync_writeback == PAGEOUT_IO_SYNC)
+ lock_page(page);
+ else if (!trylock_page(page))
goto keep;
VM_BUG_ON(PageActive(page));
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
1. trylock_page() failure
2. page is unevictable
3. zone reclaim and page is mapped
4. PageWriteback() is true
5. page is swapbacked and swap is full
6. add_to_swap() failure
7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
8. page is pinned
9. IO queue is congested
10. pageout() start IO, but not finished
When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary. In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying. This patch causes lumpy reclaim to abort
when it is known it will fail.
Case (9) is more interesting. current behavior is,
1. start shrink_page_list(async)
2. found queue_congested()
3. skip pageout write
4. still start shrink_page_list(sync)
5. wait on a lot of pages
6. again, found queue_congested()
7. give up pageout write again
So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
- If order > PAGE_ALLOC_COSTLY_ORDER
Ignore queue congestion always.
- If order <= PAGE_ALLOC_COSTLY_ORDER
skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/vmscan.h | 6 +-
mm/vmscan.c | 122 +++++++++++++++++++++++++---------------
2 files changed, 79 insertions(+), 49 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 14c1586..6f07c44 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
#define trace_reclaim_flags(page, sync) ( \
(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
#define trace_shrink_flags(file, sync) ( \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) | \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79bd812..21d1153 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>
+enum lumpy_mode {
+ LUMPY_MODE_NONE,
+ LUMPY_MODE_ASYNC,
+ LUMPY_MODE_SYNC,
+};
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
* Intend to reclaim enough contenious memory rather than to reclaim
* enough amount memory. I.e, it's the mode for high order allocation.
*/
- bool lumpy_reclaim_mode;
+ enum lumpy_mode lumpy_reclaim_mode;
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
return ret;
}
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+ bool sync)
+{
+ enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+ /*
+ * Some reclaim have alredy been failed. No worth to try synchronous
+ * lumpy reclaim.
+ */
+ if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+ return;
+
+ /*
+ * If we need a large contiguous chunk of memory, or have
+ * trouble getting a small set of contiguous pages, we
+ * will reclaim both active and inactive pages.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ sc->lumpy_reclaim_mode = mode;
+ else if (sc->order && priority < DEF_PRIORITY - 2)
+ sc->lumpy_reclaim_mode = mode;
+ else
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
return page_count(page) - page_has_private(page) == 2;
}
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+ struct scan_control *sc)
{
if (current->flags & PF_SWAPWRITE)
return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
return 1;
if (bdi == current->backing_dev_info)
return 1;
+
+ /* lumpy reclaim for hugepage often need a lot of write */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ return 1;
return 0;
}
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
-/* Request for sync pageout. */
-enum pageout_io {
- PAGEOUT_IO_ASYNC,
- PAGEOUT_IO_SYNC,
-};
-
/* possible outcome of pageout() */
typedef enum {
/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
* Calls ->writepage().
*/
static pageout_t pageout(struct page *page, struct address_space *mapping,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
- if (!may_write_to_queue(mapping->backing_dev_info))
+ if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+ disable_lumpy_reclaim_mode(sc);
return PAGE_KEEP;
+ }
if (clear_page_dirty_for_io(page)) {
int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
*/
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+ if (PageWriteback(page) &&
+ sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
wait_on_page_writeback(page);
if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
ClearPageReclaim(page);
}
trace_mm_vmscan_writepage(page,
- trace_reclaim_flags(page, sync_writeback));
+ trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
inc_zone_page_state(page, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);
/* Lumpy reclaim - ignore references */
- if (sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
return PAGEREF_RECLAIM;
/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct scan_control *sc,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);
- if (sync_writeback == PAGEOUT_IO_SYNC)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
lock_page(page);
else if (!trylock_page(page))
goto keep;
@@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* for any page for which writeback has already
* started.
*/
- if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+ may_enter_fs)
wait_on_page_writeback(page);
- else
- goto keep_locked;
+ else {
+ unlock_page(page);
+ goto keep_lumpy;
+ }
}
references = page_check_references(page, sc);
@@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
/* Page is dirty, try to write it out here */
- switch (pageout(page, mapping, sync_writeback)) {
+ switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page))
+ goto keep_lumpy;
+ if (PageDirty(page))
goto keep;
+
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -843,6 +886,7 @@ cull_mlocked:
try_to_free_swap(page);
unlock_page(page);
putback_lru_page(page);
+ disable_lumpy_reclaim_mode(sc);
continue;
activate_locked:
@@ -855,6 +899,8 @@ activate_locked:
keep_locked:
unlock_page(page);
keep:
+ disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
@@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
return false;
/* Only stall on lumpy reclaim */
- if (!sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
return false;
/* If we have relaimed everything on the isolated list, no stall */
@@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
return SWAP_CLUSTER_MAX;
}
-
+ set_lumpy_reclaim_mode(priority, sc, false);
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
nr_taken = isolate_pages_global(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE : ISOLATE_BOTH,
zone, 0, file);
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
@@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
} else {
nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE : ISOLATE_BOTH,
zone, sc->mem_cgroup,
0, file);
/*
@@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);
- nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+ set_lumpy_reclaim_mode(priority, sc, true);
+ nr_reclaimed += shrink_page_list(&page_list, sc);
}
local_irq_disable();
@@ -1727,21 +1774,6 @@ out:
}
}
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
- /*
- * If we need a large contiguous chunk of memory, or have
- * trouble getting a small set of contiguous pages, we
- * will reclaim both active and inactive pages.
- */
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
- sc->lumpy_reclaim_mode = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
- sc->lumpy_reclaim_mode = 1;
- else
- sc->lumpy_reclaim_mode = 0;
-}
-
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1756,8 +1788,6 @@ static void shrink_zone(int priority, struct zone *zone,
get_scan_count(zone, sc, nr, priority);
- set_lumpy_reclaim_mode(priority, sc);
-
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
1. trylock_page() failure
2. page is unevictable
3. zone reclaim and page is mapped
4. PageWriteback() is true
5. page is swapbacked and swap is full
6. add_to_swap() failure
7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
8. page is pinned
9. IO queue is congested
10. pageout() start IO, but not finished
When lumpy reclaim, all of failure result in entering synchronous lumpy
reclaim but this can be unnecessary. In cases (2), (3), (5), (6), (7) and
(8), there is no point retrying. This patch causes lumpy reclaim to abort
when it is known it will fail.
Case (9) is more interesting. current behavior is,
1. start shrink_page_list(async)
2. found queue_congested()
3. skip pageout write
4. still start shrink_page_list(sync)
5. wait on a lot of pages
6. again, found queue_congested()
7. give up pageout write again
So, it's meaningless time wasting. However, just skipping page reclaim is
also not a good as as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
- If order > PAGE_ALLOC_COSTLY_ORDER
Ignore queue congestion always.
- If order <= PAGE_ALLOC_COSTLY_ORDER
skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/vmscan.h | 6 +-
mm/vmscan.c | 122 +++++++++++++++++++++++++---------------
2 files changed, 79 insertions(+), 49 deletions(-)
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 14c1586..6f07c44 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -25,13 +25,13 @@
#define trace_reclaim_flags(page, sync) ( \
(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
#define trace_shrink_flags(file, sync) ( \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_MIXED : \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_MIXED : \
(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON)) | \
- (sync == PAGEOUT_IO_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
+ (sync == LUMPY_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC) \
)
TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79bd812..21d1153 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>
+enum lumpy_mode {
+ LUMPY_MODE_NONE,
+ LUMPY_MODE_ASYNC,
+ LUMPY_MODE_SYNC,
+};
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
* Intend to reclaim enough contenious memory rather than to reclaim
* enough amount memory. I.e, it's the mode for high order allocation.
*/
- bool lumpy_reclaim_mode;
+ enum lumpy_mode lumpy_reclaim_mode;
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
return ret;
}
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+ bool sync)
+{
+ enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+ /*
+ * Some reclaim have alredy been failed. No worth to try synchronous
+ * lumpy reclaim.
+ */
+ if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+ return;
+
+ /*
+ * If we need a large contiguous chunk of memory, or have
+ * trouble getting a small set of contiguous pages, we
+ * will reclaim both active and inactive pages.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ sc->lumpy_reclaim_mode = mode;
+ else if (sc->order && priority < DEF_PRIORITY - 2)
+ sc->lumpy_reclaim_mode = mode;
+ else
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
return page_count(page) - page_has_private(page) == 2;
}
-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+ struct scan_control *sc)
{
if (current->flags & PF_SWAPWRITE)
return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
return 1;
if (bdi == current->backing_dev_info)
return 1;
+
+ /* lumpy reclaim for hugepage often need a lot of write */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ return 1;
return 0;
}
@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}
-/* Request for sync pageout. */
-enum pageout_io {
- PAGEOUT_IO_ASYNC,
- PAGEOUT_IO_SYNC,
-};
-
/* possible outcome of pageout() */
typedef enum {
/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
* Calls ->writepage().
*/
static pageout_t pageout(struct page *page, struct address_space *mapping,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
- if (!may_write_to_queue(mapping->backing_dev_info))
+ if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+ disable_lumpy_reclaim_mode(sc);
return PAGE_KEEP;
+ }
if (clear_page_dirty_for_io(page)) {
int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
*/
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+ if (PageWriteback(page) &&
+ sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
wait_on_page_writeback(page);
if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
ClearPageReclaim(page);
}
trace_mm_vmscan_writepage(page,
- trace_reclaim_flags(page, sync_writeback));
+ trace_reclaim_flags(page, sc->lumpy_reclaim_mode));
inc_zone_page_state(page, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);
/* Lumpy reclaim - ignore references */
- if (sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
return PAGEREF_RECLAIM;
/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct scan_control *sc,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);
- if (sync_writeback == PAGEOUT_IO_SYNC)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
lock_page(page);
else if (!trylock_page(page))
goto keep;
@@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* for any page for which writeback has already
* started.
*/
- if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+ may_enter_fs)
wait_on_page_writeback(page);
- else
- goto keep_locked;
+ else {
+ unlock_page(page);
+ goto keep_lumpy;
+ }
}
references = page_check_references(page, sc);
@@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
/* Page is dirty, try to write it out here */
- switch (pageout(page, mapping, sync_writeback)) {
+ switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page))
+ goto keep_lumpy;
+ if (PageDirty(page))
goto keep;
+
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -843,6 +886,7 @@ cull_mlocked:
try_to_free_swap(page);
unlock_page(page);
putback_lru_page(page);
+ disable_lumpy_reclaim_mode(sc);
continue;
activate_locked:
@@ -855,6 +899,8 @@ activate_locked:
keep_locked:
unlock_page(page);
keep:
+ disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
@@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
return false;
/* Only stall on lumpy reclaim */
- if (!sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
return false;
/* If we have relaimed everything on the isolated list, no stall */
@@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
return SWAP_CLUSTER_MAX;
}
-
+ set_lumpy_reclaim_mode(priority, sc, false);
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
if (scanning_global_lru(sc)) {
nr_taken = isolate_pages_global(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE : ISOLATE_BOTH,
zone, 0, file);
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
@@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
} else {
nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE : ISOLATE_BOTH,
zone, sc->mem_cgroup,
0, file);
/*
@@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);
- nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+ set_lumpy_reclaim_mode(priority, sc, true);
+ nr_reclaimed += shrink_page_list(&page_list, sc);
}
local_irq_disable();
@@ -1727,21 +1774,6 @@ out:
}
}
-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
- /*
- * If we need a large contiguous chunk of memory, or have
- * trouble getting a small set of contiguous pages, we
- * will reclaim both active and inactive pages.
- */
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
- sc->lumpy_reclaim_mode = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
- sc->lumpy_reclaim_mode = 1;
- else
- sc->lumpy_reclaim_mode = 0;
-}
-
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1756,8 +1788,6 @@ static void shrink_zone(int priority, struct zone *zone,
get_scan_count(zone, sc, nr, priority);
- set_lumpy_reclaim_mode(priority, sc);
-
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d1153..64f9ca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1334,7 +1334,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_active;
unsigned long nr_anon;
unsigned long nr_file;
@@ -1389,13 +1388,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, NULL);
- count_vm_events(PGDEACTIVATE, nr_active);
-
set_lumpy_reclaim_mode(priority, sc, true);
nr_reclaimed += shrink_page_list(&page_list, sc);
}
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
After synchrounous lumpy reclaim, the page_list is guaranteed to not
have active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d1153..64f9ca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1334,7 +1334,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_active;
unsigned long nr_anon;
unsigned long nr_file;
@@ -1389,13 +1388,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, NULL);
- count_vm_events(PGDEACTIVATE, nr_active);
-
set_lumpy_reclaim_mode(priority, sc, true);
nr_reclaimed += shrink_page_list(&page_list, sc);
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64f9ca5..ff52b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
continue;
/* Avoid holes within the zone. */
- if (unlikely(!pfn_valid_within(pfn)))
+ if (unlikely(!pfn_valid_within(pfn))) {
+ nr_lumpy_failed++;
break;
+ }
cursor_page = pfn_to_page(pfn);
/* Check that we have not crossed a zone boundary. */
- if (unlikely(page_zone_id(cursor_page) != zone_id))
- continue;
+ if (unlikely(page_zone_id(cursor_page) != zone_id)) {
+ nr_lumpy_failed++;
+ break;
+ }
/*
* If we don't have enough swap space, reclaiming of
@@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* pointless.
*/
if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
- !PageSwapCache(cursor_page))
- continue;
+ !PageSwapCache(cursor_page)) {
+ nr_lumpy_failed++;
+ break;
+ }
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
@@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
nr_lumpy_dirty++;
scan++;
} else {
- if (mode == ISOLATE_BOTH &&
- page_count(cursor_page))
- nr_lumpy_failed++;
+ /* the page is freed already. */
+ if (!page_count(cursor_page))
+ continue;
+ nr_lumpy_failed++;
+ break;
}
}
}
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
neighbour pages of the eviction page. The neighbour search does not stop even
if neighbours cannot be isolated which is excessive as the lumpy reclaim will
no longer result in a successful higher order allocation. This patch stops
the PFN neighbour pages if an isolation fails and moves on to the next block.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64f9ca5..ff52b46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
continue;
/* Avoid holes within the zone. */
- if (unlikely(!pfn_valid_within(pfn)))
+ if (unlikely(!pfn_valid_within(pfn))) {
+ nr_lumpy_failed++;
break;
+ }
cursor_page = pfn_to_page(pfn);
/* Check that we have not crossed a zone boundary. */
- if (unlikely(page_zone_id(cursor_page) != zone_id))
- continue;
+ if (unlikely(page_zone_id(cursor_page) != zone_id)) {
+ nr_lumpy_failed++;
+ break;
+ }
/*
* If we don't have enough swap space, reclaiming of
@@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* pointless.
*/
if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
- !PageSwapCache(cursor_page))
- continue;
+ !PageSwapCache(cursor_page)) {
+ nr_lumpy_failed++;
+ break;
+ }
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
@@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
nr_lumpy_dirty++;
scan++;
} else {
- if (mode == ISOLATE_BOTH &&
- page_count(cursor_page))
- nr_lumpy_failed++;
+ /* the page is freed already. */
+ if (!page_count(cursor_page))
+ continue;
+ nr_lumpy_failed++;
+ break;
}
}
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.
This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back. If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.
As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 46 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff52b46..408c101 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc) (1)
#endif
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct scan_control *sc)
+ struct scan_control *sc,
+ unsigned long *nr_still_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
+ unsigned long nr_dirty = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
@@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (PageDirty(page)) {
+ /*
+ * Only kswapd can writeback filesystem pages to
+ * avoid risk of stack overflow
+ */
+ if (page_is_file_cache(page) && !current_is_kswapd()) {
+ nr_dirty++;
+ goto keep_locked;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
@@ -908,6 +922,8 @@ keep_lumpy:
free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
+
+ *nr_still_dirty = nr_dirty;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
return false;
+ /* If we cannot writeback, there is no point stalling */
+ if (!sc->may_writepage)
+ return false;
+
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;
@@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
LIST_HEAD(page_list);
+ LIST_HEAD(putback_list);
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc);
+ nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+ int dirty_retry = MAX_SWAP_CLEAN_WAIT;
set_lumpy_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, sc);
+
+ while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+ struct page *page, *tmp;
+
+ /* Take off the clean pages marked for activation */
+ list_for_each_entry_safe(page, tmp, &page_list, lru) {
+ if (PageDirty(page) || PageWriteback(page))
+ continue;
+
+ list_del(&page->lru);
+ list_add(&page->lru, &putback_list);
+ }
+
+ wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
+ nr_reclaimed = shrink_page_list(&page_list, sc,
+ &nr_dirty);
+ }
}
+ list_splice(&putback_list, &page_list);
+
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.
This patch prevents direct reclaim writing back filesystem pages by checking
if current is kswapd or the page is anonymous before writing back. If the
dirty pages cannot be written back, they are placed back on the LRU lists
for either background writing by the BDI threads or kswapd. If in direct
lumpy reclaim and dirty pages are encountered, the process will stall for
the background flusher before trying to reclaim the pages again.
As the call-chain for writing anonymous pages is not expected to be deep
and they are not cleaned by flusher threads, anonymous pages are still
written back in direct reclaim.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 46 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ff52b46..408c101 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc) (1)
#endif
+/* Direct lumpy reclaim waits up to five seconds for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 50
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct scan_control *sc)
+ struct scan_control *sc,
+ unsigned long *nr_still_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
+ unsigned long nr_dirty = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
@@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (PageDirty(page)) {
+ /*
+ * Only kswapd can writeback filesystem pages to
+ * avoid risk of stack overflow
+ */
+ if (page_is_file_cache(page) && !current_is_kswapd()) {
+ nr_dirty++;
+ goto keep_locked;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
@@ -908,6 +922,8 @@ keep_lumpy:
free_page_list(&free_pages);
list_splice(&ret_pages, page_list);
+
+ *nr_still_dirty = nr_dirty;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
return false;
+ /* If we cannot writeback, there is no point stalling */
+ if (!sc->may_writepage)
+ return false;
+
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;
@@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
struct scan_control *sc, int priority, int file)
{
LIST_HEAD(page_list);
+ LIST_HEAD(putback_list);
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc);
+ nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+ int dirty_retry = MAX_SWAP_CLEAN_WAIT;
set_lumpy_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, sc);
+
+ while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
+ struct page *page, *tmp;
+
+ /* Take off the clean pages marked for activation */
+ list_for_each_entry_safe(page, tmp, &page_list, lru) {
+ if (PageDirty(page) || PageWriteback(page))
+ continue;
+
+ list_del(&page->lru);
+ list_add(&page->lru, &putback_list);
+ }
+
+ wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
+ nr_reclaimed = shrink_page_list(&page_list, sc,
+ &nr_dirty);
+ }
}
+ list_splice(&putback_list, &page_list);
+
local_irq_disable();
if (current_is_kswapd())
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:47 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There are a number of cases where pages get cleaned but two of concern
to this patch are;
o When dirtying pages, processes may be throttled to clean pages if
dirty_ratio is not met.
o Pages belonging to inodes dirtied longer than
dirty_writeback_centisecs get cleaned.
The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.
Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.
Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
1 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 408c101..33d27a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
/* Direct lumpy reclaim waits up to five seconds for background cleaning */
#define MAX_SWAP_CLEAN_WAIT 50
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+ return laptop_mode ? 0 :
+ min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
+ int file,
unsigned long *nr_still_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
unsigned long nr_dirty = 0;
+ unsigned long nr_dirty_seen = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
@@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (PageDirty(page)) {
+ nr_dirty_seen++;
+
/*
* Only kswapd can writeback filesystem pages to
* avoid risk of stack overflow
@@ -923,6 +939,18 @@ keep_lumpy:
list_splice(&ret_pages, page_list);
+ /*
+ * If reclaim is encountering dirty pages, it may be because
+ * dirty pages are reaching the end of the LRU even though the
+ * dirty_ratio may be satisified. In this case, wake flusher
+ * threads to pro-actively clean up to a maximum of
+ * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+ * !may_writepage indicates that this is a direct reclaimer in
+ * laptop mode avoiding disk spin-ups
+ */
+ if (file && nr_dirty_seen && sc->may_writepage)
+ wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
*nr_still_dirty = nr_dirty;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
@@ -1414,7 +1442,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
+ nr_reclaimed = shrink_page_list(&page_list, sc, file, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1437,7 +1465,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
nr_reclaimed = shrink_page_list(&page_list, sc,
- &nr_dirty);
+ file, &nr_dirty);
}
}
--
1.7.1
^ permalink raw reply related [flat|nested] 133+ messages in thread
* [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-06 10:47 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:47 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton, Mel Gorman
There are a number of cases where pages get cleaned but two of concern
to this patch are;
o When dirtying pages, processes may be throttled to clean pages if
dirty_ratio is not met.
o Pages belonging to inodes dirtied longer than
dirty_writeback_centisecs get cleaned.
The problem for reclaim is that dirty pages can reach the end of the LRU if
pages are being dirtied slowly so that neither the throttling or a flusher
thread waking periodically cleans them.
Background flush is already cleaning old or expired inodes first but the
expire time is too far in the future at the time of page reclaim. To mitigate
future problems, this patch wakes flusher threads to clean 4M of data -
an amount that should be manageable without causing congestion in many cases.
Ideally, the background flushers would only be cleaning pages belonging
to the zone being scanned but it's not clear if this would be of benefit
(less IO) or not (potentially less efficient IO if an inode is scattered
across multiple zones).
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
1 files changed, 30 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 408c101..33d27a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
/* Direct lumpy reclaim waits up to five seconds for background cleaning */
#define MAX_SWAP_CLEAN_WAIT 50
+/*
+ * When reclaim encounters dirty data, wakeup flusher threads to clean
+ * a maximum of 4M of data.
+ */
+#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
+#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
+static inline long nr_writeback_pages(unsigned long nr_dirty)
+{
+ return laptop_mode ? 0 :
+ min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
+}
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
+ int file,
unsigned long *nr_still_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
int pgactivate = 0;
unsigned long nr_dirty = 0;
+ unsigned long nr_dirty_seen = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
@@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (PageDirty(page)) {
+ nr_dirty_seen++;
+
/*
* Only kswapd can writeback filesystem pages to
* avoid risk of stack overflow
@@ -923,6 +939,18 @@ keep_lumpy:
list_splice(&ret_pages, page_list);
+ /*
+ * If reclaim is encountering dirty pages, it may be because
+ * dirty pages are reaching the end of the LRU even though the
+ * dirty_ratio may be satisified. In this case, wake flusher
+ * threads to pro-actively clean up to a maximum of
+ * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
+ * !may_writepage indicates that this is a direct reclaimer in
+ * laptop mode avoiding disk spin-ups
+ */
+ if (file && nr_dirty_seen && sc->may_writepage)
+ wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
+
*nr_still_dirty = nr_dirty;
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
@@ -1414,7 +1442,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
+ nr_reclaimed = shrink_page_list(&page_list, sc, file, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1437,7 +1465,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
nr_reclaimed = shrink_page_list(&page_list, sc,
- &nr_dirty);
+ file, &nr_dirty);
}
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-06 10:49 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:49 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
*sigh*
The subject should have been [PATCH 0/10] of course.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-06 10:49 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-06 10:49 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
*sigh*
The subject should have been [PATCH 0/10] of course.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-07 15:25 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:25 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/backing-dev.h | 2 +-
> include/trace/events/writeback.h | 7 ++++
> mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 4 +-
> mm/vmscan.c | 26 ++++++++++++--
> 5 files changed, 96 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> TP_ARGS(usec_timeout, usec_delayed)
> );
>
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> + TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
> #endif /* _TRACE_WRITEBACK_H */
>
> /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> enum bdi_state bit;
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - set_bit(bit, &bdi->state);
> + if (!test_and_set_bit(bit, &bdi->state))
> + atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
wait_iff_congested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion. If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
and
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
I am not sure this is best.
1. Without considering various speed class storage, could we fix it as half of inactive?
2. Isn't there any writeback throttling on above layer? Do we care of it in here?
Just out of curiosity.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-07 15:25 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:25 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/backing-dev.h | 2 +-
> include/trace/events/writeback.h | 7 ++++
> mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 4 +-
> mm/vmscan.c | 26 ++++++++++++--
> 5 files changed, 96 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> TP_ARGS(usec_timeout, usec_delayed)
> );
>
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> + TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
> #endif /* _TRACE_WRITEBACK_H */
>
> /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> enum bdi_state bit;
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - set_bit(bit, &bdi->state);
> + if (!test_and_set_bit(bit, &bdi->state))
> + atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
wait_iff_congested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion. If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
and
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
I am not sure this is best.
1. Without considering various speed class storage, could we fix it as half of inactive?
2. Isn't there any writeback throttling on above layer? Do we care of it in here?
Just out of curiosity.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-07 15:26 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:26 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-07 15:26 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:26 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-07 15:28 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:28 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-07 15:28 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:28 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-07 15:33 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:33 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:30AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> After synchrounous lumpy reclaim, the page_list is guaranteed to not
> have active pages as page activation in shrink_page_list() disables lumpy
> reclaim. Remove the dead code.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list()
@ 2010-09-07 15:33 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:33 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:30AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> After synchrounous lumpy reclaim, the page_list is guaranteed to not
> have active pages as page activation in shrink_page_list() disables lumpy
> reclaim. Remove the dead code.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-07 15:37 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 24 ++++++++++++++++--------
> 1 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> continue;
>
> /* Avoid holes within the zone. */
> - if (unlikely(!pfn_valid_within(pfn)))
> + if (unlikely(!pfn_valid_within(pfn))) {
> + nr_lumpy_failed++;
> break;
> + }
>
> cursor_page = pfn_to_page(pfn);
>
> /* Check that we have not crossed a zone boundary. */
> - if (unlikely(page_zone_id(cursor_page) != zone_id))
> - continue;
> + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> /*
> * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> * pointless.
> */
> if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> - !PageSwapCache(cursor_page))
> - continue;
> + !PageSwapCache(cursor_page)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - if (mode == ISOLATE_BOTH &&
Why can we remove ISOLATION_BOTH check?
Is it a intentionall behavior change?
> - page_count(cursor_page))
> - nr_lumpy_failed++;
> + /* the page is freed already. */
> + if (!page_count(cursor_page))
> + continue;
> + nr_lumpy_failed++;
> + break;
> }
> }
> }
> --
> 1.7.1
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-07 15:37 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-07 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 24 ++++++++++++++++--------
> 1 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> continue;
>
> /* Avoid holes within the zone. */
> - if (unlikely(!pfn_valid_within(pfn)))
> + if (unlikely(!pfn_valid_within(pfn))) {
> + nr_lumpy_failed++;
> break;
> + }
>
> cursor_page = pfn_to_page(pfn);
>
> /* Check that we have not crossed a zone boundary. */
> - if (unlikely(page_zone_id(cursor_page) != zone_id))
> - continue;
> + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> /*
> * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> * pointless.
> */
> if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> - !PageSwapCache(cursor_page))
> - continue;
> + !PageSwapCache(cursor_page)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - if (mode == ISOLATE_BOTH &&
Why can we remove ISOLATION_BOTH check?
Is it a intentionall behavior change?
> - page_count(cursor_page))
> - nr_lumpy_failed++;
> + /* the page is freed already. */
> + if (!page_count(cursor_page))
> + continue;
> + nr_lumpy_failed++;
> + break;
> }
> }
> }
> --
> 1.7.1
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 3:14 ` KOSAKI Motohiro
-1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08 3:14 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> gathers together three different patch sets which in combination should
> tackle some of the root causes of latency problems being reported.
>
> The first patch improves vmscan latency by tracking when pages get reclaimed
> by shrink_inactive_list. For this series, the most important results is
> being able to calculate the scanning/reclaim ratio as a measure of the
> amount of work being done by page reclaim.
>
> Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> calling going to sleep on congestion when it is unnecessary. This is expected
> to reduce stalls in situations where the system is under memory pressure
> but not due to congestion.
>
> Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.
Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
ASAP and take a feedback soon.
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-08 3:14 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-08 3:14 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> There have been numerous reports of stalls that pointed at the problem being
> somewhere in the VM. There are multiple roots to the problems which means
> dealing with any of the root problems in isolation is tricky to justify on
> their own and they would still need integration testing. This patch series
> gathers together three different patch sets which in combination should
> tackle some of the root causes of latency problems being reported.
>
> The first patch improves vmscan latency by tracking when pages get reclaimed
> by shrink_inactive_list. For this series, the most important results is
> being able to calculate the scanning/reclaim ratio as a measure of the
> amount of work being done by page reclaim.
>
> Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> calling going to sleep on congestion when it is unnecessary. This is expected
> to reduce stalls in situations where the system is under memory pressure
> but not due to congestion.
>
> Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> this series. It has been noted that lumpy reclaim is far too aggressive and
> trashes the system somewhat. As SLUB uses high-order allocations, a large
> cost incurred by lumpy reclaim will be noticeable. It was also reported
> during transparent hugepage support testing that lumpy reclaim was trashing
> the system and these patches should mitigate that problem without disabling
> lumpy reclaim.
Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
ASAP and take a feedback soon.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 6:15 ` Johannes Weiner
-1 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08 6:15 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eabe987..5979850 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> -
> /*
> * The attempt at page out may have made some
> * of the pages active, mark them inactive again.
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-08 6:15 ` Johannes Weiner
0 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08 6:15 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, Sep 06, 2010 at 11:47:27AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 2 --
> 1 files changed, 0 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eabe987..5979850 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,8 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> -
> /*
> * The attempt at page out may have made some
> * of the pages active, mark them inactive again.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 6:16 ` Johannes Weiner
-1 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08 6:16 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5979850..79bd812 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> - if (!trylock_page(page))
> + if (sync_writeback == PAGEOUT_IO_SYNC)
> + lock_page(page);
> + else if (!trylock_page(page))
> goto keep;
>
> VM_BUG_ON(PageActive(page));
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-08 6:16 ` Johannes Weiner
0 siblings, 0 replies; 133+ messages in thread
From: Johannes Weiner @ 2010-09-08 6:16 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, KAMEZAWA Hiroyuki,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, Sep 06, 2010 at 11:47:28AM +0100, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5979850..79bd812 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> - if (!trylock_page(page))
> + if (sync_writeback == PAGEOUT_IO_SYNC)
> + lock_page(page);
> + else if (!trylock_page(page))
> goto keep;
>
> VM_BUG_ON(PageActive(page));
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
2010-09-08 3:14 ` KOSAKI Motohiro
@ 2010-09-08 8:38 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 8:38 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Wed, Sep 08, 2010 at 12:14:29PM +0900, KOSAKI Motohiro wrote:
> > There have been numerous reports of stalls that pointed at the problem being
> > somewhere in the VM. There are multiple roots to the problems which means
> > dealing with any of the root problems in isolation is tricky to justify on
> > their own and they would still need integration testing. This patch series
> > gathers together three different patch sets which in combination should
> > tackle some of the root causes of latency problems being reported.
> >
> > The first patch improves vmscan latency by tracking when pages get reclaimed
> > by shrink_inactive_list. For this series, the most important results is
> > being able to calculate the scanning/reclaim ratio as a measure of the
> > amount of work being done by page reclaim.
> >
> > Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> > calling going to sleep on congestion when it is unnecessary. This is expected
> > to reduce stalls in situations where the system is under memory pressure
> > but not due to congestion.
> >
> > Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> > this series. It has been noted that lumpy reclaim is far too aggressive and
> > trashes the system somewhat. As SLUB uses high-order allocations, a large
> > cost incurred by lumpy reclaim will be noticeable. It was also reported
> > during transparent hugepage support testing that lumpy reclaim was trashing
> > the system and these patches should mitigate that problem without disabling
> > lumpy reclaim.
>
> Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
> ASAP and take a feedback soon.
>
It did not bother me at all. I generally agreed with the direction and
it seemed sensible to take them into consideration before patches 9 and
10 in particular and make sure they all played nicely together.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-08 8:38 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 8:38 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Wed, Sep 08, 2010 at 12:14:29PM +0900, KOSAKI Motohiro wrote:
> > There have been numerous reports of stalls that pointed at the problem being
> > somewhere in the VM. There are multiple roots to the problems which means
> > dealing with any of the root problems in isolation is tricky to justify on
> > their own and they would still need integration testing. This patch series
> > gathers together three different patch sets which in combination should
> > tackle some of the root causes of latency problems being reported.
> >
> > The first patch improves vmscan latency by tracking when pages get reclaimed
> > by shrink_inactive_list. For this series, the most important results is
> > being able to calculate the scanning/reclaim ratio as a measure of the
> > amount of work being done by page reclaim.
> >
> > Patches 2 and 3 account for the time spent in congestion_wait() and avoids
> > calling going to sleep on congestion when it is unnecessary. This is expected
> > to reduce stalls in situations where the system is under memory pressure
> > but not due to congestion.
> >
> > Patches 4-8 were originally developed by Kosaki Motohiro but reworked for
> > this series. It has been noted that lumpy reclaim is far too aggressive and
> > trashes the system somewhat. As SLUB uses high-order allocations, a large
> > cost incurred by lumpy reclaim will be noticeable. It was also reported
> > during transparent hugepage support testing that lumpy reclaim was trashing
> > the system and these patches should mitigate that problem without disabling
> > lumpy reclaim.
>
> Wow, I'm sorry my lazyness bother you. I'll join to test this patch series
> ASAP and take a feedback soon.
>
It did not bother me at all. I generally agreed with the direction and
it seemed sensible to take them into consideration before patches 9 and
10 in particular and make sure they all played nicely together.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-07 15:25 ` Minchan Kim
@ 2010-09-08 11:04 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:04 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > <SNIP>
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> wait_iff_congested
>
Fixed, thanks.
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion. If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> and
>
Why and? "or" makes sense because we avoid sleeping on either condition.
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> I am not sure this is best.
>
I'm not saying it is. The objective is to identify a situation where
sleeping until the next write or congestion clears is pointless. We have
already identified that we are not congested so the question is "are we
writing a lot at the moment?". The assumption is that if there is a lot
of writing going on, we might as well sleep until one completes rather
than reclaiming more.
This is the first effort at identifying pointless sleeps. Better ones
might be identified in the future but that shouldn't stop us making a
semi-sensible decision now.
> 1. Without considering various speed class storage, could we fix it as half of inactive?
We don't really have a good means of identifying speed classes of
storage. Worse, we are considering on a zone-basis here, not a BDI
basis. The pages being written back in the zone could be backed by
anything so we cannot make decisions based on BDI speed.
> 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
>
There are but congestion_wait() and now wait_iff_congested() are part of
that. We can see from the figures in the leader that congestion_wait()
is sleeping more than is necessary or smart.
> Just out of curiosity.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 11:04 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:04 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:26AM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > <SNIP>
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> wait_iff_congested
>
Fixed, thanks.
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion. If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> and
>
Why and? "or" makes sense because we avoid sleeping on either condition.
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> I am not sure this is best.
>
I'm not saying it is. The objective is to identify a situation where
sleeping until the next write or congestion clears is pointless. We have
already identified that we are not congested so the question is "are we
writing a lot at the moment?". The assumption is that if there is a lot
of writing going on, we might as well sleep until one completes rather
than reclaiming more.
This is the first effort at identifying pointless sleeps. Better ones
might be identified in the future but that shouldn't stop us making a
semi-sensible decision now.
> 1. Without considering various speed class storage, could we fix it as half of inactive?
We don't really have a good means of identifying speed classes of
storage. Worse, we are considering on a zone-basis here, not a BDI
basis. The pages being written back in the zone could be backed by
anything so we cannot make decisions based on BDI speed.
> 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
>
There are but congestion_wait() and now wait_iff_congested() are part of
that. We can see from the figures in the leader that congestion_wait()
is sleeping more than is necessary or smart.
> Just out of curiosity.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-07 15:37 ` Minchan Kim
@ 2010-09-08 11:12 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:12 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 24 ++++++++++++++++--------
> > 1 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > continue;
> >
> > /* Avoid holes within the zone. */
> > - if (unlikely(!pfn_valid_within(pfn)))
> > + if (unlikely(!pfn_valid_within(pfn))) {
> > + nr_lumpy_failed++;
> > break;
> > + }
> >
> > cursor_page = pfn_to_page(pfn);
> >
> > /* Check that we have not crossed a zone boundary. */
> > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > - continue;
> > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > /*
> > * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > * pointless.
> > */
> > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > - !PageSwapCache(cursor_page))
> > - continue;
> > + !PageSwapCache(cursor_page)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > - if (mode == ISOLATE_BOTH &&
>
> Why can we remove ISOLATION_BOTH check?
Because this is lumpy reclaim and whether we are isolating inactive, active
or both doesn't matter. The fact we failed to isolate the page and it has
a reference count means that a contiguous allocation in that area will fail.
> Is it a intentionall behavior change?
>
Yes.
> > - page_count(cursor_page))
> > - nr_lumpy_failed++;
> > + /* the page is freed already. */
> > + if (!page_count(cursor_page))
> > + continue;
> > + nr_lumpy_failed++;
> > + break;
> > }
> > }
> > }
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 11:12 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 11:12 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 24 ++++++++++++++++--------
> > 1 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > continue;
> >
> > /* Avoid holes within the zone. */
> > - if (unlikely(!pfn_valid_within(pfn)))
> > + if (unlikely(!pfn_valid_within(pfn))) {
> > + nr_lumpy_failed++;
> > break;
> > + }
> >
> > cursor_page = pfn_to_page(pfn);
> >
> > /* Check that we have not crossed a zone boundary. */
> > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > - continue;
> > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > /*
> > * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > * pointless.
> > */
> > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > - !PageSwapCache(cursor_page))
> > - continue;
> > + !PageSwapCache(cursor_page)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > - if (mode == ISOLATE_BOTH &&
>
> Why can we remove ISOLATION_BOTH check?
Because this is lumpy reclaim and whether we are isolating inactive, active
or both doesn't matter. The fact we failed to isolate the page and it has
a reference count means that a contiguous allocation in that area will fail.
> Is it a intentionall behavior change?
>
Yes.
> > - page_count(cursor_page))
> > - nr_lumpy_failed++;
> > + /* the page is freed already. */
> > + if (!page_count(cursor_page))
> > + continue;
> > + nr_lumpy_failed++;
> > + break;
> > }
> > }
> > }
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 11:25 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:25 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:27PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-08 11:25 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:25 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:27PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 11:28 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:28 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:28PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-08 11:28 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:28 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:28PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 11:37 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 24 ++++++++++++++++--------
> 1 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> continue;
>
> /* Avoid holes within the zone. */
> - if (unlikely(!pfn_valid_within(pfn)))
> + if (unlikely(!pfn_valid_within(pfn))) {
> + nr_lumpy_failed++;
> break;
> + }
>
> cursor_page = pfn_to_page(pfn);
>
> /* Check that we have not crossed a zone boundary. */
> - if (unlikely(page_zone_id(cursor_page) != zone_id))
> - continue;
> + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> /*
> * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> * pointless.
> */
> if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> - !PageSwapCache(cursor_page))
> - continue;
> + !PageSwapCache(cursor_page)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - if (mode == ISOLATE_BOTH &&
> - page_count(cursor_page))
> - nr_lumpy_failed++;
> + /* the page is freed already. */
> + if (!page_count(cursor_page))
> + continue;
> + nr_lumpy_failed++;
> + break;
> }
> }
The many nr_lumpy_failed++ can be moved here:
if (pfn < end_pfn)
nr_lumpy_failed++;
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 11:37 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 11:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 24 ++++++++++++++++--------
> 1 files changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64f9ca5..ff52b46 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> continue;
>
> /* Avoid holes within the zone. */
> - if (unlikely(!pfn_valid_within(pfn)))
> + if (unlikely(!pfn_valid_within(pfn))) {
> + nr_lumpy_failed++;
> break;
> + }
>
> cursor_page = pfn_to_page(pfn);
>
> /* Check that we have not crossed a zone boundary. */
> - if (unlikely(page_zone_id(cursor_page) != zone_id))
> - continue;
> + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> /*
> * If we don't have enough swap space, reclaiming of
> @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> * pointless.
> */
> if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> - !PageSwapCache(cursor_page))
> - continue;
> + !PageSwapCache(cursor_page)) {
> + nr_lumpy_failed++;
> + break;
> + }
>
> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - if (mode == ISOLATE_BOTH &&
> - page_count(cursor_page))
> - nr_lumpy_failed++;
> + /* the page is freed already. */
> + if (!page_count(cursor_page))
> + continue;
> + nr_lumpy_failed++;
> + break;
> }
> }
The many nr_lumpy_failed++ can be moved here:
if (pfn < end_pfn)
nr_lumpy_failed++;
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-08 11:37 ` Wu Fengguang
@ 2010-09-08 12:50 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 12:50 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 24 ++++++++++++++++--------
> > 1 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > continue;
> >
> > /* Avoid holes within the zone. */
> > - if (unlikely(!pfn_valid_within(pfn)))
> > + if (unlikely(!pfn_valid_within(pfn))) {
> > + nr_lumpy_failed++;
> > break;
> > + }
> >
> > cursor_page = pfn_to_page(pfn);
> >
> > /* Check that we have not crossed a zone boundary. */
> > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > - continue;
> > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > /*
> > * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > * pointless.
> > */
> > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > - !PageSwapCache(cursor_page))
> > - continue;
> > + !PageSwapCache(cursor_page)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > - if (mode == ISOLATE_BOTH &&
> > - page_count(cursor_page))
> > - nr_lumpy_failed++;
> > + /* the page is freed already. */
> > + if (!page_count(cursor_page))
> > + continue;
> > + nr_lumpy_failed++;
> > + break;
> > }
> > }
>
> The many nr_lumpy_failed++ can be moved here:
>
> if (pfn < end_pfn)
> nr_lumpy_failed++;
>
Because the break stops the loop iterating, is there an advantage to
making it a pfn check instead? I might be misunderstanding your
suggestion.
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 12:50 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 12:50 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > neighbour pages of the eviction page. The neighbour search does not stop even
> > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > no longer result in a successful higher order allocation. This patch stops
> > the PFN neighbour pages if an isolation fails and moves on to the next block.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 24 ++++++++++++++++--------
> > 1 files changed, 16 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 64f9ca5..ff52b46 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > continue;
> >
> > /* Avoid holes within the zone. */
> > - if (unlikely(!pfn_valid_within(pfn)))
> > + if (unlikely(!pfn_valid_within(pfn))) {
> > + nr_lumpy_failed++;
> > break;
> > + }
> >
> > cursor_page = pfn_to_page(pfn);
> >
> > /* Check that we have not crossed a zone boundary. */
> > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > - continue;
> > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > /*
> > * If we don't have enough swap space, reclaiming of
> > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > * pointless.
> > */
> > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > - !PageSwapCache(cursor_page))
> > - continue;
> > + !PageSwapCache(cursor_page)) {
> > + nr_lumpy_failed++;
> > + break;
> > + }
> >
> > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > list_move(&cursor_page->lru, dst);
> > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > nr_lumpy_dirty++;
> > scan++;
> > } else {
> > - if (mode == ISOLATE_BOTH &&
> > - page_count(cursor_page))
> > - nr_lumpy_failed++;
> > + /* the page is freed already. */
> > + if (!page_count(cursor_page))
> > + continue;
> > + nr_lumpy_failed++;
> > + break;
> > }
> > }
>
> The many nr_lumpy_failed++ can be moved here:
>
> if (pfn < end_pfn)
> nr_lumpy_failed++;
>
Because the break stops the loop iterating, is there an advantage to
making it a pfn check instead? I might be misunderstanding your
suggestion.
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
>
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-08 12:50 ` Mel Gorman
@ 2010-09-08 13:14 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 13:14 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > continue;
> > >
> > > /* Avoid holes within the zone. */
> > > - if (unlikely(!pfn_valid_within(pfn)))
> > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > + nr_lumpy_failed++;
> > > break;
> > > + }
> > >
> > > cursor_page = pfn_to_page(pfn);
> > >
> > > /* Check that we have not crossed a zone boundary. */
> > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > - continue;
> > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > /*
> > > * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > * pointless.
> > > */
> > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > - !PageSwapCache(cursor_page))
> > > - continue;
> > > + !PageSwapCache(cursor_page)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > nr_lumpy_dirty++;
> > > scan++;
> > > } else {
> > > - if (mode == ISOLATE_BOTH &&
> > > - page_count(cursor_page))
> > > - nr_lumpy_failed++;
> > > + /* the page is freed already. */
> > > + if (!page_count(cursor_page))
> > > + continue;
> > > + nr_lumpy_failed++;
> > > + break;
> > > }
> > > }
> >
> > The many nr_lumpy_failed++ can be moved here:
> >
> > if (pfn < end_pfn)
> > nr_lumpy_failed++;
> >
>
> Because the break stops the loop iterating, is there an advantage to
> making it a pfn check instead? I might be misunderstanding your
> suggestion.
The complete view in my mind is
for (; pfn < end_pfn; pfn++) {
if (failed 1)
break;
if (failed 2)
break;
if (failed 3)
break;
}
if (pfn < end_pfn)
nr_lumpy_failed++;
Sure it just reduces several lines of code :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 13:14 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-08 13:14 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > continue;
> > >
> > > /* Avoid holes within the zone. */
> > > - if (unlikely(!pfn_valid_within(pfn)))
> > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > + nr_lumpy_failed++;
> > > break;
> > > + }
> > >
> > > cursor_page = pfn_to_page(pfn);
> > >
> > > /* Check that we have not crossed a zone boundary. */
> > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > - continue;
> > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > /*
> > > * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > * pointless.
> > > */
> > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > - !PageSwapCache(cursor_page))
> > > - continue;
> > > + !PageSwapCache(cursor_page)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > nr_lumpy_dirty++;
> > > scan++;
> > > } else {
> > > - if (mode == ISOLATE_BOTH &&
> > > - page_count(cursor_page))
> > > - nr_lumpy_failed++;
> > > + /* the page is freed already. */
> > > + if (!page_count(cursor_page))
> > > + continue;
> > > + nr_lumpy_failed++;
> > > + break;
> > > }
> > > }
> >
> > The many nr_lumpy_failed++ can be moved here:
> >
> > if (pfn < end_pfn)
> > nr_lumpy_failed++;
> >
>
> Because the break stops the loop iterating, is there an advantage to
> making it a pfn check instead? I might be misunderstanding your
> suggestion.
The complete view in my mind is
for (; pfn < end_pfn; pfn++) {
if (failed 1)
break;
if (failed 2)
break;
if (failed 3)
break;
}
if (pfn < end_pfn)
nr_lumpy_failed++;
Sure it just reduces several lines of code :)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-08 13:14 ` Wu Fengguang
@ 2010-09-08 13:27 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 13:27 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 09:14:04PM +0800, Wu Fengguang wrote:
> On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > >
> > > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > > no longer result in a successful higher order allocation. This patch stops
> > > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 64f9ca5..ff52b46 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > continue;
> > > >
> > > > /* Avoid holes within the zone. */
> > > > - if (unlikely(!pfn_valid_within(pfn)))
> > > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > > + nr_lumpy_failed++;
> > > > break;
> > > > + }
> > > >
> > > > cursor_page = pfn_to_page(pfn);
> > > >
> > > > /* Check that we have not crossed a zone boundary. */
> > > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > > - continue;
> > > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > + }
> > > >
> > > > /*
> > > > * If we don't have enough swap space, reclaiming of
> > > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > * pointless.
> > > > */
> > > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > > - !PageSwapCache(cursor_page))
> > > > - continue;
> > > > + !PageSwapCache(cursor_page)) {
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > + }
> > > >
> > > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > > list_move(&cursor_page->lru, dst);
> > > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > nr_lumpy_dirty++;
> > > > scan++;
> > > > } else {
> > > > - if (mode == ISOLATE_BOTH &&
> > > > - page_count(cursor_page))
> > > > - nr_lumpy_failed++;
> > > > + /* the page is freed already. */
> > > > + if (!page_count(cursor_page))
> > > > + continue;
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > }
> > > > }
> > >
> > > The many nr_lumpy_failed++ can be moved here:
> > >
> > > if (pfn < end_pfn)
> > > nr_lumpy_failed++;
> > >
> >
> > Because the break stops the loop iterating, is there an advantage to
> > making it a pfn check instead? I might be misunderstanding your
> > suggestion.
>
> The complete view in my mind is
>
> for (; pfn < end_pfn; pfn++) {
> if (failed 1)
> break;
> if (failed 2)
> break;
> if (failed 3)
> break;
> }
> if (pfn < end_pfn)
> nr_lumpy_failed++;
>
> Sure it just reduces several lines of code :)
>
Fair point. I applied the following patch on top.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33d27a4..54df972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1091,18 +1091,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
continue;
/* Avoid holes within the zone. */
- if (unlikely(!pfn_valid_within(pfn))) {
- nr_lumpy_failed++;
+ if (unlikely(!pfn_valid_within(pfn)))
break;
- }
cursor_page = pfn_to_page(pfn);
/* Check that we have not crossed a zone boundary. */
- if (unlikely(page_zone_id(cursor_page) != zone_id)) {
- nr_lumpy_failed++;
+ if (unlikely(page_zone_id(cursor_page) != zone_id))
break;
- }
/*
* If we don't have enough swap space, reclaiming of
@@ -1110,10 +1106,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* pointless.
*/
if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
- !PageSwapCache(cursor_page)) {
- nr_lumpy_failed++;
+ !PageSwapCache(cursor_page))
break;
- }
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
@@ -1127,10 +1121,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
/* the page is freed already. */
if (!page_count(cursor_page))
continue;
- nr_lumpy_failed++;
break;
}
}
+
+ /* If we break out of the loop above, lumpy reclaim failed */
+ if (pfn < end_pfn)
+ nr_lumpy_failed++;
}
*scanned = scan;
^ permalink raw reply related [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 13:27 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-08 13:27 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 09:14:04PM +0800, Wu Fengguang wrote:
> On Wed, Sep 08, 2010 at 08:50:44PM +0800, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 07:37:34PM +0800, Wu Fengguang wrote:
> > > On Mon, Sep 06, 2010 at 06:47:31PM +0800, Mel Gorman wrote:
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > >
> > > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > > no longer result in a successful higher order allocation. This patch stops
> > > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 64f9ca5..ff52b46 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > continue;
> > > >
> > > > /* Avoid holes within the zone. */
> > > > - if (unlikely(!pfn_valid_within(pfn)))
> > > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > > + nr_lumpy_failed++;
> > > > break;
> > > > + }
> > > >
> > > > cursor_page = pfn_to_page(pfn);
> > > >
> > > > /* Check that we have not crossed a zone boundary. */
> > > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > > - continue;
> > > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > + }
> > > >
> > > > /*
> > > > * If we don't have enough swap space, reclaiming of
> > > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > * pointless.
> > > > */
> > > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > > - !PageSwapCache(cursor_page))
> > > > - continue;
> > > > + !PageSwapCache(cursor_page)) {
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > + }
> > > >
> > > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > > list_move(&cursor_page->lru, dst);
> > > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > > nr_lumpy_dirty++;
> > > > scan++;
> > > > } else {
> > > > - if (mode == ISOLATE_BOTH &&
> > > > - page_count(cursor_page))
> > > > - nr_lumpy_failed++;
> > > > + /* the page is freed already. */
> > > > + if (!page_count(cursor_page))
> > > > + continue;
> > > > + nr_lumpy_failed++;
> > > > + break;
> > > > }
> > > > }
> > >
> > > The many nr_lumpy_failed++ can be moved here:
> > >
> > > if (pfn < end_pfn)
> > > nr_lumpy_failed++;
> > >
> >
> > Because the break stops the loop iterating, is there an advantage to
> > making it a pfn check instead? I might be misunderstanding your
> > suggestion.
>
> The complete view in my mind is
>
> for (; pfn < end_pfn; pfn++) {
> if (failed 1)
> break;
> if (failed 2)
> break;
> if (failed 3)
> break;
> }
> if (pfn < end_pfn)
> nr_lumpy_failed++;
>
> Sure it just reduces several lines of code :)
>
Fair point. I applied the following patch on top.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33d27a4..54df972 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1091,18 +1091,14 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
continue;
/* Avoid holes within the zone. */
- if (unlikely(!pfn_valid_within(pfn))) {
- nr_lumpy_failed++;
+ if (unlikely(!pfn_valid_within(pfn)))
break;
- }
cursor_page = pfn_to_page(pfn);
/* Check that we have not crossed a zone boundary. */
- if (unlikely(page_zone_id(cursor_page) != zone_id)) {
- nr_lumpy_failed++;
+ if (unlikely(page_zone_id(cursor_page) != zone_id))
break;
- }
/*
* If we don't have enough swap space, reclaiming of
@@ -1110,10 +1106,8 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* pointless.
*/
if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
- !PageSwapCache(cursor_page)) {
- nr_lumpy_failed++;
+ !PageSwapCache(cursor_page))
break;
- }
if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
@@ -1127,10 +1121,13 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
/* the page is freed already. */
if (!page_count(cursor_page))
continue;
- nr_lumpy_failed++;
break;
}
}
+
+ /* If we break out of the loop above, lumpy reclaim failed */
+ if (pfn < end_pfn)
+ nr_lumpy_failed++;
}
*scanned = scan;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-08 11:04 ` Mel Gorman
@ 2010-09-08 14:52 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:52 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > + * @zone: A zone to consider the number of being being written back from
> > > + * @sync: SYNC or ASYNC IO
> > > + * @timeout: timeout in jiffies
> > > + *
> > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > + * write congestion. If no backing_devs are congested then the number of
> > > + * writeback pages in the zone are checked and compared to the inactive
> > > + * list. If there is no sigificant writeback or congestion, there is no point
> > and
> >
>
> Why and? "or" makes sense because we avoid sleeping on either condition.
if (nr_bdi_congested[sync]) == 0) {
if (writeback < inactive / 2) {
cond_resched();
..
goto out
}
}
for avoiding sleeping, above two condition should meet.
So I thought "and" is make sense.
Am I missing something?
>
> > > + * in sleeping but cond_resched() is called in case the current process has
> > > + * consumed its CPU quota.
> > > + */
> > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > +{
> > > + long ret;
> > > + unsigned long start = jiffies;
> > > + DEFINE_WAIT(wait);
> > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > +
> > > + /*
> > > + * If there is no congestion, check the amount of writeback. If there
> > > + * is no significant writeback and no congestion, just cond_resched
> > > + */
> > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > + unsigned long inactive, writeback;
> > > +
> > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > +
> > > + /*
> > > + * If less than half the inactive list is being written back,
> > > + * reclaim might as well continue
> > > + */
> > > + if (writeback < inactive / 2) {
> >
> > I am not sure this is best.
> >
>
> I'm not saying it is. The objective is to identify a situation where
> sleeping until the next write or congestion clears is pointless. We have
> already identified that we are not congested so the question is "are we
> writing a lot at the moment?". The assumption is that if there is a lot
> of writing going on, we might as well sleep until one completes rather
> than reclaiming more.
>
> This is the first effort at identifying pointless sleeps. Better ones
> might be identified in the future but that shouldn't stop us making a
> semi-sensible decision now.
nr_bdi_congested is no problem since we have used it for a long time.
But you added new rule about writeback.
Why I pointed out is that you added new rule and I hope let others know
this change since they have a good idea or any opinions.
I think it's a one of roles as reviewer.
>
> > 1. Without considering various speed class storage, could we fix it as half of inactive?
>
> We don't really have a good means of identifying speed classes of
> storage. Worse, we are considering on a zone-basis here, not a BDI
> basis. The pages being written back in the zone could be backed by
> anything so we cannot make decisions based on BDI speed.
True. So it's why I have below question.
As you said, we don't have enough information in vmscan.
So I am not sure how effective such semi-sensible decision is.
I think best is to throttle in page-writeback well.
But I am not a expert about that and don't have any idea. Sorry.
So I can't insist on my nitpick. If others don't have any objection,
I don't mind this, either.
Wu, Do you have any opinion?
>
> > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> >
>
> There are but congestion_wait() and now wait_iff_congested() are part of
> that. We can see from the figures in the leader that congestion_wait()
> is sleeping more than is necessary or smart.
>
> > Just out of curiosity.
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 14:52 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:52 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > + * @zone: A zone to consider the number of being being written back from
> > > + * @sync: SYNC or ASYNC IO
> > > + * @timeout: timeout in jiffies
> > > + *
> > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > + * write congestion. If no backing_devs are congested then the number of
> > > + * writeback pages in the zone are checked and compared to the inactive
> > > + * list. If there is no sigificant writeback or congestion, there is no point
> > and
> >
>
> Why and? "or" makes sense because we avoid sleeping on either condition.
if (nr_bdi_congested[sync]) == 0) {
if (writeback < inactive / 2) {
cond_resched();
..
goto out
}
}
for avoiding sleeping, above two condition should meet.
So I thought "and" is make sense.
Am I missing something?
>
> > > + * in sleeping but cond_resched() is called in case the current process has
> > > + * consumed its CPU quota.
> > > + */
> > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > +{
> > > + long ret;
> > > + unsigned long start = jiffies;
> > > + DEFINE_WAIT(wait);
> > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > +
> > > + /*
> > > + * If there is no congestion, check the amount of writeback. If there
> > > + * is no significant writeback and no congestion, just cond_resched
> > > + */
> > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > + unsigned long inactive, writeback;
> > > +
> > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > +
> > > + /*
> > > + * If less than half the inactive list is being written back,
> > > + * reclaim might as well continue
> > > + */
> > > + if (writeback < inactive / 2) {
> >
> > I am not sure this is best.
> >
>
> I'm not saying it is. The objective is to identify a situation where
> sleeping until the next write or congestion clears is pointless. We have
> already identified that we are not congested so the question is "are we
> writing a lot at the moment?". The assumption is that if there is a lot
> of writing going on, we might as well sleep until one completes rather
> than reclaiming more.
>
> This is the first effort at identifying pointless sleeps. Better ones
> might be identified in the future but that shouldn't stop us making a
> semi-sensible decision now.
nr_bdi_congested is no problem since we have used it for a long time.
But you added new rule about writeback.
Why I pointed out is that you added new rule and I hope let others know
this change since they have a good idea or any opinions.
I think it's a one of roles as reviewer.
>
> > 1. Without considering various speed class storage, could we fix it as half of inactive?
>
> We don't really have a good means of identifying speed classes of
> storage. Worse, we are considering on a zone-basis here, not a BDI
> basis. The pages being written back in the zone could be backed by
> anything so we cannot make decisions based on BDI speed.
True. So it's why I have below question.
As you said, we don't have enough information in vmscan.
So I am not sure how effective such semi-sensible decision is.
I think best is to throttle in page-writeback well.
But I am not a expert about that and don't have any idea. Sorry.
So I can't insist on my nitpick. If others don't have any objection,
I don't mind this, either.
Wu, Do you have any opinion?
>
> > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> >
>
> There are but congestion_wait() and now wait_iff_congested() are part of
> that. We can see from the figures in the leader that congestion_wait()
> is sleeping more than is necessary or smart.
>
> > Just out of curiosity.
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-08 11:12 ` Mel Gorman
@ 2010-09-08 14:58 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:58 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:12:30PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> > On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > continue;
> > >
> > > /* Avoid holes within the zone. */
> > > - if (unlikely(!pfn_valid_within(pfn)))
> > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > + nr_lumpy_failed++;
> > > break;
> > > + }
> > >
> > > cursor_page = pfn_to_page(pfn);
> > >
> > > /* Check that we have not crossed a zone boundary. */
> > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > - continue;
> > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > /*
> > > * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > * pointless.
> > > */
> > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > - !PageSwapCache(cursor_page))
> > > - continue;
> > > + !PageSwapCache(cursor_page)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > nr_lumpy_dirty++;
> > > scan++;
> > > } else {
> > > - if (mode == ISOLATE_BOTH &&
> >
> > Why can we remove ISOLATION_BOTH check?
>
> Because this is lumpy reclaim and whether we are isolating inactive, active
> or both doesn't matter. The fact we failed to isolate the page and it has
> a reference count means that a contiguous allocation in that area will fail.
>
> > Is it a intentionall behavior change?
> >
>
> Yes.
It looks good to me.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-08 14:58 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-08 14:58 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 12:12:30PM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 12:37:08AM +0900, Minchan Kim wrote:
> > On Mon, Sep 06, 2010 at 11:47:31AM +0100, Mel Gorman wrote:
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> > > neighbour pages of the eviction page. The neighbour search does not stop even
> > > if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> > > no longer result in a successful higher order allocation. This patch stops
> > > the PFN neighbour pages if an isolation fails and moves on to the next block.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 24 ++++++++++++++++--------
> > > 1 files changed, 16 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 64f9ca5..ff52b46 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > continue;
> > >
> > > /* Avoid holes within the zone. */
> > > - if (unlikely(!pfn_valid_within(pfn)))
> > > + if (unlikely(!pfn_valid_within(pfn))) {
> > > + nr_lumpy_failed++;
> > > break;
> > > + }
> > >
> > > cursor_page = pfn_to_page(pfn);
> > >
> > > /* Check that we have not crossed a zone boundary. */
> > > - if (unlikely(page_zone_id(cursor_page) != zone_id))
> > > - continue;
> > > + if (unlikely(page_zone_id(cursor_page) != zone_id)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > /*
> > > * If we don't have enough swap space, reclaiming of
> > > @@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > * pointless.
> > > */
> > > if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
> > > - !PageSwapCache(cursor_page))
> > > - continue;
> > > + !PageSwapCache(cursor_page)) {
> > > + nr_lumpy_failed++;
> > > + break;
> > > + }
> > >
> > > if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> > > list_move(&cursor_page->lru, dst);
> > > @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> > > nr_lumpy_dirty++;
> > > scan++;
> > > } else {
> > > - if (mode == ISOLATE_BOTH &&
> >
> > Why can we remove ISOLATION_BOTH check?
>
> Because this is lumpy reclaim and whether we are isolating inactive, active
> or both doesn't matter. The fact we failed to isolate the page and it has
> a reference count means that a contiguous allocation in that area will fail.
>
> > Is it a intentionall behavior change?
> >
>
> Yes.
It looks good to me.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-08 21:23 ` Andrew Morton
-1 siblings, 0 replies; 133+ messages in thread
From: Andrew Morton @ 2010-09-08 21:23 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig
On Mon, 6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
The patch series looks generally good. Would like to see some testing
results ;) A few touchups are planned so I'll await v2.
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
Let's remember that a queue can get congested because of reads as well
as writes. It's very rare for this to happen - it needs either a
zillion read()ing threads or someone going berzerk with O_DIRECT aio,
etc. Probably it doesn't matter much, but for memory reclaim purposes
read-congestion is somewhat irrelevant and a bit of thought is warranted.
vmscan currently only looks at *write* congestion, but in this patch
you secretly change that logic to newly look at write-or-read
congestion. Talk to me.
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
Worried. Having a single slow disk getting itself gummed up will
affect the entire machine!
There's potential for pathological corner-case problems here. "When I
do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
page reclaim!".
What to do?
Of course, we'd very much prefer to know whether a queue which we're
interested in for writeback will block when we try to write to it.
Much better than looking at all queues.
Important question: which of teh current congestion_wait() call sites
are causing appreciable stalls?
I think a more accurate way of implementing this is to be smarter with
the may_write_to_queue()->bdi_write_congested() result. If a previous
attempt to write off this LRU encountered congestion then fine, call
congestion_wait(). But if writeback is not hitting
may_write_to_queue()->bdi_write_congested() then that is the time to
avoid calling congestion_wait().
In other words, save the bdi_write_congested() result in the zone
struct in some fashion and inspect that before deciding to synchronize
behind the underlying device's write rate. Not hitting a congested
device for this LRU? Then don't wait for congested devices.
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
That comments needs help.
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.'
write or read congestion!!
> If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
Document the return value?
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
This is all getting seriously inaccurate :(
> + cond_resched();
> +
> + /* In case we scheduled, work out time remaining */
> + ret = timeout - (jiffies - start);
> + if (ret < 0)
> + ret = 0;
> +
> + goto out;
> + }
> + }
> +
> + /* Sleep until uncongested or a write happens */
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> +
> +out:
> + trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> + jiffies_to_usecs(jiffies - start));
Does this tracepoint tell us how often wait_iff_congested() is sleeping
versus how often it is returning immediately?
> + return ret;
> +}
> +EXPORT_SYMBOL(wait_iff_congested);
>
> ...
>
> @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> sc->may_writepage = 1;
> }
>
> - /* Take a nap, wait for some writeback to complete */
> + /* Take a nap if congested, wait for some writeback */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> - priority < DEF_PRIORITY - 2)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + priority < DEF_PRIORITY - 2) {
> + struct zone *active_zone = NULL;
> + unsigned long max_writeback = 0;
> + for_each_zone_zonelist(zone, z, zonelist,
> + gfp_zone(sc->gfp_mask)) {
> + unsigned long writeback;
> +
> + /* Initialise for first zone */
> + if (active_zone == NULL)
> + active_zone = zone;
> +
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> + if (writeback > max_writeback) {
> + max_writeback = writeback;
> + active_zone = zone;
> + }
> + }
> +
> + wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> + }
Again, we would benefit from more accuracy here. In my above
suggestion I'm assuming that the (congestion) result of the most recent
attempt to perform writeback is a predictor of the next attempt.
Doing that on a kernel-wide basis would be rather inaccurate on large
machines in some scenarios. Storing the state info in the zone would
help.
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-08 21:23 ` Andrew Morton
0 siblings, 0 replies; 133+ messages in thread
From: Andrew Morton @ 2010-09-08 21:23 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig
On Mon, 6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
The patch series looks generally good. Would like to see some testing
results ;) A few touchups are planned so I'll await v2.
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
Let's remember that a queue can get congested because of reads as well
as writes. It's very rare for this to happen - it needs either a
zillion read()ing threads or someone going berzerk with O_DIRECT aio,
etc. Probably it doesn't matter much, but for memory reclaim purposes
read-congestion is somewhat irrelevant and a bit of thought is warranted.
vmscan currently only looks at *write* congestion, but in this patch
you secretly change that logic to newly look at write-or-read
congestion. Talk to me.
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
Worried. Having a single slow disk getting itself gummed up will
affect the entire machine!
There's potential for pathological corner-case problems here. "When I
do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
page reclaim!".
What to do?
Of course, we'd very much prefer to know whether a queue which we're
interested in for writeback will block when we try to write to it.
Much better than looking at all queues.
Important question: which of teh current congestion_wait() call sites
are causing appreciable stalls?
I think a more accurate way of implementing this is to be smarter with
the may_write_to_queue()->bdi_write_congested() result. If a previous
attempt to write off this LRU encountered congestion then fine, call
congestion_wait(). But if writeback is not hitting
may_write_to_queue()->bdi_write_congested() then that is the time to
avoid calling congestion_wait().
In other words, save the bdi_write_congested() result in the zone
struct in some fashion and inspect that before deciding to synchronize
behind the underlying device's write rate. Not hitting a congested
device for this LRU? Then don't wait for congested devices.
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
That comments needs help.
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion.'
write or read congestion!!
> If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
Document the return value?
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
This is all getting seriously inaccurate :(
> + cond_resched();
> +
> + /* In case we scheduled, work out time remaining */
> + ret = timeout - (jiffies - start);
> + if (ret < 0)
> + ret = 0;
> +
> + goto out;
> + }
> + }
> +
> + /* Sleep until uncongested or a write happens */
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> +
> +out:
> + trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> + jiffies_to_usecs(jiffies - start));
Does this tracepoint tell us how often wait_iff_congested() is sleeping
versus how often it is returning immediately?
> + return ret;
> +}
> +EXPORT_SYMBOL(wait_iff_congested);
>
> ...
>
> @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> sc->may_writepage = 1;
> }
>
> - /* Take a nap, wait for some writeback to complete */
> + /* Take a nap if congested, wait for some writeback */
> if (!sc->hibernation_mode && sc->nr_scanned &&
> - priority < DEF_PRIORITY - 2)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + priority < DEF_PRIORITY - 2) {
> + struct zone *active_zone = NULL;
> + unsigned long max_writeback = 0;
> + for_each_zone_zonelist(zone, z, zonelist,
> + gfp_zone(sc->gfp_mask)) {
> + unsigned long writeback;
> +
> + /* Initialise for first zone */
> + if (active_zone == NULL)
> + active_zone = zone;
> +
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> + if (writeback > max_writeback) {
> + max_writeback = writeback;
> + active_zone = zone;
> + }
> + }
> +
> + wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> + }
Again, we would benefit from more accuracy here. In my above
suggestion I'm assuming that the (congestion) result of the most recent
attempt to perform writeback is a predictor of the next attempt.
Doing that on a kernel-wide basis would be rather inaccurate on large
machines in some scenarios. Storing the state info in the zone would
help.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:02 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:02 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/backing-dev.h | 2 +-
> include/trace/events/writeback.h | 7 ++++
> mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 4 +-
> mm/vmscan.c | 26 ++++++++++++--
> 5 files changed, 96 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> TP_ARGS(usec_timeout, usec_delayed)
> );
>
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> + TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
> #endif /* _TRACE_WRITEBACK_H */
>
> /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> enum bdi_state bit;
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - set_bit(bit, &bdi->state);
> + if (!test_and_set_bit(bit, &bdi->state))
> + atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion. If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
Hmm..can't we have a way that "find a page which can be just dropped without writeback"
rather than sleeping ? I think we can throttole the number of victims for avoidng I/O
congestion as pages/tick....if exhausted, ok, we should sleep.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09 3:02 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:02 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:26 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> If congestion_wait() is called with no BDIs congested, the caller will sleep
> for the full timeout and this may be an unnecessary sleep. This patch adds
> a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> congested or if there is a significant amount of writeback going on in an
> interesting zone. Else, it calls cond_resched() to ensure the caller is
> not hogging the CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/linux/backing-dev.h | 2 +-
> include/trace/events/writeback.h | 7 ++++
> mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> mm/page_alloc.c | 4 +-
> mm/vmscan.c | 26 ++++++++++++--
> 5 files changed, 96 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 35b0074..f1b402a 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -285,7 +285,7 @@ enum {
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> long congestion_wait(int sync, long timeout);
> -
> +long wait_iff_congested(struct zone *zone, int sync, long timeout);
>
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index 275d477..eeaf1f5 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> TP_ARGS(usec_timeout, usec_delayed)
> );
>
> +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> +
> + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> +
> + TP_ARGS(usec_timeout, usec_delayed)
> +);
> +
> #endif /* _TRACE_WRITEBACK_H */
>
> /* This part must be outside protection */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 298975a..94b5433 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_clear_bit();
> if (waitqueue_active(wqh))
> wake_up(wqh);
> @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> enum bdi_state bit;
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - set_bit(bit, &bdi->state);
> + if (!test_and_set_bit(bit, &bdi->state))
> + atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> }
> EXPORT_SYMBOL(congestion_wait);
>
> +/**
> + * congestion_wait - wait for a backing_dev to become uncongested
> + * @zone: A zone to consider the number of being being written back from
> + * @sync: SYNC or ASYNC IO
> + * @timeout: timeout in jiffies
> + *
> + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> + * write congestion. If no backing_devs are congested then the number of
> + * writeback pages in the zone are checked and compared to the inactive
> + * list. If there is no sigificant writeback or congestion, there is no point
> + * in sleeping but cond_resched() is called in case the current process has
> + * consumed its CPU quota.
> + */
> +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> +{
> + long ret;
> + unsigned long start = jiffies;
> + DEFINE_WAIT(wait);
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> + /*
> + * If there is no congestion, check the amount of writeback. If there
> + * is no significant writeback and no congestion, just cond_resched
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> + unsigned long inactive, writeback;
> +
> + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> + writeback = zone_page_state(zone, NR_WRITEBACK);
> +
> + /*
> + * If less than half the inactive list is being written back,
> + * reclaim might as well continue
> + */
> + if (writeback < inactive / 2) {
Hmm..can't we have a way that "find a page which can be just dropped without writeback"
rather than sleeping ? I think we can throttole the number of victims for avoidng I/O
congestion as pages/tick....if exhausted, ok, we should sleep.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:03 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:03 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait()
@ 2010-09-09 3:03 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:03 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> congestion_wait() mean "waiting queue congestion is cleared". However,
> synchronous lumpy reclaim does not need this congestion_wait() as
> shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback()
> and it provides the necessary waiting.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:04 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:04 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:28 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09 3:04 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:04 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:28 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> pages even if page is locked. This patch uses lock_page() instead of
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:14 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:14 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:29 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> shrink_page_list() can decide to give up reclaiming a page under a
> number of conditions such as
>
> 1. trylock_page() failure
> 2. page is unevictable
> 3. zone reclaim and page is mapped
> 4. PageWriteback() is true
> 5. page is swapbacked and swap is full
> 6. add_to_swap() failure
> 7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
> 8. page is pinned
> 9. IO queue is congested
> 10. pageout() start IO, but not finished
>
> When lumpy reclaim, all of failure result in entering synchronous lumpy
> reclaim but this can be unnecessary. In cases (2), (3), (5), (6), (7) and
> (8), there is no point retrying. This patch causes lumpy reclaim to abort
> when it is known it will fail.
>
> Case (9) is more interesting. current behavior is,
> 1. start shrink_page_list(async)
> 2. found queue_congested()
> 3. skip pageout write
> 4. still start shrink_page_list(sync)
> 5. wait on a lot of pages
> 6. again, found queue_congested()
> 7. give up pageout write again
>
> So, it's meaningless time wasting. However, just skipping page reclaim is
> also not a good as as x86 allocating a huge page needs 512 pages for example.
> It can have more dirty pages than queue congestion threshold (~=128).
>
> After this patch, pageout() behaves as follows;
>
> - If order > PAGE_ALLOC_COSTLY_ORDER
> Ignore queue congestion always.
> - If order <= PAGE_ALLOC_COSTLY_ORDER
> skip write page and disable lumpy reclaim.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
seems nice.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim
@ 2010-09-09 3:14 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:14 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:29 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> shrink_page_list() can decide to give up reclaiming a page under a
> number of conditions such as
>
> 1. trylock_page() failure
> 2. page is unevictable
> 3. zone reclaim and page is mapped
> 4. PageWriteback() is true
> 5. page is swapbacked and swap is full
> 6. add_to_swap() failure
> 7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
> 8. page is pinned
> 9. IO queue is congested
> 10. pageout() start IO, but not finished
>
> When lumpy reclaim, all of failure result in entering synchronous lumpy
> reclaim but this can be unnecessary. In cases (2), (3), (5), (6), (7) and
> (8), there is no point retrying. This patch causes lumpy reclaim to abort
> when it is known it will fail.
>
> Case (9) is more interesting. current behavior is,
> 1. start shrink_page_list(async)
> 2. found queue_congested()
> 3. skip pageout write
> 4. still start shrink_page_list(sync)
> 5. wait on a lot of pages
> 6. again, found queue_congested()
> 7. give up pageout write again
>
> So, it's meaningless time wasting. However, just skipping page reclaim is
> also not a good as as x86 allocating a huge page needs 512 pages for example.
> It can have more dirty pages than queue congestion threshold (~=128).
>
> After this patch, pageout() behaves as follows;
>
> - If order > PAGE_ALLOC_COSTLY_ORDER
> Ignore queue congestion always.
> - If order <= PAGE_ALLOC_COSTLY_ORDER
> skip write page and disable lumpy reclaim.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
seems nice.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-09 3:04 ` KAMEZAWA Hiroyuki
@ 2010-09-09 3:15 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, 9 Sep 2010 12:04:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 6 Sep 2010 11:47:28 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > pages even if page is locked. This patch uses lock_page() instead of
> > trylock_page() in this case.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
Ah......but can't this change cause dead lock ??
Thanks,
-Kame
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09 3:15 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:15 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, 9 Sep 2010 12:04:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Mon, 6 Sep 2010 11:47:28 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> >
> > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > pages even if page is locked. This patch uses lock_page() instead of
> > trylock_page() in this case.
> >
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
Ah......but can't this change cause dead lock ??
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:17 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:17 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:31 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated
@ 2010-09-09 3:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:17 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:31 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>
> isolate_lru_pages() does not just isolate LRU tail pages, but also isolate
> neighbour pages of the eviction page. The neighbour search does not stop even
> if neighbours cannot be isolated which is excessive as the lumpy reclaim will
> no longer result in a successful higher order allocation. This patch stops
> the PFN neighbour pages if an isolation fails and moves on to the next block.
>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-09 3:22 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:22 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:33 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
>
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> 1 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 408c101..33d27a4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> #define MAX_SWAP_CLEAN_WAIT 50
>
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> + return laptop_mode ? 0 :
> + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}
> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct scan_control *sc,
> + int file,
> unsigned long *nr_still_dirty)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> unsigned long nr_dirty = 0;
> + unsigned long nr_dirty_seen = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
> @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
>
> if (PageDirty(page)) {
> + nr_dirty_seen++;
> +
> /*
> * Only kswapd can writeback filesystem pages to
> * avoid risk of stack overflow
> @@ -923,6 +939,18 @@ keep_lumpy:
>
> list_splice(&ret_pages, page_list);
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though the
> + * dirty_ratio may be satisified. In this case, wake flusher
> + * threads to pro-actively clean up to a maximum of
> + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> + * !may_writepage indicates that this is a direct reclaimer in
> + * laptop mode avoiding disk spin-ups
> + */
> + if (file && nr_dirty_seen && sc->may_writepage)
> + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +
Thank you. Ok, I'll check what happens in memcg.
Can I add
if (sc->memcg) {
memcg_check_flusher_wakeup()
}
or some here ?
Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
Thanks,
-Kame
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-09 3:22 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-09 3:22 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Mon, 6 Sep 2010 11:47:33 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> There are a number of cases where pages get cleaned but two of concern
> to this patch are;
> o When dirtying pages, processes may be throttled to clean pages if
> dirty_ratio is not met.
> o Pages belonging to inodes dirtied longer than
> dirty_writeback_centisecs get cleaned.
>
> The problem for reclaim is that dirty pages can reach the end of the LRU if
> pages are being dirtied slowly so that neither the throttling or a flusher
> thread waking periodically cleans them.
>
> Background flush is already cleaning old or expired inodes first but the
> expire time is too far in the future at the time of page reclaim. To mitigate
> future problems, this patch wakes flusher threads to clean 4M of data -
> an amount that should be manageable without causing congestion in many cases.
>
> Ideally, the background flushers would only be cleaning pages belonging
> to the zone being scanned but it's not clear if this would be of benefit
> (less IO) or not (potentially less efficient IO if an inode is scattered
> across multiple zones).
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> 1 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 408c101..33d27a4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> #define MAX_SWAP_CLEAN_WAIT 50
>
> +/*
> + * When reclaim encounters dirty data, wakeup flusher threads to clean
> + * a maximum of 4M of data.
> + */
> +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> +static inline long nr_writeback_pages(unsigned long nr_dirty)
> +{
> + return laptop_mode ? 0 :
> + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> +}
> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> struct scan_control *sc,
> + int file,
> unsigned long *nr_still_dirty)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> unsigned long nr_dirty = 0;
> + unsigned long nr_dirty_seen = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
> @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
>
> if (PageDirty(page)) {
> + nr_dirty_seen++;
> +
> /*
> * Only kswapd can writeback filesystem pages to
> * avoid risk of stack overflow
> @@ -923,6 +939,18 @@ keep_lumpy:
>
> list_splice(&ret_pages, page_list);
>
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though the
> + * dirty_ratio may be satisified. In this case, wake flusher
> + * threads to pro-actively clean up to a maximum of
> + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> + * !may_writepage indicates that this is a direct reclaimer in
> + * laptop mode avoiding disk spin-ups
> + */
> + if (file && nr_dirty_seen && sc->may_writepage)
> + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> +
Thank you. Ok, I'll check what happens in memcg.
Can I add
if (sc->memcg) {
memcg_check_flusher_wakeup()
}
or some here ?
Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-09 3:15 ` KAMEZAWA Hiroyuki
@ 2010-09-09 3:25 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-09 3:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 11:15:47AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> Ah......but can't this change cause dead lock ??
You mean the task goes for page allocation while holding some page
lock? Seems possible.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09 3:25 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-09 3:25 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 11:15:47AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> Ah......but can't this change cause dead lock ??
You mean the task goes for page allocation while holding some page
lock? Seems possible.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-09 3:15 ` KAMEZAWA Hiroyuki
@ 2010-09-09 4:13 ` KOSAKI Motohiro
-1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 4:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kosaki.motohiro, Mel Gorman, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> Ah......but can't this change cause dead lock ??
Yes, this patch is purely crappy. please drop. I guess I was poisoned
by poisonous mushroom of Mario Bros.
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09 4:13 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-09 4:13 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: kosaki.motohiro, Mel Gorman, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> On Thu, 9 Sep 2010 12:04:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Mon, 6 Sep 2010 11:47:28 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > >
> > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > pages even if page is locked. This patch uses lock_page() instead of
> > > trylock_page() in this case.
> > >
> > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> >
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> Ah......but can't this change cause dead lock ??
Yes, this patch is purely crappy. please drop. I guess I was poisoned
by poisonous mushroom of Mario Bros.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-08 14:52 ` Minchan Kim
@ 2010-09-09 8:54 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 8:54 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > + * @zone: A zone to consider the number of being being written back from
> > > > + * @sync: SYNC or ASYNC IO
> > > > + * @timeout: timeout in jiffies
> > > > + *
> > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > + * write congestion. If no backing_devs are congested then the number of
> > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > and
> > >
> >
> > Why and? "or" makes sense because we avoid sleeping on either condition.
>
> if (nr_bdi_congested[sync]) == 0) {
> if (writeback < inactive / 2) {
> cond_resched();
> ..
> goto out
> }
> }
>
> for avoiding sleeping, above two condition should meet.
This is a terrible comment that is badly written. Is this any clearer?
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
* @zone: A zone to consider the number of being being written back from
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) or a given @zone
* having a large number of pages in writeback, this waits for up to @timeout
* jiffies for either a BDI to exit congestion or a write to complete.
*
* If there is no congestion and few pending writes, then cond_resched()
* is called to yield the processor if necessary but otherwise does not
* sleep.
*/
> >
> > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > + * consumed its CPU quota.
> > > > + */
> > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > +{
> > > > + long ret;
> > > > + unsigned long start = jiffies;
> > > > + DEFINE_WAIT(wait);
> > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > +
> > > > + /*
> > > > + * If there is no congestion, check the amount of writeback. If there
> > > > + * is no significant writeback and no congestion, just cond_resched
> > > > + */
> > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > + unsigned long inactive, writeback;
> > > > +
> > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > +
> > > > + /*
> > > > + * If less than half the inactive list is being written back,
> > > > + * reclaim might as well continue
> > > > + */
> > > > + if (writeback < inactive / 2) {
> > >
> > > I am not sure this is best.
> > >
> >
> > I'm not saying it is. The objective is to identify a situation where
> > sleeping until the next write or congestion clears is pointless. We have
> > already identified that we are not congested so the question is "are we
> > writing a lot at the moment?". The assumption is that if there is a lot
> > of writing going on, we might as well sleep until one completes rather
> > than reclaiming more.
> >
> > This is the first effort at identifying pointless sleeps. Better ones
> > might be identified in the future but that shouldn't stop us making a
> > semi-sensible decision now.
>
> nr_bdi_congested is no problem since we have used it for a long time.
> But you added new rule about writeback.
>
Yes, I'm trying to add a new rule about throttling in the page allocator
and from vmscan. As you can see from the results in the leader, we are
currently sleeping more than we need to.
> Why I pointed out is that you added new rule and I hope let others know
> this change since they have a good idea or any opinions.
> I think it's a one of roles as reviewer.
>
Of course.
> >
> > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> >
> > We don't really have a good means of identifying speed classes of
> > storage. Worse, we are considering on a zone-basis here, not a BDI
> > basis. The pages being written back in the zone could be backed by
> > anything so we cannot make decisions based on BDI speed.
>
> True. So it's why I have below question.
> As you said, we don't have enough information in vmscan.
> So I am not sure how effective such semi-sensible decision is.
>
What additional metrics would you apply than the ones I used in the
leader mail?
> I think best is to throttle in page-writeback well.
I do not think there is a problem as such in page writeback throttling.
The problem is that we are going to sleep without any congestion or without
writes in progress. We sleep for a full timeout in this case for no reason
and this is what I'm trying to avoid.
> But I am not a expert about that and don't have any idea. Sorry.
Don't be, this is something that needs thinking about!
> So I can't insist on my nitpick. If others don't have any objection,
> I don't mind this, either.
>
> Wu, Do you have any opinion?
>
> >
> > > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > >
> >
> > There are but congestion_wait() and now wait_iff_congested() are part of
> > that. We can see from the figures in the leader that congestion_wait()
> > is sleeping more than is necessary or smart.
> >
> > > Just out of curiosity.
> > >
> >
> > --
> > Mel Gorman
> > Part-time Phd Student Linux Technology Center
> > University of Limerick IBM Dublin Software Lab
>
> --
> Kind regards,
> Minchan Kim
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09 8:54 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 8:54 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > + * @zone: A zone to consider the number of being being written back from
> > > > + * @sync: SYNC or ASYNC IO
> > > > + * @timeout: timeout in jiffies
> > > > + *
> > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > + * write congestion. If no backing_devs are congested then the number of
> > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > and
> > >
> >
> > Why and? "or" makes sense because we avoid sleeping on either condition.
>
> if (nr_bdi_congested[sync]) == 0) {
> if (writeback < inactive / 2) {
> cond_resched();
> ..
> goto out
> }
> }
>
> for avoiding sleeping, above two condition should meet.
This is a terrible comment that is badly written. Is this any clearer?
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
* @zone: A zone to consider the number of being being written back from
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) or a given @zone
* having a large number of pages in writeback, this waits for up to @timeout
* jiffies for either a BDI to exit congestion or a write to complete.
*
* If there is no congestion and few pending writes, then cond_resched()
* is called to yield the processor if necessary but otherwise does not
* sleep.
*/
> >
> > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > + * consumed its CPU quota.
> > > > + */
> > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > +{
> > > > + long ret;
> > > > + unsigned long start = jiffies;
> > > > + DEFINE_WAIT(wait);
> > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > +
> > > > + /*
> > > > + * If there is no congestion, check the amount of writeback. If there
> > > > + * is no significant writeback and no congestion, just cond_resched
> > > > + */
> > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > + unsigned long inactive, writeback;
> > > > +
> > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > +
> > > > + /*
> > > > + * If less than half the inactive list is being written back,
> > > > + * reclaim might as well continue
> > > > + */
> > > > + if (writeback < inactive / 2) {
> > >
> > > I am not sure this is best.
> > >
> >
> > I'm not saying it is. The objective is to identify a situation where
> > sleeping until the next write or congestion clears is pointless. We have
> > already identified that we are not congested so the question is "are we
> > writing a lot at the moment?". The assumption is that if there is a lot
> > of writing going on, we might as well sleep until one completes rather
> > than reclaiming more.
> >
> > This is the first effort at identifying pointless sleeps. Better ones
> > might be identified in the future but that shouldn't stop us making a
> > semi-sensible decision now.
>
> nr_bdi_congested is no problem since we have used it for a long time.
> But you added new rule about writeback.
>
Yes, I'm trying to add a new rule about throttling in the page allocator
and from vmscan. As you can see from the results in the leader, we are
currently sleeping more than we need to.
> Why I pointed out is that you added new rule and I hope let others know
> this change since they have a good idea or any opinions.
> I think it's a one of roles as reviewer.
>
Of course.
> >
> > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> >
> > We don't really have a good means of identifying speed classes of
> > storage. Worse, we are considering on a zone-basis here, not a BDI
> > basis. The pages being written back in the zone could be backed by
> > anything so we cannot make decisions based on BDI speed.
>
> True. So it's why I have below question.
> As you said, we don't have enough information in vmscan.
> So I am not sure how effective such semi-sensible decision is.
>
What additional metrics would you apply than the ones I used in the
leader mail?
> I think best is to throttle in page-writeback well.
I do not think there is a problem as such in page writeback throttling.
The problem is that we are going to sleep without any congestion or without
writes in progress. We sleep for a full timeout in this case for no reason
and this is what I'm trying to avoid.
> But I am not a expert about that and don't have any idea. Sorry.
Don't be, this is something that needs thinking about!
> So I can't insist on my nitpick. If others don't have any objection,
> I don't mind this, either.
>
> Wu, Do you have any opinion?
>
> >
> > > 2. Isn't there any writeback throttling on above layer? Do we care of it in here?
> > >
> >
> > There are but congestion_wait() and now wait_iff_congested() are part of
> > that. We can see from the figures in the leader that congestion_wait()
> > is sleeping more than is necessary or smart.
> >
> > > Just out of curiosity.
> > >
> >
> > --
> > Mel Gorman
> > Part-time Phd Student Linux Technology Center
> > University of Limerick IBM Dublin Software Lab
>
> --
> Kind regards,
> Minchan Kim
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-09 3:02 ` KAMEZAWA Hiroyuki
@ 2010-09-09 8:58 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 8:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 12:02:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/linux/backing-dev.h | 2 +-
> > include/trace/events/writeback.h | 7 ++++
> > mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> > mm/page_alloc.c | 4 +-
> > mm/vmscan.c | 26 ++++++++++++--
> > 5 files changed, 96 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 35b0074..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -285,7 +285,7 @@ enum {
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> > void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> > long congestion_wait(int sync, long timeout);
> > -
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout);
> >
> > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> > {
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index 275d477..eeaf1f5 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> > TP_ARGS(usec_timeout, usec_delayed)
> > );
> >
> > +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> > +
> > + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> > +
> > + TP_ARGS(usec_timeout, usec_delayed)
> > +);
> > +
> > #endif /* _TRACE_WRITEBACK_H */
> >
> > /* This part must be outside protection */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 298975a..94b5433 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
> > smp_mb__after_clear_bit();
> > if (waitqueue_active(wqh))
> > wake_up(wqh);
> > @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> > enum bdi_state bit;
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - set_bit(bit, &bdi->state);
> > + if (!test_and_set_bit(bit, &bdi->state))
> > + atomic_inc(&nr_bdi_congested[sync]);
> > }
> > EXPORT_SYMBOL(set_bdi_congested);
> >
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> > }
> > EXPORT_SYMBOL(congestion_wait);
> >
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion. If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> Hmm..can't we have a way that "find a page which can be just dropped without writeback"
> rather than sleeping ?
Sure, just scan for clean pages but then younger clean pages would be reclaimed
before old dirty pages because we were not waiting on writeback. It's a
significant change.
> I think we can throttole the number of victims for avoidng I/O
> congestion as pages/tick....if exhausted, ok, we should sleep.
>
I think it would be tricky to throttle based on time effectively. I find
it easier to think about throttling in terms of congested device, number
of dirty pages in a zone or number of pages currently being written back
because these are events that can prevent reclaim taking place.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09 8:58 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 8:58 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 12:02:31PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/linux/backing-dev.h | 2 +-
> > include/trace/events/writeback.h | 7 ++++
> > mm/backing-dev.c | 66 ++++++++++++++++++++++++++++++++++++-
> > mm/page_alloc.c | 4 +-
> > mm/vmscan.c | 26 ++++++++++++--
> > 5 files changed, 96 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 35b0074..f1b402a 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -285,7 +285,7 @@ enum {
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> > void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> > long congestion_wait(int sync, long timeout);
> > -
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout);
> >
> > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> > {
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index 275d477..eeaf1f5 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -181,6 +181,13 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
> > TP_ARGS(usec_timeout, usec_delayed)
> > );
> >
> > +DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
> > +
> > + TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
> > +
> > + TP_ARGS(usec_timeout, usec_delayed)
> > +);
> > +
> > #endif /* _TRACE_WRITEBACK_H */
> >
> > /* This part must be outside protection */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 298975a..94b5433 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
> > smp_mb__after_clear_bit();
> > if (waitqueue_active(wqh))
> > wake_up(wqh);
> > @@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> > enum bdi_state bit;
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - set_bit(bit, &bdi->state);
> > + if (!test_and_set_bit(bit, &bdi->state))
> > + atomic_inc(&nr_bdi_congested[sync]);
> > }
> > EXPORT_SYMBOL(set_bdi_congested);
> >
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> > }
> > EXPORT_SYMBOL(congestion_wait);
> >
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion. If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> Hmm..can't we have a way that "find a page which can be just dropped without writeback"
> rather than sleeping ?
Sure, just scan for clean pages but then younger clean pages would be reclaimed
before old dirty pages because we were not waiting on writeback. It's a
significant change.
> I think we can throttole the number of victims for avoidng I/O
> congestion as pages/tick....if exhausted, ok, we should sleep.
>
I think it would be tricky to throttle based on time effectively. I find
it easier to think about throttling in terms of congested device, number
of dirty pages in a zone or number of pages currently being written back
because these are events that can prevent reclaim taking place.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-09 4:13 ` KOSAKI Motohiro
@ 2010-09-09 9:22 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 9:22 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > On Thu, 9 Sep 2010 12:04:48 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > >
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > >
> > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > trylock_page() in this case.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> > Ah......but can't this change cause dead lock ??
>
> Yes, this patch is purely crappy. please drop. I guess I was poisoned
> by poisonous mushroom of Mario Bros.
>
Lets be clear on what the exact dead lock conditions are. The ones I had
thought about when I felt this patch was ok were;
o We are not holding the LRU lock (or any lock, we just called cond_resched())
o We do not have another page locked because we cannot lock multiple pages
o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
o lock_page() itself is not allocating anything that we could recurse on
One potential dead lock would be if the direct reclaimer held a page
lock and ended up here but is that situation even allowed? I did not
think of an obvious example of when this would happen. Similarly,
deadlock situations with mmap_sem shouldn't happen unless multiple page
locks are being taken.
(prepares to feel foolish)
What did I miss?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-09 9:22 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 9:22 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > On Thu, 9 Sep 2010 12:04:48 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > >
> > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > >
> > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > trylock_page() in this case.
> > > >
> > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > >
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > >
> > Ah......but can't this change cause dead lock ??
>
> Yes, this patch is purely crappy. please drop. I guess I was poisoned
> by poisonous mushroom of Mario Bros.
>
Lets be clear on what the exact dead lock conditions are. The ones I had
thought about when I felt this patch was ok were;
o We are not holding the LRU lock (or any lock, we just called cond_resched())
o We do not have another page locked because we cannot lock multiple pages
o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
o lock_page() itself is not allocating anything that we could recurse on
One potential dead lock would be if the direct reclaimer held a page
lock and ended up here but is that situation even allowed? I did not
think of an obvious example of when this would happen. Similarly,
deadlock situations with mmap_sem shouldn't happen unless multiple page
locks are being taken.
(prepares to feel foolish)
What did I miss?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-09 3:22 ` KAMEZAWA Hiroyuki
@ 2010-09-09 9:32 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 9:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Sep 2010 11:47:33 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> > o When dirtying pages, processes may be throttled to clean pages if
> > dirty_ratio is not met.
> > o Pages belonging to inodes dirtied longer than
> > dirty_writeback_centisecs get cleaned.
> >
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> >
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> >
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> > 1 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 408c101..33d27a4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > #define MAX_SWAP_CLEAN_WAIT 50
> >
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > + return laptop_mode ? 0 :
> > + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > + int file,
> > unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > unsigned long nr_dirty = 0;
> > + unsigned long nr_dirty_seen = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + nr_dirty_seen++;
> > +
> > /*
> > * Only kswapd can writeback filesystem pages to
> > * avoid risk of stack overflow
> > @@ -923,6 +939,18 @@ keep_lumpy:
> >
> > list_splice(&ret_pages, page_list);
> >
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though the
> > + * dirty_ratio may be satisified. In this case, wake flusher
> > + * threads to pro-actively clean up to a maximum of
> > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > + * !may_writepage indicates that this is a direct reclaimer in
> > + * laptop mode avoiding disk spin-ups
> > + */
> > + if (file && nr_dirty_seen && sc->may_writepage)
> > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > +
>
> Thank you. Ok, I'll check what happens in memcg.
>
Thanks
> Can I add
> if (sc->memcg) {
> memcg_check_flusher_wakeup()
> }
> or some here ?
>
It seems reasonable.
> Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
>
I'm afraid I cannot make a judgement call on which is the best as I am
not very familiar with how cgroups behave in comparison to normal
reclaim. There could easily be a follow-on patch though that was cgroup
specific?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-09 9:32 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 9:32 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 6 Sep 2010 11:47:33 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > There are a number of cases where pages get cleaned but two of concern
> > to this patch are;
> > o When dirtying pages, processes may be throttled to clean pages if
> > dirty_ratio is not met.
> > o Pages belonging to inodes dirtied longer than
> > dirty_writeback_centisecs get cleaned.
> >
> > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > pages are being dirtied slowly so that neither the throttling or a flusher
> > thread waking periodically cleans them.
> >
> > Background flush is already cleaning old or expired inodes first but the
> > expire time is too far in the future at the time of page reclaim. To mitigate
> > future problems, this patch wakes flusher threads to clean 4M of data -
> > an amount that should be manageable without causing congestion in many cases.
> >
> > Ideally, the background flushers would only be cleaning pages belonging
> > to the zone being scanned but it's not clear if this would be of benefit
> > (less IO) or not (potentially less efficient IO if an inode is scattered
> > across multiple zones).
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> > 1 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 408c101..33d27a4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > #define MAX_SWAP_CLEAN_WAIT 50
> >
> > +/*
> > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > + * a maximum of 4M of data.
> > + */
> > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > +{
> > + return laptop_mode ? 0 :
> > + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > +}
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > struct scan_control *sc,
> > + int file,
> > unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > unsigned long nr_dirty = 0;
> > + unsigned long nr_dirty_seen = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + nr_dirty_seen++;
> > +
> > /*
> > * Only kswapd can writeback filesystem pages to
> > * avoid risk of stack overflow
> > @@ -923,6 +939,18 @@ keep_lumpy:
> >
> > list_splice(&ret_pages, page_list);
> >
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though the
> > + * dirty_ratio may be satisified. In this case, wake flusher
> > + * threads to pro-actively clean up to a maximum of
> > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > + * !may_writepage indicates that this is a direct reclaimer in
> > + * laptop mode avoiding disk spin-ups
> > + */
> > + if (file && nr_dirty_seen && sc->may_writepage)
> > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > +
>
> Thank you. Ok, I'll check what happens in memcg.
>
Thanks
> Can I add
> if (sc->memcg) {
> memcg_check_flusher_wakeup()
> }
> or some here ?
>
It seems reasonable.
> Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
>
I'm afraid I cannot make a judgement call on which is the best as I am
not very familiar with how cgroups behave in comparison to normal
reclaim. There could easily be a follow-on patch though that was cgroup
specific?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-08 21:23 ` Andrew Morton
@ 2010-09-09 10:43 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 10:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig
On Wed, Sep 08, 2010 at 02:23:30PM -0700, Andrew Morton wrote:
> On Mon, 6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
>
> The patch series looks generally good. Would like to see some testing
> results ;)
They are all in the leader. They are all based on a test-suite that I'm
bound to stick a README on and release one of these days :/
> A few touchups are planned so I'll await v2.
>
Good plan.
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
>
> Let's remember that a queue can get congested because of reads as well
> as writes. It's very rare for this to happen - it needs either a
> zillion read()ing threads or someone going berzerk with O_DIRECT aio,
> etc. Probably it doesn't matter much, but for memory reclaim purposes
> read-congestion is somewhat irrelevant and a bit of thought is warranted.
>
This is an interesting point and would be well worth digging into if
we got a new bug report about stalls under heavy reads.
> vmscan currently only looks at *write* congestion, but in this patch
> you secretly change that logic to newly look at write-or-read
> congestion. Talk to me.
>
vmscan currently only looks at write congestion because it's checking the
BLK_RW_ASYNC and all reads will be BLK_RW_SYNC. Currently, this is why we
are only looking at write congestion even though it's approximate, right?
Remember, congestion_wait used to be about READ and WRITE but now it's about
SYNC and ASYNC.
In the patch, there are separate SYNC and ASYNC nr_bdi_congested counters.
wait_iff_congested() is only called for BLK_RW_ASYNC so we still checking
write congestion only.
What stupid thing did I miss?
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
> > smp_mb__after_clear_bit();
> > if (waitqueue_active(wqh))
> > wake_up(wqh);
>
> Worried. Having a single slow disk getting itself gummed up will
> affect the entire machine!
>
This can already happen today. In fact, I think it's one of the sources of
desktop stalls during IO from https://bugzilla.kernel.org/show_bug.cgi?id=12309
that you brought up a few weeks back. I was tempted to try resolve it in
this patch but thought I was reaching far enough with this series as it was.
> There's potential for pathological corner-case problems here. "When I
> do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
> page reclaim!".
>
I thought it might be enough to just do a huge backup to an external USB
drive. I guess I could make it worse by starting up one copy per CPU
thread preferably to more than one slow USB device.
> What to do?
>
> Of course, we'd very much prefer to know whether a queue which we're
> interested in for writeback will block when we try to write to it.
> Much better than looking at all queues.
>
And somehow reconciling the queue being written to with the zone the pages
are coming from.
> Important question: which of teh current congestion_wait() call sites
> are causing appreciable stalls?
>
This potentially can be found out from the tracepoints if they record
the stack trace as well. In this patch, I avoided changing all callers to
congestion_wait() and changed a few callers to wait_iff_congested() instead
to limit the scope of what was being changed in this cycle.
> I think a more accurate way of implementing this is to be smarter with
> the may_write_to_queue()->bdi_write_congested() result. If a previous
> attempt to write off this LRU encountered congestion then fine, call
> congestion_wait(). But if writeback is not hitting
> may_write_to_queue()->bdi_write_congested() then that is the time to
> avoid calling congestion_wait().
>
I see the logic. If we assume that there is large amounts of anon page
reclaim while writeback is happening to a USB device for example, we would
avoid a stall in this case. It would still encounter a problem if all the
reclaim is from the file LRU and there are a few pages being written to a
USB stick. We'll still wait on congestion even though it might not have been
necessary and it's why I was counting the number of writeback pages versus
the size of the inactive queue and making a decision based on that.
> In other words, save the bdi_write_congested() result in the zone
> struct in some fashion and inspect that before deciding to synchronize
> behind the underlying device's write rate. Not hitting a congested
> device for this LRU? Then don't wait for congested devices.
>
I think the idea has potential. It will take a fair amount of time to work
out the details though. Testing tends to take a *long* time even with
automation.
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> > }
> > EXPORT_SYMBOL(congestion_wait);
> >
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
>
> That comments needs help.
>
Indeed it does. It currently stands as
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
* @zone: A zone to consider the number of being being written back from
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) or a given zone
* having a large number of pages in writeback, this waits for up to @timeout
* jiffies for either a BDI to exit congestion of the given @sync queue.
*
* If there is no congestion and few pending writes, then cond_resched()
* is called to yield the processor if necessary but otherwise does not
* sleep.
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.'
>
> write or read congestion!!
>
I just know I'm going to spot where we wait on read congestion the
second I push send and make a fool of myself :(
> > If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
>
> Document the return value?
>
What's the fun in that? :)
I included a blurb on the return value in the updated comment above.
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> This is all getting seriously inaccurate :(
>
We are already woefully inaccurate.
The intention here is to catch where we are not congested but that there
is sufficient writeback in the zone to make it worthwhile waiting for
some of it to complete. Minimally, we have a reasonable expectation that
if writeback is happening that we'll be woken up if we go to sleep on
the congestion queue.
i.e. it's not great but it's better than what we have at the moment which
can be seen from the micro-mapped-file-stream results in the leader. Time to
completion is reduced, sleepy time is reduced while the ratio of scans/writes
does not get worse.
> > + cond_resched();
> > +
> > + /* In case we scheduled, work out time remaining */
> > + ret = timeout - (jiffies - start);
> > + if (ret < 0)
> > + ret = 0;
> > +
> > + goto out;
> > + }
> > + }
> > +
> > + /* Sleep until uncongested or a write happens */
> > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > + ret = io_schedule_timeout(timeout);
> > + finish_wait(wqh, &wait);
> > +
> > +out:
> > + trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> > + jiffies_to_usecs(jiffies - start));
>
> Does this tracepoint tell us how often wait_iff_congested() is sleeping
> versus how often it is returning immediately?
>
Yes. Taking an example from the leader
FTrace Reclaim Statistics: congestion_wait
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct number congest waited 499 0 0 0
Direct time congest waited 22700ms 0ms 0ms 0ms
Direct full congest waited 421 0 0 0
Direct number conditional waited 0 1214 1242 1290
Direct time conditional waited 0ms 4ms 0ms 0ms
Direct full conditional waited 421 0 0 0
KSwapd number congest waited 257 103 94 104
KSwapd time congest waited 22116ms 7344ms 7476ms 7528ms
KSwapd full congest waited 203 57 59 56
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 203 57 59 56
A "full congest waited" is a count of the number of times we slept for
more than the timeout. The trace-only kernel reports that direct reclaimers
slept the full timeout 421 times and kswapd slept for the full timeout 203
times. The patch (nocongest-v1r5) reduces these counts significantly.
The report is from a script that reads ftrace information. It's similar in
operation to what's in Documentation/trace/postprocess/.
> > + return ret;
> > +}
> > +EXPORT_SYMBOL(wait_iff_congested);
> >
> > ...
> >
> > @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > sc->may_writepage = 1;
> > }
> >
> > - /* Take a nap, wait for some writeback to complete */
> > + /* Take a nap if congested, wait for some writeback */
> > if (!sc->hibernation_mode && sc->nr_scanned &&
> > - priority < DEF_PRIORITY - 2)
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + priority < DEF_PRIORITY - 2) {
> > + struct zone *active_zone = NULL;
> > + unsigned long max_writeback = 0;
> > + for_each_zone_zonelist(zone, z, zonelist,
> > + gfp_zone(sc->gfp_mask)) {
> > + unsigned long writeback;
> > +
> > + /* Initialise for first zone */
> > + if (active_zone == NULL)
> > + active_zone = zone;
> > +
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > + if (writeback > max_writeback) {
> > + max_writeback = writeback;
> > + active_zone = zone;
> > + }
> > + }
> > +
> > + wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > + }
>
> Again, we would benefit from more accuracy here. In my above
> suggestion I'm assuming that the (congestion) result of the most recent
> attempt to perform writeback is a predictor of the next attempt.
>
I suspect you are on to something but it will take me some time to work out
the details and to build a setup involving a few USB sticks to trigger that
test case. What are the possibilities of starting with this heuristic (in
release v2 or v3 of this series) because it improves on what we have today and
then trying out different ideas for how and when to call wait_iff_congested()
in the next cycle?
> Doing that on a kernel-wide basis would be rather inaccurate on large
> machines in some scenarios. Storing the state info in the zone would
> help.
>
We are already depending on kernel-wide inaccuracy. The series aims to chip
away at some of the obvious badness to start with.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-09 10:43 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-09 10:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig
On Wed, Sep 08, 2010 at 02:23:30PM -0700, Andrew Morton wrote:
> On Mon, 6 Sep 2010 11:47:26 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
>
> > If congestion_wait() is called with no BDIs congested, the caller will sleep
> > for the full timeout and this may be an unnecessary sleep. This patch adds
> > a wait_iff_congested() that checks congestion and only sleeps if a BDI is
> > congested or if there is a significant amount of writeback going on in an
> > interesting zone. Else, it calls cond_resched() to ensure the caller is
> > not hogging the CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
>
> The patch series looks generally good. Would like to see some testing
> results ;)
They are all in the leader. They are all based on a test-suite that I'm
bound to stick a README on and release one of these days :/
> A few touchups are planned so I'll await v2.
>
Good plan.
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
>
> Let's remember that a queue can get congested because of reads as well
> as writes. It's very rare for this to happen - it needs either a
> zillion read()ing threads or someone going berzerk with O_DIRECT aio,
> etc. Probably it doesn't matter much, but for memory reclaim purposes
> read-congestion is somewhat irrelevant and a bit of thought is warranted.
>
This is an interesting point and would be well worth digging into if
we got a new bug report about stalls under heavy reads.
> vmscan currently only looks at *write* congestion, but in this patch
> you secretly change that logic to newly look at write-or-read
> congestion. Talk to me.
>
vmscan currently only looks at write congestion because it's checking the
BLK_RW_ASYNC and all reads will be BLK_RW_SYNC. Currently, this is why we
are only looking at write congestion even though it's approximate, right?
Remember, congestion_wait used to be about READ and WRITE but now it's about
SYNC and ASYNC.
In the patch, there are separate SYNC and ASYNC nr_bdi_congested counters.
wait_iff_congested() is only called for BLK_RW_ASYNC so we still checking
write congestion only.
What stupid thing did I miss?
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
> > smp_mb__after_clear_bit();
> > if (waitqueue_active(wqh))
> > wake_up(wqh);
>
> Worried. Having a single slow disk getting itself gummed up will
> affect the entire machine!
>
This can already happen today. In fact, I think it's one of the sources of
desktop stalls during IO from https://bugzilla.kernel.org/show_bug.cgi?id=12309
that you brought up a few weeks back. I was tempted to try resolve it in
this patch but thought I was reaching far enough with this series as it was.
> There's potential for pathological corner-case problems here. "When I
> do a big aio read from /dev/MySuckyUsbStick, all my CPUs get pegged in
> page reclaim!".
>
I thought it might be enough to just do a huge backup to an external USB
drive. I guess I could make it worse by starting up one copy per CPU
thread preferably to more than one slow USB device.
> What to do?
>
> Of course, we'd very much prefer to know whether a queue which we're
> interested in for writeback will block when we try to write to it.
> Much better than looking at all queues.
>
And somehow reconciling the queue being written to with the zone the pages
are coming from.
> Important question: which of teh current congestion_wait() call sites
> are causing appreciable stalls?
>
This potentially can be found out from the tracepoints if they record
the stack trace as well. In this patch, I avoided changing all callers to
congestion_wait() and changed a few callers to wait_iff_congested() instead
to limit the scope of what was being changed in this cycle.
> I think a more accurate way of implementing this is to be smarter with
> the may_write_to_queue()->bdi_write_congested() result. If a previous
> attempt to write off this LRU encountered congestion then fine, call
> congestion_wait(). But if writeback is not hitting
> may_write_to_queue()->bdi_write_congested() then that is the time to
> avoid calling congestion_wait().
>
I see the logic. If we assume that there is large amounts of anon page
reclaim while writeback is happening to a USB device for example, we would
avoid a stall in this case. It would still encounter a problem if all the
reclaim is from the file LRU and there are a few pages being written to a
USB stick. We'll still wait on congestion even though it might not have been
necessary and it's why I was counting the number of writeback pages versus
the size of the inactive queue and making a decision based on that.
> In other words, save the bdi_write_congested() result in the zone
> struct in some fashion and inspect that before deciding to synchronize
> behind the underlying device's write rate. Not hitting a congested
> device for this LRU? Then don't wait for congested devices.
>
I think the idea has potential. It will take a fair amount of time to work
out the details though. Testing tends to take a *long* time even with
automation.
> > @@ -774,3 +777,62 @@ long congestion_wait(int sync, long timeout)
> > }
> > EXPORT_SYMBOL(congestion_wait);
> >
> > +/**
> > + * congestion_wait - wait for a backing_dev to become uncongested
> > + * @zone: A zone to consider the number of being being written back from
>
> That comments needs help.
>
Indeed it does. It currently stands as
/**
* wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
* @zone: A zone to consider the number of being being written back from
* @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* In the event of a congested backing_dev (any backing_dev) or a given zone
* having a large number of pages in writeback, this waits for up to @timeout
* jiffies for either a BDI to exit congestion of the given @sync queue.
*
* If there is no congestion and few pending writes, then cond_resched()
* is called to yield the processor if necessary but otherwise does not
* sleep.
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
* returned. return_value == timeout implies the function did not sleep.
*/
> > + * @sync: SYNC or ASYNC IO
> > + * @timeout: timeout in jiffies
> > + *
> > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > + * write congestion.'
>
> write or read congestion!!
>
I just know I'm going to spot where we wait on read congestion the
second I push send and make a fool of myself :(
> > If no backing_devs are congested then the number of
> > + * writeback pages in the zone are checked and compared to the inactive
> > + * list. If there is no sigificant writeback or congestion, there is no point
> > + * in sleeping but cond_resched() is called in case the current process has
> > + * consumed its CPU quota.
> > + */
>
> Document the return value?
>
What's the fun in that? :)
I included a blurb on the return value in the updated comment above.
> > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > +{
> > + long ret;
> > + unsigned long start = jiffies;
> > + DEFINE_WAIT(wait);
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > +
> > + /*
> > + * If there is no congestion, check the amount of writeback. If there
> > + * is no significant writeback and no congestion, just cond_resched
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > + unsigned long inactive, writeback;
> > +
> > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_ANON);
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > +
> > + /*
> > + * If less than half the inactive list is being written back,
> > + * reclaim might as well continue
> > + */
> > + if (writeback < inactive / 2) {
>
> This is all getting seriously inaccurate :(
>
We are already woefully inaccurate.
The intention here is to catch where we are not congested but that there
is sufficient writeback in the zone to make it worthwhile waiting for
some of it to complete. Minimally, we have a reasonable expectation that
if writeback is happening that we'll be woken up if we go to sleep on
the congestion queue.
i.e. it's not great but it's better than what we have at the moment which
can be seen from the micro-mapped-file-stream results in the leader. Time to
completion is reduced, sleepy time is reduced while the ratio of scans/writes
does not get worse.
> > + cond_resched();
> > +
> > + /* In case we scheduled, work out time remaining */
> > + ret = timeout - (jiffies - start);
> > + if (ret < 0)
> > + ret = 0;
> > +
> > + goto out;
> > + }
> > + }
> > +
> > + /* Sleep until uncongested or a write happens */
> > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > + ret = io_schedule_timeout(timeout);
> > + finish_wait(wqh, &wait);
> > +
> > +out:
> > + trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> > + jiffies_to_usecs(jiffies - start));
>
> Does this tracepoint tell us how often wait_iff_congested() is sleeping
> versus how often it is returning immediately?
>
Yes. Taking an example from the leader
FTrace Reclaim Statistics: congestion_wait
traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
Direct number congest waited 499 0 0 0
Direct time congest waited 22700ms 0ms 0ms 0ms
Direct full congest waited 421 0 0 0
Direct number conditional waited 0 1214 1242 1290
Direct time conditional waited 0ms 4ms 0ms 0ms
Direct full conditional waited 421 0 0 0
KSwapd number congest waited 257 103 94 104
KSwapd time congest waited 22116ms 7344ms 7476ms 7528ms
KSwapd full congest waited 203 57 59 56
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 203 57 59 56
A "full congest waited" is a count of the number of times we slept for
more than the timeout. The trace-only kernel reports that direct reclaimers
slept the full timeout 421 times and kswapd slept for the full timeout 203
times. The patch (nocongest-v1r5) reduces these counts significantly.
The report is from a script that reads ftrace information. It's similar in
operation to what's in Documentation/trace/postprocess/.
> > + return ret;
> > +}
> > +EXPORT_SYMBOL(wait_iff_congested);
> >
> > ...
> >
> > @@ -1913,10 +1913,28 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > sc->may_writepage = 1;
> > }
> >
> > - /* Take a nap, wait for some writeback to complete */
> > + /* Take a nap if congested, wait for some writeback */
> > if (!sc->hibernation_mode && sc->nr_scanned &&
> > - priority < DEF_PRIORITY - 2)
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + priority < DEF_PRIORITY - 2) {
> > + struct zone *active_zone = NULL;
> > + unsigned long max_writeback = 0;
> > + for_each_zone_zonelist(zone, z, zonelist,
> > + gfp_zone(sc->gfp_mask)) {
> > + unsigned long writeback;
> > +
> > + /* Initialise for first zone */
> > + if (active_zone == NULL)
> > + active_zone = zone;
> > +
> > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > + if (writeback > max_writeback) {
> > + max_writeback = writeback;
> > + active_zone = zone;
> > + }
> > + }
> > +
> > + wait_iff_congested(active_zone, BLK_RW_ASYNC, HZ/10);
> > + }
>
> Again, we would benefit from more accuracy here. In my above
> suggestion I'm assuming that the (congestion) result of the most recent
> attempt to perform writeback is a predictor of the next attempt.
>
I suspect you are on to something but it will take me some time to work out
the details and to build a setup involving a few USB sticks to trigger that
test case. What are the possibilities of starting with this heuristic (in
release v2 or v3 of this series) because it improves on what we have today and
then trying out different ideas for how and when to call wait_iff_congested()
in the next cycle?
> Doing that on a kernel-wide basis would be rather inaccurate on large
> machines in some scenarios. Storing the state info in the zone would
> help.
>
We are already depending on kernel-wide inaccuracy. The series aims to chip
away at some of the obvious badness to start with.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-09 9:22 ` Mel Gorman
@ 2010-09-10 10:25 ` KOSAKI Motohiro
-1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:25 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > >
> > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > >
> > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > trylock_page() in this case.
> > > > >
> > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > > Ah......but can't this change cause dead lock ??
> >
> > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > by poisonous mushroom of Mario Bros.
> >
>
> Lets be clear on what the exact dead lock conditions are. The ones I had
> thought about when I felt this patch was ok were;
>
> o We are not holding the LRU lock (or any lock, we just called cond_resched())
> o We do not have another page locked because we cannot lock multiple pages
> o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> o lock_page() itself is not allocating anything that we could recurse on
True, all.
>
> One potential dead lock would be if the direct reclaimer held a page
> lock and ended up here but is that situation even allowed?
example,
__do_fault()
{
(snip)
if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf.page);
else
VM_BUG_ON(!PageLocked(vmf.page));
/*
* Should we do an early C-O-W break?
*/
page = vmf.page;
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
if (unlikely(anon_vma_prepare(vma))) {
ret = VM_FAULT_OOM;
goto out;
}
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
Afaik, detailed rule is,
o kswapd can call lock_page() because they never take page lock outside vmscan
o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
because the task have gurantee of no lock taken.
o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
I think.
> I did not
> think of an obvious example of when this would happen. Similarly,
> deadlock situations with mmap_sem shouldn't happen unless multiple page
> locks are being taken.
>
> (prepares to feel foolish)
>
> What did I miss?
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:25 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:25 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > >
> > > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > >
> > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > >
> > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > trylock_page() in this case.
> > > > >
> > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > >
> > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > >
> > > Ah......but can't this change cause dead lock ??
> >
> > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > by poisonous mushroom of Mario Bros.
> >
>
> Lets be clear on what the exact dead lock conditions are. The ones I had
> thought about when I felt this patch was ok were;
>
> o We are not holding the LRU lock (or any lock, we just called cond_resched())
> o We do not have another page locked because we cannot lock multiple pages
> o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> o lock_page() itself is not allocating anything that we could recurse on
True, all.
>
> One potential dead lock would be if the direct reclaimer held a page
> lock and ended up here but is that situation even allowed?
example,
__do_fault()
{
(snip)
if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf.page);
else
VM_BUG_ON(!PageLocked(vmf.page));
/*
* Should we do an early C-O-W break?
*/
page = vmf.page;
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
anon = 1;
if (unlikely(anon_vma_prepare(vma))) {
ret = VM_FAULT_OOM;
goto out;
}
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
vma, address);
Afaik, detailed rule is,
o kswapd can call lock_page() because they never take page lock outside vmscan
o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
because the task have gurantee of no lock taken.
o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
I think.
> I did not
> think of an obvious example of when this would happen. Similarly,
> deadlock situations with mmap_sem shouldn't happen unless multiple page
> locks are being taken.
>
> (prepares to feel foolish)
>
> What did I miss?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-10 10:25 ` KOSAKI Motohiro
(?)
@ 2010-09-10 10:33 ` KOSAKI Motohiro
-1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Mel Gorman, KAMEZAWA Hiroyuki, linux-mm,
linux-fsdevel, Linux Kernel List, Rik van Riel, Johannes Weiner,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, Dave Chinner,
Chris Mason, Christoph Hellwig, Andrew Morton
> Afaik, detailed rule is,
>
> o kswapd can call lock_page() because they never take page lock outside vmscan
s/lock_page()/lock_page_nosync()/
> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
>
> I think.
>
>
> > I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> >
> > (prepares to feel foolish)
> >
> > What did I miss?
>
>
>
>
>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:33 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: kosaki.motohiro, Mel Gorman, KAMEZAWA Hiroyuki, linux-mm,
linux-fsdevel, Linux Kernel List, Rik van Riel, Johannes Weiner,
Minchan Kim, Wu Fengguang, Andrea Arcangeli, Dave Chinner,
Chris Mason, Christoph Hellwig, Andrew Morton
> Afaik, detailed rule is,
>
> o kswapd can call lock_page() because they never take page lock outside vmscan
s/lock_page()/lock_page_nosync()/
> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
>
> I think.
>
>
> > I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> >
> > (prepares to feel foolish)
> >
> > What did I miss?
>
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-10 10:33 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-10 10:33 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: Mel Gorman, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> Afaik, detailed rule is,
>
> o kswapd can call lock_page() because they never take page lock outside vmscan
s/lock_page()/lock_page_nosync()/
> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
>
> I think.
>
>
> > I did not
> > think of an obvious example of when this would happen. Similarly,
> > deadlock situations with mmap_sem shouldn't happen unless multiple page
> > locks are being taken.
> >
> > (prepares to feel foolish)
> >
> > What did I miss?
>
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-09 8:54 ` Mel Gorman
@ 2010-09-12 15:37 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-12 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, Sep 09, 2010 at 09:54:36AM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> > On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > > + * @zone: A zone to consider the number of being being written back from
> > > > > + * @sync: SYNC or ASYNC IO
> > > > > + * @timeout: timeout in jiffies
> > > > > + *
> > > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > > + * write congestion. If no backing_devs are congested then the number of
> > > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > > and
> > > >
> > >
> > > Why and? "or" makes sense because we avoid sleeping on either condition.
> >
> > if (nr_bdi_congested[sync]) == 0) {
> > if (writeback < inactive / 2) {
> > cond_resched();
> > ..
> > goto out
> > }
> > }
> >
> > for avoiding sleeping, above two condition should meet.
>
> This is a terrible comment that is badly written. Is this any clearer?
>
> /**
> * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> * @zone: A zone to consider the number of being being written back from
> * @sync: SYNC or ASYNC IO
> * @timeout: timeout in jiffies
> *
> * In the event of a congested backing_dev (any backing_dev) or a given @zone
> * having a large number of pages in writeback, this waits for up to @timeout
> * jiffies for either a BDI to exit congestion or a write to complete.
> *
> * If there is no congestion and few pending writes, then cond_resched()
> * is called to yield the processor if necessary but otherwise does not
> * sleep.
> */
Looks good.
>
> > >
> > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > + * consumed its CPU quota.
> > > > > + */
> > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > +{
> > > > > + long ret;
> > > > > + unsigned long start = jiffies;
> > > > > + DEFINE_WAIT(wait);
> > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > +
> > > > > + /*
> > > > > + * If there is no congestion, check the amount of writeback. If there
> > > > > + * is no significant writeback and no congestion, just cond_resched
> > > > > + */
> > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > + unsigned long inactive, writeback;
> > > > > +
> > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > +
> > > > > + /*
> > > > > + * If less than half the inactive list is being written back,
> > > > > + * reclaim might as well continue
> > > > > + */
> > > > > + if (writeback < inactive / 2) {
> > > >
> > > > I am not sure this is best.
> > > >
> > >
> > > I'm not saying it is. The objective is to identify a situation where
> > > sleeping until the next write or congestion clears is pointless. We have
> > > already identified that we are not congested so the question is "are we
> > > writing a lot at the moment?". The assumption is that if there is a lot
> > > of writing going on, we might as well sleep until one completes rather
> > > than reclaiming more.
> > >
> > > This is the first effort at identifying pointless sleeps. Better ones
> > > might be identified in the future but that shouldn't stop us making a
> > > semi-sensible decision now.
> >
> > nr_bdi_congested is no problem since we have used it for a long time.
> > But you added new rule about writeback.
> >
>
> Yes, I'm trying to add a new rule about throttling in the page allocator
> and from vmscan. As you can see from the results in the leader, we are
> currently sleeping more than we need to.
I can see the about avoiding congestion_wait but can't find about
(writeback < incative / 2) hueristic result.
>
> > Why I pointed out is that you added new rule and I hope let others know
> > this change since they have a good idea or any opinions.
> > I think it's a one of roles as reviewer.
> >
>
> Of course.
>
> > >
> > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > >
> > > We don't really have a good means of identifying speed classes of
> > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > basis. The pages being written back in the zone could be backed by
> > > anything so we cannot make decisions based on BDI speed.
> >
> > True. So it's why I have below question.
> > As you said, we don't have enough information in vmscan.
> > So I am not sure how effective such semi-sensible decision is.
> >
>
> What additional metrics would you apply than the ones I used in the
> leader mail?
effectiveness of (writeback < inactive / 2) heuristic.
>
> > I think best is to throttle in page-writeback well.
>
> I do not think there is a problem as such in page writeback throttling.
> The problem is that we are going to sleep without any congestion or without
> writes in progress. We sleep for a full timeout in this case for no reason
> and this is what I'm trying to avoid.
Yes. I agree.
Just my concern is heuristic accuarcy I mentioned.
In your previous verstion, you don't add the heuristic.
But suddenly you added it in this version.
So I think you have any clue to add it in this version.
Please, write down cause and data if you have.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-12 15:37 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-12 15:37 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, Sep 09, 2010 at 09:54:36AM +0100, Mel Gorman wrote:
> On Wed, Sep 08, 2010 at 11:52:45PM +0900, Minchan Kim wrote:
> > On Wed, Sep 08, 2010 at 12:04:03PM +0100, Mel Gorman wrote:
> > > On Wed, Sep 08, 2010 at 12:25:33AM +0900, Minchan Kim wrote:
> > > > > + * @zone: A zone to consider the number of being being written back from
> > > > > + * @sync: SYNC or ASYNC IO
> > > > > + * @timeout: timeout in jiffies
> > > > > + *
> > > > > + * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > > > + * write congestion. If no backing_devs are congested then the number of
> > > > > + * writeback pages in the zone are checked and compared to the inactive
> > > > > + * list. If there is no sigificant writeback or congestion, there is no point
> > > > and
> > > >
> > >
> > > Why and? "or" makes sense because we avoid sleeping on either condition.
> >
> > if (nr_bdi_congested[sync]) == 0) {
> > if (writeback < inactive / 2) {
> > cond_resched();
> > ..
> > goto out
> > }
> > }
> >
> > for avoiding sleeping, above two condition should meet.
>
> This is a terrible comment that is badly written. Is this any clearer?
>
> /**
> * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a zone to complete writes
> * @zone: A zone to consider the number of being being written back from
> * @sync: SYNC or ASYNC IO
> * @timeout: timeout in jiffies
> *
> * In the event of a congested backing_dev (any backing_dev) or a given @zone
> * having a large number of pages in writeback, this waits for up to @timeout
> * jiffies for either a BDI to exit congestion or a write to complete.
> *
> * If there is no congestion and few pending writes, then cond_resched()
> * is called to yield the processor if necessary but otherwise does not
> * sleep.
> */
Looks good.
>
> > >
> > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > + * consumed its CPU quota.
> > > > > + */
> > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > +{
> > > > > + long ret;
> > > > > + unsigned long start = jiffies;
> > > > > + DEFINE_WAIT(wait);
> > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > +
> > > > > + /*
> > > > > + * If there is no congestion, check the amount of writeback. If there
> > > > > + * is no significant writeback and no congestion, just cond_resched
> > > > > + */
> > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > + unsigned long inactive, writeback;
> > > > > +
> > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > +
> > > > > + /*
> > > > > + * If less than half the inactive list is being written back,
> > > > > + * reclaim might as well continue
> > > > > + */
> > > > > + if (writeback < inactive / 2) {
> > > >
> > > > I am not sure this is best.
> > > >
> > >
> > > I'm not saying it is. The objective is to identify a situation where
> > > sleeping until the next write or congestion clears is pointless. We have
> > > already identified that we are not congested so the question is "are we
> > > writing a lot at the moment?". The assumption is that if there is a lot
> > > of writing going on, we might as well sleep until one completes rather
> > > than reclaiming more.
> > >
> > > This is the first effort at identifying pointless sleeps. Better ones
> > > might be identified in the future but that shouldn't stop us making a
> > > semi-sensible decision now.
> >
> > nr_bdi_congested is no problem since we have used it for a long time.
> > But you added new rule about writeback.
> >
>
> Yes, I'm trying to add a new rule about throttling in the page allocator
> and from vmscan. As you can see from the results in the leader, we are
> currently sleeping more than we need to.
I can see the about avoiding congestion_wait but can't find about
(writeback < incative / 2) hueristic result.
>
> > Why I pointed out is that you added new rule and I hope let others know
> > this change since they have a good idea or any opinions.
> > I think it's a one of roles as reviewer.
> >
>
> Of course.
>
> > >
> > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > >
> > > We don't really have a good means of identifying speed classes of
> > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > basis. The pages being written back in the zone could be backed by
> > > anything so we cannot make decisions based on BDI speed.
> >
> > True. So it's why I have below question.
> > As you said, we don't have enough information in vmscan.
> > So I am not sure how effective such semi-sensible decision is.
> >
>
> What additional metrics would you apply than the ones I used in the
> leader mail?
effectiveness of (writeback < inactive / 2) heuristic.
>
> > I think best is to throttle in page-writeback well.
>
> I do not think there is a problem as such in page writeback throttling.
> The problem is that we are going to sleep without any congestion or without
> writes in progress. We sleep for a full timeout in this case for no reason
> and this is what I'm trying to avoid.
Yes. I agree.
Just my concern is heuristic accuarcy I mentioned.
In your previous verstion, you don't add the heuristic.
But suddenly you added it in this version.
So I think you have any clue to add it in this version.
Please, write down cause and data if you have.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-09 9:32 ` Mel Gorman
@ 2010-09-13 0:53 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13 0:53 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, 9 Sep 2010 10:32:11 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 6 Sep 2010 11:47:33 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > There are a number of cases where pages get cleaned but two of concern
> > > to this patch are;
> > > o When dirtying pages, processes may be throttled to clean pages if
> > > dirty_ratio is not met.
> > > o Pages belonging to inodes dirtied longer than
> > > dirty_writeback_centisecs get cleaned.
> > >
> > > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > > pages are being dirtied slowly so that neither the throttling or a flusher
> > > thread waking periodically cleans them.
> > >
> > > Background flush is already cleaning old or expired inodes first but the
> > > expire time is too far in the future at the time of page reclaim. To mitigate
> > > future problems, this patch wakes flusher threads to clean 4M of data -
> > > an amount that should be manageable without causing congestion in many cases.
> > >
> > > Ideally, the background flushers would only be cleaning pages belonging
> > > to the zone being scanned but it's not clear if this would be of benefit
> > > (less IO) or not (potentially less efficient IO if an inode is scattered
> > > across multiple zones).
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> > > 1 files changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 408c101..33d27a4 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > > /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > > #define MAX_SWAP_CLEAN_WAIT 50
> > >
> > > +/*
> > > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > > + * a maximum of 4M of data.
> > > + */
> > > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > +{
> > > + return laptop_mode ? 0 :
> > > + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > +}
> > > +
> > > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > > struct scan_control *sc)
> > > {
> > > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > > */
> > > static unsigned long shrink_page_list(struct list_head *page_list,
> > > struct scan_control *sc,
> > > + int file,
> > > unsigned long *nr_still_dirty)
> > > {
> > > LIST_HEAD(ret_pages);
> > > LIST_HEAD(free_pages);
> > > int pgactivate = 0;
> > > unsigned long nr_dirty = 0;
> > > + unsigned long nr_dirty_seen = 0;
> > > unsigned long nr_reclaimed = 0;
> > >
> > > cond_resched();
> > > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > }
> > >
> > > if (PageDirty(page)) {
> > > + nr_dirty_seen++;
> > > +
> > > /*
> > > * Only kswapd can writeback filesystem pages to
> > > * avoid risk of stack overflow
> > > @@ -923,6 +939,18 @@ keep_lumpy:
> > >
> > > list_splice(&ret_pages, page_list);
> > >
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though the
> > > + * dirty_ratio may be satisified. In this case, wake flusher
> > > + * threads to pro-actively clean up to a maximum of
> > > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > + * !may_writepage indicates that this is a direct reclaimer in
> > > + * laptop mode avoiding disk spin-ups
> > > + */
> > > + if (file && nr_dirty_seen && sc->may_writepage)
> > > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > > +
> >
> > Thank you. Ok, I'll check what happens in memcg.
> >
>
> Thanks
>
> > Can I add
> > if (sc->memcg) {
> > memcg_check_flusher_wakeup()
> > }
> > or some here ?
> >
>
> It seems reasonable.
>
> > Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> >
>
> I'm afraid I cannot make a judgement call on which is the best as I am
> not very familiar with how cgroups behave in comparison to normal
> reclaim. There could easily be a follow-on patch though that was cgroup
> specific?
>
Yes, I'd like to make patches when this series is merged. It's not difficult and
makes it clear how memcg and flusher works for getting good reviews.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 0:53 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 133+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-09-13 0:53 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KOSAKI Motohiro, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Thu, 9 Sep 2010 10:32:11 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> On Thu, Sep 09, 2010 at 12:22:28PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Mon, 6 Sep 2010 11:47:33 +0100
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> > > There are a number of cases where pages get cleaned but two of concern
> > > to this patch are;
> > > o When dirtying pages, processes may be throttled to clean pages if
> > > dirty_ratio is not met.
> > > o Pages belonging to inodes dirtied longer than
> > > dirty_writeback_centisecs get cleaned.
> > >
> > > The problem for reclaim is that dirty pages can reach the end of the LRU if
> > > pages are being dirtied slowly so that neither the throttling or a flusher
> > > thread waking periodically cleans them.
> > >
> > > Background flush is already cleaning old or expired inodes first but the
> > > expire time is too far in the future at the time of page reclaim. To mitigate
> > > future problems, this patch wakes flusher threads to clean 4M of data -
> > > an amount that should be manageable without causing congestion in many cases.
> > >
> > > Ideally, the background flushers would only be cleaning pages belonging
> > > to the zone being scanned but it's not clear if this would be of benefit
> > > (less IO) or not (potentially less efficient IO if an inode is scattered
> > > across multiple zones).
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/vmscan.c | 32 ++++++++++++++++++++++++++++++--
> > > 1 files changed, 30 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 408c101..33d27a4 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -148,6 +148,18 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > > /* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > > #define MAX_SWAP_CLEAN_WAIT 50
> > >
> > > +/*
> > > + * When reclaim encounters dirty data, wakeup flusher threads to clean
> > > + * a maximum of 4M of data.
> > > + */
> > > +#define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT)
> > > +#define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX)
> > > +static inline long nr_writeback_pages(unsigned long nr_dirty)
> > > +{
> > > + return laptop_mode ? 0 :
> > > + min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR));
> > > +}
> > > +
> > > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > > struct scan_control *sc)
> > > {
> > > @@ -686,12 +698,14 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > > */
> > > static unsigned long shrink_page_list(struct list_head *page_list,
> > > struct scan_control *sc,
> > > + int file,
> > > unsigned long *nr_still_dirty)
> > > {
> > > LIST_HEAD(ret_pages);
> > > LIST_HEAD(free_pages);
> > > int pgactivate = 0;
> > > unsigned long nr_dirty = 0;
> > > + unsigned long nr_dirty_seen = 0;
> > > unsigned long nr_reclaimed = 0;
> > >
> > > cond_resched();
> > > @@ -790,6 +804,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > }
> > >
> > > if (PageDirty(page)) {
> > > + nr_dirty_seen++;
> > > +
> > > /*
> > > * Only kswapd can writeback filesystem pages to
> > > * avoid risk of stack overflow
> > > @@ -923,6 +939,18 @@ keep_lumpy:
> > >
> > > list_splice(&ret_pages, page_list);
> > >
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though the
> > > + * dirty_ratio may be satisified. In this case, wake flusher
> > > + * threads to pro-actively clean up to a maximum of
> > > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > + * !may_writepage indicates that this is a direct reclaimer in
> > > + * laptop mode avoiding disk spin-ups
> > > + */
> > > + if (file && nr_dirty_seen && sc->may_writepage)
> > > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> > > +
> >
> > Thank you. Ok, I'll check what happens in memcg.
> >
>
> Thanks
>
> > Can I add
> > if (sc->memcg) {
> > memcg_check_flusher_wakeup()
> > }
> > or some here ?
> >
>
> It seems reasonable.
>
> > Hm, maybe memcg should wake up flusher at starting try_to_free_memory_cgroup_pages().
> >
>
> I'm afraid I cannot make a judgement call on which is the best as I am
> not very familiar with how cgroups behave in comparison to normal
> reclaim. There could easily be a follow-on patch though that was cgroup
> specific?
>
Yes, I'd like to make patches when this series is merged. It's not difficult and
makes it clear how memcg and flusher works for getting good reviews.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-12 15:37 ` Minchan Kim
@ 2010-09-13 8:55 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 8:55 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > >
> > > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > > + * consumed its CPU quota.
> > > > > > + */
> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > > +{
> > > > > > + long ret;
> > > > > > + unsigned long start = jiffies;
> > > > > > + DEFINE_WAIT(wait);
> > > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > > +
> > > > > > + /*
> > > > > > + * If there is no congestion, check the amount of writeback. If there
> > > > > > + * is no significant writeback and no congestion, just cond_resched
> > > > > > + */
> > > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > > + unsigned long inactive, writeback;
> > > > > > +
> > > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > > +
> > > > > > + /*
> > > > > > + * If less than half the inactive list is being written back,
> > > > > > + * reclaim might as well continue
> > > > > > + */
> > > > > > + if (writeback < inactive / 2) {
> > > > >
> > > > > I am not sure this is best.
> > > > >
> > > >
> > > > I'm not saying it is. The objective is to identify a situation where
> > > > sleeping until the next write or congestion clears is pointless. We have
> > > > already identified that we are not congested so the question is "are we
> > > > writing a lot at the moment?". The assumption is that if there is a lot
> > > > of writing going on, we might as well sleep until one completes rather
> > > > than reclaiming more.
> > > >
> > > > This is the first effort at identifying pointless sleeps. Better ones
> > > > might be identified in the future but that shouldn't stop us making a
> > > > semi-sensible decision now.
> > >
> > > nr_bdi_congested is no problem since we have used it for a long time.
> > > But you added new rule about writeback.
> > >
> >
> > Yes, I'm trying to add a new rule about throttling in the page allocator
> > and from vmscan. As you can see from the results in the leader, we are
> > currently sleeping more than we need to.
>
> I can see the about avoiding congestion_wait but can't find about
> (writeback < incative / 2) hueristic result.
>
See the leader and each of the report sections entitled
"FTrace Reclaim Statistics: congestion_wait". It provides a measure of
how sleep times are affected.
"congest waited" are waits due to calling congestion_wait. "conditional waited"
are those related to wait_iff_congested(). As you will see from the reports,
sleep times are reduced overall while callers of wait_iff_congested() still
go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
how reclaim is behaving and indicators so far are that reclaim is not hurt
by introducing wait_iff_congested().
> >
> > > Why I pointed out is that you added new rule and I hope let others know
> > > this change since they have a good idea or any opinions.
> > > I think it's a one of roles as reviewer.
> > >
> >
> > Of course.
> >
> > > >
> > > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > >
> > > > We don't really have a good means of identifying speed classes of
> > > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > > basis. The pages being written back in the zone could be backed by
> > > > anything so we cannot make decisions based on BDI speed.
> > >
> > > True. So it's why I have below question.
> > > As you said, we don't have enough information in vmscan.
> > > So I am not sure how effective such semi-sensible decision is.
> > >
> >
> > What additional metrics would you apply than the ones I used in the
> > leader mail?
>
> effectiveness of (writeback < inactive / 2) heuristic.
>
Define effectiveness.
In the reports I gave, I reported on the sleep times and whether the full
timeout was slept or not. Sleep times are reduced while not negatively
impacting reclaim.
> >
> > > I think best is to throttle in page-writeback well.
> >
> > I do not think there is a problem as such in page writeback throttling.
> > The problem is that we are going to sleep without any congestion or without
> > writes in progress. We sleep for a full timeout in this case for no reason
> > and this is what I'm trying to avoid.
>
> Yes. I agree.
> Just my concern is heuristic accuarcy I mentioned.
> In your previous verstion, you don't add the heuristic.
In the previous version, I also changed all callers to congestion_wait(). V1
simply was not that great a patch and Johannes pointed out that I wasn't
measuring the scanning/reclaim ratios to see how reclaim was impacted. The
reports now include this data and things are looking better.
> But suddenly you added it in this version.
> So I think you have any clue to add it in this version.
> Please, write down cause and data if you have.
>
The leader has a large amount of data on how this and the other patches
affected results for a good variety of workloads.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 8:55 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 8:55 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > >
> > > > > > + * in sleeping but cond_resched() is called in case the current process has
> > > > > > + * consumed its CPU quota.
> > > > > > + */
> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
> > > > > > +{
> > > > > > + long ret;
> > > > > > + unsigned long start = jiffies;
> > > > > > + DEFINE_WAIT(wait);
> > > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > > > > +
> > > > > > + /*
> > > > > > + * If there is no congestion, check the amount of writeback. If there
> > > > > > + * is no significant writeback and no congestion, just cond_resched
> > > > > > + */
> > > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > > > > > + unsigned long inactive, writeback;
> > > > > > +
> > > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
> > > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
> > > > > > +
> > > > > > + /*
> > > > > > + * If less than half the inactive list is being written back,
> > > > > > + * reclaim might as well continue
> > > > > > + */
> > > > > > + if (writeback < inactive / 2) {
> > > > >
> > > > > I am not sure this is best.
> > > > >
> > > >
> > > > I'm not saying it is. The objective is to identify a situation where
> > > > sleeping until the next write or congestion clears is pointless. We have
> > > > already identified that we are not congested so the question is "are we
> > > > writing a lot at the moment?". The assumption is that if there is a lot
> > > > of writing going on, we might as well sleep until one completes rather
> > > > than reclaiming more.
> > > >
> > > > This is the first effort at identifying pointless sleeps. Better ones
> > > > might be identified in the future but that shouldn't stop us making a
> > > > semi-sensible decision now.
> > >
> > > nr_bdi_congested is no problem since we have used it for a long time.
> > > But you added new rule about writeback.
> > >
> >
> > Yes, I'm trying to add a new rule about throttling in the page allocator
> > and from vmscan. As you can see from the results in the leader, we are
> > currently sleeping more than we need to.
>
> I can see the about avoiding congestion_wait but can't find about
> (writeback < incative / 2) hueristic result.
>
See the leader and each of the report sections entitled
"FTrace Reclaim Statistics: congestion_wait". It provides a measure of
how sleep times are affected.
"congest waited" are waits due to calling congestion_wait. "conditional waited"
are those related to wait_iff_congested(). As you will see from the reports,
sleep times are reduced overall while callers of wait_iff_congested() still
go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
how reclaim is behaving and indicators so far are that reclaim is not hurt
by introducing wait_iff_congested().
> >
> > > Why I pointed out is that you added new rule and I hope let others know
> > > this change since they have a good idea or any opinions.
> > > I think it's a one of roles as reviewer.
> > >
> >
> > Of course.
> >
> > > >
> > > > > 1. Without considering various speed class storage, could we fix it as half of inactive?
> > > >
> > > > We don't really have a good means of identifying speed classes of
> > > > storage. Worse, we are considering on a zone-basis here, not a BDI
> > > > basis. The pages being written back in the zone could be backed by
> > > > anything so we cannot make decisions based on BDI speed.
> > >
> > > True. So it's why I have below question.
> > > As you said, we don't have enough information in vmscan.
> > > So I am not sure how effective such semi-sensible decision is.
> > >
> >
> > What additional metrics would you apply than the ones I used in the
> > leader mail?
>
> effectiveness of (writeback < inactive / 2) heuristic.
>
Define effectiveness.
In the reports I gave, I reported on the sleep times and whether the full
timeout was slept or not. Sleep times are reduced while not negatively
impacting reclaim.
> >
> > > I think best is to throttle in page-writeback well.
> >
> > I do not think there is a problem as such in page writeback throttling.
> > The problem is that we are going to sleep without any congestion or without
> > writes in progress. We sleep for a full timeout in this case for no reason
> > and this is what I'm trying to avoid.
>
> Yes. I agree.
> Just my concern is heuristic accuarcy I mentioned.
> In your previous verstion, you don't add the heuristic.
In the previous version, I also changed all callers to congestion_wait(). V1
simply was not that great a patch and Johannes pointed out that I wasn't
measuring the scanning/reclaim ratios to see how reclaim was impacted. The
reports now include this data and things are looking better.
> But suddenly you added it in this version.
> So I think you have any clue to add it in this version.
> Please, write down cause and data if you have.
>
The leader has a large amount of data on how this and the other patches
affected results for a good variety of workloads.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-10 10:25 ` KOSAKI Motohiro
@ 2010-09-13 9:14 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 9:14 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Fri, Sep 10, 2010 at 07:25:43PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > >
> > > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > >
> > > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > > trylock_page() in this case.
> > > > > >
> > > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > >
> > > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > >
> > > > Ah......but can't this change cause dead lock ??
> > >
> > > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > > by poisonous mushroom of Mario Bros.
> > >
> >
> > Lets be clear on what the exact dead lock conditions are. The ones I had
> > thought about when I felt this patch was ok were;
> >
> > o We are not holding the LRU lock (or any lock, we just called cond_resched())
> > o We do not have another page locked because we cannot lock multiple pages
> > o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> > o lock_page() itself is not allocating anything that we could recurse on
>
> True, all.
>
> >
> > One potential dead lock would be if the direct reclaimer held a page
> > lock and ended up here but is that situation even allowed?
>
> example,
>
> __do_fault()
> {
> (snip)
> if (unlikely(!(ret & VM_FAULT_LOCKED)))
> lock_page(vmf.page);
> else
> VM_BUG_ON(!PageLocked(vmf.page));
>
> /*
> * Should we do an early C-O-W break?
> */
> page = vmf.page;
> if (flags & FAULT_FLAG_WRITE) {
> if (!(vma->vm_flags & VM_SHARED)) {
> anon = 1;
> if (unlikely(anon_vma_prepare(vma))) {
> ret = VM_FAULT_OOM;
> goto out;
> }
> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> vma, address);
>
Correct, this is a problem. I already had dropped the patch but thanks for
pointing out a deadlock because I was missing this case. Nothing stops the
page being faulted being sent to shrink_page_list() when alloc_page_vma()
is called. The deadlock might be hard to hit, but it's there.
>
> Afaik, detailed rule is,
>
> o kswapd can call lock_page() because they never take page lock outside vmscan
lock_page_nosync as you point out in your next mail. While it can call
it, kswapd shouldn't because normally it avoids stalls but it would not
deadlock as a result of calling it.
> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
>
I think the safer bet is simply to say "direct reclaimers should not
call lock_page() because the fault path could be holding a lock on that
page already".
Thanks.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-13 9:14 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 9:14 UTC (permalink / raw)
To: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel, Linux Kernel List,
Rik van Riel, Johannes Weiner, Minchan Kim, Wu Fengguang,
Andrea Arcangeli, Dave Chinner, Chris Mason, Christoph Hellwig,
Andrew Morton
On Fri, Sep 10, 2010 at 07:25:43PM +0900, KOSAKI Motohiro wrote:
> > On Thu, Sep 09, 2010 at 01:13:22PM +0900, KOSAKI Motohiro wrote:
> > > > On Thu, 9 Sep 2010 12:04:48 +0900
> > > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > >
> > > > > On Mon, 6 Sep 2010 11:47:28 +0100
> > > > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > > >
> > > > > > From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > >
> > > > > > With synchrounous lumpy reclaim, there is no reason to give up to reclaim
> > > > > > pages even if page is locked. This patch uses lock_page() instead of
> > > > > > trylock_page() in this case.
> > > > > >
> > > > > > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > >
> > > > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > >
> > > > Ah......but can't this change cause dead lock ??
> > >
> > > Yes, this patch is purely crappy. please drop. I guess I was poisoned
> > > by poisonous mushroom of Mario Bros.
> > >
> >
> > Lets be clear on what the exact dead lock conditions are. The ones I had
> > thought about when I felt this patch was ok were;
> >
> > o We are not holding the LRU lock (or any lock, we just called cond_resched())
> > o We do not have another page locked because we cannot lock multiple pages
> > o Kswapd will never be in LUMPY_MODE_SYNC so it is not getting blocked
> > o lock_page() itself is not allocating anything that we could recurse on
>
> True, all.
>
> >
> > One potential dead lock would be if the direct reclaimer held a page
> > lock and ended up here but is that situation even allowed?
>
> example,
>
> __do_fault()
> {
> (snip)
> if (unlikely(!(ret & VM_FAULT_LOCKED)))
> lock_page(vmf.page);
> else
> VM_BUG_ON(!PageLocked(vmf.page));
>
> /*
> * Should we do an early C-O-W break?
> */
> page = vmf.page;
> if (flags & FAULT_FLAG_WRITE) {
> if (!(vma->vm_flags & VM_SHARED)) {
> anon = 1;
> if (unlikely(anon_vma_prepare(vma))) {
> ret = VM_FAULT_OOM;
> goto out;
> }
> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> vma, address);
>
Correct, this is a problem. I already had dropped the patch but thanks for
pointing out a deadlock because I was missing this case. Nothing stops the
page being faulted being sent to shrink_page_list() when alloc_page_vma()
is called. The deadlock might be hard to hit, but it's there.
>
> Afaik, detailed rule is,
>
> o kswapd can call lock_page() because they never take page lock outside vmscan
lock_page_nosync as you point out in your next mail. While it can call
it, kswapd shouldn't because normally it avoids stalls but it would not
deadlock as a result of calling it.
> o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> because the task have gurantee of no lock taken.
> o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
>
I think the safer bet is simply to say "direct reclaimers should not
call lock_page() because the fault path could be holding a lock on that
page already".
Thanks.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-13 8:55 ` Mel Gorman
@ 2010-09-13 9:48 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 9:48 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 5:55 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
>> > > > > > <SNIP>
>> > > > > >
>> > > > > > + * in sleeping but cond_resched() is called in case the current process has
>> > > > > > + * consumed its CPU quota.
>> > > > > > + */
>> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
>> > > > > > +{
>> > > > > > + long ret;
>> > > > > > + unsigned long start = jiffies;
>> > > > > > + DEFINE_WAIT(wait);
>> > > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
>> > > > > > +
>> > > > > > + /*
>> > > > > > + * If there is no congestion, check the amount of writeback. If there
>> > > > > > + * is no significant writeback and no congestion, just cond_resched
>> > > > > > + */
>> > > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>> > > > > > + unsigned long inactive, writeback;
>> > > > > > +
>> > > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
>> > > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
>> > > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
>> > > > > > +
>> > > > > > + /*
>> > > > > > + * If less than half the inactive list is being written back,
>> > > > > > + * reclaim might as well continue
>> > > > > > + */
>> > > > > > + if (writeback < inactive / 2) {
>> > > > >
>> > > > > I am not sure this is best.
>> > > > >
>> > > >
>> > > > I'm not saying it is. The objective is to identify a situation where
>> > > > sleeping until the next write or congestion clears is pointless. We have
>> > > > already identified that we are not congested so the question is "are we
>> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> > > > of writing going on, we might as well sleep until one completes rather
>> > > > than reclaiming more.
>> > > >
>> > > > This is the first effort at identifying pointless sleeps. Better ones
>> > > > might be identified in the future but that shouldn't stop us making a
>> > > > semi-sensible decision now.
>> > >
>> > > nr_bdi_congested is no problem since we have used it for a long time.
>> > > But you added new rule about writeback.
>> > >
>> >
>> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> > and from vmscan. As you can see from the results in the leader, we are
>> > currently sleeping more than we need to.
>>
>> I can see the about avoiding congestion_wait but can't find about
>> (writeback < incative / 2) hueristic result.
>>
>
> See the leader and each of the report sections entitled
> "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> how sleep times are affected.
>
> "congest waited" are waits due to calling congestion_wait. "conditional waited"
> are those related to wait_iff_congested(). As you will see from the reports,
> sleep times are reduced overall while callers of wait_iff_congested() still
> go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> how reclaim is behaving and indicators so far are that reclaim is not hurt
> by introducing wait_iff_congested().
I saw the result.
It was a result about effectiveness _both_ nr_bdi_congested and
(writeback < inactive/2).
What I mean is just effectiveness (writeback < inactive/2) _alone_.
If we remove (writeback < inactive / 2) check and unconditionally
return, how does the behavior changed?
Am I misunderstanding your report in leader?
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 9:48 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 9:48 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 5:55 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 12:37:44AM +0900, Minchan Kim wrote:
>> > > > > > <SNIP>
>> > > > > >
>> > > > > > + * in sleeping but cond_resched() is called in case the current process has
>> > > > > > + * consumed its CPU quota.
>> > > > > > + */
>> > > > > > +long wait_iff_congested(struct zone *zone, int sync, long timeout)
>> > > > > > +{
>> > > > > > + long ret;
>> > > > > > + unsigned long start = jiffies;
>> > > > > > + DEFINE_WAIT(wait);
>> > > > > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
>> > > > > > +
>> > > > > > + /*
>> > > > > > + * If there is no congestion, check the amount of writeback. If there
>> > > > > > + * is no significant writeback and no congestion, just cond_resched
>> > > > > > + */
>> > > > > > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
>> > > > > > + unsigned long inactive, writeback;
>> > > > > > +
>> > > > > > + inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
>> > > > > > + zone_page_state(zone, NR_INACTIVE_ANON);
>> > > > > > + writeback = zone_page_state(zone, NR_WRITEBACK);
>> > > > > > +
>> > > > > > + /*
>> > > > > > + * If less than half the inactive list is being written back,
>> > > > > > + * reclaim might as well continue
>> > > > > > + */
>> > > > > > + if (writeback < inactive / 2) {
>> > > > >
>> > > > > I am not sure this is best.
>> > > > >
>> > > >
>> > > > I'm not saying it is. The objective is to identify a situation where
>> > > > sleeping until the next write or congestion clears is pointless. We have
>> > > > already identified that we are not congested so the question is "are we
>> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> > > > of writing going on, we might as well sleep until one completes rather
>> > > > than reclaiming more.
>> > > >
>> > > > This is the first effort at identifying pointless sleeps. Better ones
>> > > > might be identified in the future but that shouldn't stop us making a
>> > > > semi-sensible decision now.
>> > >
>> > > nr_bdi_congested is no problem since we have used it for a long time
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-13 9:48 ` Minchan Kim
@ 2010-09-13 10:07 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:07 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> > > > already identified that we are not congested so the question is "are we
> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> > > > of writing going on, we might as well sleep until one completes rather
> >> > > > than reclaiming more.
> >> > > >
> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> > > > might be identified in the future but that shouldn't stop us making a
> >> > > > semi-sensible decision now.
> >> > >
> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> > > But you added new rule about writeback.
> >> > >
> >> >
> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> > and from vmscan. As you can see from the results in the leader, we are
> >> > currently sleeping more than we need to.
> >>
> >> I can see the about avoiding congestion_wait but can't find about
> >> (writeback < incative / 2) hueristic result.
> >>
> >
> > See the leader and each of the report sections entitled
> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> > how sleep times are affected.
> >
> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> > are those related to wait_iff_congested(). As you will see from the reports,
> > sleep times are reduced overall while callers of wait_iff_congested() still
> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> > by introducing wait_iff_congested().
>
> I saw the result.
> It was a result about effectiveness _both_ nr_bdi_congested and
> (writeback < inactive/2).
> What I mean is just effectiveness (writeback < inactive/2) _alone_.
I didn't measured it because such a change means that wait_iff_congested()
ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
it could mean that a BDI gets flooded with requests if we only checked the
ratios of one zone if little writeback was happening in that zone at the
time. It did not seem like a good idea to ignore congestion.
> If we remove (writeback < inactive / 2) check and unconditionally
> return, how does the behavior changed?
>
Based on just the workload Johannes sent, scanning and completion times both
increased without any improvement in the scanning/reclaim ratio (a bad result)
hence why this logic was introduced to back off where there is some
writeback taking place even if the BDI is not congested.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:07 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:07 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> > > > <SNIP>
> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> > > > already identified that we are not congested so the question is "are we
> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> > > > of writing going on, we might as well sleep until one completes rather
> >> > > > than reclaiming more.
> >> > > >
> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> > > > might be identified in the future but that shouldn't stop us making a
> >> > > > semi-sensible decision now.
> >> > >
> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> > > But you added new rule about writeback.
> >> > >
> >> >
> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> > and from vmscan. As you can see from the results in the leader, we are
> >> > currently sleeping more than we need to.
> >>
> >> I can see the about avoiding congestion_wait but can't find about
> >> (writeback < incative / 2) hueristic result.
> >>
> >
> > See the leader and each of the report sections entitled
> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> > how sleep times are affected.
> >
> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> > are those related to wait_iff_congested(). As you will see from the reports,
> > sleep times are reduced overall while callers of wait_iff_congested() still
> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> > by introducing wait_iff_congested().
>
> I saw the result.
> It was a result about effectiveness _both_ nr_bdi_congested and
> (writeback < inactive/2).
> What I mean is just effectiveness (writeback < inactive/2) _alone_.
I didn't measured it because such a change means that wait_iff_congested()
ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
it could mean that a BDI gets flooded with requests if we only checked the
ratios of one zone if little writeback was happening in that zone at the
time. It did not seem like a good idea to ignore congestion.
> If we remove (writeback < inactive / 2) check and unconditionally
> return, how does the behavior changed?
>
Based on just the workload Johannes sent, scanning and completion times both
increased without any improvement in the scanning/reclaim ratio (a bad result)
hence why this logic was introduced to back off where there is some
writeback taking place even if the BDI is not congested.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-13 10:07 ` Mel Gorman
@ 2010-09-13 10:20 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 10:20 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
>> >> > > > <SNIP>
>> >> > > > I'm not saying it is. The objective is to identify a situation where
>> >> > > > sleeping until the next write or congestion clears is pointless. We have
>> >> > > > already identified that we are not congested so the question is "are we
>> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> >> > > > of writing going on, we might as well sleep until one completes rather
>> >> > > > than reclaiming more.
>> >> > > >
>> >> > > > This is the first effort at identifying pointless sleeps. Better ones
>> >> > > > might be identified in the future but that shouldn't stop us making a
>> >> > > > semi-sensible decision now.
>> >> > >
>> >> > > nr_bdi_congested is no problem since we have used it for a long time.
>> >> > > But you added new rule about writeback.
>> >> > >
>> >> >
>> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> >> > and from vmscan. As you can see from the results in the leader, we are
>> >> > currently sleeping more than we need to.
>> >>
>> >> I can see the about avoiding congestion_wait but can't find about
>> >> (writeback < incative / 2) hueristic result.
>> >>
>> >
>> > See the leader and each of the report sections entitled
>> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
>> > how sleep times are affected.
>> >
>> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
>> > are those related to wait_iff_congested(). As you will see from the reports,
>> > sleep times are reduced overall while callers of wait_iff_congested() still
>> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
>> > how reclaim is behaving and indicators so far are that reclaim is not hurt
>> > by introducing wait_iff_congested().
>>
>> I saw the result.
>> It was a result about effectiveness _both_ nr_bdi_congested and
>> (writeback < inactive/2).
>> What I mean is just effectiveness (writeback < inactive/2) _alone_.
>
> I didn't measured it because such a change means that wait_iff_congested()
> ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> it could mean that a BDI gets flooded with requests if we only checked the
> ratios of one zone if little writeback was happening in that zone at the
> time. It did not seem like a good idea to ignore congestion.
You seem to misunderstand my word.
Sorry for not clear sentence.
I don't mean ignore congestion.
First of all, we should consider congestion of bdi.
My meant is whether we need adding up (nr_writeback < nr_inacive /2)
heuristic plus congestion bdi.
It wasn't previous version in your patch but it showed up in this version.
So I thought apparently you have any evidence why we should add such heuristic.
>
>> If we remove (writeback < inactive / 2) check and unconditionally
>> return, how does the behavior changed?
>>
>
> Based on just the workload Johannes sent, scanning and completion times both
> increased without any improvement in the scanning/reclaim ratio (a bad result)
> hence why this logic was introduced to back off where there is some
> writeback taking place even if the BDI is not congested.
Yes. That's what I want. At least, comment of function should have it
to understand the logic. In addition, It would be better to add the
number to show how it back off well.
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:20 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 10:20 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
>> >> > > > <SNIP>
>> >> > > > I'm not saying it is. The objective is to identify a situation where
>> >> > > > sleeping until the next write or congestion clears is pointless. We have
>> >> > > > already identified that we are not congested so the question is "are we
>> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
>> >> > > > of writing going on, we might as well sleep until one completes rather
>> >> > > > than reclaiming more.
>> >> > > >
>> >> > > > This is the first effort at identifying pointless sleeps. Better ones
>> >> > > > might be identified in the future but that shouldn't stop us making a
>> >> > > > semi-sensible decision now.
>> >> > >
>> >> > > nr_bdi_congested is no problem since we have used it for a long time.
>> >> > > But you added new rule about writeback.
>> >> > >
>> >> >
>> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
>> >> > and from vmscan. As you can see from the results in the leader, we are
>> >> > currently sleeping more than we need to.
>> >>
>> >> I can see the about avoiding congestion_wait but can't find about
>> >> (writeback < incative / 2) hueristic result.
>> >>
>> >
>> > See the leader and each of the report sections entitled
>> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
>> > how sleep times are affected.
>> >
>> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
>> > are those related to wait_iff_congested(). As you will see from the reports,
>> > sleep times are reduced overall while callers of wait_iff_congested() still
>> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
>> > how reclaim is behaving and indicators so far are that reclaim is not hurt
>> > by introducing wait_iff_congested().
>>
>> I saw the result.
>> It was a result about effectiveness _both_ nr_bdi_congested and
>> (writeback < inactive/2).
>> What I mean is just effectiveness (writeback < inactive/2) _alone_.
>
> I didn't measured it because such a change means that wait_iff_congested()
> ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> it could mean that a BDI gets flooded with requests if we only checked the
> ratios of one zone if little writeback was happening in that zone at the
> time. It did not seem like a good idea to ignore congestion.
You seem to misunderstand my word.
Sorry for not clear sentence.
I don't mean ignore congestion.
First of all, we should consider congestion of bdi.
My meant is whether we need adding up (nr_writeback < nr_inacive /2)
heuristic plus congestion bdi.
It wasn't previous version in your patch but it showed up in this version.
So I thought apparently you have any evidence why we should add such heuristic.
>
>> If we remove (writeback < inactive / 2) check and unconditionally
>> return, how does the behavior changed?
>>
>
> Based on just the workload Johannes sent, scanning and completion times both
> increased without any improvement in the scanning/reclaim ratio (a bad result)
> hence why this logic was introduced to back off where there is some
> writeback taking place even if the BDI is not congested.
Yes. That's what I want. At least, comment of function should have it
to understand the logic. In addition, It would be better to add the
number to show how it back off well.
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
2010-09-13 10:20 ` Minchan Kim
@ 2010-09-13 10:30 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:30 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 07:20:37PM +0900, Minchan Kim wrote:
> On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> >> > > > <SNIP>
> >> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> >> > > > already identified that we are not congested so the question is "are we
> >> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> >> > > > of writing going on, we might as well sleep until one completes rather
> >> >> > > > than reclaiming more.
> >> >> > > >
> >> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> >> > > > might be identified in the future but that shouldn't stop us making a
> >> >> > > > semi-sensible decision now.
> >> >> > >
> >> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> >> > > But you added new rule about writeback.
> >> >> > >
> >> >> >
> >> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> >> > and from vmscan. As you can see from the results in the leader, we are
> >> >> > currently sleeping more than we need to.
> >> >>
> >> >> I can see the about avoiding congestion_wait but can't find about
> >> >> (writeback < incative / 2) hueristic result.
> >> >>
> >> >
> >> > See the leader and each of the report sections entitled
> >> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> >> > how sleep times are affected.
> >> >
> >> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> >> > are those related to wait_iff_congested(). As you will see from the reports,
> >> > sleep times are reduced overall while callers of wait_iff_congested() still
> >> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> >> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> >> > by introducing wait_iff_congested().
> >>
> >> I saw the result.
> >> It was a result about effectiveness _both_ nr_bdi_congested and
> >> (writeback < inactive/2).
> >> What I mean is just effectiveness (writeback < inactive/2) _alone_.
> >
> > I didn't measured it because such a change means that wait_iff_congested()
> > ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> > it could mean that a BDI gets flooded with requests if we only checked the
> > ratios of one zone if little writeback was happening in that zone at the
> > time. It did not seem like a good idea to ignore congestion.
>
> You seem to misunderstand my word.
> Sorry for not clear sentence.
>
> I don't mean ignore congestion.
> First of all, we should consider congestion of bdi.
> My meant is whether we need adding up (nr_writeback < nr_inacive /2)
> heuristic plus congestion bdi.
Early tests indicated "yes".
> It wasn't previous version in your patch but it showed up in this version.
> So I thought apparently you have any evidence why we should add such heuristic.
>
Only the feedback from the first patch where Johannes posted a workload that
did exhibit a problem. Isolated tests on just that workload led to the
(nr_writeback < inactive / 2) change.
> >
> >> If we remove (writeback < inactive / 2) check and unconditionally
> >> return, how does the behavior changed?
> >>
> >
> > Based on just the workload Johannes sent, scanning and completion times both
> > increased without any improvement in the scanning/reclaim ratio (a bad result)
> > hence why this logic was introduced to back off where there is some
> > writeback taking place even if the BDI is not congested.
>
> Yes. That's what I want. At least, comment of function should have it
> to understand the logic. In addition, It would be better to add the
> number to show how it back off well.
>
Very well. I'll hold off posting v2 of the series now then because producing
such results take many hours and my machines are currently busy.
Hopefully I'll have something by Wednesday.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback
@ 2010-09-13 10:30 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 10:30 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 07:20:37PM +0900, Minchan Kim wrote:
> On Mon, Sep 13, 2010 at 7:07 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > On Mon, Sep 13, 2010 at 06:48:10PM +0900, Minchan Kim wrote:
> >> >> > > > <SNIP>
> >> >> > > > I'm not saying it is. The objective is to identify a situation where
> >> >> > > > sleeping until the next write or congestion clears is pointless. We have
> >> >> > > > already identified that we are not congested so the question is "are we
> >> >> > > > writing a lot at the moment?". The assumption is that if there is a lot
> >> >> > > > of writing going on, we might as well sleep until one completes rather
> >> >> > > > than reclaiming more.
> >> >> > > >
> >> >> > > > This is the first effort at identifying pointless sleeps. Better ones
> >> >> > > > might be identified in the future but that shouldn't stop us making a
> >> >> > > > semi-sensible decision now.
> >> >> > >
> >> >> > > nr_bdi_congested is no problem since we have used it for a long time.
> >> >> > > But you added new rule about writeback.
> >> >> > >
> >> >> >
> >> >> > Yes, I'm trying to add a new rule about throttling in the page allocator
> >> >> > and from vmscan. As you can see from the results in the leader, we are
> >> >> > currently sleeping more than we need to.
> >> >>
> >> >> I can see the about avoiding congestion_wait but can't find about
> >> >> (writeback < incative / 2) hueristic result.
> >> >>
> >> >
> >> > See the leader and each of the report sections entitled
> >> > "FTrace Reclaim Statistics: congestion_wait". It provides a measure of
> >> > how sleep times are affected.
> >> >
> >> > "congest waited" are waits due to calling congestion_wait. "conditional waited"
> >> > are those related to wait_iff_congested(). As you will see from the reports,
> >> > sleep times are reduced overall while callers of wait_iff_congested() still
> >> > go to sleep. The reports entitled "FTrace Reclaim Statistics: vmscan" show
> >> > how reclaim is behaving and indicators so far are that reclaim is not hurt
> >> > by introducing wait_iff_congested().
> >>
> >> I saw the result.
> >> It was a result about effectiveness _both_ nr_bdi_congested and
> >> (writeback < inactive/2).
> >> What I mean is just effectiveness (writeback < inactive/2) _alone_.
> >
> > I didn't measured it because such a change means that wait_iff_congested()
> > ignored BDI congestion. If we were reclaiming on a NUMA machine for example,
> > it could mean that a BDI gets flooded with requests if we only checked the
> > ratios of one zone if little writeback was happening in that zone at the
> > time. It did not seem like a good idea to ignore congestion.
>
> You seem to misunderstand my word.
> Sorry for not clear sentence.
>
> I don't mean ignore congestion.
> First of all, we should consider congestion of bdi.
> My meant is whether we need adding up (nr_writeback < nr_inacive /2)
> heuristic plus congestion bdi.
Early tests indicated "yes".
> It wasn't previous version in your patch but it showed up in this version.
> So I thought apparently you have any evidence why we should add such heuristic.
>
Only the feedback from the first patch where Johannes posted a workload that
did exhibit a problem. Isolated tests on just that workload led to the
(nr_writeback < inactive / 2) change.
> >
> >> If we remove (writeback < inactive / 2) check and unconditionally
> >> return, how does the behavior changed?
> >>
> >
> > Based on just the workload Johannes sent, scanning and completion times both
> > increased without any improvement in the scanning/reclaim ratio (a bad result)
> > hence why this logic was introduced to back off where there is some
> > writeback taking place even if the BDI is not congested.
>
> Yes. That's what I want. At least, comment of function should have it
> to understand the logic. In addition, It would be better to add the
> number to show how it back off well.
>
Very well. I'll hold off posting v2 of the series now then because producing
such results take many hours and my machines are currently busy.
Hopefully I'll have something by Wednesday.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-13 13:31 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:31 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
Mel,
Sorry for being late, I'm doing pretty much prework these days ;)
On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back. If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> 1 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ff52b46..408c101 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #define scanning_global_lru(sc) (1)
> #endif
>
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> * shrink_page_list() returns the number of reclaimed pages
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> - struct scan_control *sc)
> + struct scan_control *sc,
> + unsigned long *nr_still_dirty)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> + unsigned long nr_dirty = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
> @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
>
> if (PageDirty(page)) {
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + nr_dirty++;
> + goto keep_locked;
> + }
> +
> if (references == PAGEREF_RECLAIM_CLEAN)
> goto keep_locked;
> if (!may_enter_fs)
> @@ -908,6 +922,8 @@ keep_lumpy:
> free_page_list(&free_pages);
>
> list_splice(&ret_pages, page_list);
> +
> + *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> return false;
>
> + /* If we cannot writeback, there is no point stalling */
> + if (!sc->may_writepage)
> + return false;
> +
> /* If we have relaimed everything on the isolated list, no stall */
> if (nr_freed == nr_taken)
> return false;
> @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> struct scan_control *sc, int priority, int file)
> {
> LIST_HEAD(page_list);
> + LIST_HEAD(putback_list);
> unsigned long nr_scanned;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_taken;
> unsigned long nr_anon;
> unsigned long nr_file;
> + unsigned long nr_dirty;
>
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, sc);
> + nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
It is possible to OOM if the LRU list is small and/or the storage is slow, so
that the flusher cannot clean enough pages before the LRU is fully scanned.
So we may need do waits on dirty/writeback pages on *order-0*
direct reclaims, when priority goes rather low (such as < 3).
> + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> set_lumpy_reclaim_mode(priority, sc, true);
> - nr_reclaimed += shrink_page_list(&page_list, sc);
> +
> + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> + struct page *page, *tmp;
> +
> + /* Take off the clean pages marked for activation */
> + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> + if (PageDirty(page) || PageWriteback(page))
> + continue;
> +
> + list_del(&page->lru);
> + list_add(&page->lru, &putback_list);
> + }
nitpick: I guess the above loop is optional code to avoid overheads
of shrink_page_list() repeatedly going through some unfreeable pages?
Considering this is the slow code path, I'd prefer to keep the code
simple than to do such optimizations.
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
how about
if (!laptop_mode)
wakeup_flusher_threads(nr_dirty);
> + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +
> + nr_reclaimed = shrink_page_list(&page_list, sc,
> + &nr_dirty);
> + }
> }
>
> + list_splice(&putback_list, &page_list);
> +
> local_irq_disable();
> if (current_is_kswapd())
> __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> --
> 1.7.1
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 13:31 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:31 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
Mel,
Sorry for being late, I'm doing pretty much prework these days ;)
On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> When memory is under enough pressure, a process may enter direct
> reclaim to free pages in the same manner kswapd does. If a dirty page is
> encountered during the scan, this page is written to backing storage using
> mapping->writepage. This can result in very deep call stacks, particularly
> if the target storage or filesystem are complex. It has already been observed
> on XFS that the stack overflows but the problem is not XFS-specific.
>
> This patch prevents direct reclaim writing back filesystem pages by checking
> if current is kswapd or the page is anonymous before writing back. If the
> dirty pages cannot be written back, they are placed back on the LRU lists
> for either background writing by the BDI threads or kswapd. If in direct
> lumpy reclaim and dirty pages are encountered, the process will stall for
> the background flusher before trying to reclaim the pages again.
>
> As the call-chain for writing anonymous pages is not expected to be deep
> and they are not cleaned by flusher threads, anonymous pages are still
> written back in direct reclaim.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> 1 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ff52b46..408c101 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> #define scanning_global_lru(sc) (1)
> #endif
>
> +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> +#define MAX_SWAP_CLEAN_WAIT 50
> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> * shrink_page_list() returns the number of reclaimed pages
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> - struct scan_control *sc)
> + struct scan_control *sc,
> + unsigned long *nr_still_dirty)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> int pgactivate = 0;
> + unsigned long nr_dirty = 0;
> unsigned long nr_reclaimed = 0;
>
> cond_resched();
> @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> }
>
> if (PageDirty(page)) {
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + nr_dirty++;
> + goto keep_locked;
> + }
> +
> if (references == PAGEREF_RECLAIM_CLEAN)
> goto keep_locked;
> if (!may_enter_fs)
> @@ -908,6 +922,8 @@ keep_lumpy:
> free_page_list(&free_pages);
>
> list_splice(&ret_pages, page_list);
> +
> + *nr_still_dirty = nr_dirty;
> count_vm_events(PGACTIVATE, pgactivate);
> return nr_reclaimed;
> }
> @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> return false;
>
> + /* If we cannot writeback, there is no point stalling */
> + if (!sc->may_writepage)
> + return false;
> +
> /* If we have relaimed everything on the isolated list, no stall */
> if (nr_freed == nr_taken)
> return false;
> @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> struct scan_control *sc, int priority, int file)
> {
> LIST_HEAD(page_list);
> + LIST_HEAD(putback_list);
> unsigned long nr_scanned;
> unsigned long nr_reclaimed = 0;
> unsigned long nr_taken;
> unsigned long nr_anon;
> unsigned long nr_file;
> + unsigned long nr_dirty;
>
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, sc);
> + nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
It is possible to OOM if the LRU list is small and/or the storage is slow, so
that the flusher cannot clean enough pages before the LRU is fully scanned.
So we may need do waits on dirty/writeback pages on *order-0*
direct reclaims, when priority goes rather low (such as < 3).
> + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> set_lumpy_reclaim_mode(priority, sc, true);
> - nr_reclaimed += shrink_page_list(&page_list, sc);
> +
> + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> + struct page *page, *tmp;
> +
> + /* Take off the clean pages marked for activation */
> + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> + if (PageDirty(page) || PageWriteback(page))
> + continue;
> +
> + list_del(&page->lru);
> + list_add(&page->lru, &putback_list);
> + }
nitpick: I guess the above loop is optional code to avoid overheads
of shrink_page_list() repeatedly going through some unfreeable pages?
Considering this is the slow code path, I'd prefer to keep the code
simple than to do such optimizations.
> + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
how about
if (!laptop_mode)
wakeup_flusher_threads(nr_dirty);
> + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> +
> + nr_reclaimed = shrink_page_list(&page_list, sc,
> + &nr_dirty);
> + }
> }
>
> + list_splice(&putback_list, &page_list);
> +
> local_irq_disable();
> if (current_is_kswapd())
> __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> --
> 1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-13 13:48 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:48 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though the
> + * dirty_ratio may be satisified. In this case, wake flusher
> + * threads to pro-actively clean up to a maximum of
> + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> + * !may_writepage indicates that this is a direct reclaimer in
> + * laptop mode avoiding disk spin-ups
> + */
> + if (file && nr_dirty_seen && sc->may_writepage)
> + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
wakeup_flusher_threads() works, but seems not the pertinent one.
- locally, it needs some luck to clean the pages that direct reclaim is waiting on
- globally, it cleans up some dirty pages, however some heavy dirtier
may quickly create new ones..
So how about taking the approaches in these patches?
- "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
- "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
In particular the first patch should work very nicely with memcg, as
all pages of an inode typically belong to the same memcg. So doing
write-around helps clean lots of dirty pages in the target LRU list in
one shot.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 13:48 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 13:48 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> + /*
> + * If reclaim is encountering dirty pages, it may be because
> + * dirty pages are reaching the end of the LRU even though the
> + * dirty_ratio may be satisified. In this case, wake flusher
> + * threads to pro-actively clean up to a maximum of
> + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> + * !may_writepage indicates that this is a direct reclaimer in
> + * laptop mode avoiding disk spin-ups
> + */
> + if (file && nr_dirty_seen && sc->may_writepage)
> + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
wakeup_flusher_threads() works, but seems not the pertinent one.
- locally, it needs some luck to clean the pages that direct reclaim is waiting on
- globally, it cleans up some dirty pages, however some heavy dirtier
may quickly create new ones..
So how about taking the approaches in these patches?
- "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
- "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
In particular the first patch should work very nicely with memcg, as
all pages of an inode typically belong to the same memcg. So doing
write-around helps clean lots of dirty pages in the target LRU list in
one shot.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-09-13 13:31 ` Wu Fengguang
@ 2010-09-13 13:55 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 13:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 09:31:56PM +0800, Wu Fengguang wrote:
> Mel,
>
> Sorry for being late, I'm doing pretty much prework these days ;)
>
No worries, I'm all over the place at the moment so cannot lecture on
response times :)
> On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back. If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> >
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ff52b46..408c101 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > #define scanning_global_lru(sc) (1)
> > #endif
> >
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > - struct scan_control *sc)
> > + struct scan_control *sc,
> > + unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > + unsigned long nr_dirty = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + nr_dirty++;
> > + goto keep_locked;
> > + }
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -908,6 +922,8 @@ keep_lumpy:
> > free_page_list(&free_pages);
> >
> > list_splice(&ret_pages, page_list);
> > +
> > + *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> > if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> > return false;
> >
> > + /* If we cannot writeback, there is no point stalling */
> > + if (!sc->may_writepage)
> > + return false;
> > +
> > /* If we have relaimed everything on the isolated list, no stall */
> > if (nr_freed == nr_taken)
> > return false;
> > @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > struct scan_control *sc, int priority, int file)
> > {
> > LIST_HEAD(page_list);
> > + LIST_HEAD(putback_list);
> > unsigned long nr_scanned;
> > unsigned long nr_reclaimed = 0;
> > unsigned long nr_taken;
> > unsigned long nr_anon;
> > unsigned long nr_file;
> > + unsigned long nr_dirty;
> >
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, sc);
> > + nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
> >
> > /* Check if we should syncronously wait for writeback */
> > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>
> It is possible to OOM if the LRU list is small and/or the storage is slow, so
> that the flusher cannot clean enough pages before the LRU is fully scanned.
>
To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
would have to be dirty or unreclaimable. If that situation happens, is
the dirty throttling not also broken?
> So we may need do waits on dirty/writeback pages on *order-0*
> direct reclaims, when priority goes rather low (such as < 3).
>
In case this is really necessary, the necessary stalling could be done by
removing the check for lumpy reclaim in should_reclaim_stall(). What do
you think of the following replacement?
/*
* Returns true if the caller should wait to clean dirty/writeback pages.
*
* If we are direct reclaiming for contiguous pages and we do not reclaim
* everything in the list, try again and wait for writeback IO to complete.
* This will stall high-order allocations noticeably. Only do that when really
* need to free the pages under high memory pressure.
*
* Alternatively, if priority is getting high, it may be because there are
* too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
* and potentially causing an OOM, we stall on writeback.
*/
static inline bool should_reclaim_stall(unsigned long nr_taken,
unsigned long nr_freed,
int priority,
struct scan_control *sc)
{
int stall_priority;
/* kswapd should not stall on sync IO */
if (current_is_kswapd())
return false;
/* If we cannot writeback, there is no point stalling */
if (!sc->may_writepage)
return false;
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;
/*
* For high-order allocations, there are two stall thresholds.
* High-cost allocations stall immediately where as lower
* order allocations such as stacks require the scanning
* priority to be much higher before stalling.
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
stall_priority = DEF_PRIORITY;
else
stall_priority = DEF_PRIORITY / 3;
return priority <= stall_priority;
}
> > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > set_lumpy_reclaim_mode(priority, sc, true);
> > - nr_reclaimed += shrink_page_list(&page_list, sc);
> > +
> > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > + struct page *page, *tmp;
> > +
>
> > + /* Take off the clean pages marked for activation */
> > + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > + if (PageDirty(page) || PageWriteback(page))
> > + continue;
> > +
> > + list_del(&page->lru);
> > + list_add(&page->lru, &putback_list);
> > + }
>
> nitpick: I guess the above loop is optional code to avoid overheads
> of shrink_page_list() repeatedly going through some unfreeable pages?
Pretty much, if they are to be activated, there is no point trying to reclaim
them again. It's unnecessary overhead. A strong motivation for this
series is to reduce overheads in the reclaim paths and unnecessary
retrying of unfreeable pages.
> Considering this is the slow code path, I'd prefer to keep the code
> simple than to do such optimizations.
>
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
>
> how about
> if (!laptop_mode)
> wakeup_flusher_threads(nr_dirty);
>
It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
clean all pages if some need dirtying. laptop_mode cleans all pages to
minimise disk spinups.
> > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +
> > + nr_reclaimed = shrink_page_list(&page_list, sc,
> > + &nr_dirty);
> > + }
> > }
> >
> > + list_splice(&putback_list, &page_list);
> > +
> > local_irq_disable();
> > if (current_is_kswapd())
> > __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > --
> > 1.7.1
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 13:55 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 13:55 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 09:31:56PM +0800, Wu Fengguang wrote:
> Mel,
>
> Sorry for being late, I'm doing pretty much prework these days ;)
>
No worries, I'm all over the place at the moment so cannot lecture on
response times :)
> On Mon, Sep 06, 2010 at 06:47:32PM +0800, Mel Gorman wrote:
> > When memory is under enough pressure, a process may enter direct
> > reclaim to free pages in the same manner kswapd does. If a dirty page is
> > encountered during the scan, this page is written to backing storage using
> > mapping->writepage. This can result in very deep call stacks, particularly
> > if the target storage or filesystem are complex. It has already been observed
> > on XFS that the stack overflows but the problem is not XFS-specific.
> >
> > This patch prevents direct reclaim writing back filesystem pages by checking
> > if current is kswapd or the page is anonymous before writing back. If the
> > dirty pages cannot be written back, they are placed back on the LRU lists
> > for either background writing by the BDI threads or kswapd. If in direct
> > lumpy reclaim and dirty pages are encountered, the process will stall for
> > the background flusher before trying to reclaim the pages again.
> >
> > As the call-chain for writing anonymous pages is not expected to be deep
> > and they are not cleaned by flusher threads, anonymous pages are still
> > written back in direct reclaim.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> > mm/vmscan.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++---
> > 1 files changed, 46 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ff52b46..408c101 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -145,6 +145,9 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > #define scanning_global_lru(sc) (1)
> > #endif
> >
> > +/* Direct lumpy reclaim waits up to five seconds for background cleaning */
> > +#define MAX_SWAP_CLEAN_WAIT 50
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -682,11 +685,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> > * shrink_page_list() returns the number of reclaimed pages
> > */
> > static unsigned long shrink_page_list(struct list_head *page_list,
> > - struct scan_control *sc)
> > + struct scan_control *sc,
> > + unsigned long *nr_still_dirty)
> > {
> > LIST_HEAD(ret_pages);
> > LIST_HEAD(free_pages);
> > int pgactivate = 0;
> > + unsigned long nr_dirty = 0;
> > unsigned long nr_reclaimed = 0;
> >
> > cond_resched();
> > @@ -785,6 +790,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > }
> >
> > if (PageDirty(page)) {
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + nr_dirty++;
> > + goto keep_locked;
> > + }
> > +
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > goto keep_locked;
> > if (!may_enter_fs)
> > @@ -908,6 +922,8 @@ keep_lumpy:
> > free_page_list(&free_pages);
> >
> > list_splice(&ret_pages, page_list);
> > +
> > + *nr_still_dirty = nr_dirty;
> > count_vm_events(PGACTIVATE, pgactivate);
> > return nr_reclaimed;
> > }
> > @@ -1312,6 +1328,10 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> > if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> > return false;
> >
> > + /* If we cannot writeback, there is no point stalling */
> > + if (!sc->may_writepage)
> > + return false;
> > +
> > /* If we have relaimed everything on the isolated list, no stall */
> > if (nr_freed == nr_taken)
> > return false;
> > @@ -1339,11 +1359,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > struct scan_control *sc, int priority, int file)
> > {
> > LIST_HEAD(page_list);
> > + LIST_HEAD(putback_list);
> > unsigned long nr_scanned;
> > unsigned long nr_reclaimed = 0;
> > unsigned long nr_taken;
> > unsigned long nr_anon;
> > unsigned long nr_file;
> > + unsigned long nr_dirty;
> >
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > @@ -1392,14 +1414,35 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > spin_unlock_irq(&zone->lru_lock);
> >
> > - nr_reclaimed = shrink_page_list(&page_list, sc);
> > + nr_reclaimed = shrink_page_list(&page_list, sc, &nr_dirty);
> >
> > /* Check if we should syncronously wait for writeback */
> > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>
> It is possible to OOM if the LRU list is small and/or the storage is slow, so
> that the flusher cannot clean enough pages before the LRU is fully scanned.
>
To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
would have to be dirty or unreclaimable. If that situation happens, is
the dirty throttling not also broken?
> So we may need do waits on dirty/writeback pages on *order-0*
> direct reclaims, when priority goes rather low (such as < 3).
>
In case this is really necessary, the necessary stalling could be done by
removing the check for lumpy reclaim in should_reclaim_stall(). What do
you think of the following replacement?
/*
* Returns true if the caller should wait to clean dirty/writeback pages.
*
* If we are direct reclaiming for contiguous pages and we do not reclaim
* everything in the list, try again and wait for writeback IO to complete.
* This will stall high-order allocations noticeably. Only do that when really
* need to free the pages under high memory pressure.
*
* Alternatively, if priority is getting high, it may be because there are
* too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
* and potentially causing an OOM, we stall on writeback.
*/
static inline bool should_reclaim_stall(unsigned long nr_taken,
unsigned long nr_freed,
int priority,
struct scan_control *sc)
{
int stall_priority;
/* kswapd should not stall on sync IO */
if (current_is_kswapd())
return false;
/* If we cannot writeback, there is no point stalling */
if (!sc->may_writepage)
return false;
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;
/*
* For high-order allocations, there are two stall thresholds.
* High-cost allocations stall immediately where as lower
* order allocations such as stacks require the scanning
* priority to be much higher before stalling.
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
stall_priority = DEF_PRIORITY;
else
stall_priority = DEF_PRIORITY / 3;
return priority <= stall_priority;
}
> > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > set_lumpy_reclaim_mode(priority, sc, true);
> > - nr_reclaimed += shrink_page_list(&page_list, sc);
> > +
> > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > + struct page *page, *tmp;
> > +
>
> > + /* Take off the clean pages marked for activation */
> > + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > + if (PageDirty(page) || PageWriteback(page))
> > + continue;
> > +
> > + list_del(&page->lru);
> > + list_add(&page->lru, &putback_list);
> > + }
>
> nitpick: I guess the above loop is optional code to avoid overheads
> of shrink_page_list() repeatedly going through some unfreeable pages?
Pretty much, if they are to be activated, there is no point trying to reclaim
them again. It's unnecessary overhead. A strong motivation for this
series is to reduce overheads in the reclaim paths and unnecessary
retrying of unfreeable pages.
> Considering this is the slow code path, I'd prefer to keep the code
> simple than to do such optimizations.
>
> > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
>
> how about
> if (!laptop_mode)
> wakeup_flusher_threads(nr_dirty);
>
It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
clean all pages if some need dirtying. laptop_mode cleans all pages to
minimise disk spinups.
> > + wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
> > +
> > + nr_reclaimed = shrink_page_list(&page_list, sc,
> > + &nr_dirty);
> > + }
> > }
> >
> > + list_splice(&putback_list, &page_list);
> > +
> > local_irq_disable();
> > if (current_is_kswapd())
> > __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
> > --
> > 1.7.1
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-13 13:48 ` Wu Fengguang
@ 2010-09-13 14:10 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 14:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though the
> > + * dirty_ratio may be satisified. In this case, wake flusher
> > + * threads to pro-actively clean up to a maximum of
> > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > + * !may_writepage indicates that this is a direct reclaimer in
> > + * laptop mode avoiding disk spin-ups
> > + */
> > + if (file && nr_dirty_seen && sc->may_writepage)
> > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
>
> wakeup_flusher_threads() works, but seems not the pertinent one.
>
> - locally, it needs some luck to clean the pages that direct reclaim is waiting on
There is a certain amount of luck involved but it's depending on there being a
correlation between old inodes and old pages on the LRU list. As long as that
correlation is accurate, some relevant pages will get cleaned. Testing on
previously released versions of this patch did show that the percentage of
dirty pages encountered during reclaim were reduced as a result of this patch.
> - globally, it cleans up some dirty pages, however some heavy dirtier
> may quickly create new ones..
>
> So how about taking the approaches in these patches?
>
> - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
>
There is a lot going on in those patches. It's going to take me a while to
figure them out and formulate an opinion.
> In particular the first patch should work very nicely with memcg, as
> all pages of an inode typically belong to the same memcg. So doing
> write-around helps clean lots of dirty pages in the target LRU list in
> one shot.
>
It might but as there is also a correlation between old dirty inodes and
the location of dirty pages, it is tricky to predict if it is better and
if so, by how much.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 14:10 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-09-13 14:10 UTC (permalink / raw)
To: Wu Fengguang
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > + /*
> > + * If reclaim is encountering dirty pages, it may be because
> > + * dirty pages are reaching the end of the LRU even though the
> > + * dirty_ratio may be satisified. In this case, wake flusher
> > + * threads to pro-actively clean up to a maximum of
> > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > + * !may_writepage indicates that this is a direct reclaimer in
> > + * laptop mode avoiding disk spin-ups
> > + */
> > + if (file && nr_dirty_seen && sc->may_writepage)
> > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
>
> wakeup_flusher_threads() works, but seems not the pertinent one.
>
> - locally, it needs some luck to clean the pages that direct reclaim is waiting on
There is a certain amount of luck involved but it's depending on there being a
correlation between old inodes and old pages on the LRU list. As long as that
correlation is accurate, some relevant pages will get cleaned. Testing on
previously released versions of this patch did show that the percentage of
dirty pages encountered during reclaim were reduced as a result of this patch.
> - globally, it cleans up some dirty pages, however some heavy dirtier
> may quickly create new ones..
>
> So how about taking the approaches in these patches?
>
> - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
>
There is a lot going on in those patches. It's going to take me a while to
figure them out and formulate an opinion.
> In particular the first patch should work very nicely with memcg, as
> all pages of an inode typically belong to the same memcg. So doing
> write-around helps clean lots of dirty pages in the target LRU list in
> one shot.
>
It might but as there is also a correlation between old dirty inodes and
the location of dirty pages, it is tricky to predict if it is better and
if so, by how much.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-09-13 13:55 ` Mel Gorman
@ 2010-09-13 14:33 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:33 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> > > /* Check if we should syncronously wait for writeback */
> > > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> >
> > It is possible to OOM if the LRU list is small and/or the storage is slow, so
> > that the flusher cannot clean enough pages before the LRU is fully scanned.
> >
>
> To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
> would have to be dirty or unreclaimable. If that situation happens, is
> the dirty throttling not also broken?
My worry is, even if the dirty throttling limit is instantly set to 0,
it may still take time to knock down the number of dirty pages. Think
about 500MB dirty pages waiting to be flushed to a slow USB stick.
> > So we may need do waits on dirty/writeback pages on *order-0*
> > direct reclaims, when priority goes rather low (such as < 3).
> >
>
> In case this is really necessary, the necessary stalling could be done by
> removing the check for lumpy reclaim in should_reclaim_stall(). What do
> you think of the following replacement?
I merely want to provide a guarantee, so it may be enough to add this:
if (nr_freed == nr_taken)
return false;
+ if (!priority)
+ return true;
This ensures the last full LRU scan will do necessary waits to prevent
the OOM.
> /*
> * Returns true if the caller should wait to clean dirty/writeback pages.
> *
> * If we are direct reclaiming for contiguous pages and we do not reclaim
> * everything in the list, try again and wait for writeback IO to complete.
> * This will stall high-order allocations noticeably. Only do that when really
> * need to free the pages under high memory pressure.
> *
> * Alternatively, if priority is getting high, it may be because there are
> * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
> * and potentially causing an OOM, we stall on writeback.
> */
> static inline bool should_reclaim_stall(unsigned long nr_taken,
> unsigned long nr_freed,
> int priority,
> struct scan_control *sc)
> {
> int stall_priority;
>
> /* kswapd should not stall on sync IO */
> if (current_is_kswapd())
> return false;
>
> /* If we cannot writeback, there is no point stalling */
> if (!sc->may_writepage)
> return false;
>
> /* If we have relaimed everything on the isolated list, no stall */
> if (nr_freed == nr_taken)
> return false;
>
> /*
> * For high-order allocations, there are two stall thresholds.
> * High-cost allocations stall immediately where as lower
> * order allocations such as stacks require the scanning
> * priority to be much higher before stalling.
> */
> if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> stall_priority = DEF_PRIORITY;
> else
> stall_priority = DEF_PRIORITY / 3;
>
> return priority <= stall_priority;
> }
>
>
> > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > set_lumpy_reclaim_mode(priority, sc, true);
> > > - nr_reclaimed += shrink_page_list(&page_list, sc);
> > > +
> > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > + struct page *page, *tmp;
> > > +
> >
> > > + /* Take off the clean pages marked for activation */
> > > + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > > + if (PageDirty(page) || PageWriteback(page))
> > > + continue;
> > > +
> > > + list_del(&page->lru);
> > > + list_add(&page->lru, &putback_list);
> > > + }
> >
> > nitpick: I guess the above loop is optional code to avoid overheads
> > of shrink_page_list() repeatedly going through some unfreeable pages?
>
> Pretty much, if they are to be activated, there is no point trying to reclaim
> them again. It's unnecessary overhead. A strong motivation for this
> series is to reduce overheads in the reclaim paths and unnecessary
> retrying of unfreeable pages.
We do so much waits in this loop, so that users will get upset by the
iowait stalls much much more than the CPU overheads.. best option is
always to avoid entering this loop in the first place, and if we
succeeded on that, these lines of optimizations will be nothing but
mind destroyers for newbie developers.
> > Considering this is the slow code path, I'd prefer to keep the code
> > simple than to do such optimizations.
> >
> > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> >
> > how about
> > if (!laptop_mode)
> > wakeup_flusher_threads(nr_dirty);
> >
>
> It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
> clean all pages if some need dirtying. laptop_mode cleans all pages to
> minimise disk spinups.
Ah.. that's sure fine. I wonder if the flusher could be more smart to
automatically extend the number of pages to write in laptop mode. This
could simplify some callers.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-09-13 14:33 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:33 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> > > /* Check if we should syncronously wait for writeback */
> > > if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> >
> > It is possible to OOM if the LRU list is small and/or the storage is slow, so
> > that the flusher cannot clean enough pages before the LRU is fully scanned.
> >
>
> To go OOM, nr_reclaimed would have to be 0 and for that, the entire list
> would have to be dirty or unreclaimable. If that situation happens, is
> the dirty throttling not also broken?
My worry is, even if the dirty throttling limit is instantly set to 0,
it may still take time to knock down the number of dirty pages. Think
about 500MB dirty pages waiting to be flushed to a slow USB stick.
> > So we may need do waits on dirty/writeback pages on *order-0*
> > direct reclaims, when priority goes rather low (such as < 3).
> >
>
> In case this is really necessary, the necessary stalling could be done by
> removing the check for lumpy reclaim in should_reclaim_stall(). What do
> you think of the following replacement?
I merely want to provide a guarantee, so it may be enough to add this:
if (nr_freed == nr_taken)
return false;
+ if (!priority)
+ return true;
This ensures the last full LRU scan will do necessary waits to prevent
the OOM.
> /*
> * Returns true if the caller should wait to clean dirty/writeback pages.
> *
> * If we are direct reclaiming for contiguous pages and we do not reclaim
> * everything in the list, try again and wait for writeback IO to complete.
> * This will stall high-order allocations noticeably. Only do that when really
> * need to free the pages under high memory pressure.
> *
> * Alternatively, if priority is getting high, it may be because there are
> * too many dirty pages on the LRU. Rather than returning nr_reclaimed == 0
> * and potentially causing an OOM, we stall on writeback.
> */
> static inline bool should_reclaim_stall(unsigned long nr_taken,
> unsigned long nr_freed,
> int priority,
> struct scan_control *sc)
> {
> int stall_priority;
>
> /* kswapd should not stall on sync IO */
> if (current_is_kswapd())
> return false;
>
> /* If we cannot writeback, there is no point stalling */
> if (!sc->may_writepage)
> return false;
>
> /* If we have relaimed everything on the isolated list, no stall */
> if (nr_freed == nr_taken)
> return false;
>
> /*
> * For high-order allocations, there are two stall thresholds.
> * High-cost allocations stall immediately where as lower
> * order allocations such as stacks require the scanning
> * priority to be much higher before stalling.
> */
> if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> stall_priority = DEF_PRIORITY;
> else
> stall_priority = DEF_PRIORITY / 3;
>
> return priority <= stall_priority;
> }
>
>
> > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT;
> > > set_lumpy_reclaim_mode(priority, sc, true);
> > > - nr_reclaimed += shrink_page_list(&page_list, sc);
> > > +
> > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) {
> > > + struct page *page, *tmp;
> > > +
> >
> > > + /* Take off the clean pages marked for activation */
> > > + list_for_each_entry_safe(page, tmp, &page_list, lru) {
> > > + if (PageDirty(page) || PageWriteback(page))
> > > + continue;
> > > +
> > > + list_del(&page->lru);
> > > + list_add(&page->lru, &putback_list);
> > > + }
> >
> > nitpick: I guess the above loop is optional code to avoid overheads
> > of shrink_page_list() repeatedly going through some unfreeable pages?
>
> Pretty much, if they are to be activated, there is no point trying to reclaim
> them again. It's unnecessary overhead. A strong motivation for this
> series is to reduce overheads in the reclaim paths and unnecessary
> retrying of unfreeable pages.
We do so much waits in this loop, so that users will get upset by the
iowait stalls much much more than the CPU overheads.. best option is
always to avoid entering this loop in the first place, and if we
succeeded on that, these lines of optimizations will be nothing but
mind destroyers for newbie developers.
> > Considering this is the slow code path, I'd prefer to keep the code
> > simple than to do such optimizations.
> >
> > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty);
> >
> > how about
> > if (!laptop_mode)
> > wakeup_flusher_threads(nr_dirty);
> >
>
> It's not the same thing. wakeup_flusher_threads(0) in laptop_mode is to
> clean all pages if some need dirtying. laptop_mode cleans all pages to
> minimise disk spinups.
Ah.. that's sure fine. I wonder if the flusher could be more smart to
automatically extend the number of pages to write in laptop mode. This
could simplify some callers.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
2010-09-13 14:10 ` Mel Gorman
@ 2010-09-13 14:41 ` Wu Fengguang
-1 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:41 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 10:10:46PM +0800, Mel Gorman wrote:
> On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though the
> > > + * dirty_ratio may be satisified. In this case, wake flusher
> > > + * threads to pro-actively clean up to a maximum of
> > > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > + * !may_writepage indicates that this is a direct reclaimer in
> > > + * laptop mode avoiding disk spin-ups
> > > + */
> > > + if (file && nr_dirty_seen && sc->may_writepage)
> > > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> >
> > wakeup_flusher_threads() works, but seems not the pertinent one.
> >
> > - locally, it needs some luck to clean the pages that direct reclaim is waiting on
>
> There is a certain amount of luck involved but it's depending on there being a
> correlation between old inodes and old pages on the LRU list. As long as that
> correlation is accurate, some relevant pages will get cleaned. Testing on
> previously released versions of this patch did show that the percentage of
> dirty pages encountered during reclaim were reduced as a result of this patch.
Yup.
> > - globally, it cleans up some dirty pages, however some heavy dirtier
> > may quickly create new ones..
> >
> > So how about taking the approaches in these patches?
> >
> > - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> > - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> >
>
> There is a lot going on in those patches. It's going to take me a while to
> figure them out and formulate an opinion.
OK. I also need some time off for doing other works :)
> > In particular the first patch should work very nicely with memcg, as
> > all pages of an inode typically belong to the same memcg. So doing
> > write-around helps clean lots of dirty pages in the target LRU list in
> > one shot.
> >
>
> It might but as there is also a correlation between old dirty inodes and
> the location of dirty pages, it is tricky to predict if it is better and
> if so, by how much.
It at least guarantees to clean the one page pageout() is running into :)
Others will depend on the locality/sequentiality of the workload. But
as the write-around pages are in the same LRU lists, the vmscan code
will hit them sooner or later.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages
@ 2010-09-13 14:41 ` Wu Fengguang
0 siblings, 0 replies; 133+ messages in thread
From: Wu Fengguang @ 2010-09-13 14:41 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 13, 2010 at 10:10:46PM +0800, Mel Gorman wrote:
> On Mon, Sep 13, 2010 at 09:48:45PM +0800, Wu Fengguang wrote:
> > > + /*
> > > + * If reclaim is encountering dirty pages, it may be because
> > > + * dirty pages are reaching the end of the LRU even though the
> > > + * dirty_ratio may be satisified. In this case, wake flusher
> > > + * threads to pro-actively clean up to a maximum of
> > > + * 4 * SWAP_CLUSTER_MAX amount of data (usually 1/2MB) unless
> > > + * !may_writepage indicates that this is a direct reclaimer in
> > > + * laptop mode avoiding disk spin-ups
> > > + */
> > > + if (file && nr_dirty_seen && sc->may_writepage)
> > > + wakeup_flusher_threads(nr_writeback_pages(nr_dirty));
> >
> > wakeup_flusher_threads() works, but seems not the pertinent one.
> >
> > - locally, it needs some luck to clean the pages that direct reclaim is waiting on
>
> There is a certain amount of luck involved but it's depending on there being a
> correlation between old inodes and old pages on the LRU list. As long as that
> correlation is accurate, some relevant pages will get cleaned. Testing on
> previously released versions of this patch did show that the percentage of
> dirty pages encountered during reclaim were reduced as a result of this patch.
Yup.
> > - globally, it cleans up some dirty pages, however some heavy dirtier
> > may quickly create new ones..
> >
> > So how about taking the approaches in these patches?
> >
> > - "[PATCH 4/4] vmscan: transfer async file writeback to the flusher"
> > - "[PATCH 15/17] mm: lower soft dirty limits on memory pressure"
> >
>
> There is a lot going on in those patches. It's going to take me a while to
> figure them out and formulate an opinion.
OK. I also need some time off for doing other works :)
> > In particular the first patch should work very nicely with memcg, as
> > all pages of an inode typically belong to the same memcg. So doing
> > write-around helps clean lots of dirty pages in the target LRU list in
> > one shot.
> >
>
> It might but as there is also a correlation between old dirty inodes and
> the location of dirty pages, it is tricky to predict if it is better and
> if so, by how much.
It at least guarantees to clean the one page pageout() is running into :)
Others will depend on the locality/sequentiality of the workload. But
as the write-around pages are in the same LRU lists, the vmscan code
will hit them sooner or later.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
2010-09-06 10:47 ` Mel Gorman
@ 2010-09-13 23:10 ` Minchan Kim
-1 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 23:10 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 6, 2010 at 7:47 PM, Mel Gorman <mel@csn.ul.ie> wrote:
<snip>
>
> These are just the raw figures taken from /proc/vmstat. It's a rough measure
> of reclaim activity. Note that allocstall counts are higher because we
> are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
>
> FTrace Reclaim Statistics: vmscan
> micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
> traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
> Direct reclaims 152 941 967 729
> Direct reclaim pages scanned 507377 1404350 1332420 1450213
> Direct reclaim pages reclaimed 10968 72042 77186 41097
> Direct reclaim write file async I/O 0 0 0 0
> Direct reclaim write anon async I/O 0 0 0 0
> Direct reclaim write file sync I/O 0 0 0 0
> Direct reclaim write anon sync I/O 0 0 0 0
> Wake kswapd requests 127195 241025 254825 188846
> Kswapd wakeups 6 1 1 1
> Kswapd pages scanned 4210101 3345122 3427915 3306356
> Kswapd pages reclaimed 2228073 2165721 2143876 2194611
> Kswapd reclaim write file async I/O 0 0 0 0
> Kswapd reclaim write anon async I/O 0 0 0 0
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 7.60 3.03 3.24 3.43
> Time kswapd awake (seconds) 12.46 9.46 9.56 9.40
>
> Total pages scanned 4717478 4749472 4760335 4756569
> Total pages reclaimed 2239041 2237763 2221062 2235708
> %age total pages scanned/reclaimed 47.46% 47.12% 46.66% 47.00%
> %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> Percentage Time Spent Direct Reclaim 43.80% 21.38% 22.34% 23.46%
> Percentage Time kswapd Awake 79.92% 79.56% 79.20% 80.48%
There is a nitpick about stalled reclaim time.
For example, In direct reclaim
===
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
===
In this case, Isn't this time accumulated value?
My point is following as.
Process A Process B
direct reclaim begin
do_try_to_free_pages
cond_resched
direct reclaim begin
do_try_to_free_pages
direct reclaim end
direct reclaim end
So A's result includes B's time so total stall time would be bigger than real.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1
@ 2010-09-13 23:10 ` Minchan Kim
0 siblings, 0 replies; 133+ messages in thread
From: Minchan Kim @ 2010-09-13 23:10 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Mon, Sep 6, 2010 at 7:47 PM, Mel Gorman <mel@csn.ul.ie> wrote:
<snip>
>
> These are just the raw figures taken from /proc/vmstat. It's a rough measure
> of reclaim activity. Note that allocstall counts are higher because we
> are entering direct reclaim more often as a result of not sleeping in
> congestion. In itself, it's not necessarily a bad thing. It's easier to
> get a view of what happened from the vmscan tracepoint report.
>
> FTrace Reclaim Statistics: vmscan
> micro-traceonly-v1r5-micromicro-nocongest-v1r5-micromicro-lowlumpy-v1r5-micromicro-nodirect-v1r5-micro
> traceonly-v1r5 nocongest-v1r5 lowlumpy-v1r5 nodirect-v1r5
> Direct reclaims 152 941 967 729
> Direct reclaim pages scanned 507377 1404350 1332420 1450213
> Direct reclaim pages reclaimed 10968 72042 77186 41097
> Direct reclaim write file async I/O 0 0 0 0
> Direct reclaim write anon async I/O 0 0 0 0
> Direct reclaim write file sync I/O 0 0 0 0
> Direct reclaim write anon sync I/O 0 0 0 0
> Wake kswapd requests 127195 241025 254825 188846
> Kswapd wakeups 6 1 1 1
> Kswapd pages scanned 4210101 3345122 3427915 3306356
> Kswapd pages reclaimed 2228073 2165721 2143876 2194611
> Kswapd reclaim write file async I/O 0 0 0 0
> Kswapd reclaim write anon async I/O 0 0 0 0
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 7.60 3.03 3.24 3.43
> Time kswapd awake (seconds) 12.46 9.46 9.56 9.40
>
> Total pages scanned 4717478 4749472 4760335 4756569
> Total pages reclaimed 2239041 2237763 2221062 2235708
> %age total pages scanned/reclaimed 47.46% 47.12% 46.66% 47.00%
> %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> Percentage Time Spent Direct Reclaim 43.80% 21.38% 22.34% 23.46%
> Percentage Time kswapd Awake 79.92% 79.56% 79.20% 80.48%
There is a nitpick about stalled reclaim time.
For example, In direct reclaim
===
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
===
In this case, Isn't this time accumulated value?
My point is following as.
Process A Process B
direct reclaim begin
do_try_to_free_pages
cond_resched
direct reclaim begin
do_try_to_free_pages
direct reclaim end
direct reclaim end
So A's result includes B's time so total stall time would be bigger than real.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
2010-09-13 9:14 ` Mel Gorman
@ 2010-09-14 10:14 ` KOSAKI Motohiro
-1 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 10:14 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> > example,
> >
> > __do_fault()
> > {
> > (snip)
> > if (unlikely(!(ret & VM_FAULT_LOCKED)))
> > lock_page(vmf.page);
> > else
> > VM_BUG_ON(!PageLocked(vmf.page));
> >
> > /*
> > * Should we do an early C-O-W break?
> > */
> > page = vmf.page;
> > if (flags & FAULT_FLAG_WRITE) {
> > if (!(vma->vm_flags & VM_SHARED)) {
> > anon = 1;
> > if (unlikely(anon_vma_prepare(vma))) {
> > ret = VM_FAULT_OOM;
> > goto out;
> > }
> > page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> > vma, address);
> >
>
> Correct, this is a problem. I already had dropped the patch but thanks for
> pointing out a deadlock because I was missing this case. Nothing stops the
> page being faulted being sent to shrink_page_list() when alloc_page_vma()
> is called. The deadlock might be hard to hit, but it's there.
Yup, unfortunatelly.
> > Afaik, detailed rule is,
> >
> > o kswapd can call lock_page() because they never take page lock outside vmscan
>
> lock_page_nosync as you point out in your next mail. While it can call
> it, kswapd shouldn't because normally it avoids stalls but it would not
> deadlock as a result of calling it.
Agreed.
> > o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> > because the task have gurantee of no lock taken.
> > o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> >
>
> I think the safer bet is simply to say "direct reclaimers should not
> call lock_page() because the fault path could be holding a lock on that
> page already".
Yup, agreed.
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page()
@ 2010-09-14 10:14 ` KOSAKI Motohiro
0 siblings, 0 replies; 133+ messages in thread
From: KOSAKI Motohiro @ 2010-09-14 10:14 UTC (permalink / raw)
To: Mel Gorman
Cc: kosaki.motohiro, KAMEZAWA Hiroyuki, linux-mm, linux-fsdevel,
Linux Kernel List, Rik van Riel, Johannes Weiner, Minchan Kim,
Wu Fengguang, Andrea Arcangeli, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
> > example,
> >
> > __do_fault()
> > {
> > (snip)
> > if (unlikely(!(ret & VM_FAULT_LOCKED)))
> > lock_page(vmf.page);
> > else
> > VM_BUG_ON(!PageLocked(vmf.page));
> >
> > /*
> > * Should we do an early C-O-W break?
> > */
> > page = vmf.page;
> > if (flags & FAULT_FLAG_WRITE) {
> > if (!(vma->vm_flags & VM_SHARED)) {
> > anon = 1;
> > if (unlikely(anon_vma_prepare(vma))) {
> > ret = VM_FAULT_OOM;
> > goto out;
> > }
> > page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
> > vma, address);
> >
>
> Correct, this is a problem. I already had dropped the patch but thanks for
> pointing out a deadlock because I was missing this case. Nothing stops the
> page being faulted being sent to shrink_page_list() when alloc_page_vma()
> is called. The deadlock might be hard to hit, but it's there.
Yup, unfortunatelly.
> > Afaik, detailed rule is,
> >
> > o kswapd can call lock_page() because they never take page lock outside vmscan
>
> lock_page_nosync as you point out in your next mail. While it can call
> it, kswapd shouldn't because normally it avoids stalls but it would not
> deadlock as a result of calling it.
Agreed.
> > o if try_lock() is successed, we can call lock_page_nosync() against its page after unlock.
> > because the task have gurantee of no lock taken.
> > o otherwise, direct reclaimer can't call lock_page(). the task may have a lock already.
> >
>
> I think the safer bet is simply to say "direct reclaimers should not
> call lock_page() because the fault path could be holding a lock on that
> page already".
Yup, agreed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-09-06 10:47 ` Mel Gorman
@ 2010-10-28 21:50 ` Christoph Hellwig
-1 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2010-10-28 21:50 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
Looks like this once again didn't get merged for 2.6.37. Any reason
for that?
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-10-28 21:50 ` Christoph Hellwig
0 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2010-10-28 21:50 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
Looks like this once again didn't get merged for 2.6.37. Any reason
for that?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
2010-10-28 21:50 ` Christoph Hellwig
@ 2010-10-29 10:26 ` Mel Gorman
-1 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-10-29 10:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, Oct 28, 2010 at 05:50:46PM -0400, Christoph Hellwig wrote:
> Looks like this once again didn't get merged for 2.6.37. Any reason
> for that?
>
There are still concerns as to whether this is a good idea or or whether we
are papering over the fact that there are too many dirty pages at the end
of the LRU. The tracepoints necessary to track the dirty pages encountered
went in this cycle as well as some writeback and congestion-waiting changes.
I was waiting for some of the writeback churn to die down before
revisiting this. The ideal point to reach is "we hardly ever encounter
dirty pages so disabling direct writeback has no impact".
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 133+ messages in thread
* Re: [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim
@ 2010-10-29 10:26 ` Mel Gorman
0 siblings, 0 replies; 133+ messages in thread
From: Mel Gorman @ 2010-10-29 10:26 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-mm, linux-fsdevel, Linux Kernel List, Rik van Riel,
Johannes Weiner, Minchan Kim, Wu Fengguang, Andrea Arcangeli,
KAMEZAWA Hiroyuki, KOSAKI Motohiro, Dave Chinner, Chris Mason,
Christoph Hellwig, Andrew Morton
On Thu, Oct 28, 2010 at 05:50:46PM -0400, Christoph Hellwig wrote:
> Looks like this once again didn't get merged for 2.6.37. Any reason
> for that?
>
There are still concerns as to whether this is a good idea or or whether we
are papering over the fact that there are too many dirty pages at the end
of the LRU. The tracepoints necessary to track the dirty pages encountered
went in this cycle as well as some writeback and congestion-waiting changes.
I was waiting for some of the writeback churn to die down before
revisiting this. The ideal point to reach is "we hardly ever encounter
dirty pages so disabling direct writeback has no impact".
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 133+ messages in thread
end of thread, other threads:[~2010-10-29 10:27 UTC | newest]
Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-06 10:47 [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1 Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 01/10] tracing, vmscan: Add trace events for LRU list shrinking Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 02/10] writeback: Account for time spent congestion_waited Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 03/10] writeback: Do not congestion sleep if there are no congested BDIs or significant writeback Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-07 15:25 ` Minchan Kim
2010-09-07 15:25 ` Minchan Kim
2010-09-08 11:04 ` Mel Gorman
2010-09-08 11:04 ` Mel Gorman
2010-09-08 14:52 ` Minchan Kim
2010-09-08 14:52 ` Minchan Kim
2010-09-09 8:54 ` Mel Gorman
2010-09-09 8:54 ` Mel Gorman
2010-09-12 15:37 ` Minchan Kim
2010-09-12 15:37 ` Minchan Kim
2010-09-13 8:55 ` Mel Gorman
2010-09-13 8:55 ` Mel Gorman
2010-09-13 9:48 ` Minchan Kim
2010-09-13 9:48 ` Minchan Kim
2010-09-13 10:07 ` Mel Gorman
2010-09-13 10:07 ` Mel Gorman
2010-09-13 10:20 ` Minchan Kim
2010-09-13 10:20 ` Minchan Kim
2010-09-13 10:30 ` Mel Gorman
2010-09-13 10:30 ` Mel Gorman
2010-09-08 21:23 ` Andrew Morton
2010-09-08 21:23 ` Andrew Morton
2010-09-09 10:43 ` Mel Gorman
2010-09-09 10:43 ` Mel Gorman
2010-09-09 3:02 ` KAMEZAWA Hiroyuki
2010-09-09 3:02 ` KAMEZAWA Hiroyuki
2010-09-09 8:58 ` Mel Gorman
2010-09-09 8:58 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 04/10] vmscan: Synchronous lumpy reclaim should not call congestion_wait() Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-07 15:26 ` Minchan Kim
2010-09-07 15:26 ` Minchan Kim
2010-09-08 6:15 ` Johannes Weiner
2010-09-08 6:15 ` Johannes Weiner
2010-09-08 11:25 ` Wu Fengguang
2010-09-08 11:25 ` Wu Fengguang
2010-09-09 3:03 ` KAMEZAWA Hiroyuki
2010-09-09 3:03 ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 05/10] vmscan: Synchrounous lumpy reclaim use lock_page() instead trylock_page() Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-07 15:28 ` Minchan Kim
2010-09-07 15:28 ` Minchan Kim
2010-09-08 6:16 ` Johannes Weiner
2010-09-08 6:16 ` Johannes Weiner
2010-09-08 11:28 ` Wu Fengguang
2010-09-08 11:28 ` Wu Fengguang
2010-09-09 3:04 ` KAMEZAWA Hiroyuki
2010-09-09 3:04 ` KAMEZAWA Hiroyuki
2010-09-09 3:15 ` KAMEZAWA Hiroyuki
2010-09-09 3:15 ` KAMEZAWA Hiroyuki
2010-09-09 3:25 ` Wu Fengguang
2010-09-09 3:25 ` Wu Fengguang
2010-09-09 4:13 ` KOSAKI Motohiro
2010-09-09 4:13 ` KOSAKI Motohiro
2010-09-09 9:22 ` Mel Gorman
2010-09-09 9:22 ` Mel Gorman
2010-09-10 10:25 ` KOSAKI Motohiro
2010-09-10 10:25 ` KOSAKI Motohiro
2010-09-10 10:33 ` KOSAKI Motohiro
2010-09-10 10:33 ` KOSAKI Motohiro
2010-09-10 10:33 ` KOSAKI Motohiro
2010-09-13 9:14 ` Mel Gorman
2010-09-13 9:14 ` Mel Gorman
2010-09-14 10:14 ` KOSAKI Motohiro
2010-09-14 10:14 ` KOSAKI Motohiro
2010-09-06 10:47 ` [PATCH 06/10] vmscan: Narrow the scenarios lumpy reclaim uses synchrounous reclaim Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-09 3:14 ` KAMEZAWA Hiroyuki
2010-09-09 3:14 ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 07/10] vmscan: Remove dead code in shrink_inactive_list() Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-07 15:33 ` Minchan Kim
2010-09-07 15:33 ` Minchan Kim
2010-09-06 10:47 ` [PATCH 08/10] vmscan: isolated_lru_pages() stop neighbour search if neighbour cannot be isolated Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-07 15:37 ` Minchan Kim
2010-09-07 15:37 ` Minchan Kim
2010-09-08 11:12 ` Mel Gorman
2010-09-08 11:12 ` Mel Gorman
2010-09-08 14:58 ` Minchan Kim
2010-09-08 14:58 ` Minchan Kim
2010-09-08 11:37 ` Wu Fengguang
2010-09-08 11:37 ` Wu Fengguang
2010-09-08 12:50 ` Mel Gorman
2010-09-08 12:50 ` Mel Gorman
2010-09-08 13:14 ` Wu Fengguang
2010-09-08 13:14 ` Wu Fengguang
2010-09-08 13:27 ` Mel Gorman
2010-09-08 13:27 ` Mel Gorman
2010-09-09 3:17 ` KAMEZAWA Hiroyuki
2010-09-09 3:17 ` KAMEZAWA Hiroyuki
2010-09-06 10:47 ` [PATCH 09/10] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-13 13:31 ` Wu Fengguang
2010-09-13 13:31 ` Wu Fengguang
2010-09-13 13:55 ` Mel Gorman
2010-09-13 13:55 ` Mel Gorman
2010-09-13 14:33 ` Wu Fengguang
2010-09-13 14:33 ` Wu Fengguang
2010-10-28 21:50 ` Christoph Hellwig
2010-10-28 21:50 ` Christoph Hellwig
2010-10-29 10:26 ` Mel Gorman
2010-10-29 10:26 ` Mel Gorman
2010-09-06 10:47 ` [PATCH 10/10] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman
2010-09-06 10:47 ` Mel Gorman
2010-09-09 3:22 ` KAMEZAWA Hiroyuki
2010-09-09 3:22 ` KAMEZAWA Hiroyuki
2010-09-09 9:32 ` Mel Gorman
2010-09-09 9:32 ` Mel Gorman
2010-09-13 0:53 ` KAMEZAWA Hiroyuki
2010-09-13 0:53 ` KAMEZAWA Hiroyuki
2010-09-13 13:48 ` Wu Fengguang
2010-09-13 13:48 ` Wu Fengguang
2010-09-13 14:10 ` Mel Gorman
2010-09-13 14:10 ` Mel Gorman
2010-09-13 14:41 ` Wu Fengguang
2010-09-13 14:41 ` Wu Fengguang
2010-09-06 10:49 ` [PATCH 0/9] Reduce latencies and improve overall reclaim efficiency v1 Mel Gorman
2010-09-06 10:49 ` Mel Gorman
2010-09-08 3:14 ` KOSAKI Motohiro
2010-09-08 3:14 ` KOSAKI Motohiro
2010-09-08 8:38 ` Mel Gorman
2010-09-08 8:38 ` Mel Gorman
2010-09-13 23:10 ` Minchan Kim
2010-09-13 23:10 ` Minchan Kim
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.