* [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
(Revisting this from a year ago and following on from the thread
"Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
clustering". Posting an prototype to see if anything obvious is
being missed)
Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was really low). That
writing back a small number of pages is ok has been heavily disputed
for quite some time and Dave Chinner explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request entirely
I think ext3 just writes back the page but I didn't double check.
Either way, each filesystem will have different performance
characteristics when under memory pressure and there are a lot of
dirty pages.
The objective of this series to for memory reclaim to play nicely
with writeback that is already in progress and throttle reclaimers
appropriately when dirty pages are encountered. The assumption is that
the flushers will always write pages faster than if reclaim issues
the IO. The problem is that reclaim has very little control over how
long before a page in a particular zone or container is cleaned.
This is a serious problem but as the behaviour of ->writepage is
filesystem-dependant, we are already faced with a situation where
reclaim has poor control over page cleaning.
A secondary goal is to avoid the problem whereby direct reclaim
splices two potentially deep call stacks together.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written
Patch 2 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 3 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 4 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
Patch 5 tries to prioritise inodes backing dirty pages found at the end
of the LRU.
This is a prototype only and it's probable that I forgot or omitted
some issue brought up over the last year and a bit. I have not thought
about how this affects memcg and I have some concerns about patches
4 and 5. Patch 4 may reclaim too many pages as a reclaimer will skip
the dirty page, reclaim a clean page and later the dirty page gets
reclaimed anyway when writeback completes. I don't think it matters
but it's worth thinking about. Patch 5 is potentially a problem
because move_expired_inodes() is now walking the full delayed_queue
list. Is that a problem? I also have no double checked it's safe
to add I_DIRTY_RECLAIM or that the locking is correct. Basically,
patch 5 is a quick hack to see if it's worthwhile and may be rendered
unnecessary by Wu Fengguang or Jan Kara.
I consider this series to be orthogonal to the writeback work going
on at the moment so shout if that assumption is in error.
I tested this on ext3, ext4, btrfs and xfs using fs_mark and a micro
benchmark that does a streaming write to a large mapping (exercises
use-once LRU logic). The command line for fs_mark looked something like
./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
this triggers the worst behaviour.
6 kernels are tested.
vanilla 3.0-rc6
nodirectwb-v1r3 patch 1
lesskswapdwb-v1r3p patches 1-2
throttle-v1r10 patches 1-3
immediate-v1r10 patches 1-4
prioinode-v1r10 patches 1-5
During testing, a number of monitors were running to gather information
from ftrace in particular. This disrupts the results of course because
recording the information generates IO in itself but I'm ignoring
that for the moment so the effect of the patches can be seen.
I've posted the raw reports for each filesystem at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
As it was Dave and Christoph that brought this back up, here is the
XFS report in a bit more detail;
FS-Mark
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Files/s min 5.30 ( 0.00%) 5.10 (-3.92%) 5.40 ( 1.85%) 5.70 ( 7.02%) 5.80 ( 8.62%) 5.70 ( 7.02%)
Files/s mean 6.93 ( 0.00%) 6.96 ( 0.40%) 7.11 ( 2.53%) 7.52 ( 7.82%) 7.44 ( 6.83%) 7.48 ( 7.38%)
Files/s stddev 0.89 ( 0.00%) 0.99 (10.62%) 0.85 (-4.18%) 1.02 (13.23%) 1.08 (18.06%) 1.00 (10.72%)
Files/s max 8.10 ( 0.00%) 8.60 ( 5.81%) 8.20 ( 1.22%) 9.50 (14.74%) 9.00 (10.00%) 9.10 (10.99%)
Overhead min 6623.00 ( 0.00%) 6417.00 ( 3.21%) 6035.00 ( 9.74%) 6354.00 ( 4.23%) 6213.00 ( 6.60%) 6491.00 ( 2.03%)
Overhead mean 29678.24 ( 0.00%) 40053.96 (-25.90%) 18278.56 (62.37%) 16365.20 (81.35%) 11987.40 (147.58%) 15606.36 (90.17%)
Overhead stddev 68727.49 ( 0.00%) 116258.18 (-40.88%) 34121.42 (101.42%) 28963.27 (137.29%) 17221.33 (299.08%) 26231.50 (162.00%)
Overhead max 339993.00 ( 0.00%) 588147.00 (-42.19%) 148281.00 (129.29%) 140568.00 (141.87%) 77836.00 (336.81%) 124728.00 (172.59%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 34.97 35.31 31.16 30.47 29.85 29.66
Total Elapsed Time (seconds) 567.08 566.84 551.75 525.81 534.91 526.32
Average files per second is increased by a nice percentage albeit
just within the standard deviation. Consider the type of test this is,
variability was inevitable but will double check without monitoring.
The overhead (time spent in non-filesystem-related activities) is
reduced a *lot* and is a lot less variable. Time to completion is
improved across the board which is always good because it implies
that IO was consistently higher which is sortof visible 4 minutes into the test at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-sandy.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-smooth-sandy.png
kswapd CPU usage is also interesting
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/kswapdcpu-comparison-smooth-sandy.png
Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but the throttle brings it back down
and reduced further by patches 4 and 5.
MMTests Statistics: vmstat
Page Ins 189840 196608 189864 128120 126148 151888
Page Outs 38439897 38420872 38422937 38395008 38367766 38396612
Swap Ins 19468 20555 20024 4933 3799 4588
Swap Outs 10019 10388 10353 4737 3617 4084
Direct pages scanned 4865170 4903030 1359813 408460 101716 199483
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Direct pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Kswapd efficiency 57% 57% 48% 47% 66% 66%
Kswapd velocity 14463.592 14371.722 30775.233 36949.506 26677.211 26797.142
Direct efficiency 99% 99% 99% 99% 98% 99%
Direct velocity 8579.336 8649.760 2464.546 776.821 190.155 379.015
Percentage direct scans 37% 37% 7% 2% 0% 1%
Page writes by reclaim 14511 14721 10387 4819 3617 4084
Page writes skipped 0 30 2300502 2774735 0 0
Page reclaim invalidate 0 0 0 0 5155 3509
Page reclaim throttled 0 0 0 65112 190 190
Slabs scanned 16512 17920 18048 17536 16640 17408
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 5180 5318 5177 5178 5179 5193
Kswapd skipped wait 131 0 4 44 0 0
Compaction stalls 2 2 0 0 5 1
Compaction success 2 2 0 0 2 1
Compaction failures 0 0 0 0 3 0
Compaction pages moved 0 0 0 0 1049 0
Compaction move failure 0 0 0 0 96 0
These stats are based on information from /proc/vmstat
"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency starts
low at 57% but increases to 66% with all the patches applied.
"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked
on page writes so it's expected.
Direct reclaim work is significantly reduced going from 37% of all
pages scanned to 1% with all patches applied. This implies that
processes are getting stalled less.
Page writes by reclaim is what is motivating this series. It goes
from 14511 pages to 4084 which is a big improvement. We'll see later
if these were anonymous or file-backed pages.
"Page writes skipped" are dirty pages encountered at the end of the
LRU and only exists for patches 2, 3 and 4. It shows that kswapd is
encountering very large numbers of dirty pages (debugging showed they
weren't under writeback). The number of pages that get invalidated and
freed later is a more reasonable number and "page reclaim throttled"
shows that throttling is not a major problem.
FTrace Reclaim Statistics: vmscan
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 89145 89785 24921 7546 1954 3747
Direct reclaim pages scanned 4865170 4903030 1359813 408460 101716 199483
Direct reclaim pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 3 1 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 11152 11021 21223 24029 26797 26672
Kswapd wakeups 421 397 761 778 776 742
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Kswapd reclaim write anon async I/O 10027 10435 10387 4815 3616 4084
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 0.26 0.25 0.08 0.05 0.04 0.08
Time kswapd awake (seconds) 493.26 494.05 430.09 420.52 428.55 428.81
Total pages scanned 13067184 13049497 18340048 19836880 14371623 14303355
Total pages reclaimed 9564914 9566504 9565121 9551708 9550242 9556670
%age total pages scanned/reclaimed 73.20% 73.31% 52.15% 48.15% 66.45% 66.81%
%age total pages scanned/written 0.11% 0.11% 0.06% 0.02% 0.03% 0.03%
%age file pages scanned/written 0.03% 0.03% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.74% 0.70% 0.26% 0.16% 0.13% 0.27%
Percentage Time kswapd Awake 86.98% 87.16% 77.95% 79.98% 80.12% 81.47%
This is based on information from the vmscan tracepoints introduced
the last time this issue came up.
Direct reclaim writes were never a problem according to this.
kswapd writes of file-backed pages on the other hand went from 4483 to
0 which is nice and part of the objective after all. The page writes of
4084 recorded from /proc/vmstat with all patches applied iwas clearly
due to writing anonymous pages as there is a direct correlation there.
Time spent in direct reclaim is reduced quite a bit as well as the
time kswapd spent awake.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 0 1 0 56 8 0
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 4 0 1 0 6 0
KSwapd time congest waited 400ms 0ms 100ms 0ms 501ms 0ms
KSwapd full congest waited 4 0 1 0 5 0
KSwapd number conditional waited 0 0 0 65056 189 190
KSwapd time conditional waited 0ms 0ms 0ms 1ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled 190 times with all patches
applied, it spent negligible time asleep so probably just called
cond_resched(). This implies that neither the zone or the backing
device was congested. As there is only once source of IO, this is
expected. With multiple processes, this picture might change.
MICRO
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 6.95 7.2 6.84 6.33 5.97 6.13
Total Elapsed Time (seconds) 56.34 65.04 66.53 63.24 52.48 63.00
This is a test that just writes a mapping. Unfortunately, the time to
completion is increased by the series. Again I'll have to run without
any monitoring to confirm it's a problem.
MMTests Statistics: vmstat
Page Ins 46928 50660 48504 42888 42648 43036
Page Outs 4990816 4994987 4987572 4999242 4981324 4990627
Swap Ins 2573 3234 2470 1396 1352 1297
Swap Outs 2316 2578 2360 937 912 873
Direct pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Direct pages reclaimed 536567 517876 314115 289472 272265 252361
Kswapd efficiency 45% 51% 46% 48% 63% 64%
Kswapd velocity 24831.505 19566.990 27699.895 28634.836 27160.175 22643.429
Direct efficiency 29% 25% 19% 15% 14% 12%
Direct velocity 32559.993 31011.593 24405.156 29154.870 36636.204 30824.063
Percentage direct scans 56% 61% 46% 50% 57% 57%
Page writes by reclaim 2706 2910 2416 969 912 873
Page writes skipped 0 12640 148339 166844 0 0
Page reclaim invalidate 0 0 0 0 12 58
Page reclaim throttled 0 0 0 4788 7 9
Slabs scanned 4096 5248 5120 6656 4480 16768
Direct inode steals 531 1189 348 1166 700 3783
Kswapd inode steals 164 0 349 0 0 9
Kswapd skipped wait 78 35 74 51 14 10
Compaction stalls 0 0 1 0 0 0
Compaction success 0 0 1 0 0 0
Compaction failures 0 0 0 0 0 0
Compaction pages moved 0 0 0 0 0 0
Compaction move failure 0 0 0 0 0 0
Kswapd efficiency up but kswapd was doing less work according to kswapd velocity.
Direct reclaim efficiency is worse as well.
It's writing fewer pages at least.
FTrace Reclaim Statistics: vmscan
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 9823 9477 5737 5347 5078 4720
Direct reclaim pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Direct reclaim pages reclaimed 536567 517876 314115 289472 272265 252361
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0 16 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 1636 1692 2177 2403 2707 2757
Kswapd wakeups 28 29 30 34 15 23
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Kswapd reclaim write file async I/O 380 332 56 32 0 0
Kswapd reclaim write anon async I/O 2326 2578 2360 937 896 873
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 2.06 2.10 1.62 2.65 2.25 1.86
Time kswapd awake (seconds) 49.44 56.39 54.31 55.45 47.00 56.74
Total pages scanned 3233437 3289631 3466549 3654621 3348034 3368452
Total pages reclaimed 1174275 1175294 1174627 1174003 1178873 1179567
%age total pages scanned/reclaimed 36.32% 35.73% 33.88% 32.12% 35.21% 35.02%
%age total pages scanned/written 0.08% 0.09% 0.07% 0.03% 0.03% 0.03%
%age file pages scanned/written 0.01% 0.01% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 22.86% 22.58% 19.15% 29.51% 27.37% 23.28%
Percentage Time kswapd Awake 87.75% 86.70% 81.63% 87.68% 89.56% 90.06%
Again, writes of file pages are reduced but kswapd is clearly awake
for longer.
What is interesting is that the number of pages written without the
patches was already quite low. This means there is relatively little room
for improvement in this benchmark.
FTrace Reclaim Statistics: congestion_wait
Generating ftrace report ftrace-3.0.0-rc6-prioinode-v1r10-micro-congestion.report
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 768 793 704 1359 608 674
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 41 22 58 43 78 92
KSwapd time congest waited 2937ms 2200ms 4543ms 4300ms 7800ms 9200ms
KSwapd full congest waited 29 22 45 43 78 92
KSwapd number conditional waited 0 0 0 4284 4 9
KSwapd time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
Some throttling but little time sleep.
The objective of the series - reducing writes from reclaim - is
met with filesystem writes from reclaim reduced to 0 with reclaim
in general doing less work. ext3, ext4 and xfs all showed marked
improvements for fs_mark in this configuration. btrfs looked worse
but it's within the noise and I'd expect the patches to have little
or no impact there due it ignoring ->writepage from reclaim.
I'm rerunning the tests without monitors at the moment to verify the
performance improvements which will take about 6 hours to complete
but so far it looks promising.
Comments?
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/mmzone.h | 2 +
include/linux/writeback.h | 1 +
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 +
6 files changed, 115 insertions(+), 6 deletions(-)
--
1.7.3.4
^ permalink raw reply [flat|nested] 114+ messages in thread
* [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
(Revisting this from a year ago and following on from the thread
"Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
clustering". Posting an prototype to see if anything obvious is
being missed)
Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was really low). That
writing back a small number of pages is ok has been heavily disputed
for quite some time and Dave Chinner explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request entirely
I think ext3 just writes back the page but I didn't double check.
Either way, each filesystem will have different performance
characteristics when under memory pressure and there are a lot of
dirty pages.
The objective of this series to for memory reclaim to play nicely
with writeback that is already in progress and throttle reclaimers
appropriately when dirty pages are encountered. The assumption is that
the flushers will always write pages faster than if reclaim issues
the IO. The problem is that reclaim has very little control over how
long before a page in a particular zone or container is cleaned.
This is a serious problem but as the behaviour of ->writepage is
filesystem-dependant, we are already faced with a situation where
reclaim has poor control over page cleaning.
A secondary goal is to avoid the problem whereby direct reclaim
splices two potentially deep call stacks together.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written
Patch 2 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 3 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 4 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
Patch 5 tries to prioritise inodes backing dirty pages found at the end
of the LRU.
This is a prototype only and it's probable that I forgot or omitted
some issue brought up over the last year and a bit. I have not thought
about how this affects memcg and I have some concerns about patches
4 and 5. Patch 4 may reclaim too many pages as a reclaimer will skip
the dirty page, reclaim a clean page and later the dirty page gets
reclaimed anyway when writeback completes. I don't think it matters
but it's worth thinking about. Patch 5 is potentially a problem
because move_expired_inodes() is now walking the full delayed_queue
list. Is that a problem? I also have no double checked it's safe
to add I_DIRTY_RECLAIM or that the locking is correct. Basically,
patch 5 is a quick hack to see if it's worthwhile and may be rendered
unnecessary by Wu Fengguang or Jan Kara.
I consider this series to be orthogonal to the writeback work going
on at the moment so shout if that assumption is in error.
I tested this on ext3, ext4, btrfs and xfs using fs_mark and a micro
benchmark that does a streaming write to a large mapping (exercises
use-once LRU logic). The command line for fs_mark looked something like
./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
this triggers the worst behaviour.
6 kernels are tested.
vanilla 3.0-rc6
nodirectwb-v1r3 patch 1
lesskswapdwb-v1r3p patches 1-2
throttle-v1r10 patches 1-3
immediate-v1r10 patches 1-4
prioinode-v1r10 patches 1-5
During testing, a number of monitors were running to gather information
from ftrace in particular. This disrupts the results of course because
recording the information generates IO in itself but I'm ignoring
that for the moment so the effect of the patches can be seen.
I've posted the raw reports for each filesystem at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
As it was Dave and Christoph that brought this back up, here is the
XFS report in a bit more detail;
FS-Mark
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Files/s min 5.30 ( 0.00%) 5.10 (-3.92%) 5.40 ( 1.85%) 5.70 ( 7.02%) 5.80 ( 8.62%) 5.70 ( 7.02%)
Files/s mean 6.93 ( 0.00%) 6.96 ( 0.40%) 7.11 ( 2.53%) 7.52 ( 7.82%) 7.44 ( 6.83%) 7.48 ( 7.38%)
Files/s stddev 0.89 ( 0.00%) 0.99 (10.62%) 0.85 (-4.18%) 1.02 (13.23%) 1.08 (18.06%) 1.00 (10.72%)
Files/s max 8.10 ( 0.00%) 8.60 ( 5.81%) 8.20 ( 1.22%) 9.50 (14.74%) 9.00 (10.00%) 9.10 (10.99%)
Overhead min 6623.00 ( 0.00%) 6417.00 ( 3.21%) 6035.00 ( 9.74%) 6354.00 ( 4.23%) 6213.00 ( 6.60%) 6491.00 ( 2.03%)
Overhead mean 29678.24 ( 0.00%) 40053.96 (-25.90%) 18278.56 (62.37%) 16365.20 (81.35%) 11987.40 (147.58%) 15606.36 (90.17%)
Overhead stddev 68727.49 ( 0.00%) 116258.18 (-40.88%) 34121.42 (101.42%) 28963.27 (137.29%) 17221.33 (299.08%) 26231.50 (162.00%)
Overhead max 339993.00 ( 0.00%) 588147.00 (-42.19%) 148281.00 (129.29%) 140568.00 (141.87%) 77836.00 (336.81%) 124728.00 (172.59%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 34.97 35.31 31.16 30.47 29.85 29.66
Total Elapsed Time (seconds) 567.08 566.84 551.75 525.81 534.91 526.32
Average files per second is increased by a nice percentage albeit
just within the standard deviation. Consider the type of test this is,
variability was inevitable but will double check without monitoring.
The overhead (time spent in non-filesystem-related activities) is
reduced a *lot* and is a lot less variable. Time to completion is
improved across the board which is always good because it implies
that IO was consistently higher which is sortof visible 4 minutes into the test at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-sandy.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-smooth-sandy.png
kswapd CPU usage is also interesting
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/kswapdcpu-comparison-smooth-sandy.png
Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but the throttle brings it back down
and reduced further by patches 4 and 5.
MMTests Statistics: vmstat
Page Ins 189840 196608 189864 128120 126148 151888
Page Outs 38439897 38420872 38422937 38395008 38367766 38396612
Swap Ins 19468 20555 20024 4933 3799 4588
Swap Outs 10019 10388 10353 4737 3617 4084
Direct pages scanned 4865170 4903030 1359813 408460 101716 199483
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Direct pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Kswapd efficiency 57% 57% 48% 47% 66% 66%
Kswapd velocity 14463.592 14371.722 30775.233 36949.506 26677.211 26797.142
Direct efficiency 99% 99% 99% 99% 98% 99%
Direct velocity 8579.336 8649.760 2464.546 776.821 190.155 379.015
Percentage direct scans 37% 37% 7% 2% 0% 1%
Page writes by reclaim 14511 14721 10387 4819 3617 4084
Page writes skipped 0 30 2300502 2774735 0 0
Page reclaim invalidate 0 0 0 0 5155 3509
Page reclaim throttled 0 0 0 65112 190 190
Slabs scanned 16512 17920 18048 17536 16640 17408
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 5180 5318 5177 5178 5179 5193
Kswapd skipped wait 131 0 4 44 0 0
Compaction stalls 2 2 0 0 5 1
Compaction success 2 2 0 0 2 1
Compaction failures 0 0 0 0 3 0
Compaction pages moved 0 0 0 0 1049 0
Compaction move failure 0 0 0 0 96 0
These stats are based on information from /proc/vmstat
"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency starts
low at 57% but increases to 66% with all the patches applied.
"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked
on page writes so it's expected.
Direct reclaim work is significantly reduced going from 37% of all
pages scanned to 1% with all patches applied. This implies that
processes are getting stalled less.
Page writes by reclaim is what is motivating this series. It goes
from 14511 pages to 4084 which is a big improvement. We'll see later
if these were anonymous or file-backed pages.
"Page writes skipped" are dirty pages encountered at the end of the
LRU and only exists for patches 2, 3 and 4. It shows that kswapd is
encountering very large numbers of dirty pages (debugging showed they
weren't under writeback). The number of pages that get invalidated and
freed later is a more reasonable number and "page reclaim throttled"
shows that throttling is not a major problem.
FTrace Reclaim Statistics: vmscan
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 89145 89785 24921 7546 1954 3747
Direct reclaim pages scanned 4865170 4903030 1359813 408460 101716 199483
Direct reclaim pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 3 1 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 11152 11021 21223 24029 26797 26672
Kswapd wakeups 421 397 761 778 776 742
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Kswapd reclaim write anon async I/O 10027 10435 10387 4815 3616 4084
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 0.26 0.25 0.08 0.05 0.04 0.08
Time kswapd awake (seconds) 493.26 494.05 430.09 420.52 428.55 428.81
Total pages scanned 13067184 13049497 18340048 19836880 14371623 14303355
Total pages reclaimed 9564914 9566504 9565121 9551708 9550242 9556670
%age total pages scanned/reclaimed 73.20% 73.31% 52.15% 48.15% 66.45% 66.81%
%age total pages scanned/written 0.11% 0.11% 0.06% 0.02% 0.03% 0.03%
%age file pages scanned/written 0.03% 0.03% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.74% 0.70% 0.26% 0.16% 0.13% 0.27%
Percentage Time kswapd Awake 86.98% 87.16% 77.95% 79.98% 80.12% 81.47%
This is based on information from the vmscan tracepoints introduced
the last time this issue came up.
Direct reclaim writes were never a problem according to this.
kswapd writes of file-backed pages on the other hand went from 4483 to
0 which is nice and part of the objective after all. The page writes of
4084 recorded from /proc/vmstat with all patches applied iwas clearly
due to writing anonymous pages as there is a direct correlation there.
Time spent in direct reclaim is reduced quite a bit as well as the
time kswapd spent awake.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 0 1 0 56 8 0
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 4 0 1 0 6 0
KSwapd time congest waited 400ms 0ms 100ms 0ms 501ms 0ms
KSwapd full congest waited 4 0 1 0 5 0
KSwapd number conditional waited 0 0 0 65056 189 190
KSwapd time conditional waited 0ms 0ms 0ms 1ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled 190 times with all patches
applied, it spent negligible time asleep so probably just called
cond_resched(). This implies that neither the zone or the backing
device was congested. As there is only once source of IO, this is
expected. With multiple processes, this picture might change.
MICRO
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 6.95 7.2 6.84 6.33 5.97 6.13
Total Elapsed Time (seconds) 56.34 65.04 66.53 63.24 52.48 63.00
This is a test that just writes a mapping. Unfortunately, the time to
completion is increased by the series. Again I'll have to run without
any monitoring to confirm it's a problem.
MMTests Statistics: vmstat
Page Ins 46928 50660 48504 42888 42648 43036
Page Outs 4990816 4994987 4987572 4999242 4981324 4990627
Swap Ins 2573 3234 2470 1396 1352 1297
Swap Outs 2316 2578 2360 937 912 873
Direct pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Direct pages reclaimed 536567 517876 314115 289472 272265 252361
Kswapd efficiency 45% 51% 46% 48% 63% 64%
Kswapd velocity 24831.505 19566.990 27699.895 28634.836 27160.175 22643.429
Direct efficiency 29% 25% 19% 15% 14% 12%
Direct velocity 32559.993 31011.593 24405.156 29154.870 36636.204 30824.063
Percentage direct scans 56% 61% 46% 50% 57% 57%
Page writes by reclaim 2706 2910 2416 969 912 873
Page writes skipped 0 12640 148339 166844 0 0
Page reclaim invalidate 0 0 0 0 12 58
Page reclaim throttled 0 0 0 4788 7 9
Slabs scanned 4096 5248 5120 6656 4480 16768
Direct inode steals 531 1189 348 1166 700 3783
Kswapd inode steals 164 0 349 0 0 9
Kswapd skipped wait 78 35 74 51 14 10
Compaction stalls 0 0 1 0 0 0
Compaction success 0 0 1 0 0 0
Compaction failures 0 0 0 0 0 0
Compaction pages moved 0 0 0 0 0 0
Compaction move failure 0 0 0 0 0 0
Kswapd efficiency up but kswapd was doing less work according to kswapd velocity.
Direct reclaim efficiency is worse as well.
It's writing fewer pages at least.
FTrace Reclaim Statistics: vmscan
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 9823 9477 5737 5347 5078 4720
Direct reclaim pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Direct reclaim pages reclaimed 536567 517876 314115 289472 272265 252361
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0 16 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 1636 1692 2177 2403 2707 2757
Kswapd wakeups 28 29 30 34 15 23
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Kswapd reclaim write file async I/O 380 332 56 32 0 0
Kswapd reclaim write anon async I/O 2326 2578 2360 937 896 873
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 2.06 2.10 1.62 2.65 2.25 1.86
Time kswapd awake (seconds) 49.44 56.39 54.31 55.45 47.00 56.74
Total pages scanned 3233437 3289631 3466549 3654621 3348034 3368452
Total pages reclaimed 1174275 1175294 1174627 1174003 1178873 1179567
%age total pages scanned/reclaimed 36.32% 35.73% 33.88% 32.12% 35.21% 35.02%
%age total pages scanned/written 0.08% 0.09% 0.07% 0.03% 0.03% 0.03%
%age file pages scanned/written 0.01% 0.01% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 22.86% 22.58% 19.15% 29.51% 27.37% 23.28%
Percentage Time kswapd Awake 87.75% 86.70% 81.63% 87.68% 89.56% 90.06%
Again, writes of file pages are reduced but kswapd is clearly awake
for longer.
What is interesting is that the number of pages written without the
patches was already quite low. This means there is relatively little room
for improvement in this benchmark.
FTrace Reclaim Statistics: congestion_wait
Generating ftrace report ftrace-3.0.0-rc6-prioinode-v1r10-micro-congestion.report
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 768 793 704 1359 608 674
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 41 22 58 43 78 92
KSwapd time congest waited 2937ms 2200ms 4543ms 4300ms 7800ms 9200ms
KSwapd full congest waited 29 22 45 43 78 92
KSwapd number conditional waited 0 0 0 4284 4 9
KSwapd time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
Some throttling but little time sleep.
The objective of the series - reducing writes from reclaim - is
met with filesystem writes from reclaim reduced to 0 with reclaim
in general doing less work. ext3, ext4 and xfs all showed marked
improvements for fs_mark in this configuration. btrfs looked worse
but it's within the noise and I'd expect the patches to have little
or no impact there due it ignoring ->writepage from reclaim.
I'm rerunning the tests without monitors at the moment to verify the
performance improvements which will take about 6 hours to complete
but so far it looks promising.
Comments?
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/mmzone.h | 2 +
include/linux/writeback.h | 1 +
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 +
6 files changed, 115 insertions(+), 6 deletions(-)
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
(Revisting this from a year ago and following on from the thread
"Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
clustering". Posting an prototype to see if anything obvious is
being missed)
Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was really low). That
writing back a small number of pages is ok has been heavily disputed
for quite some time and Dave Chinner explained it well;
It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.
To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig
xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request entirely
I think ext3 just writes back the page but I didn't double check.
Either way, each filesystem will have different performance
characteristics when under memory pressure and there are a lot of
dirty pages.
The objective of this series to for memory reclaim to play nicely
with writeback that is already in progress and throttle reclaimers
appropriately when dirty pages are encountered. The assumption is that
the flushers will always write pages faster than if reclaim issues
the IO. The problem is that reclaim has very little control over how
long before a page in a particular zone or container is cleaned.
This is a serious problem but as the behaviour of ->writepage is
filesystem-dependant, we are already faced with a situation where
reclaim has poor control over page cleaning.
A secondary goal is to avoid the problem whereby direct reclaim
splices two potentially deep call stacks together.
Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written
Patch 2 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.
Patch 3 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.
Patch 4 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them
Patch 5 tries to prioritise inodes backing dirty pages found at the end
of the LRU.
This is a prototype only and it's probable that I forgot or omitted
some issue brought up over the last year and a bit. I have not thought
about how this affects memcg and I have some concerns about patches
4 and 5. Patch 4 may reclaim too many pages as a reclaimer will skip
the dirty page, reclaim a clean page and later the dirty page gets
reclaimed anyway when writeback completes. I don't think it matters
but it's worth thinking about. Patch 5 is potentially a problem
because move_expired_inodes() is now walking the full delayed_queue
list. Is that a problem? I also have no double checked it's safe
to add I_DIRTY_RECLAIM or that the locking is correct. Basically,
patch 5 is a quick hack to see if it's worthwhile and may be rendered
unnecessary by Wu Fengguang or Jan Kara.
I consider this series to be orthogonal to the writeback work going
on at the moment so shout if that assumption is in error.
I tested this on ext3, ext4, btrfs and xfs using fs_mark and a micro
benchmark that does a streaming write to a large mapping (exercises
use-once LRU logic). The command line for fs_mark looked something like
./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
this triggers the worst behaviour.
6 kernels are tested.
vanilla 3.0-rc6
nodirectwb-v1r3 patch 1
lesskswapdwb-v1r3p patches 1-2
throttle-v1r10 patches 1-3
immediate-v1r10 patches 1-4
prioinode-v1r10 patches 1-5
During testing, a number of monitors were running to gather information
from ftrace in particular. This disrupts the results of course because
recording the information generates IO in itself but I'm ignoring
that for the moment so the effect of the patches can be seen.
I've posted the raw reports for each filesystem at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
As it was Dave and Christoph that brought this back up, here is the
XFS report in a bit more detail;
FS-Mark
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Files/s min 5.30 ( 0.00%) 5.10 (-3.92%) 5.40 ( 1.85%) 5.70 ( 7.02%) 5.80 ( 8.62%) 5.70 ( 7.02%)
Files/s mean 6.93 ( 0.00%) 6.96 ( 0.40%) 7.11 ( 2.53%) 7.52 ( 7.82%) 7.44 ( 6.83%) 7.48 ( 7.38%)
Files/s stddev 0.89 ( 0.00%) 0.99 (10.62%) 0.85 (-4.18%) 1.02 (13.23%) 1.08 (18.06%) 1.00 (10.72%)
Files/s max 8.10 ( 0.00%) 8.60 ( 5.81%) 8.20 ( 1.22%) 9.50 (14.74%) 9.00 (10.00%) 9.10 (10.99%)
Overhead min 6623.00 ( 0.00%) 6417.00 ( 3.21%) 6035.00 ( 9.74%) 6354.00 ( 4.23%) 6213.00 ( 6.60%) 6491.00 ( 2.03%)
Overhead mean 29678.24 ( 0.00%) 40053.96 (-25.90%) 18278.56 (62.37%) 16365.20 (81.35%) 11987.40 (147.58%) 15606.36 (90.17%)
Overhead stddev 68727.49 ( 0.00%) 116258.18 (-40.88%) 34121.42 (101.42%) 28963.27 (137.29%) 17221.33 (299.08%) 26231.50 (162.00%)
Overhead max 339993.00 ( 0.00%) 588147.00 (-42.19%) 148281.00 (129.29%) 140568.00 (141.87%) 77836.00 (336.81%) 124728.00 (172.59%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 34.97 35.31 31.16 30.47 29.85 29.66
Total Elapsed Time (seconds) 567.08 566.84 551.75 525.81 534.91 526.32
Average files per second is increased by a nice percentage albeit
just within the standard deviation. Consider the type of test this is,
variability was inevitable but will double check without monitoring.
The overhead (time spent in non-filesystem-related activities) is
reduced a *lot* and is a lot less variable. Time to completion is
improved across the board which is always good because it implies
that IO was consistently higher which is sortof visible 4 minutes into the test at
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-sandy.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/blockio-comparison-smooth-sandy.png
kswapd CPU usage is also interesting
http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/kswapdcpu-comparison-smooth-sandy.png
Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but the throttle brings it back down
and reduced further by patches 4 and 5.
MMTests Statistics: vmstat
Page Ins 189840 196608 189864 128120 126148 151888
Page Outs 38439897 38420872 38422937 38395008 38367766 38396612
Swap Ins 19468 20555 20024 4933 3799 4588
Swap Outs 10019 10388 10353 4737 3617 4084
Direct pages scanned 4865170 4903030 1359813 408460 101716 199483
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Direct pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Kswapd efficiency 57% 57% 48% 47% 66% 66%
Kswapd velocity 14463.592 14371.722 30775.233 36949.506 26677.211 26797.142
Direct efficiency 99% 99% 99% 99% 98% 99%
Direct velocity 8579.336 8649.760 2464.546 776.821 190.155 379.015
Percentage direct scans 37% 37% 7% 2% 0% 1%
Page writes by reclaim 14511 14721 10387 4819 3617 4084
Page writes skipped 0 30 2300502 2774735 0 0
Page reclaim invalidate 0 0 0 0 5155 3509
Page reclaim throttled 0 0 0 65112 190 190
Slabs scanned 16512 17920 18048 17536 16640 17408
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 5180 5318 5177 5178 5179 5193
Kswapd skipped wait 131 0 4 44 0 0
Compaction stalls 2 2 0 0 5 1
Compaction success 2 2 0 0 2 1
Compaction failures 0 0 0 0 3 0
Compaction pages moved 0 0 0 0 1049 0
Compaction move failure 0 0 0 0 96 0
These stats are based on information from /proc/vmstat
"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency starts
low at 57% but increases to 66% with all the patches applied.
"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked
on page writes so it's expected.
Direct reclaim work is significantly reduced going from 37% of all
pages scanned to 1% with all patches applied. This implies that
processes are getting stalled less.
Page writes by reclaim is what is motivating this series. It goes
from 14511 pages to 4084 which is a big improvement. We'll see later
if these were anonymous or file-backed pages.
"Page writes skipped" are dirty pages encountered at the end of the
LRU and only exists for patches 2, 3 and 4. It shows that kswapd is
encountering very large numbers of dirty pages (debugging showed they
weren't under writeback). The number of pages that get invalidated and
freed later is a more reasonable number and "page reclaim throttled"
shows that throttling is not a major problem.
FTrace Reclaim Statistics: vmscan
fsmark-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 89145 89785 24921 7546 1954 3747
Direct reclaim pages scanned 4865170 4903030 1359813 408460 101716 199483
Direct reclaim pages reclaimed 4864514 4901411 1359368 407711 100520 198323
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 3 1 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 11152 11021 21223 24029 26797 26672
Kswapd wakeups 421 397 761 778 776 742
Kswapd pages scanned 8202014 8146467 16980235 19428420 14269907 14103872
Kswapd pages reclaimed 4700400 4665093 8205753 9143997 9449722 9358347
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Kswapd reclaim write anon async I/O 10027 10435 10387 4815 3616 4084
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 0.26 0.25 0.08 0.05 0.04 0.08
Time kswapd awake (seconds) 493.26 494.05 430.09 420.52 428.55 428.81
Total pages scanned 13067184 13049497 18340048 19836880 14371623 14303355
Total pages reclaimed 9564914 9566504 9565121 9551708 9550242 9556670
%age total pages scanned/reclaimed 73.20% 73.31% 52.15% 48.15% 66.45% 66.81%
%age total pages scanned/written 0.11% 0.11% 0.06% 0.02% 0.03% 0.03%
%age file pages scanned/written 0.03% 0.03% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.74% 0.70% 0.26% 0.16% 0.13% 0.27%
Percentage Time kswapd Awake 86.98% 87.16% 77.95% 79.98% 80.12% 81.47%
This is based on information from the vmscan tracepoints introduced
the last time this issue came up.
Direct reclaim writes were never a problem according to this.
kswapd writes of file-backed pages on the other hand went from 4483 to
0 which is nice and part of the objective after all. The page writes of
4084 recorded from /proc/vmstat with all patches applied iwas clearly
due to writing anonymous pages as there is a direct correlation there.
Time spent in direct reclaim is reduced quite a bit as well as the
time kswapd spent awake.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 0 1 0 56 8 0
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 4 0 1 0 6 0
KSwapd time congest waited 400ms 0ms 100ms 0ms 501ms 0ms
KSwapd full congest waited 4 0 1 0 5 0
KSwapd number conditional waited 0 0 0 65056 189 190
KSwapd time conditional waited 0ms 0ms 0ms 1ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled 190 times with all patches
applied, it spent negligible time asleep so probably just called
cond_resched(). This implies that neither the zone or the backing
device was congested. As there is only once source of IO, this is
expected. With multiple processes, this picture might change.
MICRO
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 6.95 7.2 6.84 6.33 5.97 6.13
Total Elapsed Time (seconds) 56.34 65.04 66.53 63.24 52.48 63.00
This is a test that just writes a mapping. Unfortunately, the time to
completion is increased by the series. Again I'll have to run without
any monitoring to confirm it's a problem.
MMTests Statistics: vmstat
Page Ins 46928 50660 48504 42888 42648 43036
Page Outs 4990816 4994987 4987572 4999242 4981324 4990627
Swap Ins 2573 3234 2470 1396 1352 1297
Swap Outs 2316 2578 2360 937 912 873
Direct pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Direct pages reclaimed 536567 517876 314115 289472 272265 252361
Kswapd efficiency 45% 51% 46% 48% 63% 64%
Kswapd velocity 24831.505 19566.990 27699.895 28634.836 27160.175 22643.429
Direct efficiency 29% 25% 19% 15% 14% 12%
Direct velocity 32559.993 31011.593 24405.156 29154.870 36636.204 30824.063
Percentage direct scans 56% 61% 46% 50% 57% 57%
Page writes by reclaim 2706 2910 2416 969 912 873
Page writes skipped 0 12640 148339 166844 0 0
Page reclaim invalidate 0 0 0 0 12 58
Page reclaim throttled 0 0 0 4788 7 9
Slabs scanned 4096 5248 5120 6656 4480 16768
Direct inode steals 531 1189 348 1166 700 3783
Kswapd inode steals 164 0 349 0 0 9
Kswapd skipped wait 78 35 74 51 14 10
Compaction stalls 0 0 1 0 0 0
Compaction success 0 0 1 0 0 0
Compaction failures 0 0 0 0 0 0
Compaction pages moved 0 0 0 0 0 0
Compaction move failure 0 0 0 0 0 0
Kswapd efficiency up but kswapd was doing less work according to kswapd velocity.
Direct reclaim efficiency is worse as well.
It's writing fewer pages at least.
FTrace Reclaim Statistics: vmscan
micro-3.0.0 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6 3.0.0-rc6
rc6-vanilla nodirectwb-v1r3 lesskswapdwb-v1r3 throttle-v1r10 immediate-v1r10 prioinode-v1r10
Direct reclaims 9823 9477 5737 5347 5078 4720
Direct reclaim pages scanned 1834430 2016994 1623675 1843754 1922668 1941916
Direct reclaim pages reclaimed 536567 517876 314115 289472 272265 252361
Direct reclaim write file async I/O 0 0 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0 16 0
Direct reclaim write file sync I/O 0 0 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0 0 0
Wake kswapd requests 1636 1692 2177 2403 2707 2757
Kswapd wakeups 28 29 30 34 15 23
Kswapd pages scanned 1399007 1272637 1842874 1810867 1425366 1426536
Kswapd pages reclaimed 637708 657418 860512 884531 906608 927206
Kswapd reclaim write file async I/O 380 332 56 32 0 0
Kswapd reclaim write anon async I/O 2326 2578 2360 937 896 873
Kswapd reclaim write file sync I/O 0 0 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0 0 0
Time stalled direct reclaim (seconds) 2.06 2.10 1.62 2.65 2.25 1.86
Time kswapd awake (seconds) 49.44 56.39 54.31 55.45 47.00 56.74
Total pages scanned 3233437 3289631 3466549 3654621 3348034 3368452
Total pages reclaimed 1174275 1175294 1174627 1174003 1178873 1179567
%age total pages scanned/reclaimed 36.32% 35.73% 33.88% 32.12% 35.21% 35.02%
%age total pages scanned/written 0.08% 0.09% 0.07% 0.03% 0.03% 0.03%
%age file pages scanned/written 0.01% 0.01% 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 22.86% 22.58% 19.15% 29.51% 27.37% 23.28%
Percentage Time kswapd Awake 87.75% 86.70% 81.63% 87.68% 89.56% 90.06%
Again, writes of file pages are reduced but kswapd is clearly awake
for longer.
What is interesting is that the number of pages written without the
patches was already quite low. This means there is relatively little room
for improvement in this benchmark.
FTrace Reclaim Statistics: congestion_wait
Generating ftrace report ftrace-3.0.0-rc6-prioinode-v1r10-micro-congestion.report
Direct number congest waited 0 0 0 0 0 0
Direct time congest waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full congest waited 0 0 0 0 0 0
Direct number conditional waited 768 793 704 1359 608 674
Direct time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0 0 0
KSwapd number congest waited 41 22 58 43 78 92
KSwapd time congest waited 2937ms 2200ms 4543ms 4300ms 7800ms 9200ms
KSwapd full congest waited 29 22 45 43 78 92
KSwapd number conditional waited 0 0 0 4284 4 9
KSwapd time conditional waited 0ms 0ms 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0 0 0
Some throttling but little time sleep.
The objective of the series - reducing writes from reclaim - is
met with filesystem writes from reclaim reduced to 0 with reclaim
in general doing less work. ext3, ext4 and xfs all showed marked
improvements for fs_mark in this configuration. btrfs looked worse
but it's within the noise and I'd expect the patches to have little
or no impact there due it ignoring ->writepage from reclaim.
I'm rerunning the tests without monitors at the moment to verify the
performance improvements which will take about 6 hours to complete
but so far it looks promising.
Comments?
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/mmzone.h | 2 +
include/linux/writeback.h | 1 +
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 +
6 files changed, 115 insertions(+), 6 deletions(-)
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 14:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
From: Mel Gorman <mel@csn.ul.ie>
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty
page is encountered during the scan, this page is written to backing
storage using mapping->writepage.
This causes two problems. First, it can result in very deep call
stacks, particularly if the target storage or filesystem are complex.
Some filesystems ignore write requests from direct reclaim as a result.
The second is that a single-page flush is inefficient in terms of IO.
While there is an expectation that the elevator will merge requests,
this does not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to
swap as there is not the equivalent of a flusher thread for anonymos
pages. If the dirty pages cannot be written back, they are placed
back on the LRU lists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 9 +++++++++
mm/vmstat.c | 1 +
3 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..b70a0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,6 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
+ NR_VMSCAN_WRITE_SKIP,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..2d3e5b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
nr_dirty++;
+ /*
+ * Only kswapd can writeback filesystem pages to
+ * avoid risk of stack overflow
+ */
+ if (page_is_file_cache(page) && !current_is_kswapd()) {
+ inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+ goto keep_locked;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..fd109f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,6 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
+ "nr_vmscan_write_skip",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
From: Mel Gorman <mel@csn.ul.ie>
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty
page is encountered during the scan, this page is written to backing
storage using mapping->writepage.
This causes two problems. First, it can result in very deep call
stacks, particularly if the target storage or filesystem are complex.
Some filesystems ignore write requests from direct reclaim as a result.
The second is that a single-page flush is inefficient in terms of IO.
While there is an expectation that the elevator will merge requests,
this does not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to
swap as there is not the equivalent of a flusher thread for anonymos
pages. If the dirty pages cannot be written back, they are placed
back on the LRU lists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 9 +++++++++
mm/vmstat.c | 1 +
3 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..b70a0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,6 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
+ NR_VMSCAN_WRITE_SKIP,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..2d3e5b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
nr_dirty++;
+ /*
+ * Only kswapd can writeback filesystem pages to
+ * avoid risk of stack overflow
+ */
+ if (page_is_file_cache(page) && !current_is_kswapd()) {
+ inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+ goto keep_locked;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..fd109f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,6 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
+ "nr_vmscan_write_skip",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
From: Mel Gorman <mel@csn.ul.ie>
When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty
page is encountered during the scan, this page is written to backing
storage using mapping->writepage.
This causes two problems. First, it can result in very deep call
stacks, particularly if the target storage or filesystem are complex.
Some filesystems ignore write requests from direct reclaim as a result.
The second is that a single-page flush is inefficient in terms of IO.
While there is an expectation that the elevator will merge requests,
this does not always happen. Quoting Christoph Hellwig;
The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.
This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to
swap as there is not the equivalent of a flusher thread for anonymos
pages. If the dirty pages cannot be written back, they are placed
back on the LRU lists.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 9 +++++++++
mm/vmstat.c | 1 +
3 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..b70a0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,6 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
+ NR_VMSCAN_WRITE_SKIP,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f49535..2d3e5b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageDirty(page)) {
nr_dirty++;
+ /*
+ * Only kswapd can writeback filesystem pages to
+ * avoid risk of stack overflow
+ */
+ if (page_is_file_cache(page) && !current_is_kswapd()) {
+ inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+ goto keep_locked;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..fd109f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,6 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
+ "nr_vmscan_write_skip",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 14:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
It is preferable that no dirty pages are dispatched for cleaning from
the page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.
However, page reclaim does have a requirement that pages be freed
in a particular zone. If it is failing to make sufficient progress
(reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
considered to tbe the point where kswapd is getting into trouble
reclaiming pages. If this priority is reached, kswapd will dispatch
pages for writing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d3e5b6..e272951 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc,
+ int priority)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -827,9 +828,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow
+ * avoid risk of stack overflow but do not writeback
+ * unless under significant pressure.
*/
- if (page_is_file_cache(page) && !current_is_kswapd()) {
+ if (page_is_file_cache(page) &&
+ (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
@@ -1465,12 +1468,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
}
local_irq_disable();
--
1.7.3.4
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
It is preferable that no dirty pages are dispatched for cleaning from
the page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.
However, page reclaim does have a requirement that pages be freed
in a particular zone. If it is failing to make sufficient progress
(reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
considered to tbe the point where kswapd is getting into trouble
reclaiming pages. If this priority is reached, kswapd will dispatch
pages for writing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d3e5b6..e272951 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc,
+ int priority)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -827,9 +828,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow
+ * avoid risk of stack overflow but do not writeback
+ * unless under significant pressure.
*/
- if (page_is_file_cache(page) && !current_is_kswapd()) {
+ if (page_is_file_cache(page) &&
+ (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
@@ -1465,12 +1468,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
}
local_irq_disable();
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
It is preferable that no dirty pages are dispatched for cleaning from
the page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.
However, page reclaim does have a requirement that pages be freed
in a particular zone. If it is failing to make sufficient progress
(reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
considered to tbe the point where kswapd is getting into trouble
reclaiming pages. If this priority is reached, kswapd will dispatch
pages for writing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
mm/vmscan.c | 13 ++++++++-----
1 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d3e5b6..e272951 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
- struct scan_control *sc)
+ struct scan_control *sc,
+ int priority)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -827,9 +828,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
- * avoid risk of stack overflow
+ * avoid risk of stack overflow but do not writeback
+ * unless under significant pressure.
*/
- if (page_is_file_cache(page) && !current_is_kswapd()) {
+ if (page_is_file_cache(page) &&
+ (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
@@ -1465,12 +1468,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
}
local_irq_disable();
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 14:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
Workloads that are allocating frequently and writing files place a
large number of dirty pages on the LRU. With use-once logic, it is
possible for them to reach the end of the LRU quickly requiring the
reclaimer to scan more to find clean pages. Ordinarily, processes that
are dirtying memory will get throttled by dirty balancing but this
is a global heuristic and does not take into account that LRUs are
maintained on a per-zone basis. This can lead to a situation whereby
reclaim is scanning heavily, skipping over a large number of pages
under writeback and recycling them around the LRU consuming CPU.
This patch checks how many of the number of pages isolated from the
LRU were dirty. If a percentage of them are dirty, the process will be
throttled if a blocking device is congested or the zone being scanned
is marked congested. The percentage that must be dirty depends on
the priority. At default priority, all of them must be dirty. At
DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
etc. i.e. as pressure increases the greater the likelihood the process
will get throttled to allow the flusher threads to make some progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 23 ++++++++++++++++++++---
mm/vmstat.c | 1 +
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b70a0c0..c4508a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -101,6 +101,7 @@ enum zone_stat_item {
NR_BOUNCE,
NR_VMSCAN_WRITE,
NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e272951..9826086 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
- int priority)
+ int priority,
+ unsigned long *ret_nr_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -971,6 +972,7 @@ keep_lumpy:
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
+ *ret_nr_dirty += nr_dirty;
return nr_reclaimed;
}
@@ -1420,6 +1422,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ unsigned long nr_dirty = 0;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1468,12 +1471,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
}
local_irq_disable();
@@ -1483,6 +1488,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+ /*
+ * If we have encountered a high number of dirty pages then they
+ * are reaching the end of the LRU too quickly and global limits are
+ * not enough to throttle processes due to the page distribution
+ * throughout zones. Scale the number of dirty pages that must be
+ * dirty before being throttled to priority.
+ */
+ if (nr_dirty && nr_dirty >= (nr_taken >> (DEF_PRIORITY-priority))) {
+ inc_zone_state(zone, NR_VMSCAN_THROTTLED);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ }
+
trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
zone_idx(zone),
nr_scanned, nr_reclaimed,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd109f3..59ee17c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,6 +703,7 @@ const char * const vmstat_text[] = {
"nr_bounce",
"nr_vmscan_write",
"nr_vmscan_write_skip",
+ "nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
Workloads that are allocating frequently and writing files place a
large number of dirty pages on the LRU. With use-once logic, it is
possible for them to reach the end of the LRU quickly requiring the
reclaimer to scan more to find clean pages. Ordinarily, processes that
are dirtying memory will get throttled by dirty balancing but this
is a global heuristic and does not take into account that LRUs are
maintained on a per-zone basis. This can lead to a situation whereby
reclaim is scanning heavily, skipping over a large number of pages
under writeback and recycling them around the LRU consuming CPU.
This patch checks how many of the number of pages isolated from the
LRU were dirty. If a percentage of them are dirty, the process will be
throttled if a blocking device is congested or the zone being scanned
is marked congested. The percentage that must be dirty depends on
the priority. At default priority, all of them must be dirty. At
DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
etc. i.e. as pressure increases the greater the likelihood the process
will get throttled to allow the flusher threads to make some progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 23 ++++++++++++++++++++---
mm/vmstat.c | 1 +
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b70a0c0..c4508a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -101,6 +101,7 @@ enum zone_stat_item {
NR_BOUNCE,
NR_VMSCAN_WRITE,
NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e272951..9826086 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
- int priority)
+ int priority,
+ unsigned long *ret_nr_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -971,6 +972,7 @@ keep_lumpy:
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
+ *ret_nr_dirty += nr_dirty;
return nr_reclaimed;
}
@@ -1420,6 +1422,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ unsigned long nr_dirty = 0;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1468,12 +1471,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
}
local_irq_disable();
@@ -1483,6 +1488,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+ /*
+ * If we have encountered a high number of dirty pages then they
+ * are reaching the end of the LRU too quickly and global limits are
+ * not enough to throttle processes due to the page distribution
+ * throughout zones. Scale the number of dirty pages that must be
+ * dirty before being throttled to priority.
+ */
+ if (nr_dirty && nr_dirty >= (nr_taken >> (DEF_PRIORITY-priority))) {
+ inc_zone_state(zone, NR_VMSCAN_THROTTLED);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ }
+
trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
zone_idx(zone),
nr_scanned, nr_reclaimed,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd109f3..59ee17c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,6 +703,7 @@ const char * const vmstat_text[] = {
"nr_bounce",
"nr_vmscan_write",
"nr_vmscan_write_skip",
+ "nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
Workloads that are allocating frequently and writing files place a
large number of dirty pages on the LRU. With use-once logic, it is
possible for them to reach the end of the LRU quickly requiring the
reclaimer to scan more to find clean pages. Ordinarily, processes that
are dirtying memory will get throttled by dirty balancing but this
is a global heuristic and does not take into account that LRUs are
maintained on a per-zone basis. This can lead to a situation whereby
reclaim is scanning heavily, skipping over a large number of pages
under writeback and recycling them around the LRU consuming CPU.
This patch checks how many of the number of pages isolated from the
LRU were dirty. If a percentage of them are dirty, the process will be
throttled if a blocking device is congested or the zone being scanned
is marked congested. The percentage that must be dirty depends on
the priority. At default priority, all of them must be dirty. At
DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
etc. i.e. as pressure increases the greater the likelihood the process
will get throttled to allow the flusher threads to make some progress.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 1 +
mm/vmscan.c | 23 ++++++++++++++++++++---
mm/vmstat.c | 1 +
3 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b70a0c0..c4508a2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -101,6 +101,7 @@ enum zone_stat_item {
NR_BOUNCE,
NR_VMSCAN_WRITE,
NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e272951..9826086 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
- int priority)
+ int priority,
+ unsigned long *ret_nr_dirty)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -971,6 +972,7 @@ keep_lumpy:
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
+ *ret_nr_dirty += nr_dirty;
return nr_reclaimed;
}
@@ -1420,6 +1422,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
+ unsigned long nr_dirty = 0;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1468,12 +1471,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
spin_unlock_irq(&zone->lru_lock);
- nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+ priority, &nr_dirty);
}
local_irq_disable();
@@ -1483,6 +1488,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
+ /*
+ * If we have encountered a high number of dirty pages then they
+ * are reaching the end of the LRU too quickly and global limits are
+ * not enough to throttle processes due to the page distribution
+ * throughout zones. Scale the number of dirty pages that must be
+ * dirty before being throttled to priority.
+ */
+ if (nr_dirty && nr_dirty >= (nr_taken >> (DEF_PRIORITY-priority))) {
+ inc_zone_state(zone, NR_VMSCAN_THROTTLED);
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+ }
+
trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
zone_idx(zone),
nr_scanned, nr_reclaimed,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd109f3..59ee17c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,6 +703,7 @@ const char * const vmstat_text[] = {
"nr_bounce",
"nr_vmscan_write",
"nr_vmscan_write_skip",
+ "nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 14:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
When direct reclaim encounters a dirty page, it gets recycled around
the LRU for another cycle. This patch marks the page PageReclaim using
deactivate_page() so that the page gets reclaimed almost immediately
after the page gets cleaned. This is to avoid reclaiming clean pages
that are younger than a dirty page encountered at the end of the LRU
that might have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +-
mm/vmscan.c | 10 ++++++++--
mm/vmstat.c | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4508a2..bea7858 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
- NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_INVALIDATE,
NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9826086..8e00aee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
- inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
- goto keep_locked;
+ inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
+
+ /* Immediately reclaim when written back */
+ unlock_page(page);
+ deactivate_page(page);
+
+ goto keep_dirty;
}
if (references == PAGEREF_RECLAIM_CLEAN)
@@ -956,6 +961,7 @@ keep:
reset_reclaim_mode(sc);
keep_lumpy:
list_add(&page->lru, &ret_pages);
+keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59ee17c..2c82ae5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,7 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
- "nr_vmscan_write_skip",
+ "nr_vmscan_invalidate",
"nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
--
1.7.3.4
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
When direct reclaim encounters a dirty page, it gets recycled around
the LRU for another cycle. This patch marks the page PageReclaim using
deactivate_page() so that the page gets reclaimed almost immediately
after the page gets cleaned. This is to avoid reclaiming clean pages
that are younger than a dirty page encountered at the end of the LRU
that might have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +-
mm/vmscan.c | 10 ++++++++--
mm/vmstat.c | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4508a2..bea7858 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
- NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_INVALIDATE,
NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9826086..8e00aee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
- inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
- goto keep_locked;
+ inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
+
+ /* Immediately reclaim when written back */
+ unlock_page(page);
+ deactivate_page(page);
+
+ goto keep_dirty;
}
if (references == PAGEREF_RECLAIM_CLEAN)
@@ -956,6 +961,7 @@ keep:
reset_reclaim_mode(sc);
keep_lumpy:
list_add(&page->lru, &ret_pages);
+keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59ee17c..2c82ae5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,7 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
- "nr_vmscan_write_skip",
+ "nr_vmscan_invalidate",
"nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
When direct reclaim encounters a dirty page, it gets recycled around
the LRU for another cycle. This patch marks the page PageReclaim using
deactivate_page() so that the page gets reclaimed almost immediately
after the page gets cleaned. This is to avoid reclaiming clean pages
that are younger than a dirty page encountered at the end of the LRU
that might have been something like a use-once page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
include/linux/mmzone.h | 2 +-
mm/vmscan.c | 10 ++++++++--
mm/vmstat.c | 2 +-
3 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4508a2..bea7858 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ enum zone_stat_item {
NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_BOUNCE,
NR_VMSCAN_WRITE,
- NR_VMSCAN_WRITE_SKIP,
+ NR_VMSCAN_INVALIDATE,
NR_VMSCAN_THROTTLED,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9826086..8e00aee 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
- inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
- goto keep_locked;
+ inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
+
+ /* Immediately reclaim when written back */
+ unlock_page(page);
+ deactivate_page(page);
+
+ goto keep_dirty;
}
if (references == PAGEREF_RECLAIM_CLEAN)
@@ -956,6 +961,7 @@ keep:
reset_reclaim_mode(sc);
keep_lumpy:
list_add(&page->lru, &ret_pages);
+keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59ee17c..2c82ae5 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,7 +702,7 @@ const char * const vmstat_text[] = {
"nr_unstable",
"nr_bounce",
"nr_vmscan_write",
- "nr_vmscan_write_skip",
+ "nr_vmscan_invalidate",
"nr_vmscan_throttled",
"nr_writeback_temp",
"nr_isolated_anon",
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 14:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
It is preferable that no dirty pages are dispatched from the page
reclaim path. If reclaim is encountering dirty pages, it implies that
either reclaim is getting ahead of writeback or use-once logic has
prioritise pages for reclaiming that are young relative to when the
inode was dirtied.
When dirty pages are encounted on the LRU, this patch marks the inodes
I_DIRTY_RECLAIM and wakes the background flusher. When the background
flusher runs, it moves such inodes immediately to the dispatch queue
regardless of inode age. There is no guarantee that pages reclaim
cares about will be cleaned first but the expectation is that the
flusher threads will clean the page quicker than if reclaim tried to
clean a single page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/writeback.h | 1 +
mm/vmscan.c | 16 ++++++++++++-
4 files changed, 74 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..1201052 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tinode;
int do_sb_sort = 0;
+ /* Move inodes reclaim found at end of LRU to dispatch queue */
+ list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
+ /* Move any inode found at end of LRU to dispatch queue */
+ if (inode->i_state & I_DIRTY_RECLAIM) {
+ inode->i_state &= ~I_DIRTY_RECLAIM;
+ list_move(&inode->i_wb_list, &tmp);
+
+ if (sb && sb != inode->i_sb)
+ do_sb_sort = 1;
+ sb = inode->i_sb;
+ }
+ }
+
+ sb = NULL;
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
if (older_than_this &&
@@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
rcu_read_unlock();
}
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct inode *inode;
+
+ list_for_each_entry(page, page_list, lru) {
+ if (!PageDirty(page))
+ continue;
+
+ if (PageSwapBacked(page))
+ continue;
+
+ lock_page(page);
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock;
+
+ /*
+ * Test outside the lock to see as if it is already set. Inode
+ * should be pinned by the lock_page
+ */
+ inode = page->mapping->host;
+ if (inode->i_state & I_DIRTY_RECLAIM)
+ goto unlock;
+
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_RECLAIM;
+ spin_unlock(&inode->i_lock);
+unlock:
+ unlock_page(page);
+ }
+
+ wakeup_flusher_threads(nr_pages);
+}
+
static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..bb0f4c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1650,8 +1650,8 @@ struct super_operations {
/*
* Inode state bits. Protected by inode->i_lock
*
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1706,6 +1706,7 @@ struct super_operations {
#define __I_SYNC 7
#define I_SYNC (1 << __I_SYNC)
#define I_REFERENCED (1 << 8)
+#define I_DIRTY_RECLAIM (1 << 9)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..1e77793 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -66,6 +66,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
struct writeback_control *wbc);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e00aee..db62af1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,8 +725,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(dirty_pages);
+
int pgactivate = 0;
unsigned long nr_dirty = 0;
+ unsigned long nr_unqueued_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
@@ -830,7 +833,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
* avoid risk of stack overflow but do not writeback
- * unless under significant pressure.
+ * unless under significant pressure. For dirty pages
+ * not under writeback, create a list and pass the
+ * inodes to the flusher threads later
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
@@ -840,6 +845,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unlock_page(page);
deactivate_page(page);
+ /* Prioritise the backing inodes later */
+ nr_unqueued_dirty++;
+ list_add(&page->lru, &dirty_pages);
+
goto keep_dirty;
}
@@ -976,6 +985,11 @@ keep_dirty:
free_page_list(&free_pages);
+ if (!list_empty(&dirty_pages)) {
+ wakeup_flusher_threads_pages(nr_unqueued_dirty, &dirty_pages);
+ list_splice(&ret_pages, &dirty_pages);
+ }
+
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
*ret_nr_dirty += nr_dirty;
--
1.7.3.4
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner, Mel Gorman
It is preferable that no dirty pages are dispatched from the page
reclaim path. If reclaim is encountering dirty pages, it implies that
either reclaim is getting ahead of writeback or use-once logic has
prioritise pages for reclaiming that are young relative to when the
inode was dirtied.
When dirty pages are encounted on the LRU, this patch marks the inodes
I_DIRTY_RECLAIM and wakes the background flusher. When the background
flusher runs, it moves such inodes immediately to the dispatch queue
regardless of inode age. There is no guarantee that pages reclaim
cares about will be cleaned first but the expectation is that the
flusher threads will clean the page quicker than if reclaim tried to
clean a single page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/writeback.h | 1 +
mm/vmscan.c | 16 ++++++++++++-
4 files changed, 74 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..1201052 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tinode;
int do_sb_sort = 0;
+ /* Move inodes reclaim found at end of LRU to dispatch queue */
+ list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
+ /* Move any inode found at end of LRU to dispatch queue */
+ if (inode->i_state & I_DIRTY_RECLAIM) {
+ inode->i_state &= ~I_DIRTY_RECLAIM;
+ list_move(&inode->i_wb_list, &tmp);
+
+ if (sb && sb != inode->i_sb)
+ do_sb_sort = 1;
+ sb = inode->i_sb;
+ }
+ }
+
+ sb = NULL;
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
if (older_than_this &&
@@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
rcu_read_unlock();
}
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct inode *inode;
+
+ list_for_each_entry(page, page_list, lru) {
+ if (!PageDirty(page))
+ continue;
+
+ if (PageSwapBacked(page))
+ continue;
+
+ lock_page(page);
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock;
+
+ /*
+ * Test outside the lock to see as if it is already set. Inode
+ * should be pinned by the lock_page
+ */
+ inode = page->mapping->host;
+ if (inode->i_state & I_DIRTY_RECLAIM)
+ goto unlock;
+
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_RECLAIM;
+ spin_unlock(&inode->i_lock);
+unlock:
+ unlock_page(page);
+ }
+
+ wakeup_flusher_threads(nr_pages);
+}
+
static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..bb0f4c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1650,8 +1650,8 @@ struct super_operations {
/*
* Inode state bits. Protected by inode->i_lock
*
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1706,6 +1706,7 @@ struct super_operations {
#define __I_SYNC 7
#define I_SYNC (1 << __I_SYNC)
#define I_REFERENCED (1 << 8)
+#define I_DIRTY_RECLAIM (1 << 9)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..1e77793 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -66,6 +66,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
struct writeback_control *wbc);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e00aee..db62af1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,8 +725,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(dirty_pages);
+
int pgactivate = 0;
unsigned long nr_dirty = 0;
+ unsigned long nr_unqueued_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
@@ -830,7 +833,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
* avoid risk of stack overflow but do not writeback
- * unless under significant pressure.
+ * unless under significant pressure. For dirty pages
+ * not under writeback, create a list and pass the
+ * inodes to the flusher threads later
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
@@ -840,6 +845,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unlock_page(page);
deactivate_page(page);
+ /* Prioritise the backing inodes later */
+ nr_unqueued_dirty++;
+ list_add(&page->lru, &dirty_pages);
+
goto keep_dirty;
}
@@ -976,6 +985,11 @@ keep_dirty:
free_page_list(&free_pages);
+ if (!list_empty(&dirty_pages)) {
+ wakeup_flusher_threads_pages(nr_unqueued_dirty, &dirty_pages);
+ list_splice(&ret_pages, &dirty_pages);
+ }
+
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
*ret_nr_dirty += nr_dirty;
--
1.7.3.4
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 114+ messages in thread
* [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 14:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 14:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman
It is preferable that no dirty pages are dispatched from the page
reclaim path. If reclaim is encountering dirty pages, it implies that
either reclaim is getting ahead of writeback or use-once logic has
prioritise pages for reclaiming that are young relative to when the
inode was dirtied.
When dirty pages are encounted on the LRU, this patch marks the inodes
I_DIRTY_RECLAIM and wakes the background flusher. When the background
flusher runs, it moves such inodes immediately to the dispatch queue
regardless of inode age. There is no guarantee that pages reclaim
cares about will be cleaned first but the expectation is that the
flusher threads will clean the page quicker than if reclaim tried to
clean a single page.
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 5 ++-
include/linux/writeback.h | 1 +
mm/vmscan.c | 16 ++++++++++++-
4 files changed, 74 insertions(+), 4 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0f015a0..1201052 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tinode;
int do_sb_sort = 0;
+ /* Move inodes reclaim found at end of LRU to dispatch queue */
+ list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
+ /* Move any inode found at end of LRU to dispatch queue */
+ if (inode->i_state & I_DIRTY_RECLAIM) {
+ inode->i_state &= ~I_DIRTY_RECLAIM;
+ list_move(&inode->i_wb_list, &tmp);
+
+ if (sb && sb != inode->i_sb)
+ do_sb_sort = 1;
+ sb = inode->i_sb;
+ }
+ }
+
+ sb = NULL;
while (!list_empty(delaying_queue)) {
inode = wb_inode(delaying_queue->prev);
if (older_than_this &&
@@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
rcu_read_unlock();
}
+/*
+ * Similar to wakeup_flusher_threads except prioritise inodes contained
+ * in the page_list regardless of age
+ */
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct inode *inode;
+
+ list_for_each_entry(page, page_list, lru) {
+ if (!PageDirty(page))
+ continue;
+
+ if (PageSwapBacked(page))
+ continue;
+
+ lock_page(page);
+ mapping = page_mapping(page);
+ if (!mapping)
+ goto unlock;
+
+ /*
+ * Test outside the lock to see as if it is already set. Inode
+ * should be pinned by the lock_page
+ */
+ inode = page->mapping->host;
+ if (inode->i_state & I_DIRTY_RECLAIM)
+ goto unlock;
+
+ spin_lock(&inode->i_lock);
+ inode->i_state |= I_DIRTY_RECLAIM;
+ spin_unlock(&inode->i_lock);
+unlock:
+ unlock_page(page);
+ }
+
+ wakeup_flusher_threads(nr_pages);
+}
+
static noinline void block_dump___mark_inode_dirty(struct inode *inode)
{
if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b5b9792..bb0f4c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1650,8 +1650,8 @@ struct super_operations {
/*
* Inode state bits. Protected by inode->i_lock
*
- * Three bits determine the dirty state of the inode, I_DIRTY_SYNC,
- * I_DIRTY_DATASYNC and I_DIRTY_PAGES.
+ * Four bits determine the dirty state of the inode, I_DIRTY_SYNC,
+ * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM.
*
* Four bits define the lifetime of an inode. Initially, inodes are I_NEW,
* until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at
@@ -1706,6 +1706,7 @@ struct super_operations {
#define __I_SYNC 7
#define I_SYNC (1 << __I_SYNC)
#define I_REFERENCED (1 << 8)
+#define I_DIRTY_RECLAIM (1 << 9)
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 17e7ccc..1e77793 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -66,6 +66,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
struct writeback_control *wbc);
long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
void wakeup_flusher_threads(long nr_pages);
+void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8e00aee..db62af1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -725,8 +725,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
+ LIST_HEAD(dirty_pages);
+
int pgactivate = 0;
unsigned long nr_dirty = 0;
+ unsigned long nr_unqueued_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
@@ -830,7 +833,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
/*
* Only kswapd can writeback filesystem pages to
* avoid risk of stack overflow but do not writeback
- * unless under significant pressure.
+ * unless under significant pressure. For dirty pages
+ * not under writeback, create a list and pass the
+ * inodes to the flusher threads later
*/
if (page_is_file_cache(page) &&
(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
@@ -840,6 +845,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
unlock_page(page);
deactivate_page(page);
+ /* Prioritise the backing inodes later */
+ nr_unqueued_dirty++;
+ list_add(&page->lru, &dirty_pages);
+
goto keep_dirty;
}
@@ -976,6 +985,11 @@ keep_dirty:
free_page_list(&free_pages);
+ if (!list_empty(&dirty_pages)) {
+ wakeup_flusher_threads_pages(nr_unqueued_dirty, &dirty_pages);
+ list_splice(&ret_pages, &dirty_pages);
+ }
+
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
*ret_nr_dirty += nr_dirty;
--
1.7.3.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 15:31 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 15:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> <SNIP>
> The objective of the series - reducing writes from reclaim - is
> met with filesystem writes from reclaim reduced to 0 with reclaim
> in general doing less work. ext3, ext4 and xfs all showed marked
> improvements for fs_mark in this configuration. btrfs looked worse
> but it's within the noise and I'd expect the patches to have little
> or no impact there due it ignoring ->writepage from reclaim.
>
My bad, I accidentally looked at an old report for btrfs based on
older patches. In the report posted with all patches applied, the
performance of btrfs does look better but as the patches should make
no difference, it's still in the noise.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 15:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 15:31 UTC (permalink / raw)
To: Linux-MM
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> <SNIP>
> The objective of the series - reducing writes from reclaim - is
> met with filesystem writes from reclaim reduced to 0 with reclaim
> in general doing less work. ext3, ext4 and xfs all showed marked
> improvements for fs_mark in this configuration. btrfs looked worse
> but it's within the noise and I'd expect the patches to have little
> or no impact there due it ignoring ->writepage from reclaim.
>
My bad, I accidentally looked at an old report for btrfs based on
older patches. In the report posted with all patches applied, the
performance of btrfs does look better but as the patches should make
no difference, it's still in the noise.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-13 15:31 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 15:31 UTC (permalink / raw)
To: Linux-MM
Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> <SNIP>
> The objective of the series - reducing writes from reclaim - is
> met with filesystem writes from reclaim reduced to 0 with reclaim
> in general doing less work. ext3, ext4 and xfs all showed marked
> improvements for fs_mark in this configuration. btrfs looked worse
> but it's within the noise and I'd expect the patches to have little
> or no impact there due it ignoring ->writepage from reclaim.
>
My bad, I accidentally looked at an old report for btrfs based on
older patches. In the report posted with all patches applied, the
performance of btrfs does look better but as the patches should make
no difference, it's still in the noise.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 16:40 ` Johannes Weiner
-1 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2011-07-13 16:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim using
> deactivate_page() so that the page gets reclaimed almost immediately
> after the page gets cleaned. This is to avoid reclaiming clean pages
> that are younger than a dirty page encountered at the end of the LRU
> that might have been something like a use-once page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/mmzone.h | 2 +-
> mm/vmscan.c | 10 ++++++++--
> mm/vmstat.c | 2 +-
> 3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c4508a2..bea7858 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,7 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> - NR_VMSCAN_WRITE_SKIP,
> + NR_VMSCAN_INVALIDATE,
> NR_VMSCAN_THROTTLED,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9826086..8e00aee 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> if (page_is_file_cache(page) &&
> (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> - goto keep_locked;
> + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> +
> + /* Immediately reclaim when written back */
> + unlock_page(page);
> + deactivate_page(page);
> +
> + goto keep_dirty;
> }
>
> if (references == PAGEREF_RECLAIM_CLEAN)
> @@ -956,6 +961,7 @@ keep:
> reset_reclaim_mode(sc);
> keep_lumpy:
> list_add(&page->lru, &ret_pages);
> +keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
I really like the idea behind this patch, but I think all those pages
are lost as PageLRU is cleared on isolation and lru_deactivate_fn
bails on them in turn.
If I'm not mistaken, the reference from the isolation is also leaked.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 16:40 ` Johannes Weiner
0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2011-07-13 16:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang
On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim using
> deactivate_page() so that the page gets reclaimed almost immediately
> after the page gets cleaned. This is to avoid reclaiming clean pages
> that are younger than a dirty page encountered at the end of the LRU
> that might have been something like a use-once page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/mmzone.h | 2 +-
> mm/vmscan.c | 10 ++++++++--
> mm/vmstat.c | 2 +-
> 3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c4508a2..bea7858 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,7 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> - NR_VMSCAN_WRITE_SKIP,
> + NR_VMSCAN_INVALIDATE,
> NR_VMSCAN_THROTTLED,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9826086..8e00aee 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> if (page_is_file_cache(page) &&
> (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> - goto keep_locked;
> + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> +
> + /* Immediately reclaim when written back */
> + unlock_page(page);
> + deactivate_page(page);
> +
> + goto keep_dirty;
> }
>
> if (references == PAGEREF_RECLAIM_CLEAN)
> @@ -956,6 +961,7 @@ keep:
> reset_reclaim_mode(sc);
> keep_lumpy:
> list_add(&page->lru, &ret_pages);
> +keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
I really like the idea behind this patch, but I think all those pages
are lost as PageLRU is cleared on isolation and lru_deactivate_fn
bails on them in turn.
If I'm not mistaken, the reference from the isolation is also leaked.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 16:40 ` Johannes Weiner
0 siblings, 0 replies; 114+ messages in thread
From: Johannes Weiner @ 2011-07-13 16:40 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim using
> deactivate_page() so that the page gets reclaimed almost immediately
> after the page gets cleaned. This is to avoid reclaiming clean pages
> that are younger than a dirty page encountered at the end of the LRU
> that might have been something like a use-once page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> include/linux/mmzone.h | 2 +-
> mm/vmscan.c | 10 ++++++++--
> mm/vmstat.c | 2 +-
> 3 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c4508a2..bea7858 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,7 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> - NR_VMSCAN_WRITE_SKIP,
> + NR_VMSCAN_INVALIDATE,
> NR_VMSCAN_THROTTLED,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9826086..8e00aee 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> */
> if (page_is_file_cache(page) &&
> (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> - goto keep_locked;
> + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> +
> + /* Immediately reclaim when written back */
> + unlock_page(page);
> + deactivate_page(page);
> +
> + goto keep_dirty;
> }
>
> if (references == PAGEREF_RECLAIM_CLEAN)
> @@ -956,6 +961,7 @@ keep:
> reset_reclaim_mode(sc);
> keep_lumpy:
> list_add(&page->lru, &ret_pages);
> +keep_dirty:
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
I really like the idea behind this patch, but I think all those pages
are lost as PageLRU is cleared on isolation and lru_deactivate_fn
bails on them in turn.
If I'm not mistaken, the reference from the isolation is also leaked.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
2011-07-13 16:40 ` Johannes Weiner
(?)
@ 2011-07-13 17:15 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 17:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 06:40:40PM +0200, Johannes Weiner wrote:
> On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim using
> > deactivate_page() so that the page gets reclaimed almost immediately
> > after the page gets cleaned. This is to avoid reclaiming clean pages
> > that are younger than a dirty page encountered at the end of the LRU
> > that might have been something like a use-once page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > include/linux/mmzone.h | 2 +-
> > mm/vmscan.c | 10 ++++++++--
> > mm/vmstat.c | 2 +-
> > 3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c4508a2..bea7858 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -100,7 +100,7 @@ enum zone_stat_item {
> > NR_UNSTABLE_NFS, /* NFS unstable pages */
> > NR_BOUNCE,
> > NR_VMSCAN_WRITE,
> > - NR_VMSCAN_WRITE_SKIP,
> > + NR_VMSCAN_INVALIDATE,
> > NR_VMSCAN_THROTTLED,
> > NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> > NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9826086..8e00aee 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > */
> > if (page_is_file_cache(page) &&
> > (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > - goto keep_locked;
> > + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > +
> > + /* Immediately reclaim when written back */
> > + unlock_page(page);
> > + deactivate_page(page);
> > +
> > + goto keep_dirty;
> > }
> >
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > @@ -956,6 +961,7 @@ keep:
> > reset_reclaim_mode(sc);
> > keep_lumpy:
> > list_add(&page->lru, &ret_pages);
> > +keep_dirty:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
>
> I really like the idea behind this patch, but I think all those pages
> are lost as PageLRU is cleared on isolation and lru_deactivate_fn
> bails on them in turn.
>
> If I'm not mistaken, the reference from the isolation is also leaked.
I think you're right. This patch was rushed and not thought through
properly. The surprise it appeared to work at all. Will rework
it. Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 17:15 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 17:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang
On Wed, Jul 13, 2011 at 06:40:40PM +0200, Johannes Weiner wrote:
> On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim using
> > deactivate_page() so that the page gets reclaimed almost immediately
> > after the page gets cleaned. This is to avoid reclaiming clean pages
> > that are younger than a dirty page encountered at the end of the LRU
> > that might have been something like a use-once page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > include/linux/mmzone.h | 2 +-
> > mm/vmscan.c | 10 ++++++++--
> > mm/vmstat.c | 2 +-
> > 3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c4508a2..bea7858 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -100,7 +100,7 @@ enum zone_stat_item {
> > NR_UNSTABLE_NFS, /* NFS unstable pages */
> > NR_BOUNCE,
> > NR_VMSCAN_WRITE,
> > - NR_VMSCAN_WRITE_SKIP,
> > + NR_VMSCAN_INVALIDATE,
> > NR_VMSCAN_THROTTLED,
> > NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> > NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9826086..8e00aee 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > */
> > if (page_is_file_cache(page) &&
> > (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > - goto keep_locked;
> > + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > +
> > + /* Immediately reclaim when written back */
> > + unlock_page(page);
> > + deactivate_page(page);
> > +
> > + goto keep_dirty;
> > }
> >
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > @@ -956,6 +961,7 @@ keep:
> > reset_reclaim_mode(sc);
> > keep_lumpy:
> > list_add(&page->lru, &ret_pages);
> > +keep_dirty:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
>
> I really like the idea behind this patch, but I think all those pages
> are lost as PageLRU is cleared on isolation and lru_deactivate_fn
> bails on them in turn.
>
> If I'm not mistaken, the reference from the isolation is also leaked.
I think you're right. This patch was rushed and not thought through
properly. The surprise it appeared to work at all. Will rework
it. Thanks.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
@ 2011-07-13 17:15 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-13 17:15 UTC (permalink / raw)
To: Johannes Weiner
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 06:40:40PM +0200, Johannes Weiner wrote:
> On Wed, Jul 13, 2011 at 03:31:26PM +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim using
> > deactivate_page() so that the page gets reclaimed almost immediately
> > after the page gets cleaned. This is to avoid reclaiming clean pages
> > that are younger than a dirty page encountered at the end of the LRU
> > that might have been something like a use-once page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > include/linux/mmzone.h | 2 +-
> > mm/vmscan.c | 10 ++++++++--
> > mm/vmstat.c | 2 +-
> > 3 files changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c4508a2..bea7858 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -100,7 +100,7 @@ enum zone_stat_item {
> > NR_UNSTABLE_NFS, /* NFS unstable pages */
> > NR_BOUNCE,
> > NR_VMSCAN_WRITE,
> > - NR_VMSCAN_WRITE_SKIP,
> > + NR_VMSCAN_INVALIDATE,
> > NR_VMSCAN_THROTTLED,
> > NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> > NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9826086..8e00aee 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -834,8 +834,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > */
> > if (page_is_file_cache(page) &&
> > (!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > - inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > - goto keep_locked;
> > + inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > +
> > + /* Immediately reclaim when written back */
> > + unlock_page(page);
> > + deactivate_page(page);
> > +
> > + goto keep_dirty;
> > }
> >
> > if (references == PAGEREF_RECLAIM_CLEAN)
> > @@ -956,6 +961,7 @@ keep:
> > reset_reclaim_mode(sc);
> > keep_lumpy:
> > list_add(&page->lru, &ret_pages);
> > +keep_dirty:
> > VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> > }
>
> I really like the idea behind this patch, but I think all those pages
> are lost as PageLRU is cleared on isolation and lru_deactivate_fn
> bails on them in turn.
>
> If I'm not mistaken, the reference from the isolation is also leaked.
I think you're right. This patch was rushed and not thought through
properly. The surprise it appeared to work at all. Will rework
it. Thanks.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 21:39 ` Jan Kara
-1 siblings, 0 replies; 114+ messages in thread
From: Jan Kara @ 2011-07-13 21:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
Hmm, I was looking through your numbers but I didn't see any significant
difference this patch would make. Do you?
I was thinking about the problem and actually doing IO from kswapd would be
a small problem if we submitted more than just a single page. Just to give
you idea - time to write a single page on plain SATA drive might be like 4
ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
these numbers but the orders should be right). So to write 1000 times more
data you just need like 20 times longer. That's a factor of 50 in IO
efficiency. So when reclaim/kswapd submits a single page IO once every
couple of miliseconds, your IO throughput just went close to zero...
BTW: I just checked your numbers in fsmark test with vanilla kernel. You
wrote like 14500 pages from reclaim in 567 seconds. That is about one page
per 39 ms. That is going to have noticeable impact on IO throughput (not
with XFS because it plays tricks with writing more than asked but with ext2
or ext3 you would see it I guess).
So when kswapd sees high percentage of dirty pages at the end of LRU, it
could call something like fdatawrite_range() for the range of 4 MB
(provided the file is large enough) containing that page and IO thoughput
would not be hit that much and you will get reasonably bounded time when
the page gets cleaned... If you wanted to be clever, you could possibly be
more sophisticated in picking the file and range to write so that you get
rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
cycles. Does this sound reasonable to you?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 21:39 ` Jan Kara
0 siblings, 0 replies; 114+ messages in thread
From: Jan Kara @ 2011-07-13 21:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
Hmm, I was looking through your numbers but I didn't see any significant
difference this patch would make. Do you?
I was thinking about the problem and actually doing IO from kswapd would be
a small problem if we submitted more than just a single page. Just to give
you idea - time to write a single page on plain SATA drive might be like 4
ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
these numbers but the orders should be right). So to write 1000 times more
data you just need like 20 times longer. That's a factor of 50 in IO
efficiency. So when reclaim/kswapd submits a single page IO once every
couple of miliseconds, your IO throughput just went close to zero...
BTW: I just checked your numbers in fsmark test with vanilla kernel. You
wrote like 14500 pages from reclaim in 567 seconds. That is about one page
per 39 ms. That is going to have noticeable impact on IO throughput (not
with XFS because it plays tricks with writing more than asked but with ext2
or ext3 you would see it I guess).
So when kswapd sees high percentage of dirty pages at the end of LRU, it
could call something like fdatawrite_range() for the range of 4 MB
(provided the file is large enough) containing that page and IO thoughput
would not be hit that much and you will get reasonably bounded time when
the page gets cleaned... If you wanted to be clever, you could possibly be
more sophisticated in picking the file and range to write so that you get
rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
cycles. Does this sound reasonable to you?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 21:39 ` Jan Kara
0 siblings, 0 replies; 114+ messages in thread
From: Jan Kara @ 2011-07-13 21:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
Hmm, I was looking through your numbers but I didn't see any significant
difference this patch would make. Do you?
I was thinking about the problem and actually doing IO from kswapd would be
a small problem if we submitted more than just a single page. Just to give
you idea - time to write a single page on plain SATA drive might be like 4
ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
these numbers but the orders should be right). So to write 1000 times more
data you just need like 20 times longer. That's a factor of 50 in IO
efficiency. So when reclaim/kswapd submits a single page IO once every
couple of miliseconds, your IO throughput just went close to zero...
BTW: I just checked your numbers in fsmark test with vanilla kernel. You
wrote like 14500 pages from reclaim in 567 seconds. That is about one page
per 39 ms. That is going to have noticeable impact on IO throughput (not
with XFS because it plays tricks with writing more than asked but with ext2
or ext3 you would see it I guess).
So when kswapd sees high percentage of dirty pages at the end of LRU, it
could call something like fdatawrite_range() for the range of 4 MB
(provided the file is large enough) containing that page and IO thoughput
would not be hit that much and you will get reasonably bounded time when
the page gets cleaned... If you wanted to be clever, you could possibly be
more sophisticated in picking the file and range to write so that you get
rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
cycles. Does this sound reasonable to you?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 23:34 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
condition redundant. In effect the patch should be a no-op for those
filesystems. Can you also remove the checks in the filesystems?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-13 23:34 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
condition redundant. In effect the patch should be a no-op for those
filesystems. Can you also remove the checks in the filesystems?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-13 23:34 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:34 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
condition redundant. In effect the patch should be a no-op for those
filesystems. Can you also remove the checks in the filesystems?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 23:37 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched for cleaning from
> the page reclaim path. At normal priorities, this patch prevents kswapd
> writing pages.
>
> However, page reclaim does have a requirement that pages be freed
> in a particular zone. If it is failing to make sufficient progress
> (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> considered to tbe the point where kswapd is getting into trouble
> reclaiming pages. If this priority is reached, kswapd will dispatch
> pages for writing.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Seems reasonable, but btrfs still will ignore this writeback from
kswapd, and it doesn't fall over. Given that data point, I'd like to
see the results when you stop kswapd from doing writeback altogether
as well.
Can you try removing it altogether and seeing what that does to your
test results? i.e
if (page_is_file_cache(page)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-13 23:37 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched for cleaning from
> the page reclaim path. At normal priorities, this patch prevents kswapd
> writing pages.
>
> However, page reclaim does have a requirement that pages be freed
> in a particular zone. If it is failing to make sufficient progress
> (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> considered to tbe the point where kswapd is getting into trouble
> reclaiming pages. If this priority is reached, kswapd will dispatch
> pages for writing.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Seems reasonable, but btrfs still will ignore this writeback from
kswapd, and it doesn't fall over. Given that data point, I'd like to
see the results when you stop kswapd from doing writeback altogether
as well.
Can you try removing it altogether and seeing what that does to your
test results? i.e
if (page_is_file_cache(page)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-13 23:37 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:37 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched for cleaning from
> the page reclaim path. At normal priorities, this patch prevents kswapd
> writing pages.
>
> However, page reclaim does have a requirement that pages be freed
> in a particular zone. If it is failing to make sufficient progress
> (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> considered to tbe the point where kswapd is getting into trouble
> reclaiming pages. If this priority is reached, kswapd will dispatch
> pages for writing.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Seems reasonable, but btrfs still will ignore this writeback from
kswapd, and it doesn't fall over. Given that data point, I'd like to
see the results when you stop kswapd from doing writeback altogether
as well.
Can you try removing it altogether and seeing what that does to your
test results? i.e
if (page_is_file_cache(page)) {
inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
goto keep_locked;
}
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 23:41 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
>
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e. as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.
It still doesn't take into account how many pages under writeback
were skipped. If there are lots of pages that are under writeback, I
think we still want to throttle to give IO a chance to complete and
clean those pages before scanning again....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-13 23:41 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
>
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e. as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.
It still doesn't take into account how many pages under writeback
were skipped. If there are lots of pages that are under writeback, I
think we still want to throttle to give IO a chance to complete and
clean those pages before scanning again....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-13 23:41 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:41 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
>
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e. as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.
It still doesn't take into account how many pages under writeback
were skipped. If there are lots of pages that are under writeback, I
think we still want to throttle to give IO a chance to complete and
clean those pages before scanning again....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-13 23:56 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 16 ++++++++++++-
> 4 files changed, 74 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0f015a0..1201052 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> + /* Move any inode found at end of LRU to dispatch queue */
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_wb_list, &tmp);
> +
> + if (sb && sb != inode->i_sb)
> + do_sb_sort = 1;
> + sb = inode->i_sb;
> + }
> + }
This is not a good idea. move_expired_inodes() already sucks a large
amount of CPU when there are lots of dirty inodes on the list (think
hundreds of thousands), and that is when the traversal terminates at
*older_than_this. It's not uncommon in my testing to see this
one function consume 30-35% of the bdi-flusher thread CPU usage
in such conditions.
By adding an entire list traversal in addition to the aging
traversal, this is going significantly increase the CPU overhead of
the function and hence could significantly increase
bdi->wb_list_lock contention and decrease writeback throughput.
> +
> + sb = NULL;
> while (!list_empty(delaying_queue)) {
> inode = wb_inode(delaying_queue->prev);
> if (older_than_this &&
> @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + if (PageSwapBacked(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set. Inode
> + * should be pinned by the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode->i_lock);
Micro optimisations like this are unnecessary - the inode->i_lock is
not contended.
As it is, this code won't really work as you think it might.
There's no guarantee a dirty inode is on the dirty - it might have
already been expired, and it might even currently be under
writeback. In that case, if it is still dirty it goes to the
b_more_io list and writeback bandwidth is shared between all the
other dirty inodes and completely ignores this flag...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 23:56 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 16 ++++++++++++-
> 4 files changed, 74 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0f015a0..1201052 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> + /* Move any inode found at end of LRU to dispatch queue */
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_wb_list, &tmp);
> +
> + if (sb && sb != inode->i_sb)
> + do_sb_sort = 1;
> + sb = inode->i_sb;
> + }
> + }
This is not a good idea. move_expired_inodes() already sucks a large
amount of CPU when there are lots of dirty inodes on the list (think
hundreds of thousands), and that is when the traversal terminates at
*older_than_this. It's not uncommon in my testing to see this
one function consume 30-35% of the bdi-flusher thread CPU usage
in such conditions.
By adding an entire list traversal in addition to the aging
traversal, this is going significantly increase the CPU overhead of
the function and hence could significantly increase
bdi->wb_list_lock contention and decrease writeback throughput.
> +
> + sb = NULL;
> while (!list_empty(delaying_queue)) {
> inode = wb_inode(delaying_queue->prev);
> if (older_than_this &&
> @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + if (PageSwapBacked(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set. Inode
> + * should be pinned by the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode->i_lock);
Micro optimisations like this are unnecessary - the inode->i_lock is
not contended.
As it is, this code won't really work as you think it might.
There's no guarantee a dirty inode is on the dirty - it might have
already been expired, and it might even currently be under
writeback. In that case, if it is still dirty it goes to the
b_more_io list and writeback bandwidth is shared between all the
other dirty inodes and completely ignores this flag...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-13 23:56 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-13 23:56 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
>
> When dirty pages are encounted on the LRU, this patch marks the inodes
> I_DIRTY_RECLAIM and wakes the background flusher. When the background
> flusher runs, it moves such inodes immediately to the dispatch queue
> regardless of inode age. There is no guarantee that pages reclaim
> cares about will be cleaned first but the expectation is that the
> flusher threads will clean the page quicker than if reclaim tried to
> clean a single page.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
> fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/fs.h | 5 ++-
> include/linux/writeback.h | 1 +
> mm/vmscan.c | 16 ++++++++++++-
> 4 files changed, 74 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0f015a0..1201052 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tinode;
> int do_sb_sort = 0;
>
> + /* Move inodes reclaim found at end of LRU to dispatch queue */
> + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> + /* Move any inode found at end of LRU to dispatch queue */
> + if (inode->i_state & I_DIRTY_RECLAIM) {
> + inode->i_state &= ~I_DIRTY_RECLAIM;
> + list_move(&inode->i_wb_list, &tmp);
> +
> + if (sb && sb != inode->i_sb)
> + do_sb_sort = 1;
> + sb = inode->i_sb;
> + }
> + }
This is not a good idea. move_expired_inodes() already sucks a large
amount of CPU when there are lots of dirty inodes on the list (think
hundreds of thousands), and that is when the traversal terminates at
*older_than_this. It's not uncommon in my testing to see this
one function consume 30-35% of the bdi-flusher thread CPU usage
in such conditions.
By adding an entire list traversal in addition to the aging
traversal, this is going significantly increase the CPU overhead of
the function and hence could significantly increase
bdi->wb_list_lock contention and decrease writeback throughput.
> +
> + sb = NULL;
> while (!list_empty(delaying_queue)) {
> inode = wb_inode(delaying_queue->prev);
> if (older_than_this &&
> @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> rcu_read_unlock();
> }
>
> +/*
> + * Similar to wakeup_flusher_threads except prioritise inodes contained
> + * in the page_list regardless of age
> + */
> +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> +{
> + struct page *page;
> + struct address_space *mapping;
> + struct inode *inode;
> +
> + list_for_each_entry(page, page_list, lru) {
> + if (!PageDirty(page))
> + continue;
> +
> + if (PageSwapBacked(page))
> + continue;
> +
> + lock_page(page);
> + mapping = page_mapping(page);
> + if (!mapping)
> + goto unlock;
> +
> + /*
> + * Test outside the lock to see as if it is already set. Inode
> + * should be pinned by the lock_page
> + */
> + inode = page->mapping->host;
> + if (inode->i_state & I_DIRTY_RECLAIM)
> + goto unlock;
> +
> + spin_lock(&inode->i_lock);
> + inode->i_state |= I_DIRTY_RECLAIM;
> + spin_unlock(&inode->i_lock);
Micro optimisations like this are unnecessary - the inode->i_lock is
not contended.
As it is, this code won't really work as you think it might.
There's no guarantee a dirty inode is on the dirty - it might have
already been expired, and it might even currently be under
writeback. In that case, if it is still dirty it goes to the
b_more_io list and writeback bandwidth is shared between all the
other dirty inodes and completely ignores this flag...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 21:39 ` Jan Kara
(?)
@ 2011-07-14 0:09 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:09 UTC (permalink / raw)
To: Jan Kara
Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
I'm not so concerned about single drives - the numbers look far worse
when you have a high throughput filesystem. For arguments sake, lets
call that 1GB/s (even though I know of plenty of 10+GB/s XFS
filesystems out there). That gives you 4ms for a 4k IO, and 4MB of
data in 4ms seek + 4ms data transfer time, for 8ms total IO time.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency.
In the case I tend to care about, it's more like factor of 1000 in
IO efficiency - 3 orders of magnitude or greater difference in
performance.
> So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
That's what Wu's patch did - it pushed it off to the bdi-flusher
because you can't call iput() in memory reclaim context and you need
a reference to the inode before calling fdatawrite_range().
As I mentioned for that patch, writing 4MB instead of a single page
will cause different problems - after just 25 dirty pages, we've
queued 100MB of IO and on a typical desktop system that will take at
least a second to complete. Now we get the opposite problem of IO
latency to clean a specific page and the potential to stall normal
background expired inode writeback forever if we keep hitting dirty
pages during page reclaim.
It's just yet another reason I'd really like to see numbers showing
that not doing IO from memory reclaim causes problems in the cases
where it is said to be needed (like reclaiming memory from a
specific node) and that issuing IO is the -only- solution. If numbers
can't be produced showing that we *need* to do IO from memory
reclaim, then why jump through hoops like we currently are trying to
fix all the nasty corner cases?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 0:09 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:09 UTC (permalink / raw)
To: Jan Kara
Cc: Rik van Riel, LKML, XFS, Christoph Hellwig, Linux-MM, Mel Gorman,
Wu Fengguang, Johannes Weiner, Minchan Kim
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
I'm not so concerned about single drives - the numbers look far worse
when you have a high throughput filesystem. For arguments sake, lets
call that 1GB/s (even though I know of plenty of 10+GB/s XFS
filesystems out there). That gives you 4ms for a 4k IO, and 4MB of
data in 4ms seek + 4ms data transfer time, for 8ms total IO time.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency.
In the case I tend to care about, it's more like factor of 1000 in
IO efficiency - 3 orders of magnitude or greater difference in
performance.
> So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
That's what Wu's patch did - it pushed it off to the bdi-flusher
because you can't call iput() in memory reclaim context and you need
a reference to the inode before calling fdatawrite_range().
As I mentioned for that patch, writing 4MB instead of a single page
will cause different problems - after just 25 dirty pages, we've
queued 100MB of IO and on a typical desktop system that will take at
least a second to complete. Now we get the opposite problem of IO
latency to clean a specific page and the potential to stall normal
background expired inode writeback forever if we keep hitting dirty
pages during page reclaim.
It's just yet another reason I'd really like to see numbers showing
that not doing IO from memory reclaim causes problems in the cases
where it is said to be needed (like reclaiming memory from a
specific node) and that issuing IO is the -only- solution. If numbers
can't be produced showing that we *need* to do IO from memory
reclaim, then why jump through hoops like we currently are trying to
fix all the nasty corner cases?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 0:09 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:09 UTC (permalink / raw)
To: Jan Kara
Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
I'm not so concerned about single drives - the numbers look far worse
when you have a high throughput filesystem. For arguments sake, lets
call that 1GB/s (even though I know of plenty of 10+GB/s XFS
filesystems out there). That gives you 4ms for a 4k IO, and 4MB of
data in 4ms seek + 4ms data transfer time, for 8ms total IO time.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency.
In the case I tend to care about, it's more like factor of 1000 in
IO efficiency - 3 orders of magnitude or greater difference in
performance.
> So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
That's what Wu's patch did - it pushed it off to the bdi-flusher
because you can't call iput() in memory reclaim context and you need
a reference to the inode before calling fdatawrite_range().
As I mentioned for that patch, writing 4MB instead of a single page
will cause different problems - after just 25 dirty pages, we've
queued 100MB of IO and on a typical desktop system that will take at
least a second to complete. Now we get the opposite problem of IO
latency to clean a specific page and the potential to stall normal
background expired inode writeback forever if we keep hitting dirty
pages during page reclaim.
It's just yet another reason I'd really like to see numbers showing
that not doing IO from memory reclaim causes problems in the cases
where it is said to be needed (like reclaiming memory from a
specific node) and that issuing IO is the -only- solution. If numbers
can't be produced showing that we *need* to do IO from memory
reclaim, then why jump through hoops like we currently are trying to
fix all the nasty corner cases?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-14 0:33 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:33 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> (Revisting this from a year ago and following on from the thread
> "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> clustering". Posting an prototype to see if anything obvious is
> being missed)
Hi Mel,
Thanks for picking this up again. The results are definitely
promising, but I'd like to see a comparison against simply not doing
IO from memory reclaim at all combined with the enhancements in this
patchset. After all, that's what I keep asking for (so we can get
rid of .writepage altogether), and if the numbers don't add up, then
I'll shut up about it. ;)
.....
> use-once LRU logic). The command line for fs_mark looked something like
>
> ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
>
> The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> this triggers the worst behaviour.
....
> During testing, a number of monitors were running to gather information
> from ftrace in particular. This disrupts the results of course because
> recording the information generates IO in itself but I'm ignoring
> that for the moment so the effect of the patches can be seen.
>
> I've posted the raw reports for each filesystem at
>
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
.....
> Average files per second is increased by a nice percentage albeit
> just within the standard deviation. Consider the type of test this is,
> variability was inevitable but will double check without monitoring.
>
> The overhead (time spent in non-filesystem-related activities) is
> reduced a *lot* and is a lot less variable.
Given that userspace is doing the same amount of work in all test
runs, that implies that the userspace process is retaining it's
working set hot in the cache over syscalls with this patchset.
> Direct reclaim work is significantly reduced going from 37% of all
> pages scanned to 1% with all patches applied. This implies that
> processes are getting stalled less.
And that directly implicates page scanning during direct reclaim as
the prime contributor to turfing the application's working set out
of the CPU cache....
> Page writes by reclaim is what is motivating this series. It goes
> from 14511 pages to 4084 which is a big improvement. We'll see later
> if these were anonymous or file-backed pages.
Which were anon pages, so this is a major improvement. However,
given that there were no dirty pages writen directly by memory
reclaim, perhaps we don't need to do IO at all from here and
throttling is all that is needed? ;)
> Direct reclaim writes were never a problem according to this.
That's true. but we disable direct reclaim for other reasons, namely
that writeback from direct reclaim blows the stack.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 0:33 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:33 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> (Revisting this from a year ago and following on from the thread
> "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> clustering". Posting an prototype to see if anything obvious is
> being missed)
Hi Mel,
Thanks for picking this up again. The results are definitely
promising, but I'd like to see a comparison against simply not doing
IO from memory reclaim at all combined with the enhancements in this
patchset. After all, that's what I keep asking for (so we can get
rid of .writepage altogether), and if the numbers don't add up, then
I'll shut up about it. ;)
.....
> use-once LRU logic). The command line for fs_mark looked something like
>
> ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
>
> The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> this triggers the worst behaviour.
....
> During testing, a number of monitors were running to gather information
> from ftrace in particular. This disrupts the results of course because
> recording the information generates IO in itself but I'm ignoring
> that for the moment so the effect of the patches can be seen.
>
> I've posted the raw reports for each filesystem at
>
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
.....
> Average files per second is increased by a nice percentage albeit
> just within the standard deviation. Consider the type of test this is,
> variability was inevitable but will double check without monitoring.
>
> The overhead (time spent in non-filesystem-related activities) is
> reduced a *lot* and is a lot less variable.
Given that userspace is doing the same amount of work in all test
runs, that implies that the userspace process is retaining it's
working set hot in the cache over syscalls with this patchset.
> Direct reclaim work is significantly reduced going from 37% of all
> pages scanned to 1% with all patches applied. This implies that
> processes are getting stalled less.
And that directly implicates page scanning during direct reclaim as
the prime contributor to turfing the application's working set out
of the CPU cache....
> Page writes by reclaim is what is motivating this series. It goes
> from 14511 pages to 4084 which is a big improvement. We'll see later
> if these were anonymous or file-backed pages.
Which were anon pages, so this is a major improvement. However,
given that there were no dirty pages writen directly by memory
reclaim, perhaps we don't need to do IO at all from here and
throttling is all that is needed? ;)
> Direct reclaim writes were never a problem according to this.
That's true. but we disable direct reclaim for other reasons, namely
that writeback from direct reclaim blows the stack.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 0:33 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 0:33 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> (Revisting this from a year ago and following on from the thread
> "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> clustering". Posting an prototype to see if anything obvious is
> being missed)
Hi Mel,
Thanks for picking this up again. The results are definitely
promising, but I'd like to see a comparison against simply not doing
IO from memory reclaim at all combined with the enhancements in this
patchset. After all, that's what I keep asking for (so we can get
rid of .writepage altogether), and if the numbers don't add up, then
I'll shut up about it. ;)
.....
> use-once LRU logic). The command line for fs_mark looked something like
>
> ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
>
> The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> this triggers the worst behaviour.
....
> During testing, a number of monitors were running to gather information
> from ftrace in particular. This disrupts the results of course because
> recording the information generates IO in itself but I'm ignoring
> that for the moment so the effect of the patches can be seen.
>
> I've posted the raw reports for each filesystem at
>
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
.....
> Average files per second is increased by a nice percentage albeit
> just within the standard deviation. Consider the type of test this is,
> variability was inevitable but will double check without monitoring.
>
> The overhead (time spent in non-filesystem-related activities) is
> reduced a *lot* and is a lot less variable.
Given that userspace is doing the same amount of work in all test
runs, that implies that the userspace process is retaining it's
working set hot in the cache over syscalls with this patchset.
> Direct reclaim work is significantly reduced going from 37% of all
> pages scanned to 1% with all patches applied. This implies that
> processes are getting stalled less.
And that directly implicates page scanning during direct reclaim as
the prime contributor to turfing the application's working set out
of the CPU cache....
> Page writes by reclaim is what is motivating this series. It goes
> from 14511 pages to 4084 which is a big improvement. We'll see later
> if these were anonymous or file-backed pages.
Which were anon pages, so this is a major improvement. However,
given that there were no dirty pages writen directly by memory
reclaim, perhaps we don't need to do IO at all from here and
throttling is all that is needed? ;)
> Direct reclaim writes were never a problem according to this.
That's true. but we disable direct reclaim for other reasons, namely
that writeback from direct reclaim blows the stack.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-14 1:38 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 1:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed, 13 Jul 2011 15:31:23 +0100
Mel Gorman <mgorman@suse.de> wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Hm.
> ---
> include/linux/mmzone.h | 1 +
> mm/vmscan.c | 9 +++++++++
> mm/vmstat.c | 1 +
> 3 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9f7c3eb..b70a0c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,6 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> + NR_VMSCAN_WRITE_SKIP,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..2d3e5b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> if (PageDirty(page)) {
> nr_dirty++;
>
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> + goto keep_locked;
> + }
> +
This will cause tons of memcg OOM kill because we have no help of kswapd (now).
Could you make this
if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
...
Then...sorry, please keep file system hook for a while. I'll do memcg dirty_ratio work
by myself if Greg will not post new version until the next month. After that, we can
remove scanning_global_lru(sc), I think.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 1:38 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 1:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, 13 Jul 2011 15:31:23 +0100
Mel Gorman <mgorman@suse.de> wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Hm.
> ---
> include/linux/mmzone.h | 1 +
> mm/vmscan.c | 9 +++++++++
> mm/vmstat.c | 1 +
> 3 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9f7c3eb..b70a0c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,6 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> + NR_VMSCAN_WRITE_SKIP,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..2d3e5b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> if (PageDirty(page)) {
> nr_dirty++;
>
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> + goto keep_locked;
> + }
> +
This will cause tons of memcg OOM kill because we have no help of kswapd (now).
Could you make this
if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
...
Then...sorry, please keep file system hook for a while. I'll do memcg dirty_ratio work
by myself if Greg will not post new version until the next month. After that, we can
remove scanning_global_lru(sc), I think.
Thanks,
-Kame
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 1:38 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 1:38 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed, 13 Jul 2011 15:31:23 +0100
Mel Gorman <mgorman@suse.de> wrote:
> From: Mel Gorman <mel@csn.ul.ie>
>
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
>
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
>
> The elevator has a relatively small window it can operate on,
> and can never fix up a bad large scale writeback pattern.
>
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymos
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Hm.
> ---
> include/linux/mmzone.h | 1 +
> mm/vmscan.c | 9 +++++++++
> mm/vmstat.c | 1 +
> 3 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9f7c3eb..b70a0c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -100,6 +100,7 @@ enum zone_stat_item {
> NR_UNSTABLE_NFS, /* NFS unstable pages */
> NR_BOUNCE,
> NR_VMSCAN_WRITE,
> + NR_VMSCAN_WRITE_SKIP,
> NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
> NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
> NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f49535..2d3e5b6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> if (PageDirty(page)) {
> nr_dirty++;
>
> + /*
> + * Only kswapd can writeback filesystem pages to
> + * avoid risk of stack overflow
> + */
> + if (page_is_file_cache(page) && !current_is_kswapd()) {
> + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> + goto keep_locked;
> + }
> +
This will cause tons of memcg OOM kill because we have no help of kswapd (now).
Could you make this
if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
...
Then...sorry, please keep file system hook for a while. I'll do memcg dirty_ratio work
by myself if Greg will not post new version until the next month. After that, we can
remove scanning_global_lru(sc), I think.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 4:46 ` Christoph Hellwig
(?)
@ 2011-07-14 4:46 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 4:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, 14 Jul 2011 00:46:43 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> XFS and btrfs already disable writeback from memcg context, as does ext4
> for the typical non-overwrite workloads, and none has fallen apart.
>
> In fact there's no way we can enable them as the memcg calling contexts
> tend to have massive stack usage.
>
Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Thanks,
-Kame
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 4:46 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 4:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Rik van Riel, Jan Kara, LKML, XFS, Linux-MM, Mel Gorman,
Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, 14 Jul 2011 00:46:43 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> XFS and btrfs already disable writeback from memcg context, as does ext4
> for the typical non-overwrite workloads, and none has fallen apart.
>
> In fact there's no way we can enable them as the memcg calling contexts
> tend to have massive stack usage.
>
Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Thanks,
-Kame
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 4:46 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 4:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, 14 Jul 2011 00:46:43 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> XFS and btrfs already disable writeback from memcg context, as does ext4
> for the typical non-overwrite workloads, and none has fallen apart.
>
> In fact there's no way we can enable them as the memcg calling contexts
> tend to have massive stack usage.
>
Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 1:38 ` KAMEZAWA Hiroyuki
(?)
@ 2011-07-14 4:46 ` Christoph Hellwig
-1 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
XFS and btrfs already disable writeback from memcg context, as does ext4
for the typical non-overwrite workloads, and none has fallen apart.
In fact there's no way we can enable them as the memcg calling contexts
tend to have massive stack usage.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 4:46 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
XFS and btrfs already disable writeback from memcg context, as does ext4
for the typical non-overwrite workloads, and none has fallen apart.
In fact there's no way we can enable them as the memcg calling contexts
tend to have massive stack usage.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 4:46 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:46 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
XFS and btrfs already disable writeback from memcg context, as does ext4
for the typical non-overwrite workloads, and none has fallen apart.
In fact there's no way we can enable them as the memcg calling contexts
tend to have massive stack usage.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
2011-07-14 0:33 ` Dave Chinner
(?)
@ 2011-07-14 4:51 ` Christoph Hellwig
-1 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> patchset. After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
Unfortunately there's a few more users of ->writepage in addition to
memory reclaim. The most visible on is page migration, but there's also
a write_one_page helper used by a few filesystems that would either
need to get a writepage-like callback or a bigger rewrite.
I agree that killing of ->writepage would be a worthwhile goal, though.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 4:51 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> patchset. After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
Unfortunately there's a few more users of ->writepage in addition to
memory reclaim. The most visible on is page migration, but there's also
a write_one_page helper used by a few filesystems that would either
need to get a writepage-like callback or a bigger rewrite.
I agree that killing of ->writepage would be a worthwhile goal, though.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 4:51 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 4:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Mel Gorman, Linux-MM, LKML, XFS, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> patchset. After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
Unfortunately there's a few more users of ->writepage in addition to
memory reclaim. The most visible on is page migration, but there's also
a write_one_page helper used by a few filesystems that would either
need to get a writepage-like callback or a bigger rewrite.
I agree that killing of ->writepage would be a worthwhile goal, though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 6:19 ` Mel Gorman
(?)
@ 2011-07-14 6:17 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 6:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, 14 Jul 2011 07:19:15 +0100
Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > if (PageDirty(page)) {
> > > nr_dirty++;
> > >
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > Could you make this
> >
> > if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> > ...
> >
>
> I can, but as Christoph points out, the request is already being
> ignored so how will it help?
>
Hmm, ok, please go as you do now. I'll do hurry up to implement dirty_ratio by myself
without waiting for original patch writer. I think his latest version was really
near to be merged... I'll start rebase his one to mmotm in this end of month.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 6:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, 14 Jul 2011 07:19:15 +0100
Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > if (PageDirty(page)) {
> > > nr_dirty++;
> > >
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > Could you make this
> >
> > if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> > ...
> >
>
> I can, but as Christoph points out, the request is already being
> ignored so how will it help?
>
Hmm, ok, please go as you do now. I'll do hurry up to implement dirty_ratio by myself
without waiting for original patch writer. I think his latest version was really
near to be merged... I'll start rebase his one to mmotm in this end of month.
Thanks,
-Kame
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 6:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, 14 Jul 2011 07:19:15 +0100
Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > if (PageDirty(page)) {
> > > nr_dirty++;
> > >
> > > + /*
> > > + * Only kswapd can writeback filesystem pages to
> > > + * avoid risk of stack overflow
> > > + */
> > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > + goto keep_locked;
> > > + }
> > > +
> >
> >
> > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > Could you make this
> >
> > if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> > ...
> >
>
> I can, but as Christoph points out, the request is already being
> ignored so how will it help?
>
Hmm, ok, please go as you do now. I'll do hurry up to implement dirty_ratio by myself
without waiting for original patch writer. I think his latest version was really
near to be merged... I'll start rebase his one to mmotm in this end of month.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-13 23:34 ` Dave Chinner
(?)
@ 2011-07-14 6:17 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:34:49AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mel@csn.ul.ie>
> >
> > When kswapd is failing to keep zones above the min watermark, a process
> > will enter direct reclaim in the same manner kswapd does. If a dirty
> > page is encountered during the scan, this page is written to backing
> > storage using mapping->writepage.
> >
> > This causes two problems. First, it can result in very deep call
> > stacks, particularly if the target storage or filesystem are complex.
> > Some filesystems ignore write requests from direct reclaim as a result.
> > The second is that a single-page flush is inefficient in terms of IO.
> > While there is an expectation that the elevator will merge requests,
> > this does not always happen. Quoting Christoph Hellwig;
> >
> > The elevator has a relatively small window it can operate on,
> > and can never fix up a bad large scale writeback pattern.
> >
> > This patch prevents direct reclaim writing back filesystem pages by
> > checking if current is kswapd. Anonymous pages are still written to
> > swap as there is not the equivalent of a flusher thread for anonymos
> > pages. If the dirty pages cannot be written back, they are placed
> > back on the LRU lists.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
> condition redundant. In effect the patch should be a no-op for those
> filesystems. Can you also remove the checks in the filesystems?
>
I'll convert them to warnings just in case it regresses due to an
oversight.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:17 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 09:34:49AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mel@csn.ul.ie>
> >
> > When kswapd is failing to keep zones above the min watermark, a process
> > will enter direct reclaim in the same manner kswapd does. If a dirty
> > page is encountered during the scan, this page is written to backing
> > storage using mapping->writepage.
> >
> > This causes two problems. First, it can result in very deep call
> > stacks, particularly if the target storage or filesystem are complex.
> > Some filesystems ignore write requests from direct reclaim as a result.
> > The second is that a single-page flush is inefficient in terms of IO.
> > While there is an expectation that the elevator will merge requests,
> > this does not always happen. Quoting Christoph Hellwig;
> >
> > The elevator has a relatively small window it can operate on,
> > and can never fix up a bad large scale writeback pattern.
> >
> > This patch prevents direct reclaim writing back filesystem pages by
> > checking if current is kswapd. Anonymous pages are still written to
> > swap as there is not the equivalent of a flusher thread for anonymos
> > pages. If the dirty pages cannot be written back, they are placed
> > back on the LRU lists.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
> condition redundant. In effect the patch should be a no-op for those
> filesystems. Can you also remove the checks in the filesystems?
>
I'll convert them to warnings just in case it regresses due to an
oversight.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:17 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:34:49AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:23PM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mel@csn.ul.ie>
> >
> > When kswapd is failing to keep zones above the min watermark, a process
> > will enter direct reclaim in the same manner kswapd does. If a dirty
> > page is encountered during the scan, this page is written to backing
> > storage using mapping->writepage.
> >
> > This causes two problems. First, it can result in very deep call
> > stacks, particularly if the target storage or filesystem are complex.
> > Some filesystems ignore write requests from direct reclaim as a result.
> > The second is that a single-page flush is inefficient in terms of IO.
> > While there is an expectation that the elevator will merge requests,
> > this does not always happen. Quoting Christoph Hellwig;
> >
> > The elevator has a relatively small window it can operate on,
> > and can never fix up a bad large scale writeback pattern.
> >
> > This patch prevents direct reclaim writing back filesystem pages by
> > checking if current is kswapd. Anonymous pages are still written to
> > swap as there is not the equivalent of a flusher thread for anonymos
> > pages. If the dirty pages cannot be written back, they are placed
> > back on the LRU lists.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Ok, so that makes the .writepage checks in ext4, xfs and btrfs for this
> condition redundant. In effect the patch should be a no-op for those
> filesystems. Can you also remove the checks in the filesystems?
>
I'll convert them to warnings just in case it regresses due to an
oversight.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 1:38 ` KAMEZAWA Hiroyuki
(?)
@ 2011-07-14 6:19 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > if (PageDirty(page)) {
> > nr_dirty++;
> >
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> Could you make this
>
> if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> ...
>
I can, but as Christoph points out, the request is already being
ignored so how will it help?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:19 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > if (PageDirty(page)) {
> > nr_dirty++;
> >
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> Could you make this
>
> if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> ...
>
I can, but as Christoph points out, the request is already being
ignored so how will it help?
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 6:19 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:19 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > if (PageDirty(page)) {
> > nr_dirty++;
> >
> > + /*
> > + * Only kswapd can writeback filesystem pages to
> > + * avoid risk of stack overflow
> > + */
> > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > + goto keep_locked;
> > + }
> > +
>
>
> This will cause tons of memcg OOM kill because we have no help of kswapd (now).
>
> Could you make this
>
> if (scanning_global_lru(sc) && page_is_file_cache(page) && !current_is_kswapd())
> ...
>
I can, but as Christoph points out, the request is already being
ignored so how will it help?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-13 23:37 ` Dave Chinner
(?)
@ 2011-07-14 6:29 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:29 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched for cleaning from
> > the page reclaim path. At normal priorities, this patch prevents kswapd
> > writing pages.
> >
> > However, page reclaim does have a requirement that pages be freed
> > in a particular zone. If it is failing to make sufficient progress
> > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > considered to tbe the point where kswapd is getting into trouble
> > reclaiming pages. If this priority is reached, kswapd will dispatch
> > pages for writing.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Seems reasonable, but btrfs still will ignore this writeback from
> kswapd, and it doesn't fall over.
At least there are no reports of it falling over :)
> Given that data point, I'd like to
> see the results when you stop kswapd from doing writeback altogether
> as well.
>
The results for this test will be identical because the ftrace results
show that kswapd is already writing 0 filesystem pages.
Where it makes a difference is when the system is under enough
pressure that it is failing to reclaim any memory and is in danger
of prematurely triggering the OOM killer. Andrea outlined some of
the concerns before at http://lkml.org/lkml/2010/6/15/246
> Can you try removing it altogether and seeing what that does to your
> test results? i.e
>
> if (page_is_file_cache(page)) {
> inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> goto keep_locked;
> }
It won't do anything, it'll still be writing 0 filesystem-backed pages.
Because of the possibility for the OOM killer triggering prematurely due
to the inability of kswapd to write pages, I'd prefer to separate such a
change by at least one release so that if there is an increase in OOM
reports, it'll be obvious what was the culprit.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 6:29 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:29 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched for cleaning from
> > the page reclaim path. At normal priorities, this patch prevents kswapd
> > writing pages.
> >
> > However, page reclaim does have a requirement that pages be freed
> > in a particular zone. If it is failing to make sufficient progress
> > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > considered to tbe the point where kswapd is getting into trouble
> > reclaiming pages. If this priority is reached, kswapd will dispatch
> > pages for writing.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Seems reasonable, but btrfs still will ignore this writeback from
> kswapd, and it doesn't fall over.
At least there are no reports of it falling over :)
> Given that data point, I'd like to
> see the results when you stop kswapd from doing writeback altogether
> as well.
>
The results for this test will be identical because the ftrace results
show that kswapd is already writing 0 filesystem pages.
Where it makes a difference is when the system is under enough
pressure that it is failing to reclaim any memory and is in danger
of prematurely triggering the OOM killer. Andrea outlined some of
the concerns before at http://lkml.org/lkml/2010/6/15/246
> Can you try removing it altogether and seeing what that does to your
> test results? i.e
>
> if (page_is_file_cache(page)) {
> inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> goto keep_locked;
> }
It won't do anything, it'll still be writing 0 filesystem-backed pages.
Because of the possibility for the OOM killer triggering prematurely due
to the inability of kswapd to write pages, I'd prefer to separate such a
change by at least one release so that if there is an increase in OOM
reports, it'll be obvious what was the culprit.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 6:29 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:29 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched for cleaning from
> > the page reclaim path. At normal priorities, this patch prevents kswapd
> > writing pages.
> >
> > However, page reclaim does have a requirement that pages be freed
> > in a particular zone. If it is failing to make sufficient progress
> > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > considered to tbe the point where kswapd is getting into trouble
> > reclaiming pages. If this priority is reached, kswapd will dispatch
> > pages for writing.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> Seems reasonable, but btrfs still will ignore this writeback from
> kswapd, and it doesn't fall over.
At least there are no reports of it falling over :)
> Given that data point, I'd like to
> see the results when you stop kswapd from doing writeback altogether
> as well.
>
The results for this test will be identical because the ftrace results
show that kswapd is already writing 0 filesystem pages.
Where it makes a difference is when the system is under enough
pressure that it is failing to reclaim any memory and is in danger
of prematurely triggering the OOM killer. Andrea outlined some of
the concerns before at http://lkml.org/lkml/2010/6/15/246
> Can you try removing it altogether and seeing what that does to your
> test results? i.e
>
> if (page_is_file_cache(page)) {
> inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> goto keep_locked;
> }
It won't do anything, it'll still be writing 0 filesystem-backed pages.
Because of the possibility for the OOM killer triggering prematurely due
to the inability of kswapd to write pages, I'd prefer to separate such a
change by at least one release so that if there is an increase in OOM
reports, it'll be obvious what was the culprit.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
2011-07-13 23:41 ` Dave Chinner
(?)
@ 2011-07-14 6:33 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:41:50AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> > Workloads that are allocating frequently and writing files place a
> > large number of dirty pages on the LRU. With use-once logic, it is
> > possible for them to reach the end of the LRU quickly requiring the
> > reclaimer to scan more to find clean pages. Ordinarily, processes that
> > are dirtying memory will get throttled by dirty balancing but this
> > is a global heuristic and does not take into account that LRUs are
> > maintained on a per-zone basis. This can lead to a situation whereby
> > reclaim is scanning heavily, skipping over a large number of pages
> > under writeback and recycling them around the LRU consuming CPU.
> >
> > This patch checks how many of the number of pages isolated from the
> > LRU were dirty. If a percentage of them are dirty, the process will be
> > throttled if a blocking device is congested or the zone being scanned
> > is marked congested. The percentage that must be dirty depends on
> > the priority. At default priority, all of them must be dirty. At
> > DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> > etc. i.e. as pressure increases the greater the likelihood the process
> > will get throttled to allow the flusher threads to make some progress.
>
> It still doesn't take into account how many pages under writeback
> were skipped. If there are lots of pages that are under writeback, I
> think we still want to throttle to give IO a chance to complete and
> clean those pages before scanning again....
>
An earlier revision did take them into account but in these tests at
least, 0 pages at the end of the LRU were PageWriteback. I expect this
to change when multiple processes and CPUs were in use but am ignoring
it for the moment.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-14 6:33 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 09:41:50AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> > Workloads that are allocating frequently and writing files place a
> > large number of dirty pages on the LRU. With use-once logic, it is
> > possible for them to reach the end of the LRU quickly requiring the
> > reclaimer to scan more to find clean pages. Ordinarily, processes that
> > are dirtying memory will get throttled by dirty balancing but this
> > is a global heuristic and does not take into account that LRUs are
> > maintained on a per-zone basis. This can lead to a situation whereby
> > reclaim is scanning heavily, skipping over a large number of pages
> > under writeback and recycling them around the LRU consuming CPU.
> >
> > This patch checks how many of the number of pages isolated from the
> > LRU were dirty. If a percentage of them are dirty, the process will be
> > throttled if a blocking device is congested or the zone being scanned
> > is marked congested. The percentage that must be dirty depends on
> > the priority. At default priority, all of them must be dirty. At
> > DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> > etc. i.e. as pressure increases the greater the likelihood the process
> > will get throttled to allow the flusher threads to make some progress.
>
> It still doesn't take into account how many pages under writeback
> were skipped. If there are lots of pages that are under writeback, I
> think we still want to throttle to give IO a chance to complete and
> clean those pages before scanning again....
>
An earlier revision did take them into account but in these tests at
least, 0 pages at the end of the LRU were PageWriteback. I expect this
to change when multiple processes and CPUs were in use but am ignoring
it for the moment.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
@ 2011-07-14 6:33 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 6:33 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:41:50AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:25PM +0100, Mel Gorman wrote:
> > Workloads that are allocating frequently and writing files place a
> > large number of dirty pages on the LRU. With use-once logic, it is
> > possible for them to reach the end of the LRU quickly requiring the
> > reclaimer to scan more to find clean pages. Ordinarily, processes that
> > are dirtying memory will get throttled by dirty balancing but this
> > is a global heuristic and does not take into account that LRUs are
> > maintained on a per-zone basis. This can lead to a situation whereby
> > reclaim is scanning heavily, skipping over a large number of pages
> > under writeback and recycling them around the LRU consuming CPU.
> >
> > This patch checks how many of the number of pages isolated from the
> > LRU were dirty. If a percentage of them are dirty, the process will be
> > throttled if a blocking device is congested or the zone being scanned
> > is marked congested. The percentage that must be dirty depends on
> > the priority. At default priority, all of them must be dirty. At
> > DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> > etc. i.e. as pressure increases the greater the likelihood the process
> > will get throttled to allow the flusher threads to make some progress.
>
> It still doesn't take into account how many pages under writeback
> were skipped. If there are lots of pages that are under writeback, I
> think we still want to throttle to give IO a chance to complete and
> clean those pages before scanning again....
>
An earlier revision did take them into account but in these tests at
least, 0 pages at the end of the LRU were PageWriteback. I expect this
to change when multiple processes and CPUs were in use but am ignoring
it for the moment.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 21:39 ` Jan Kara
(?)
@ 2011-07-14 7:03 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:03 UTC (permalink / raw)
To: Jan Kara
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
Marginal and well within noise. I'm very skeptical about the patch
but the VM needs some way of prioritising what pages are getting
written back to that pages in a particular zone can be cleaned.
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
It's as good as number as any for arguements sake. It's not the
first time such a patch has done the rounds. The last one I did along
similar lines was http://lkml.org/lkml/2010/6/8/85 although I mucked
it up with respect to racing with iput.
Wu posted a patch that deferred the writing of ranges to a
flusher thread http://www.spinics.net/lists/xfs/msg05659.html
which Dave has already commented on at
http://www.spinics.net/lists/xfs/msg05665.html. The clustering size
could be easily fixed but the scalability problem he pointed out is
a far greater problem.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency. So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
>
Semi-reasonable and it's along the same lines as what
http://lkml.org/lkml/2010/6/8/85 tried to achieve but maybe the effort
of fixing it up with respect to racing with iput() just isn't worth it.
I think I'll leave it as kswapd will call writepage if the priority is
high enough until a good solution for how the VM can tell the flusher to
prioritise a particular page is devised.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 7:03 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:03 UTC (permalink / raw)
To: Jan Kara
Cc: Rik van Riel, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
Marginal and well within noise. I'm very skeptical about the patch
but the VM needs some way of prioritising what pages are getting
written back to that pages in a particular zone can be cleaned.
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
It's as good as number as any for arguements sake. It's not the
first time such a patch has done the rounds. The last one I did along
similar lines was http://lkml.org/lkml/2010/6/8/85 although I mucked
it up with respect to racing with iput.
Wu posted a patch that deferred the writing of ranges to a
flusher thread http://www.spinics.net/lists/xfs/msg05659.html
which Dave has already commented on at
http://www.spinics.net/lists/xfs/msg05665.html. The clustering size
could be easily fixed but the scalability problem he pointed out is
a far greater problem.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency. So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
>
Semi-reasonable and it's along the same lines as what
http://lkml.org/lkml/2010/6/8/85 tried to achieve but maybe the effort
of fixing it up with respect to racing with iput() just isn't worth it.
I think I'll leave it as kswapd will call writepage if the priority is
high enough until a good solution for how the VM can tell the flusher to
prioritise a particular page is devised.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 7:03 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:03 UTC (permalink / raw)
To: Jan Kara
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Rik van Riel, Minchan Kim
On Wed, Jul 13, 2011 at 11:39:47PM +0200, Jan Kara wrote:
> On Wed 13-07-11 15:31:27, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> Hmm, I was looking through your numbers but I didn't see any significant
> difference this patch would make. Do you?
>
Marginal and well within noise. I'm very skeptical about the patch
but the VM needs some way of prioritising what pages are getting
written back to that pages in a particular zone can be cleaned.
> I was thinking about the problem and actually doing IO from kswapd would be
> a small problem if we submitted more than just a single page. Just to give
> you idea - time to write a single page on plain SATA drive might be like 4
> ms. Time to write sequential 4 MB of data is like 80 ms (I just made up
> these numbers but the orders should be right).
It's as good as number as any for arguements sake. It's not the
first time such a patch has done the rounds. The last one I did along
similar lines was http://lkml.org/lkml/2010/6/8/85 although I mucked
it up with respect to racing with iput.
Wu posted a patch that deferred the writing of ranges to a
flusher thread http://www.spinics.net/lists/xfs/msg05659.html
which Dave has already commented on at
http://www.spinics.net/lists/xfs/msg05665.html. The clustering size
could be easily fixed but the scalability problem he pointed out is
a far greater problem.
> So to write 1000 times more
> data you just need like 20 times longer. That's a factor of 50 in IO
> efficiency. So when reclaim/kswapd submits a single page IO once every
> couple of miliseconds, your IO throughput just went close to zero...
> BTW: I just checked your numbers in fsmark test with vanilla kernel. You
> wrote like 14500 pages from reclaim in 567 seconds. That is about one page
> per 39 ms. That is going to have noticeable impact on IO throughput (not
> with XFS because it plays tricks with writing more than asked but with ext2
> or ext3 you would see it I guess).
>
> So when kswapd sees high percentage of dirty pages at the end of LRU, it
> could call something like fdatawrite_range() for the range of 4 MB
> (provided the file is large enough) containing that page and IO thoughput
> would not be hit that much and you will get reasonably bounded time when
> the page gets cleaned... If you wanted to be clever, you could possibly be
> more sophisticated in picking the file and range to write so that you get
> rid of the most pages at the end of LRU but I'm not sure it's worth the CPU
> cycles. Does this sound reasonable to you?
>
Semi-reasonable and it's along the same lines as what
http://lkml.org/lkml/2010/6/8/85 tried to achieve but maybe the effort
of fixing it up with respect to racing with iput() just isn't worth it.
I think I'll leave it as kswapd will call writepage if the priority is
high enough until a good solution for how the VM can tell the flusher to
prioritise a particular page is devised.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 23:56 ` Dave Chinner
(?)
@ 2011-07-14 7:30 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:30 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:56:06AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/fs.h | 5 ++-
> > include/linux/writeback.h | 1 +
> > mm/vmscan.c | 16 ++++++++++++-
> > 4 files changed, 74 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0f015a0..1201052 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> > LIST_HEAD(tmp);
> > struct list_head *pos, *node;
> > struct super_block *sb = NULL;
> > - struct inode *inode;
> > + struct inode *inode, *tinode;
> > int do_sb_sort = 0;
> >
> > + /* Move inodes reclaim found at end of LRU to dispatch queue */
> > + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> > + /* Move any inode found at end of LRU to dispatch queue */
> > + if (inode->i_state & I_DIRTY_RECLAIM) {
> > + inode->i_state &= ~I_DIRTY_RECLAIM;
> > + list_move(&inode->i_wb_list, &tmp);
> > +
> > + if (sb && sb != inode->i_sb)
> > + do_sb_sort = 1;
> > + sb = inode->i_sb;
> > + }
> > + }
>
> This is not a good idea. move_expired_inodes() already sucks a large
> amount of CPU when there are lots of dirty inodes on the list (think
> hundreds of thousands), and that is when the traversal terminates at
> *older_than_this. It's not uncommon in my testing to see this
> one function consume 30-35% of the bdi-flusher thread CPU usage
> in such conditions.
>
I thought this might be the case. I wasn't sure how bad it could be but
I mentioned in the leader it might be a problem. I'll consider other
ways that pages found at the end of the LRU could be prioritised for
writeback.
> > <SNIP>
> > +
> > + sb = NULL;
> > while (!list_empty(delaying_queue)) {
> > inode = wb_inode(delaying_queue->prev);
> > if (older_than_this &&
> > @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> > rcu_read_unlock();
> > }
> >
> > +/*
> > + * Similar to wakeup_flusher_threads except prioritise inodes contained
> > + * in the page_list regardless of age
> > + */
> > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> > +{
> > + struct page *page;
> > + struct address_space *mapping;
> > + struct inode *inode;
> > +
> > + list_for_each_entry(page, page_list, lru) {
> > + if (!PageDirty(page))
> > + continue;
> > +
> > + if (PageSwapBacked(page))
> > + continue;
> > +
> > + lock_page(page);
> > + mapping = page_mapping(page);
> > + if (!mapping)
> > + goto unlock;
> > +
> > + /*
> > + * Test outside the lock to see as if it is already set. Inode
> > + * should be pinned by the lock_page
> > + */
> > + inode = page->mapping->host;
> > + if (inode->i_state & I_DIRTY_RECLAIM)
> > + goto unlock;
> > +
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_DIRTY_RECLAIM;
> > + spin_unlock(&inode->i_lock);
>
> Micro optimisations like this are unnecessary - the inode->i_lock is
> not contended.
>
This patch was brought forward from a time when it would have been
taking the global inode_lock. I wasn't sure how badly inode->i_lock
was being contended and hadn't set up lock stats. Thanks for the
clarification.
> As it is, this code won't really work as you think it might.
> There's no guarantee a dirty inode is on the dirty - it might have
> already been expired, and it might even currently be under
> writeback. In that case, if it is still dirty it goes to the
> b_more_io list and writeback bandwidth is shared between all the
> other dirty inodes and completely ignores this flag...
>
Ok, it's a total bust. If I revisit this at all, it'll either be in
the context of Wu's approach or calling fdatawrite_range but but it
might be pointless and overall it might just be better for now to
leave kswapd calling ->writepage if reclaim is failing and priority
is raised.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 7:30 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:30 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 09:56:06AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/fs.h | 5 ++-
> > include/linux/writeback.h | 1 +
> > mm/vmscan.c | 16 ++++++++++++-
> > 4 files changed, 74 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0f015a0..1201052 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> > LIST_HEAD(tmp);
> > struct list_head *pos, *node;
> > struct super_block *sb = NULL;
> > - struct inode *inode;
> > + struct inode *inode, *tinode;
> > int do_sb_sort = 0;
> >
> > + /* Move inodes reclaim found at end of LRU to dispatch queue */
> > + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> > + /* Move any inode found at end of LRU to dispatch queue */
> > + if (inode->i_state & I_DIRTY_RECLAIM) {
> > + inode->i_state &= ~I_DIRTY_RECLAIM;
> > + list_move(&inode->i_wb_list, &tmp);
> > +
> > + if (sb && sb != inode->i_sb)
> > + do_sb_sort = 1;
> > + sb = inode->i_sb;
> > + }
> > + }
>
> This is not a good idea. move_expired_inodes() already sucks a large
> amount of CPU when there are lots of dirty inodes on the list (think
> hundreds of thousands), and that is when the traversal terminates at
> *older_than_this. It's not uncommon in my testing to see this
> one function consume 30-35% of the bdi-flusher thread CPU usage
> in such conditions.
>
I thought this might be the case. I wasn't sure how bad it could be but
I mentioned in the leader it might be a problem. I'll consider other
ways that pages found at the end of the LRU could be prioritised for
writeback.
> > <SNIP>
> > +
> > + sb = NULL;
> > while (!list_empty(delaying_queue)) {
> > inode = wb_inode(delaying_queue->prev);
> > if (older_than_this &&
> > @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> > rcu_read_unlock();
> > }
> >
> > +/*
> > + * Similar to wakeup_flusher_threads except prioritise inodes contained
> > + * in the page_list regardless of age
> > + */
> > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> > +{
> > + struct page *page;
> > + struct address_space *mapping;
> > + struct inode *inode;
> > +
> > + list_for_each_entry(page, page_list, lru) {
> > + if (!PageDirty(page))
> > + continue;
> > +
> > + if (PageSwapBacked(page))
> > + continue;
> > +
> > + lock_page(page);
> > + mapping = page_mapping(page);
> > + if (!mapping)
> > + goto unlock;
> > +
> > + /*
> > + * Test outside the lock to see as if it is already set. Inode
> > + * should be pinned by the lock_page
> > + */
> > + inode = page->mapping->host;
> > + if (inode->i_state & I_DIRTY_RECLAIM)
> > + goto unlock;
> > +
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_DIRTY_RECLAIM;
> > + spin_unlock(&inode->i_lock);
>
> Micro optimisations like this are unnecessary - the inode->i_lock is
> not contended.
>
This patch was brought forward from a time when it would have been
taking the global inode_lock. I wasn't sure how badly inode->i_lock
was being contended and hadn't set up lock stats. Thanks for the
clarification.
> As it is, this code won't really work as you think it might.
> There's no guarantee a dirty inode is on the dirty - it might have
> already been expired, and it might even currently be under
> writeback. In that case, if it is still dirty it goes to the
> b_more_io list and writeback bandwidth is shared between all the
> other dirty inodes and completely ignores this flag...
>
Ok, it's a total bust. If I revisit this at all, it'll either be in
the context of Wu's approach or calling fdatawrite_range but but it
might be pointless and overall it might just be better for now to
leave kswapd calling ->writepage if reclaim is failing and priority
is raised.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 7:30 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:30 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:56:06AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
> >
> > When dirty pages are encounted on the LRU, this patch marks the inodes
> > I_DIRTY_RECLAIM and wakes the background flusher. When the background
> > flusher runs, it moves such inodes immediately to the dispatch queue
> > regardless of inode age. There is no guarantee that pages reclaim
> > cares about will be cleaned first but the expectation is that the
> > flusher threads will clean the page quicker than if reclaim tried to
> > clean a single page.
> >
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> > fs/fs-writeback.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/fs.h | 5 ++-
> > include/linux/writeback.h | 1 +
> > mm/vmscan.c | 16 ++++++++++++-
> > 4 files changed, 74 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0f015a0..1201052 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -257,9 +257,23 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> > LIST_HEAD(tmp);
> > struct list_head *pos, *node;
> > struct super_block *sb = NULL;
> > - struct inode *inode;
> > + struct inode *inode, *tinode;
> > int do_sb_sort = 0;
> >
> > + /* Move inodes reclaim found at end of LRU to dispatch queue */
> > + list_for_each_entry_safe(inode, tinode, delaying_queue, i_wb_list) {
> > + /* Move any inode found at end of LRU to dispatch queue */
> > + if (inode->i_state & I_DIRTY_RECLAIM) {
> > + inode->i_state &= ~I_DIRTY_RECLAIM;
> > + list_move(&inode->i_wb_list, &tmp);
> > +
> > + if (sb && sb != inode->i_sb)
> > + do_sb_sort = 1;
> > + sb = inode->i_sb;
> > + }
> > + }
>
> This is not a good idea. move_expired_inodes() already sucks a large
> amount of CPU when there are lots of dirty inodes on the list (think
> hundreds of thousands), and that is when the traversal terminates at
> *older_than_this. It's not uncommon in my testing to see this
> one function consume 30-35% of the bdi-flusher thread CPU usage
> in such conditions.
>
I thought this might be the case. I wasn't sure how bad it could be but
I mentioned in the leader it might be a problem. I'll consider other
ways that pages found at the end of the LRU could be prioritised for
writeback.
> > <SNIP>
> > +
> > + sb = NULL;
> > while (!list_empty(delaying_queue)) {
> > inode = wb_inode(delaying_queue->prev);
> > if (older_than_this &&
> > @@ -968,6 +982,46 @@ void wakeup_flusher_threads(long nr_pages)
> > rcu_read_unlock();
> > }
> >
> > +/*
> > + * Similar to wakeup_flusher_threads except prioritise inodes contained
> > + * in the page_list regardless of age
> > + */
> > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list)
> > +{
> > + struct page *page;
> > + struct address_space *mapping;
> > + struct inode *inode;
> > +
> > + list_for_each_entry(page, page_list, lru) {
> > + if (!PageDirty(page))
> > + continue;
> > +
> > + if (PageSwapBacked(page))
> > + continue;
> > +
> > + lock_page(page);
> > + mapping = page_mapping(page);
> > + if (!mapping)
> > + goto unlock;
> > +
> > + /*
> > + * Test outside the lock to see as if it is already set. Inode
> > + * should be pinned by the lock_page
> > + */
> > + inode = page->mapping->host;
> > + if (inode->i_state & I_DIRTY_RECLAIM)
> > + goto unlock;
> > +
> > + spin_lock(&inode->i_lock);
> > + inode->i_state |= I_DIRTY_RECLAIM;
> > + spin_unlock(&inode->i_lock);
>
> Micro optimisations like this are unnecessary - the inode->i_lock is
> not contended.
>
This patch was brought forward from a time when it would have been
taking the global inode_lock. I wasn't sure how badly inode->i_lock
was being contended and hadn't set up lock stats. Thanks for the
clarification.
> As it is, this code won't really work as you think it might.
> There's no guarantee a dirty inode is on the dirty - it might have
> already been expired, and it might even currently be under
> writeback. In that case, if it is still dirty it goes to the
> b_more_io list and writeback bandwidth is shared between all the
> other dirty inodes and completely ignores this flag...
>
Ok, it's a total bust. If I revisit this at all, it'll either be in
the context of Wu's approach or calling fdatawrite_range but but it
might be pointless and overall it might just be better for now to
leave kswapd calling ->writepage if reclaim is failing and priority
is raised.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
2011-07-14 0:33 ` Dave Chinner
(?)
@ 2011-07-14 7:37 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:37 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> > (Revisting this from a year ago and following on from the thread
> > "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> > clustering". Posting an prototype to see if anything obvious is
> > being missed)
>
> Hi Mel,
>
> Thanks for picking this up again. The results are definitely
> promising, but I'd like to see a comparison against simply not doing
> IO from memory reclaim at all combined with the enhancements in this
> patchset.
Convered elsewhere. In these tests we are already writing 0 pages so it
won't make a difference and I'm wary of eliminating writes entirely
unless kswapd has a way of priotising pages the flusher writes back
because of the risk of premature OOM kill.
> After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
>
Christoph covered this.
> .....
>
> > use-once LRU logic). The command line for fs_mark looked something like
> >
> > ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
> >
> > The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> > this triggers the worst behaviour.
> ....
> > During testing, a number of monitors were running to gather information
> > from ftrace in particular. This disrupts the results of course because
> > recording the information generates IO in itself but I'm ignoring
> > that for the moment so the effect of the patches can be seen.
> >
> > I've posted the raw reports for each filesystem at
> >
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
> .....
> > Average files per second is increased by a nice percentage albeit
> > just within the standard deviation. Consider the type of test this is,
> > variability was inevitable but will double check without monitoring.
> >
> > The overhead (time spent in non-filesystem-related activities) is
> > reduced a *lot* and is a lot less variable.
>
> Given that userspace is doing the same amount of work in all test
> runs, that implies that the userspace process is retaining it's
> working set hot in the cache over syscalls with this patchset.
>
It's one possibility. The more likely one is that the fs_marks anonymous
pages are getting swapped out leading to variability. If IO is less
seeky as a result of the change, the swap in/outs would be faster.
> > Direct reclaim work is significantly reduced going from 37% of all
> > pages scanned to 1% with all patches applied. This implies that
> > processes are getting stalled less.
>
> And that directly implicates page scanning during direct reclaim as
> the prime contributor to turfing the application's working set out
> of the CPU cache....
>
It's a possibility.
> > Page writes by reclaim is what is motivating this series. It goes
> > from 14511 pages to 4084 which is a big improvement. We'll see later
> > if these were anonymous or file-backed pages.
>
> Which were anon pages, so this is a major improvement. However,
> given that there were no dirty pages writen directly by memory
> reclaim, perhaps we don't need to do IO at all from here and
> throttling is all that is needed? ;)
>
I wouldn't bet my life on it due to potential premature OOM kill problem
if we cannot reclaim pages at all :)
> > Direct reclaim writes were never a problem according to this.
>
> That's true. but we disable direct reclaim for other reasons, namely
> that writeback from direct reclaim blows the stack.
>
Correct. I should have been clearer and said direct reclaim wasn't
a problem in terms of queueing pages for IO.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 7:37 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:37 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> > (Revisting this from a year ago and following on from the thread
> > "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> > clustering". Posting an prototype to see if anything obvious is
> > being missed)
>
> Hi Mel,
>
> Thanks for picking this up again. The results are definitely
> promising, but I'd like to see a comparison against simply not doing
> IO from memory reclaim at all combined with the enhancements in this
> patchset.
Convered elsewhere. In these tests we are already writing 0 pages so it
won't make a difference and I'm wary of eliminating writes entirely
unless kswapd has a way of priotising pages the flusher writes back
because of the risk of premature OOM kill.
> After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
>
Christoph covered this.
> .....
>
> > use-once LRU logic). The command line for fs_mark looked something like
> >
> > ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
> >
> > The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> > this triggers the worst behaviour.
> ....
> > During testing, a number of monitors were running to gather information
> > from ftrace in particular. This disrupts the results of course because
> > recording the information generates IO in itself but I'm ignoring
> > that for the moment so the effect of the patches can be seen.
> >
> > I've posted the raw reports for each filesystem at
> >
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
> .....
> > Average files per second is increased by a nice percentage albeit
> > just within the standard deviation. Consider the type of test this is,
> > variability was inevitable but will double check without monitoring.
> >
> > The overhead (time spent in non-filesystem-related activities) is
> > reduced a *lot* and is a lot less variable.
>
> Given that userspace is doing the same amount of work in all test
> runs, that implies that the userspace process is retaining it's
> working set hot in the cache over syscalls with this patchset.
>
It's one possibility. The more likely one is that the fs_marks anonymous
pages are getting swapped out leading to variability. If IO is less
seeky as a result of the change, the swap in/outs would be faster.
> > Direct reclaim work is significantly reduced going from 37% of all
> > pages scanned to 1% with all patches applied. This implies that
> > processes are getting stalled less.
>
> And that directly implicates page scanning during direct reclaim as
> the prime contributor to turfing the application's working set out
> of the CPU cache....
>
It's a possibility.
> > Page writes by reclaim is what is motivating this series. It goes
> > from 14511 pages to 4084 which is a big improvement. We'll see later
> > if these were anonymous or file-backed pages.
>
> Which were anon pages, so this is a major improvement. However,
> given that there were no dirty pages writen directly by memory
> reclaim, perhaps we don't need to do IO at all from here and
> throttling is all that is needed? ;)
>
I wouldn't bet my life on it due to potential premature OOM kill problem
if we cannot reclaim pages at all :)
> > Direct reclaim writes were never a problem according to this.
>
> That's true. but we disable direct reclaim for other reasons, namely
> that writeback from direct reclaim blows the stack.
>
Correct. I should have been clearer and said direct reclaim wasn't
a problem in terms of queueing pages for IO.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again)
@ 2011-07-14 7:37 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 7:37 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 10:33:40AM +1000, Dave Chinner wrote:
> On Wed, Jul 13, 2011 at 03:31:22PM +0100, Mel Gorman wrote:
> > (Revisting this from a year ago and following on from the thread
> > "Re: [PATCH 03/27] xfs: use write_cache_pages for writeback
> > clustering". Posting an prototype to see if anything obvious is
> > being missed)
>
> Hi Mel,
>
> Thanks for picking this up again. The results are definitely
> promising, but I'd like to see a comparison against simply not doing
> IO from memory reclaim at all combined with the enhancements in this
> patchset.
Convered elsewhere. In these tests we are already writing 0 pages so it
won't make a difference and I'm wary of eliminating writes entirely
unless kswapd has a way of priotising pages the flusher writes back
because of the risk of premature OOM kill.
> After all, that's what I keep asking for (so we can get
> rid of .writepage altogether), and if the numbers don't add up, then
> I'll shut up about it. ;)
>
Christoph covered this.
> .....
>
> > use-once LRU logic). The command line for fs_mark looked something like
> >
> > ./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760
> >
> > The machine was booted with "nr_cpus=1 mem=512M" as according to Dave
> > this triggers the worst behaviour.
> ....
> > During testing, a number of monitors were running to gather information
> > from ftrace in particular. This disrupts the results of course because
> > recording the information generates IO in itself but I'm ignoring
> > that for the moment so the effect of the patches can be seen.
> >
> > I've posted the raw reports for each filesystem at
> >
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext3/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-ext4/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-btrfs/sandy/comparison.html
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110713/writeback-xfs/sandy/comparison.html
> .....
> > Average files per second is increased by a nice percentage albeit
> > just within the standard deviation. Consider the type of test this is,
> > variability was inevitable but will double check without monitoring.
> >
> > The overhead (time spent in non-filesystem-related activities) is
> > reduced a *lot* and is a lot less variable.
>
> Given that userspace is doing the same amount of work in all test
> runs, that implies that the userspace process is retaining it's
> working set hot in the cache over syscalls with this patchset.
>
It's one possibility. The more likely one is that the fs_marks anonymous
pages are getting swapped out leading to variability. If IO is less
seeky as a result of the change, the swap in/outs would be faster.
> > Direct reclaim work is significantly reduced going from 37% of all
> > pages scanned to 1% with all patches applied. This implies that
> > processes are getting stalled less.
>
> And that directly implicates page scanning during direct reclaim as
> the prime contributor to turfing the application's working set out
> of the CPU cache....
>
It's a possibility.
> > Page writes by reclaim is what is motivating this series. It goes
> > from 14511 pages to 4084 which is a big improvement. We'll see later
> > if these were anonymous or file-backed pages.
>
> Which were anon pages, so this is a major improvement. However,
> given that there were no dirty pages writen directly by memory
> reclaim, perhaps we don't need to do IO at all from here and
> throttling is all that is needed? ;)
>
I wouldn't bet my life on it due to potential premature OOM kill problem
if we cannot reclaim pages at all :)
> > Direct reclaim writes were never a problem according to this.
>
> That's true. but we disable direct reclaim for other reasons, namely
> that writeback from direct reclaim blows the stack.
>
Correct. I should have been clearer and said direct reclaim wasn't
a problem in terms of queueing pages for IO.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-14 6:29 ` Mel Gorman
(?)
@ 2011-07-14 11:52 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 11:52 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > It is preferable that no dirty pages are dispatched for cleaning from
> > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > writing pages.
> > >
> > > However, page reclaim does have a requirement that pages be freed
> > > in a particular zone. If it is failing to make sufficient progress
> > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > considered to tbe the point where kswapd is getting into trouble
> > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > pages for writing.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Seems reasonable, but btrfs still will ignore this writeback from
> > kswapd, and it doesn't fall over.
>
> At least there are no reports of it falling over :)
However you want to spin it.
> > Given that data point, I'd like to
> > see the results when you stop kswapd from doing writeback altogether
> > as well.
> >
>
> The results for this test will be identical because the ftrace results
> show that kswapd is already writing 0 filesystem pages.
You mean these numbers:
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Which shows that kswapd, under this workload has been improved to
the point that it doesn't need to do IO. Yes, you've addressed the
one problematic workload, but the numbers do not provide the answers
to the fundamental question that have been raised during
discussions. i.e. do we even need IO at all from reclaim?
> Where it makes a difference is when the system is under enough
> pressure that it is failing to reclaim any memory and is in danger
> of prematurely triggering the OOM killer. Andrea outlined some of
> the concerns before at http://lkml.org/lkml/2010/6/15/246
So put the system under more pressure such that with this patch
series memory reclaim still writes from kswapd. Can you even get it
to that stage, and if you can, does the system OOM more or less if
you don't do file IO from reclaim?
> > Can you try removing it altogether and seeing what that does to your
> > test results? i.e
> >
> > if (page_is_file_cache(page)) {
> > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > goto keep_locked;
> > }
>
> It won't do anything, it'll still be writing 0 filesystem-backed pages.
>
> Because of the possibility for the OOM killer triggering prematurely due
> to the inability of kswapd to write pages, I'd prefer to separate such a
> change by at least one release so that if there is an increase in OOM
> reports, it'll be obvious what was the culprit.
I'm not asking for release quality patches or even when such fixes
would roll out.
What you've shown here is that memory reclaim can be more efficient
without issuing IO itself under medium memory pressure. Now the
question is whether it can do so under heavy, sustained, near OOM
memory pressure?
IOWs, what I want to see is whether the fundamental principle of
IO-less reclaim can be validated as workable or struck down. This
patchset demonstrates that IO-less reclaim is superior for a
workload that produces medium levels of sustained IO-based memory
pressure, which leads to the conclusion that the approach has merit
and needs further investigation.
It's that next step that I'm asking you to test now. What form
potential changes take or when they are released is irrelevant to me
at this point, because we still haven't determined if the
fundamental concept is completely sound or not. If the concept is
sound I'm quite happy to wait until the implementation is fully
baked before it gets rolled out....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 11:52 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 11:52 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > It is preferable that no dirty pages are dispatched for cleaning from
> > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > writing pages.
> > >
> > > However, page reclaim does have a requirement that pages be freed
> > > in a particular zone. If it is failing to make sufficient progress
> > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > considered to tbe the point where kswapd is getting into trouble
> > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > pages for writing.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Seems reasonable, but btrfs still will ignore this writeback from
> > kswapd, and it doesn't fall over.
>
> At least there are no reports of it falling over :)
However you want to spin it.
> > Given that data point, I'd like to
> > see the results when you stop kswapd from doing writeback altogether
> > as well.
> >
>
> The results for this test will be identical because the ftrace results
> show that kswapd is already writing 0 filesystem pages.
You mean these numbers:
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Which shows that kswapd, under this workload has been improved to
the point that it doesn't need to do IO. Yes, you've addressed the
one problematic workload, but the numbers do not provide the answers
to the fundamental question that have been raised during
discussions. i.e. do we even need IO at all from reclaim?
> Where it makes a difference is when the system is under enough
> pressure that it is failing to reclaim any memory and is in danger
> of prematurely triggering the OOM killer. Andrea outlined some of
> the concerns before at http://lkml.org/lkml/2010/6/15/246
So put the system under more pressure such that with this patch
series memory reclaim still writes from kswapd. Can you even get it
to that stage, and if you can, does the system OOM more or less if
you don't do file IO from reclaim?
> > Can you try removing it altogether and seeing what that does to your
> > test results? i.e
> >
> > if (page_is_file_cache(page)) {
> > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > goto keep_locked;
> > }
>
> It won't do anything, it'll still be writing 0 filesystem-backed pages.
>
> Because of the possibility for the OOM killer triggering prematurely due
> to the inability of kswapd to write pages, I'd prefer to separate such a
> change by at least one release so that if there is an increase in OOM
> reports, it'll be obvious what was the culprit.
I'm not asking for release quality patches or even when such fixes
would roll out.
What you've shown here is that memory reclaim can be more efficient
without issuing IO itself under medium memory pressure. Now the
question is whether it can do so under heavy, sustained, near OOM
memory pressure?
IOWs, what I want to see is whether the fundamental principle of
IO-less reclaim can be validated as workable or struck down. This
patchset demonstrates that IO-less reclaim is superior for a
workload that produces medium levels of sustained IO-based memory
pressure, which leads to the conclusion that the approach has merit
and needs further investigation.
It's that next step that I'm asking you to test now. What form
potential changes take or when they are released is irrelevant to me
at this point, because we still haven't determined if the
fundamental concept is completely sound or not. If the concept is
sound I'm quite happy to wait until the implementation is fully
baked before it gets rolled out....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 11:52 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-14 11:52 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > It is preferable that no dirty pages are dispatched for cleaning from
> > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > writing pages.
> > >
> > > However, page reclaim does have a requirement that pages be freed
> > > in a particular zone. If it is failing to make sufficient progress
> > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > considered to tbe the point where kswapd is getting into trouble
> > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > pages for writing.
> > >
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> >
> > Seems reasonable, but btrfs still will ignore this writeback from
> > kswapd, and it doesn't fall over.
>
> At least there are no reports of it falling over :)
However you want to spin it.
> > Given that data point, I'd like to
> > see the results when you stop kswapd from doing writeback altogether
> > as well.
> >
>
> The results for this test will be identical because the ftrace results
> show that kswapd is already writing 0 filesystem pages.
You mean these numbers:
Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
Which shows that kswapd, under this workload has been improved to
the point that it doesn't need to do IO. Yes, you've addressed the
one problematic workload, but the numbers do not provide the answers
to the fundamental question that have been raised during
discussions. i.e. do we even need IO at all from reclaim?
> Where it makes a difference is when the system is under enough
> pressure that it is failing to reclaim any memory and is in danger
> of prematurely triggering the OOM killer. Andrea outlined some of
> the concerns before at http://lkml.org/lkml/2010/6/15/246
So put the system under more pressure such that with this patch
series memory reclaim still writes from kswapd. Can you even get it
to that stage, and if you can, does the system OOM more or less if
you don't do file IO from reclaim?
> > Can you try removing it altogether and seeing what that does to your
> > test results? i.e
> >
> > if (page_is_file_cache(page)) {
> > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > goto keep_locked;
> > }
>
> It won't do anything, it'll still be writing 0 filesystem-backed pages.
>
> Because of the possibility for the OOM killer triggering prematurely due
> to the inability of kswapd to write pages, I'd prefer to separate such a
> change by at least one release so that if there is an increase in OOM
> reports, it'll be obvious what was the culprit.
I'm not asking for release quality patches or even when such fixes
would roll out.
What you've shown here is that memory reclaim can be more efficient
without issuing IO itself under medium memory pressure. Now the
question is whether it can do so under heavy, sustained, near OOM
memory pressure?
IOWs, what I want to see is whether the fundamental principle of
IO-less reclaim can be validated as workable or struck down. This
patchset demonstrates that IO-less reclaim is superior for a
workload that produces medium levels of sustained IO-based memory
pressure, which leads to the conclusion that the approach has merit
and needs further investigation.
It's that next step that I'm asking you to test now. What form
potential changes take or when they are released is irrelevant to me
at this point, because we still haven't determined if the
fundamental concept is completely sound or not. If the concept is
sound I'm quite happy to wait until the implementation is fully
baked before it gets rolled out....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-14 11:52 ` Dave Chinner
(?)
@ 2011-07-14 13:17 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 13:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > writing pages.
> > > >
> > > > However, page reclaim does have a requirement that pages be freed
> > > > in a particular zone. If it is failing to make sufficient progress
> > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > considered to tbe the point where kswapd is getting into trouble
> > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > pages for writing.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > >
> > > Seems reasonable, but btrfs still will ignore this writeback from
> > > kswapd, and it doesn't fall over.
> >
> > At least there are no reports of it falling over :)
>
> However you want to spin it.
>
I regret that it is coming across as spin. My primary concern is
that if we get OOM-related bugs due to this series later that it'll
be difficult to pinpoint whether the whole series is at fault or whether
preventing kswapd writing any pages was at fault.
> > > Given that data point, I'd like to
> > > see the results when you stop kswapd from doing writeback altogether
> > > as well.
> > >
> >
> > The results for this test will be identical because the ftrace results
> > show that kswapd is already writing 0 filesystem pages.
>
> You mean these numbers:
>
> Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
>
> Which shows that kswapd, under this workload has been improved to
> the point that it doesn't need to do IO. Yes, you've addressed the
> one problematic workload, but the numbers do not provide the answers
> to the fundamental question that have been raised during
> discussions. i.e. do we even need IO at all from reclaim?
>
I don't know and at best will only be able to test with a single
disk which is why I wanted to separate this series from a complete
preventing of kswapd writing pages. I may be able to get access to
a machine with more disks but it'll take time.
> > Where it makes a difference is when the system is under enough
> > pressure that it is failing to reclaim any memory and is in danger
> > of prematurely triggering the OOM killer. Andrea outlined some of
> > the concerns before at http://lkml.org/lkml/2010/6/15/246
>
> So put the system under more pressure such that with this patch
> series memory reclaim still writes from kswapd. Can you even get it
> to that stage, and if you can, does the system OOM more or less if
> you don't do file IO from reclaim?
>
I can setup such a tests, it'll be at least next week before I
configure such a test and get it queued. It'll probably take a few
days to run then because more iterations will be required to pinpoint
where the OOM threshold is. I know from the past that pushing a
system near OOM causes a non-deterministic number of triggers that
depend heavily on what was killed so the only real choice is to start
light and increase the load until boom which is time consuming.
Even then, the test will be inconclusive because it'll be just one
or two machines that I'll have to test on. There will be important
corner cases that I won't be able to test for. For example;
o small lowest zone that is critical for operation of some reason and
the pages must be cleaned from there even though there is a large
amount of memory overall
o small highest zone causing high kswapd usage as it fails to balance
continually due to pages being dirtied constantly and the window
between when flushers clean the page and kswapd reclaim the page
being too big. I might be able to simulate this one but bugs of
this nature tend to be workload specific and affect some machines
worse than others
o Machines with many nodes and dirty pages spread semi-randomly
on all nodes. If the flusher thread is not cleaning pages from
a particular node that is under memory pressure due to affinity,
processes will stall for long periods of time until the relevant
inodes expire and gets cleaned. This will be particularly
problematic if zone_reclaim is enabled
Questions about scenarios like this are going to cause problems in
review because it's reasonable to ask if any of them can occur and
we can't give an iron-clad answer.
> > > Can you try removing it altogether and seeing what that does to your
> > > test results? i.e
> > >
> > > if (page_is_file_cache(page)) {
> > > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > goto keep_locked;
> > > }
> >
> > It won't do anything, it'll still be writing 0 filesystem-backed pages.
> >
> > Because of the possibility for the OOM killer triggering prematurely due
> > to the inability of kswapd to write pages, I'd prefer to separate such a
> > change by at least one release so that if there is an increase in OOM
> > reports, it'll be obvious what was the culprit.
>
> I'm not asking for release quality patches or even when such fixes
> would roll out.
>
Very well. I was hoping to start with just this series and handle the
complete disabling of writing later but it can wait a few weeks too. It
was always a stretch that the next merge window was going to be hit.
> What you've shown here is that memory reclaim can be more efficient
> without issuing IO itself under medium memory pressure. Now the
> question is whether it can do so under heavy, sustained, near OOM
> memory pressure?
>
> IOWs, what I want to see is whether the fundamental principle of
> IO-less reclaim can be validated as workable or struck down. This
> patchset demonstrates that IO-less reclaim is superior for a
> workload that produces medium levels of sustained IO-based memory
> pressure, which leads to the conclusion that the approach has merit
> and needs further investigation.
>
> It's that next step that I'm asking you to test now. What form
> potential changes take or when they are released is irrelevant to me
> at this point, because we still haven't determined if the
> fundamental concept is completely sound or not. If the concept is
> sound I'm quite happy to wait until the implementation is fully
> baked before it gets rolled out....
>
I'll setup a suitable test next week then.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 13:17 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 13:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > writing pages.
> > > >
> > > > However, page reclaim does have a requirement that pages be freed
> > > > in a particular zone. If it is failing to make sufficient progress
> > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > considered to tbe the point where kswapd is getting into trouble
> > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > pages for writing.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > >
> > > Seems reasonable, but btrfs still will ignore this writeback from
> > > kswapd, and it doesn't fall over.
> >
> > At least there are no reports of it falling over :)
>
> However you want to spin it.
>
I regret that it is coming across as spin. My primary concern is
that if we get OOM-related bugs due to this series later that it'll
be difficult to pinpoint whether the whole series is at fault or whether
preventing kswapd writing any pages was at fault.
> > > Given that data point, I'd like to
> > > see the results when you stop kswapd from doing writeback altogether
> > > as well.
> > >
> >
> > The results for this test will be identical because the ftrace results
> > show that kswapd is already writing 0 filesystem pages.
>
> You mean these numbers:
>
> Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
>
> Which shows that kswapd, under this workload has been improved to
> the point that it doesn't need to do IO. Yes, you've addressed the
> one problematic workload, but the numbers do not provide the answers
> to the fundamental question that have been raised during
> discussions. i.e. do we even need IO at all from reclaim?
>
I don't know and at best will only be able to test with a single
disk which is why I wanted to separate this series from a complete
preventing of kswapd writing pages. I may be able to get access to
a machine with more disks but it'll take time.
> > Where it makes a difference is when the system is under enough
> > pressure that it is failing to reclaim any memory and is in danger
> > of prematurely triggering the OOM killer. Andrea outlined some of
> > the concerns before at http://lkml.org/lkml/2010/6/15/246
>
> So put the system under more pressure such that with this patch
> series memory reclaim still writes from kswapd. Can you even get it
> to that stage, and if you can, does the system OOM more or less if
> you don't do file IO from reclaim?
>
I can setup such a tests, it'll be at least next week before I
configure such a test and get it queued. It'll probably take a few
days to run then because more iterations will be required to pinpoint
where the OOM threshold is. I know from the past that pushing a
system near OOM causes a non-deterministic number of triggers that
depend heavily on what was killed so the only real choice is to start
light and increase the load until boom which is time consuming.
Even then, the test will be inconclusive because it'll be just one
or two machines that I'll have to test on. There will be important
corner cases that I won't be able to test for. For example;
o small lowest zone that is critical for operation of some reason and
the pages must be cleaned from there even though there is a large
amount of memory overall
o small highest zone causing high kswapd usage as it fails to balance
continually due to pages being dirtied constantly and the window
between when flushers clean the page and kswapd reclaim the page
being too big. I might be able to simulate this one but bugs of
this nature tend to be workload specific and affect some machines
worse than others
o Machines with many nodes and dirty pages spread semi-randomly
on all nodes. If the flusher thread is not cleaning pages from
a particular node that is under memory pressure due to affinity,
processes will stall for long periods of time until the relevant
inodes expire and gets cleaned. This will be particularly
problematic if zone_reclaim is enabled
Questions about scenarios like this are going to cause problems in
review because it's reasonable to ask if any of them can occur and
we can't give an iron-clad answer.
> > > Can you try removing it altogether and seeing what that does to your
> > > test results? i.e
> > >
> > > if (page_is_file_cache(page)) {
> > > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > goto keep_locked;
> > > }
> >
> > It won't do anything, it'll still be writing 0 filesystem-backed pages.
> >
> > Because of the possibility for the OOM killer triggering prematurely due
> > to the inability of kswapd to write pages, I'd prefer to separate such a
> > change by at least one release so that if there is an increase in OOM
> > reports, it'll be obvious what was the culprit.
>
> I'm not asking for release quality patches or even when such fixes
> would roll out.
>
Very well. I was hoping to start with just this series and handle the
complete disabling of writing later but it can wait a few weeks too. It
was always a stretch that the next merge window was going to be hit.
> What you've shown here is that memory reclaim can be more efficient
> without issuing IO itself under medium memory pressure. Now the
> question is whether it can do so under heavy, sustained, near OOM
> memory pressure?
>
> IOWs, what I want to see is whether the fundamental principle of
> IO-less reclaim can be validated as workable or struck down. This
> patchset demonstrates that IO-less reclaim is superior for a
> workload that produces medium levels of sustained IO-based memory
> pressure, which leads to the conclusion that the approach has merit
> and needs further investigation.
>
> It's that next step that I'm asking you to test now. What form
> potential changes take or when they are released is irrelevant to me
> at this point, because we still haven't determined if the
> fundamental concept is completely sound or not. If the concept is
> sound I'm quite happy to wait until the implementation is fully
> baked before it gets rolled out....
>
I'll setup a suitable test next week then.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-14 13:17 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 13:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > writing pages.
> > > >
> > > > However, page reclaim does have a requirement that pages be freed
> > > > in a particular zone. If it is failing to make sufficient progress
> > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > considered to tbe the point where kswapd is getting into trouble
> > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > pages for writing.
> > > >
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > >
> > > Seems reasonable, but btrfs still will ignore this writeback from
> > > kswapd, and it doesn't fall over.
> >
> > At least there are no reports of it falling over :)
>
> However you want to spin it.
>
I regret that it is coming across as spin. My primary concern is
that if we get OOM-related bugs due to this series later that it'll
be difficult to pinpoint whether the whole series is at fault or whether
preventing kswapd writing any pages was at fault.
> > > Given that data point, I'd like to
> > > see the results when you stop kswapd from doing writeback altogether
> > > as well.
> > >
> >
> > The results for this test will be identical because the ftrace results
> > show that kswapd is already writing 0 filesystem pages.
>
> You mean these numbers:
>
> Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
>
> Which shows that kswapd, under this workload has been improved to
> the point that it doesn't need to do IO. Yes, you've addressed the
> one problematic workload, but the numbers do not provide the answers
> to the fundamental question that have been raised during
> discussions. i.e. do we even need IO at all from reclaim?
>
I don't know and at best will only be able to test with a single
disk which is why I wanted to separate this series from a complete
preventing of kswapd writing pages. I may be able to get access to
a machine with more disks but it'll take time.
> > Where it makes a difference is when the system is under enough
> > pressure that it is failing to reclaim any memory and is in danger
> > of prematurely triggering the OOM killer. Andrea outlined some of
> > the concerns before at http://lkml.org/lkml/2010/6/15/246
>
> So put the system under more pressure such that with this patch
> series memory reclaim still writes from kswapd. Can you even get it
> to that stage, and if you can, does the system OOM more or less if
> you don't do file IO from reclaim?
>
I can setup such a tests, it'll be at least next week before I
configure such a test and get it queued. It'll probably take a few
days to run then because more iterations will be required to pinpoint
where the OOM threshold is. I know from the past that pushing a
system near OOM causes a non-deterministic number of triggers that
depend heavily on what was killed so the only real choice is to start
light and increase the load until boom which is time consuming.
Even then, the test will be inconclusive because it'll be just one
or two machines that I'll have to test on. There will be important
corner cases that I won't be able to test for. For example;
o small lowest zone that is critical for operation of some reason and
the pages must be cleaned from there even though there is a large
amount of memory overall
o small highest zone causing high kswapd usage as it fails to balance
continually due to pages being dirtied constantly and the window
between when flushers clean the page and kswapd reclaim the page
being too big. I might be able to simulate this one but bugs of
this nature tend to be workload specific and affect some machines
worse than others
o Machines with many nodes and dirty pages spread semi-randomly
on all nodes. If the flusher thread is not cleaning pages from
a particular node that is under memory pressure due to affinity,
processes will stall for long periods of time until the relevant
inodes expire and gets cleaned. This will be particularly
problematic if zone_reclaim is enabled
Questions about scenarios like this are going to cause problems in
review because it's reasonable to ask if any of them can occur and
we can't give an iron-clad answer.
> > > Can you try removing it altogether and seeing what that does to your
> > > test results? i.e
> > >
> > > if (page_is_file_cache(page)) {
> > > inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > goto keep_locked;
> > > }
> >
> > It won't do anything, it'll still be writing 0 filesystem-backed pages.
> >
> > Because of the possibility for the OOM killer triggering prematurely due
> > to the inability of kswapd to write pages, I'd prefer to separate such a
> > change by at least one release so that if there is an increase in OOM
> > reports, it'll be obvious what was the culprit.
>
> I'm not asking for release quality patches or even when such fixes
> would roll out.
>
Very well. I was hoping to start with just this series and handle the
complete disabling of writing later but it can wait a few weeks too. It
was always a stretch that the next merge window was going to be hit.
> What you've shown here is that memory reclaim can be more efficient
> without issuing IO itself under medium memory pressure. Now the
> question is whether it can do so under heavy, sustained, near OOM
> memory pressure?
>
> IOWs, what I want to see is whether the fundamental principle of
> IO-less reclaim can be validated as workable or struck down. This
> patchset demonstrates that IO-less reclaim is superior for a
> workload that produces medium levels of sustained IO-based memory
> pressure, which leads to the conclusion that the approach has merit
> and needs further investigation.
>
> It's that next step that I'm asking you to test now. What form
> potential changes take or when they are released is irrelevant to me
> at this point, because we still haven't determined if the
> fundamental concept is completely sound or not. If the concept is
> sound I'm quite happy to wait until the implementation is fully
> baked before it gets rolled out....
>
I'll setup a suitable test next week then.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 4:46 ` KAMEZAWA Hiroyuki
(?)
@ 2011-07-14 15:07 ` Christoph Hellwig
-1 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
We're using a fairly deep stack in normal buffered read/write,
wich is almost 100% common code. It's not just the long callchain
(see below), but also that we put the unneeded kiocb and a vector
of I/O vects on the stack:
vfs_writev
do_readv_writev
do_sync_write
generic_file_aio_write
__generic_file_aio_write
generic_file_buffered_write
generic_perform_write
block_write_begin
grab_cache_page_write_begin
add_to_page_cache_lru
add_to_page_cache
add_to_page_cache_locked
mem_cgroup_cache_charge
this might additionally come from in-kernel callers like nfsd,
which has even more stack space used. And at this point we only
enter the memcg/reclaim code, which last time I had a stack trace
ate up another about 3k of stack space.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 15:07 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
We're using a fairly deep stack in normal buffered read/write,
wich is almost 100% common code. It's not just the long callchain
(see below), but also that we put the unneeded kiocb and a vector
of I/O vects on the stack:
vfs_writev
do_readv_writev
do_sync_write
generic_file_aio_write
__generic_file_aio_write
generic_file_buffered_write
generic_perform_write
block_write_begin
grab_cache_page_write_begin
add_to_page_cache_lru
add_to_page_cache
add_to_page_cache_locked
mem_cgroup_cache_charge
this might additionally come from in-kernel callers like nfsd,
which has even more stack space used. And at this point we only
enter the memcg/reclaim code, which last time I had a stack trace
ate up another about 3k of stack space.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 15:07 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:07 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
We're using a fairly deep stack in normal buffered read/write,
wich is almost 100% common code. It's not just the long callchain
(see below), but also that we put the unneeded kiocb and a vector
of I/O vects on the stack:
vfs_writev
do_readv_writev
do_sync_write
generic_file_aio_write
__generic_file_aio_write
generic_file_buffered_write
generic_perform_write
block_write_begin
grab_cache_page_write_begin
add_to_page_cache_lru
add_to_page_cache
add_to_page_cache_locked
mem_cgroup_cache_charge
this might additionally come from in-kernel callers like nfsd,
which has even more stack space used. And at this point we only
enter the memcg/reclaim code, which last time I had a stack trace
ate up another about 3k of stack space.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-13 14:31 ` Mel Gorman
(?)
@ 2011-07-14 15:09 ` Christoph Hellwig
-1 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
what does this buy us? If at all we should prioritize by a zone,
e.g. tell write_cache_pages only to bother with writing things out
if the dirty page is in a given zone. We'd probably still cluster
around it to make sure we get good I/O patterns, but would only start
I/O if it has a page we actually care about.
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 15:09 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
what does this buy us? If at all we should prioritize by a zone,
e.g. tell write_cache_pages only to bother with writing things out
if the dirty page is in a given zone. We'd probably still cluster
around it to make sure we get good I/O patterns, but would only start
I/O if it has a page we actually care about.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 15:09 ` Christoph Hellwig
0 siblings, 0 replies; 114+ messages in thread
From: Christoph Hellwig @ 2011-07-14 15:09 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched from the page
> reclaim path. If reclaim is encountering dirty pages, it implies that
> either reclaim is getting ahead of writeback or use-once logic has
> prioritise pages for reclaiming that are young relative to when the
> inode was dirtied.
what does this buy us? If at all we should prioritize by a zone,
e.g. tell write_cache_pages only to bother with writing things out
if the dirty page is in a given zone. We'd probably still cluster
around it to make sure we get good I/O patterns, but would only start
I/O if it has a page we actually care about.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
2011-07-14 15:09 ` Christoph Hellwig
(?)
@ 2011-07-14 15:49 ` Mel Gorman
-1 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 15:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner, Wu Fengguang,
Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 11:09:59AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
>
> what does this buy us?
Very little. The vague intention was to avoid a situation where kswapds
priority was raised such that it had to write pages to clean a
particular zone.
> If at all we should prioritize by a zone,
> e.g. tell write_cache_pages only to bother with writing things out
> if the dirty page is in a given zone. We'd probably still cluster
> around it to make sure we get good I/O patterns, but would only start
> I/O if it has a page we actually care about.
>
That would make more sense. I've dropped this patch entirely.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 15:49 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 15:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Rik van Riel, Jan Kara, LKML, XFS, Linux-MM, Minchan Kim,
Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 11:09:59AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
>
> what does this buy us?
Very little. The vague intention was to avoid a situation where kswapds
priority was raised such that it had to write pages to clean a
particular zone.
> If at all we should prioritize by a zone,
> e.g. tell write_cache_pages only to bother with writing things out
> if the dirty page is in a given zone. We'd probably still cluster
> around it to make sure we get good I/O patterns, but would only start
> I/O if it has a page we actually care about.
>
That would make more sense. I've dropped this patch entirely.
--
Mel Gorman
SUSE Labs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing
@ 2011-07-14 15:49 ` Mel Gorman
0 siblings, 0 replies; 114+ messages in thread
From: Mel Gorman @ 2011-07-14 15:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner, Wu Fengguang,
Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 11:09:59AM -0400, Christoph Hellwig wrote:
> On Wed, Jul 13, 2011 at 03:31:27PM +0100, Mel Gorman wrote:
> > It is preferable that no dirty pages are dispatched from the page
> > reclaim path. If reclaim is encountering dirty pages, it implies that
> > either reclaim is getting ahead of writeback or use-once logic has
> > prioritise pages for reclaiming that are young relative to when the
> > inode was dirtied.
>
> what does this buy us?
Very little. The vague intention was to avoid a situation where kswapds
priority was raised such that it had to write pages to clean a
particular zone.
> If at all we should prioritize by a zone,
> e.g. tell write_cache_pages only to bother with writing things out
> if the dirty page is in a given zone. We'd probably still cluster
> around it to make sure we get good I/O patterns, but would only start
> I/O if it has a page we actually care about.
>
That would make more sense. I've dropped this patch entirely.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 15:07 ` Christoph Hellwig
(?)
@ 2011-07-14 23:55 ` KAMEZAWA Hiroyuki
-1 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, 14 Jul 2011 11:07:00 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> We're using a fairly deep stack in normal buffered read/write,
> wich is almost 100% common code. It's not just the long callchain
> (see below), but also that we put the unneeded kiocb and a vector
> of I/O vects on the stack:
>
> vfs_writev
> do_readv_writev
> do_sync_write
> generic_file_aio_write
> __generic_file_aio_write
> generic_file_buffered_write
> generic_perform_write
> block_write_begin
> grab_cache_page_write_begin
> add_to_page_cache_lru
> add_to_page_cache
> add_to_page_cache_locked
> mem_cgroup_cache_charge
>
> this might additionally come from in-kernel callers like nfsd,
> which has even more stack space used. And at this point we only
> enter the memcg/reclaim code, which last time I had a stack trace
> ate up another about 3k of stack space.
>
Hmm. I'll prepare 2 functions for memcg
1. asynchronous memory reclaim as kswapd does.
2. dirty_ratio
please remove ->writepage 1st. It may break memcg but it happens sometimes.
We'll do fix.
Thanks,
-Kame
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 23:55 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Rik van Riel, Jan Kara, LKML, XFS, Linux-MM, Mel Gorman,
Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, 14 Jul 2011 11:07:00 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> We're using a fairly deep stack in normal buffered read/write,
> wich is almost 100% common code. It's not just the long callchain
> (see below), but also that we put the unneeded kiocb and a vector
> of I/O vects on the stack:
>
> vfs_writev
> do_readv_writev
> do_sync_write
> generic_file_aio_write
> __generic_file_aio_write
> generic_file_buffered_write
> generic_perform_write
> block_write_begin
> grab_cache_page_write_begin
> add_to_page_cache_lru
> add_to_page_cache
> add_to_page_cache_locked
> mem_cgroup_cache_charge
>
> this might additionally come from in-kernel callers like nfsd,
> which has even more stack space used. And at this point we only
> enter the memcg/reclaim code, which last time I had a stack trace
> ate up another about 3k of stack space.
>
Hmm. I'll prepare 2 functions for memcg
1. asynchronous memory reclaim as kswapd does.
2. dirty_ratio
please remove ->writepage 1st. It may break memcg but it happens sometimes.
We'll do fix.
Thanks,
-Kame
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-14 23:55 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 114+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-14 23:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, 14 Jul 2011 11:07:00 -0400
Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> We're using a fairly deep stack in normal buffered read/write,
> wich is almost 100% common code. It's not just the long callchain
> (see below), but also that we put the unneeded kiocb and a vector
> of I/O vects on the stack:
>
> vfs_writev
> do_readv_writev
> do_sync_write
> generic_file_aio_write
> __generic_file_aio_write
> generic_file_buffered_write
> generic_perform_write
> block_write_begin
> grab_cache_page_write_begin
> add_to_page_cache_lru
> add_to_page_cache
> add_to_page_cache_locked
> mem_cgroup_cache_charge
>
> this might additionally come from in-kernel callers like nfsd,
> which has even more stack space used. And at this point we only
> enter the memcg/reclaim code, which last time I had a stack trace
> ate up another about 3k of stack space.
>
Hmm. I'll prepare 2 functions for memcg
1. asynchronous memory reclaim as kswapd does.
2. dirty_ratio
please remove ->writepage 1st. It may break memcg but it happens sometimes.
We'll do fix.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-14 4:46 ` KAMEZAWA Hiroyuki
(?)
@ 2011-07-15 2:22 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 00:46:43 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > + /*
> > > > + * Only kswapd can writeback filesystem pages to
> > > > + * avoid risk of stack overflow
> > > > + */
> > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > + goto keep_locked;
> > > > + }
> > > > +
> > >
> > >
> > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Here's an example writeback stack trace. Notice how deep it is from
the __writepage() call?
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (50 entries)
----- ---- --------
0) 5000 80 enqueue_task_fair+0x63/0x4f0
1) 4920 48 enqueue_task+0x6a/0x80
2) 4872 32 activate_task+0x2d/0x40
3) 4840 32 ttwu_activate+0x21/0x50
4) 4808 32 T.2130+0x3c/0x60
5) 4776 112 try_to_wake_up+0x25e/0x2d0
6) 4664 16 wake_up_process+0x15/0x20
7) 4648 16 wake_up_worker+0x24/0x30
8) 4632 16 insert_work+0x6f/0x80
9) 4616 96 __queue_work+0xf9/0x3f0
10) 4520 16 queue_work_on+0x25/0x40
11) 4504 16 queue_work+0x1f/0x30
12) 4488 16 queue_delayed_work+0x2d/0x40
13) 4472 32 blk_run_queue_async+0x41/0x60
14) 4440 64 queue_unplugged+0x8e/0xc0
15) 4376 112 blk_flush_plug_list+0x1f5/0x240
16) 4264 176 schedule+0x4c3/0x8b0
17) 4088 128 schedule_timeout+0x1a5/0x280
18) 3960 160 wait_for_common+0xdb/0x180
19) 3800 16 wait_for_completion+0x1d/0x20
20) 3784 48 xfs_buf_iowait+0x30/0xc0
21) 3736 32 _xfs_buf_read+0x60/0x70
22) 3704 48 xfs_buf_read+0xa2/0x100
23) 3656 80 xfs_trans_read_buf+0x1ef/0x430
24) 3576 96 xfs_btree_read_buf_block+0x5e/0xd0
25) 3480 96 xfs_btree_lookup_get_block+0x83/0xf0
26) 3384 176 xfs_btree_lookup+0xd7/0x490
27) 3208 16 xfs_alloc_lookup_eq+0x19/0x20
28) 3192 112 xfs_alloc_fixup_trees+0x2b5/0x350
29) 3080 224 xfs_alloc_ag_vextent_near+0x631/0xb60
30) 2856 32 xfs_alloc_ag_vextent+0xd5/0x100
31) 2824 96 xfs_alloc_vextent+0x2a4/0x5f0
32) 2728 256 xfs_bmap_btalloc+0x257/0x720
33) 2472 16 xfs_bmap_alloc+0x21/0x40
34) 2456 432 xfs_bmapi+0x9b7/0x1150
35) 2024 192 xfs_iomap_write_allocate+0x17d/0x350
36) 1832 144 xfs_map_blocks+0x1e2/0x270
37) 1688 208 xfs_vm_writepage+0x19f/0x500
38) 1480 32 __writepage+0x17/0x40
39) 1448 304 write_cache_pages+0x21d/0x4d0
40) 1144 96 generic_writepages+0x51/0x80
41) 1048 48 xfs_vm_writepages+0x5d/0x80
42) 1000 16 do_writepages+0x21/0x40
43) 984 96 writeback_single_inode+0x10e/0x270
44) 888 96 writeback_sb_inodes+0xdb/0x1b0
45) 792 208 wb_writeback+0x1bf/0x420
46) 584 160 wb_do_writeback+0x9f/0x270
47) 424 144 bdi_writeback_thread+0xaa/0x270
48) 280 96 kthread+0x96/0xa0
49) 184 184 kernel_thread_helper+0x4/0x10
So from ->writepage, there is about 3.5k of stack usage here. 2.5k
of that is in XFS, and the worst I've seen is around 4k before
getting to the IO subsystem, which in the worst case I've seen
consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
down to IO take over 6k of stack space on x86_64....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-15 2:22 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 00:46:43 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > + /*
> > > > + * Only kswapd can writeback filesystem pages to
> > > > + * avoid risk of stack overflow
> > > > + */
> > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > + goto keep_locked;
> > > > + }
> > > > +
> > >
> > >
> > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Here's an example writeback stack trace. Notice how deep it is from
the __writepage() call?
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (50 entries)
----- ---- --------
0) 5000 80 enqueue_task_fair+0x63/0x4f0
1) 4920 48 enqueue_task+0x6a/0x80
2) 4872 32 activate_task+0x2d/0x40
3) 4840 32 ttwu_activate+0x21/0x50
4) 4808 32 T.2130+0x3c/0x60
5) 4776 112 try_to_wake_up+0x25e/0x2d0
6) 4664 16 wake_up_process+0x15/0x20
7) 4648 16 wake_up_worker+0x24/0x30
8) 4632 16 insert_work+0x6f/0x80
9) 4616 96 __queue_work+0xf9/0x3f0
10) 4520 16 queue_work_on+0x25/0x40
11) 4504 16 queue_work+0x1f/0x30
12) 4488 16 queue_delayed_work+0x2d/0x40
13) 4472 32 blk_run_queue_async+0x41/0x60
14) 4440 64 queue_unplugged+0x8e/0xc0
15) 4376 112 blk_flush_plug_list+0x1f5/0x240
16) 4264 176 schedule+0x4c3/0x8b0
17) 4088 128 schedule_timeout+0x1a5/0x280
18) 3960 160 wait_for_common+0xdb/0x180
19) 3800 16 wait_for_completion+0x1d/0x20
20) 3784 48 xfs_buf_iowait+0x30/0xc0
21) 3736 32 _xfs_buf_read+0x60/0x70
22) 3704 48 xfs_buf_read+0xa2/0x100
23) 3656 80 xfs_trans_read_buf+0x1ef/0x430
24) 3576 96 xfs_btree_read_buf_block+0x5e/0xd0
25) 3480 96 xfs_btree_lookup_get_block+0x83/0xf0
26) 3384 176 xfs_btree_lookup+0xd7/0x490
27) 3208 16 xfs_alloc_lookup_eq+0x19/0x20
28) 3192 112 xfs_alloc_fixup_trees+0x2b5/0x350
29) 3080 224 xfs_alloc_ag_vextent_near+0x631/0xb60
30) 2856 32 xfs_alloc_ag_vextent+0xd5/0x100
31) 2824 96 xfs_alloc_vextent+0x2a4/0x5f0
32) 2728 256 xfs_bmap_btalloc+0x257/0x720
33) 2472 16 xfs_bmap_alloc+0x21/0x40
34) 2456 432 xfs_bmapi+0x9b7/0x1150
35) 2024 192 xfs_iomap_write_allocate+0x17d/0x350
36) 1832 144 xfs_map_blocks+0x1e2/0x270
37) 1688 208 xfs_vm_writepage+0x19f/0x500
38) 1480 32 __writepage+0x17/0x40
39) 1448 304 write_cache_pages+0x21d/0x4d0
40) 1144 96 generic_writepages+0x51/0x80
41) 1048 48 xfs_vm_writepages+0x5d/0x80
42) 1000 16 do_writepages+0x21/0x40
43) 984 96 writeback_single_inode+0x10e/0x270
44) 888 96 writeback_sb_inodes+0xdb/0x1b0
45) 792 208 wb_writeback+0x1bf/0x420
46) 584 160 wb_do_writeback+0x9f/0x270
47) 424 144 bdi_writeback_thread+0xaa/0x270
48) 280 96 kthread+0x96/0xa0
49) 184 184 kernel_thread_helper+0x4/0x10
So from ->writepage, there is about 3.5k of stack usage here. 2.5k
of that is in XFS, and the worst I've seen is around 4k before
getting to the IO subsystem, which in the worst case I've seen
consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
down to IO take over 6k of stack space on x86_64....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-15 2:22 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 14 Jul 2011 00:46:43 -0400
> Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > + /*
> > > > + * Only kswapd can writeback filesystem pages to
> > > > + * avoid risk of stack overflow
> > > > + */
> > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > + goto keep_locked;
> > > > + }
> > > > +
> > >
> > >
> > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> >
> > XFS and btrfs already disable writeback from memcg context, as does ext4
> > for the typical non-overwrite workloads, and none has fallen apart.
> >
> > In fact there's no way we can enable them as the memcg calling contexts
> > tend to have massive stack usage.
> >
>
> Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
Here's an example writeback stack trace. Notice how deep it is from
the __writepage() call?
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (50 entries)
----- ---- --------
0) 5000 80 enqueue_task_fair+0x63/0x4f0
1) 4920 48 enqueue_task+0x6a/0x80
2) 4872 32 activate_task+0x2d/0x40
3) 4840 32 ttwu_activate+0x21/0x50
4) 4808 32 T.2130+0x3c/0x60
5) 4776 112 try_to_wake_up+0x25e/0x2d0
6) 4664 16 wake_up_process+0x15/0x20
7) 4648 16 wake_up_worker+0x24/0x30
8) 4632 16 insert_work+0x6f/0x80
9) 4616 96 __queue_work+0xf9/0x3f0
10) 4520 16 queue_work_on+0x25/0x40
11) 4504 16 queue_work+0x1f/0x30
12) 4488 16 queue_delayed_work+0x2d/0x40
13) 4472 32 blk_run_queue_async+0x41/0x60
14) 4440 64 queue_unplugged+0x8e/0xc0
15) 4376 112 blk_flush_plug_list+0x1f5/0x240
16) 4264 176 schedule+0x4c3/0x8b0
17) 4088 128 schedule_timeout+0x1a5/0x280
18) 3960 160 wait_for_common+0xdb/0x180
19) 3800 16 wait_for_completion+0x1d/0x20
20) 3784 48 xfs_buf_iowait+0x30/0xc0
21) 3736 32 _xfs_buf_read+0x60/0x70
22) 3704 48 xfs_buf_read+0xa2/0x100
23) 3656 80 xfs_trans_read_buf+0x1ef/0x430
24) 3576 96 xfs_btree_read_buf_block+0x5e/0xd0
25) 3480 96 xfs_btree_lookup_get_block+0x83/0xf0
26) 3384 176 xfs_btree_lookup+0xd7/0x490
27) 3208 16 xfs_alloc_lookup_eq+0x19/0x20
28) 3192 112 xfs_alloc_fixup_trees+0x2b5/0x350
29) 3080 224 xfs_alloc_ag_vextent_near+0x631/0xb60
30) 2856 32 xfs_alloc_ag_vextent+0xd5/0x100
31) 2824 96 xfs_alloc_vextent+0x2a4/0x5f0
32) 2728 256 xfs_bmap_btalloc+0x257/0x720
33) 2472 16 xfs_bmap_alloc+0x21/0x40
34) 2456 432 xfs_bmapi+0x9b7/0x1150
35) 2024 192 xfs_iomap_write_allocate+0x17d/0x350
36) 1832 144 xfs_map_blocks+0x1e2/0x270
37) 1688 208 xfs_vm_writepage+0x19f/0x500
38) 1480 32 __writepage+0x17/0x40
39) 1448 304 write_cache_pages+0x21d/0x4d0
40) 1144 96 generic_writepages+0x51/0x80
41) 1048 48 xfs_vm_writepages+0x5d/0x80
42) 1000 16 do_writepages+0x21/0x40
43) 984 96 writeback_single_inode+0x10e/0x270
44) 888 96 writeback_sb_inodes+0xdb/0x1b0
45) 792 208 wb_writeback+0x1bf/0x420
46) 584 160 wb_do_writeback+0x9f/0x270
47) 424 144 bdi_writeback_thread+0xaa/0x270
48) 280 96 kthread+0x96/0xa0
49) 184 184 kernel_thread_helper+0x4/0x10
So from ->writepage, there is about 3.5k of stack usage here. 2.5k
of that is in XFS, and the worst I've seen is around 4k before
getting to the IO subsystem, which in the worst case I've seen
consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
down to IO take over 6k of stack space on x86_64....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
2011-07-14 13:17 ` Mel Gorman
(?)
@ 2011-07-15 3:12 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 3:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 02:17:45PM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > > writing pages.
> > > > >
> > > > > However, page reclaim does have a requirement that pages be freed
> > > > > in a particular zone. If it is failing to make sufficient progress
> > > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > > considered to tbe the point where kswapd is getting into trouble
> > > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > > pages for writing.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > >
> > > > Seems reasonable, but btrfs still will ignore this writeback from
> > > > kswapd, and it doesn't fall over.
> > >
> > > At least there are no reports of it falling over :)
> >
> > However you want to spin it.
>
> I regret that it is coming across as spin.
Shit, sorry, I didn't mean it that way. I forgot to add the smiley
at the end of that comment. It was meant in jest and not to be
derogatory - I do understand your concerns.
> > > > Given that data point, I'd like to
> > > > see the results when you stop kswapd from doing writeback altogether
> > > > as well.
> > > >
> > >
> > > The results for this test will be identical because the ftrace results
> > > show that kswapd is already writing 0 filesystem pages.
> >
> > You mean these numbers:
> >
> > Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
> >
> > Which shows that kswapd, under this workload has been improved to
> > the point that it doesn't need to do IO. Yes, you've addressed the
> > one problematic workload, but the numbers do not provide the answers
> > to the fundamental question that have been raised during
> > discussions. i.e. do we even need IO at all from reclaim?
>
> I don't know and at best will only be able to test with a single
> disk which is why I wanted to separate this series from a complete
> preventing of kswapd writing pages. I may be able to get access to
> a machine with more disks but it'll take time.
That, to me, seems like a major problem, and explains why swapping
was affecting your results - you've got your test filesystem and
your swap partition on the same spindle. In the server admin world,
that's the first thing anyone concerned with performance avoids and
as such I tend to avoid doing that, too.
The lack of spindles/bandwidth used in testing the mm code is also
potentially another reason why XFS tends to show up mm problems.
That is, most testing and production use of XFS occurs on disk
subsystems much more bandwidth than a single spindle, and hence the
effects of bad IO show up much more obviously than for a single
spindle.
> > > Where it makes a difference is when the system is under enough
> > > pressure that it is failing to reclaim any memory and is in danger
> > > of prematurely triggering the OOM killer. Andrea outlined some of
> > > the concerns before at http://lkml.org/lkml/2010/6/15/246
> >
> > So put the system under more pressure such that with this patch
> > series memory reclaim still writes from kswapd. Can you even get it
> > to that stage, and if you can, does the system OOM more or less if
> > you don't do file IO from reclaim?
>
> I can setup such a tests, it'll be at least next week before I
> configure such a test and get it queued. It'll probably take a few
> days to run then because more iterations will be required to pinpoint
> where the OOM threshold is. I know from the past that pushing a
> system near OOM causes a non-deterministic number of triggers that
> depend heavily on what was killed so the only real choice is to start
> light and increase the load until boom which is time consuming.
>
> Even then, the test will be inconclusive because it'll be just one
> or two machines that I'll have to test on.
Which is why I have a bunch of test VMs with different
CPU/RAM/platform configs. I regularly use 1p/1GB x86-64, 1p/2GB
i686 (to stress highmem), 2p/2GB, 8p/4GB and 8p/16GB x86-64 VMs. I
have a bunch of different disk images for the VMs to work off,
located on storage from shared single SATA spindles to a 16TB volume
to a short-stroked, 1GB/s, 5kiops, 12 disk dm RAID-0 setup.
I mix and match the VMs with the disk images all the time - this is
one of the benefits of using a virtualised test environment. One
slightly beefy piece of hardware that costs $10k can be used to test
many, many different configurations. That's why I complain about
corner cases all the time ;)
> There will be important
> corner cases that I won't be able to test for. For example;
>
> o small lowest zone that is critical for operation of some reason and
> the pages must be cleaned from there even though there is a large
> amount of memory overall
That's the i686 highmem case, using a large amount of memory (e.g.
4GB or more) to make sure that the highmem zone is much larger than
the lowmem zone. inode caching uses low memory, so directory
intensive operations on large sets of files (e.g. 10 million)
tend to stress low memory availability.
> o small highest zone causing high kswapd usage as it fails to balance
> continually due to pages being dirtied constantly and the window
> between when flushers clean the page and kswapd reclaim the page
> being too big. I might be able to simulate this one but bugs of
> this nature tend to be workload specific and affect some machines
> worse than others
And that is also testable with i686 highmem, but simply use smaller
amounts of ram (say 1.5GB). Use page cache pressure to fill and
dirty highmem, and inode cache pressure to fill lowmem.
Guess what one of my ad hoc tests for XFS shrinker balancing is. :)
> o Machines with many nodes and dirty pages spread semi-randomly
> on all nodes. If the flusher thread is not cleaning pages from
> a particular node that is under memory pressure due to affinity,
> processes will stall for long periods of time until the relevant
> inodes expire and gets cleaned. This will be particularly
> problematic if zone_reclaim is enabled
And you can create large node-count virtual machines via the kvm
-numa option. I haven't been doing this as yet because getting stuff
working well on single node SMP needs to be done first.
So, like you, I really only have one or two tests machine available
locally, but I've been creative in working around that
limitation.... :/
> > It's that next step that I'm asking you to test now. What form
> > potential changes take or when they are released is irrelevant to me
> > at this point, because we still haven't determined if the
> > fundamental concept is completely sound or not. If the concept is
> > sound I'm quite happy to wait until the implementation is fully
> > baked before it gets rolled out....
>
> I'll setup a suitable test next week then.
Sounds great. Thanks Mel.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-15 3:12 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 3:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Minchan Kim, Wu Fengguang, Johannes Weiner
On Thu, Jul 14, 2011 at 02:17:45PM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > > writing pages.
> > > > >
> > > > > However, page reclaim does have a requirement that pages be freed
> > > > > in a particular zone. If it is failing to make sufficient progress
> > > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > > considered to tbe the point where kswapd is getting into trouble
> > > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > > pages for writing.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > >
> > > > Seems reasonable, but btrfs still will ignore this writeback from
> > > > kswapd, and it doesn't fall over.
> > >
> > > At least there are no reports of it falling over :)
> >
> > However you want to spin it.
>
> I regret that it is coming across as spin.
Shit, sorry, I didn't mean it that way. I forgot to add the smiley
at the end of that comment. It was meant in jest and not to be
derogatory - I do understand your concerns.
> > > > Given that data point, I'd like to
> > > > see the results when you stop kswapd from doing writeback altogether
> > > > as well.
> > > >
> > >
> > > The results for this test will be identical because the ftrace results
> > > show that kswapd is already writing 0 filesystem pages.
> >
> > You mean these numbers:
> >
> > Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
> >
> > Which shows that kswapd, under this workload has been improved to
> > the point that it doesn't need to do IO. Yes, you've addressed the
> > one problematic workload, but the numbers do not provide the answers
> > to the fundamental question that have been raised during
> > discussions. i.e. do we even need IO at all from reclaim?
>
> I don't know and at best will only be able to test with a single
> disk which is why I wanted to separate this series from a complete
> preventing of kswapd writing pages. I may be able to get access to
> a machine with more disks but it'll take time.
That, to me, seems like a major problem, and explains why swapping
was affecting your results - you've got your test filesystem and
your swap partition on the same spindle. In the server admin world,
that's the first thing anyone concerned with performance avoids and
as such I tend to avoid doing that, too.
The lack of spindles/bandwidth used in testing the mm code is also
potentially another reason why XFS tends to show up mm problems.
That is, most testing and production use of XFS occurs on disk
subsystems much more bandwidth than a single spindle, and hence the
effects of bad IO show up much more obviously than for a single
spindle.
> > > Where it makes a difference is when the system is under enough
> > > pressure that it is failing to reclaim any memory and is in danger
> > > of prematurely triggering the OOM killer. Andrea outlined some of
> > > the concerns before at http://lkml.org/lkml/2010/6/15/246
> >
> > So put the system under more pressure such that with this patch
> > series memory reclaim still writes from kswapd. Can you even get it
> > to that stage, and if you can, does the system OOM more or less if
> > you don't do file IO from reclaim?
>
> I can setup such a tests, it'll be at least next week before I
> configure such a test and get it queued. It'll probably take a few
> days to run then because more iterations will be required to pinpoint
> where the OOM threshold is. I know from the past that pushing a
> system near OOM causes a non-deterministic number of triggers that
> depend heavily on what was killed so the only real choice is to start
> light and increase the load until boom which is time consuming.
>
> Even then, the test will be inconclusive because it'll be just one
> or two machines that I'll have to test on.
Which is why I have a bunch of test VMs with different
CPU/RAM/platform configs. I regularly use 1p/1GB x86-64, 1p/2GB
i686 (to stress highmem), 2p/2GB, 8p/4GB and 8p/16GB x86-64 VMs. I
have a bunch of different disk images for the VMs to work off,
located on storage from shared single SATA spindles to a 16TB volume
to a short-stroked, 1GB/s, 5kiops, 12 disk dm RAID-0 setup.
I mix and match the VMs with the disk images all the time - this is
one of the benefits of using a virtualised test environment. One
slightly beefy piece of hardware that costs $10k can be used to test
many, many different configurations. That's why I complain about
corner cases all the time ;)
> There will be important
> corner cases that I won't be able to test for. For example;
>
> o small lowest zone that is critical for operation of some reason and
> the pages must be cleaned from there even though there is a large
> amount of memory overall
That's the i686 highmem case, using a large amount of memory (e.g.
4GB or more) to make sure that the highmem zone is much larger than
the lowmem zone. inode caching uses low memory, so directory
intensive operations on large sets of files (e.g. 10 million)
tend to stress low memory availability.
> o small highest zone causing high kswapd usage as it fails to balance
> continually due to pages being dirtied constantly and the window
> between when flushers clean the page and kswapd reclaim the page
> being too big. I might be able to simulate this one but bugs of
> this nature tend to be workload specific and affect some machines
> worse than others
And that is also testable with i686 highmem, but simply use smaller
amounts of ram (say 1.5GB). Use page cache pressure to fill and
dirty highmem, and inode cache pressure to fill lowmem.
Guess what one of my ad hoc tests for XFS shrinker balancing is. :)
> o Machines with many nodes and dirty pages spread semi-randomly
> on all nodes. If the flusher thread is not cleaning pages from
> a particular node that is under memory pressure due to affinity,
> processes will stall for long periods of time until the relevant
> inodes expire and gets cleaned. This will be particularly
> problematic if zone_reclaim is enabled
And you can create large node-count virtual machines via the kvm
-numa option. I haven't been doing this as yet because getting stuff
working well on single node SMP needs to be done first.
So, like you, I really only have one or two tests machine available
locally, but I've been creative in working around that
limitation.... :/
> > It's that next step that I'm asking you to test now. What form
> > potential changes take or when they are released is irrelevant to me
> > at this point, because we still haven't determined if the
> > fundamental concept is completely sound or not. If the concept is
> > sound I'm quite happy to wait until the implementation is fully
> > baked before it gets rolled out....
>
> I'll setup a suitable test next week then.
Sounds great. Thanks Mel.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
@ 2011-07-15 3:12 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-15 3:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim
On Thu, Jul 14, 2011 at 02:17:45PM +0100, Mel Gorman wrote:
> On Thu, Jul 14, 2011 at 09:52:21PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 07:29:47AM +0100, Mel Gorman wrote:
> > > On Thu, Jul 14, 2011 at 09:37:43AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 13, 2011 at 03:31:24PM +0100, Mel Gorman wrote:
> > > > > It is preferable that no dirty pages are dispatched for cleaning from
> > > > > the page reclaim path. At normal priorities, this patch prevents kswapd
> > > > > writing pages.
> > > > >
> > > > > However, page reclaim does have a requirement that pages be freed
> > > > > in a particular zone. If it is failing to make sufficient progress
> > > > > (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> > > > > is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> > > > > considered to tbe the point where kswapd is getting into trouble
> > > > > reclaiming pages. If this priority is reached, kswapd will dispatch
> > > > > pages for writing.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > >
> > > > Seems reasonable, but btrfs still will ignore this writeback from
> > > > kswapd, and it doesn't fall over.
> > >
> > > At least there are no reports of it falling over :)
> >
> > However you want to spin it.
>
> I regret that it is coming across as spin.
Shit, sorry, I didn't mean it that way. I forgot to add the smiley
at the end of that comment. It was meant in jest and not to be
derogatory - I do understand your concerns.
> > > > Given that data point, I'd like to
> > > > see the results when you stop kswapd from doing writeback altogether
> > > > as well.
> > > >
> > >
> > > The results for this test will be identical because the ftrace results
> > > show that kswapd is already writing 0 filesystem pages.
> >
> > You mean these numbers:
> >
> > Kswapd reclaim write file async I/O 4483 4286 0 1 0 0
> >
> > Which shows that kswapd, under this workload has been improved to
> > the point that it doesn't need to do IO. Yes, you've addressed the
> > one problematic workload, but the numbers do not provide the answers
> > to the fundamental question that have been raised during
> > discussions. i.e. do we even need IO at all from reclaim?
>
> I don't know and at best will only be able to test with a single
> disk which is why I wanted to separate this series from a complete
> preventing of kswapd writing pages. I may be able to get access to
> a machine with more disks but it'll take time.
That, to me, seems like a major problem, and explains why swapping
was affecting your results - you've got your test filesystem and
your swap partition on the same spindle. In the server admin world,
that's the first thing anyone concerned with performance avoids and
as such I tend to avoid doing that, too.
The lack of spindles/bandwidth used in testing the mm code is also
potentially another reason why XFS tends to show up mm problems.
That is, most testing and production use of XFS occurs on disk
subsystems much more bandwidth than a single spindle, and hence the
effects of bad IO show up much more obviously than for a single
spindle.
> > > Where it makes a difference is when the system is under enough
> > > pressure that it is failing to reclaim any memory and is in danger
> > > of prematurely triggering the OOM killer. Andrea outlined some of
> > > the concerns before at http://lkml.org/lkml/2010/6/15/246
> >
> > So put the system under more pressure such that with this patch
> > series memory reclaim still writes from kswapd. Can you even get it
> > to that stage, and if you can, does the system OOM more or less if
> > you don't do file IO from reclaim?
>
> I can setup such a tests, it'll be at least next week before I
> configure such a test and get it queued. It'll probably take a few
> days to run then because more iterations will be required to pinpoint
> where the OOM threshold is. I know from the past that pushing a
> system near OOM causes a non-deterministic number of triggers that
> depend heavily on what was killed so the only real choice is to start
> light and increase the load until boom which is time consuming.
>
> Even then, the test will be inconclusive because it'll be just one
> or two machines that I'll have to test on.
Which is why I have a bunch of test VMs with different
CPU/RAM/platform configs. I regularly use 1p/1GB x86-64, 1p/2GB
i686 (to stress highmem), 2p/2GB, 8p/4GB and 8p/16GB x86-64 VMs. I
have a bunch of different disk images for the VMs to work off,
located on storage from shared single SATA spindles to a 16TB volume
to a short-stroked, 1GB/s, 5kiops, 12 disk dm RAID-0 setup.
I mix and match the VMs with the disk images all the time - this is
one of the benefits of using a virtualised test environment. One
slightly beefy piece of hardware that costs $10k can be used to test
many, many different configurations. That's why I complain about
corner cases all the time ;)
> There will be important
> corner cases that I won't be able to test for. For example;
>
> o small lowest zone that is critical for operation of some reason and
> the pages must be cleaned from there even though there is a large
> amount of memory overall
That's the i686 highmem case, using a large amount of memory (e.g.
4GB or more) to make sure that the highmem zone is much larger than
the lowmem zone. inode caching uses low memory, so directory
intensive operations on large sets of files (e.g. 10 million)
tend to stress low memory availability.
> o small highest zone causing high kswapd usage as it fails to balance
> continually due to pages being dirtied constantly and the window
> between when flushers clean the page and kswapd reclaim the page
> being too big. I might be able to simulate this one but bugs of
> this nature tend to be workload specific and affect some machines
> worse than others
And that is also testable with i686 highmem, but simply use smaller
amounts of ram (say 1.5GB). Use page cache pressure to fill and
dirty highmem, and inode cache pressure to fill lowmem.
Guess what one of my ad hoc tests for XFS shrinker balancing is. :)
> o Machines with many nodes and dirty pages spread semi-randomly
> on all nodes. If the flusher thread is not cleaning pages from
> a particular node that is under memory pressure due to affinity,
> processes will stall for long periods of time until the relevant
> inodes expire and gets cleaned. This will be particularly
> problematic if zone_reclaim is enabled
And you can create large node-count virtual machines via the kvm
-numa option. I haven't been doing this as yet because getting stuff
working well on single node SMP needs to be done first.
So, like you, I really only have one or two tests machine available
locally, but I've been creative in working around that
limitation.... :/
> > It's that next step that I'm asking you to test now. What form
> > potential changes take or when they are released is irrelevant to me
> > at this point, because we still haven't determined if the
> > fundamental concept is completely sound or not. If the concept is
> > sound I'm quite happy to wait until the implementation is fully
> > baked before it gets rolled out....
>
> I'll setup a suitable test next week then.
Sounds great. Thanks Mel.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-15 2:22 ` Dave Chinner
(?)
@ 2011-07-18 2:22 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > + /*
> > > > > + * Only kswapd can writeback filesystem pages to
> > > > > + * avoid risk of stack overflow
> > > > > + */
> > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > + goto keep_locked;
> > > > > + }
> > > > > +
> > > >
> > > >
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > >
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
>
> So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....
BTW, here's a stack frame that indicates swap IO:
dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (46 entries)
----- ---- --------
0) 5080 40 zone_statistics+0xad/0xc0
1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
3) 4480 48 kmem_getpages+0x62/0x160
4) 4432 112 cache_grow+0x2d1/0x300
5) 4320 80 cache_alloc_refill+0x219/0x260
6) 4240 64 kmem_cache_alloc+0x182/0x190
7) 4176 16 mempool_alloc_slab+0x15/0x20
8) 4160 144 mempool_alloc+0x63/0x140
9) 4016 16 scsi_sg_alloc+0x4c/0x60
10) 4000 112 __sg_alloc_table+0x66/0x140
11) 3888 32 scsi_init_sgtable+0x33/0x90
12) 3856 48 scsi_init_io+0x31/0xc0
13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3776 112 sd_prep_fn+0x150/0xa90
15) 3664 64 blk_peek_request+0xc7/0x230
16) 3600 96 scsi_request_fn+0x68/0x500
17) 3504 16 __blk_run_queue+0x1b/0x20
18) 3488 96 __make_request+0x2cb/0x310
19) 3392 192 generic_make_request+0x26d/0x500
20) 3200 96 submit_bio+0x64/0xe0
21) 3104 48 swap_writepage+0x83/0xd0
22) 3056 112 pageout+0x122/0x2f0
23) 2944 192 shrink_page_list+0x458/0x5f0
24) 2752 192 shrink_inactive_list+0x1ec/0x410
25) 2560 224 shrink_zone+0x468/0x500
26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
27) 2192 176 try_to_free_pages+0xa4/0x120
28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
29) 1728 48 kmem_getpages+0x62/0x160
30) 1680 128 fallback_alloc+0x192/0x240
31) 1552 96 ____cache_alloc_node+0x9a/0x170
32) 1456 16 __kmalloc+0x17d/0x200
33) 1440 128 kmem_alloc+0x77/0xf0
34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
36) 1088 208 xfs_create+0x57a/0x640
37) 880 96 xfs_vn_mknod+0xa1/0x1b0
38) 784 16 xfs_vn_create+0x10/0x20
39) 768 64 vfs_create+0xb1/0xe0
40) 704 96 do_last+0x5f5/0x770
41) 608 144 path_openat+0xd5/0x400
42) 464 224 do_filp_open+0x49/0xa0
43) 240 96 do_sys_open+0x107/0x1e0
44) 144 16 sys_open+0x20/0x30
45) 128 128 system_call_fastpath+0x16/0x1b
That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....
Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-18 2:22 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > + /*
> > > > > + * Only kswapd can writeback filesystem pages to
> > > > > + * avoid risk of stack overflow
> > > > > + */
> > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > + goto keep_locked;
> > > > > + }
> > > > > +
> > > >
> > > >
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > >
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
>
> So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....
BTW, here's a stack frame that indicates swap IO:
dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (46 entries)
----- ---- --------
0) 5080 40 zone_statistics+0xad/0xc0
1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
3) 4480 48 kmem_getpages+0x62/0x160
4) 4432 112 cache_grow+0x2d1/0x300
5) 4320 80 cache_alloc_refill+0x219/0x260
6) 4240 64 kmem_cache_alloc+0x182/0x190
7) 4176 16 mempool_alloc_slab+0x15/0x20
8) 4160 144 mempool_alloc+0x63/0x140
9) 4016 16 scsi_sg_alloc+0x4c/0x60
10) 4000 112 __sg_alloc_table+0x66/0x140
11) 3888 32 scsi_init_sgtable+0x33/0x90
12) 3856 48 scsi_init_io+0x31/0xc0
13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3776 112 sd_prep_fn+0x150/0xa90
15) 3664 64 blk_peek_request+0xc7/0x230
16) 3600 96 scsi_request_fn+0x68/0x500
17) 3504 16 __blk_run_queue+0x1b/0x20
18) 3488 96 __make_request+0x2cb/0x310
19) 3392 192 generic_make_request+0x26d/0x500
20) 3200 96 submit_bio+0x64/0xe0
21) 3104 48 swap_writepage+0x83/0xd0
22) 3056 112 pageout+0x122/0x2f0
23) 2944 192 shrink_page_list+0x458/0x5f0
24) 2752 192 shrink_inactive_list+0x1ec/0x410
25) 2560 224 shrink_zone+0x468/0x500
26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
27) 2192 176 try_to_free_pages+0xa4/0x120
28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
29) 1728 48 kmem_getpages+0x62/0x160
30) 1680 128 fallback_alloc+0x192/0x240
31) 1552 96 ____cache_alloc_node+0x9a/0x170
32) 1456 16 __kmalloc+0x17d/0x200
33) 1440 128 kmem_alloc+0x77/0xf0
34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
36) 1088 208 xfs_create+0x57a/0x640
37) 880 96 xfs_vn_mknod+0xa1/0x1b0
38) 784 16 xfs_vn_create+0x10/0x20
39) 768 64 vfs_create+0xb1/0xe0
40) 704 96 do_last+0x5f5/0x770
41) 608 144 path_openat+0xd5/0x400
42) 464 224 do_filp_open+0x49/0xa0
43) 240 96 do_sys_open+0x107/0x1e0
44) 144 16 sys_open+0x20/0x30
45) 128 128 system_call_fastpath+0x16/0x1b
That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....
Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-18 2:22 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 2:22 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 14 Jul 2011 00:46:43 -0400
> > Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > + /*
> > > > > + * Only kswapd can writeback filesystem pages to
> > > > > + * avoid risk of stack overflow
> > > > > + */
> > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > + goto keep_locked;
> > > > > + }
> > > > > +
> > > >
> > > >
> > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > >
> > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > for the typical non-overwrite workloads, and none has fallen apart.
> > >
> > > In fact there's no way we can enable them as the memcg calling contexts
> > > tend to have massive stack usage.
> > >
> >
> > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
>
> Here's an example writeback stack trace. Notice how deep it is from
> the __writepage() call?
....
>
> So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> of that is in XFS, and the worst I've seen is around 4k before
> getting to the IO subsystem, which in the worst case I've seen
> consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> down to IO take over 6k of stack space on x86_64....
BTW, here's a stack frame that indicates swap IO:
dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (46 entries)
----- ---- --------
0) 5080 40 zone_statistics+0xad/0xc0
1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
3) 4480 48 kmem_getpages+0x62/0x160
4) 4432 112 cache_grow+0x2d1/0x300
5) 4320 80 cache_alloc_refill+0x219/0x260
6) 4240 64 kmem_cache_alloc+0x182/0x190
7) 4176 16 mempool_alloc_slab+0x15/0x20
8) 4160 144 mempool_alloc+0x63/0x140
9) 4016 16 scsi_sg_alloc+0x4c/0x60
10) 4000 112 __sg_alloc_table+0x66/0x140
11) 3888 32 scsi_init_sgtable+0x33/0x90
12) 3856 48 scsi_init_io+0x31/0xc0
13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
14) 3776 112 sd_prep_fn+0x150/0xa90
15) 3664 64 blk_peek_request+0xc7/0x230
16) 3600 96 scsi_request_fn+0x68/0x500
17) 3504 16 __blk_run_queue+0x1b/0x20
18) 3488 96 __make_request+0x2cb/0x310
19) 3392 192 generic_make_request+0x26d/0x500
20) 3200 96 submit_bio+0x64/0xe0
21) 3104 48 swap_writepage+0x83/0xd0
22) 3056 112 pageout+0x122/0x2f0
23) 2944 192 shrink_page_list+0x458/0x5f0
24) 2752 192 shrink_inactive_list+0x1ec/0x410
25) 2560 224 shrink_zone+0x468/0x500
26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
27) 2192 176 try_to_free_pages+0xa4/0x120
28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
29) 1728 48 kmem_getpages+0x62/0x160
30) 1680 128 fallback_alloc+0x192/0x240
31) 1552 96 ____cache_alloc_node+0x9a/0x170
32) 1456 16 __kmalloc+0x17d/0x200
33) 1440 128 kmem_alloc+0x77/0xf0
34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
36) 1088 208 xfs_create+0x57a/0x640
37) 880 96 xfs_vn_mknod+0xa1/0x1b0
38) 784 16 xfs_vn_create+0x10/0x20
39) 768 64 vfs_create+0xb1/0xe0
40) 704 96 do_last+0x5f5/0x770
41) 608 144 path_openat+0xd5/0x400
42) 464 224 do_filp_open+0x49/0xa0
43) 240 96 do_sys_open+0x107/0x1e0
44) 144 16 sys_open+0x20/0x30
45) 128 128 system_call_fastpath+0x16/0x1b
That's pretty damn bad. From kmem_alloc to the top of the stack is
more than 3.5k through the direct reclaim swap IO path. That, to me,
kind of indicates that even doing swap IO on dirty anonymous pages
from direct reclaim risks overflowing the 8k stack on x86_64....
Umm, hold on a second, WTF is my standard create-lots-of-zero-length
inodes-in-parallel doing swapping? Oh, shit, it's also running about
50% slower (50-60k files/s instead of 110-120l files/s)....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
2011-07-18 2:22 ` Dave Chinner
(?)
@ 2011-07-18 3:06 ` Dave Chinner
-1 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 3:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Mon, Jul 18, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 00:46:43 -0400
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > + /*
> > > > > > + * Only kswapd can writeback filesystem pages to
> > > > > > + * avoid risk of stack overflow
> > > > > > + */
> > > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > > + goto keep_locked;
> > > > > > + }
> > > > > > +
> > > > >
> > > > >
> > > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > >
> > > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > > for the typical non-overwrite workloads, and none has fallen apart.
> > > >
> > > > In fact there's no way we can enable them as the memcg calling contexts
> > > > tend to have massive stack usage.
> > > >
> > >
> > > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> >
> > Here's an example writeback stack trace. Notice how deep it is from
> > the __writepage() call?
> ....
> >
> > So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> > of that is in XFS, and the worst I've seen is around 4k before
> > getting to the IO subsystem, which in the worst case I've seen
> > consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> > down to IO take over 6k of stack space on x86_64....
>
> BTW, here's a stack frame that indicates swap IO:
>
> dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (46 entries)
> ----- ---- --------
> 0) 5080 40 zone_statistics+0xad/0xc0
> 1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
> 2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
> 3) 4480 48 kmem_getpages+0x62/0x160
> 4) 4432 112 cache_grow+0x2d1/0x300
> 5) 4320 80 cache_alloc_refill+0x219/0x260
> 6) 4240 64 kmem_cache_alloc+0x182/0x190
> 7) 4176 16 mempool_alloc_slab+0x15/0x20
> 8) 4160 144 mempool_alloc+0x63/0x140
> 9) 4016 16 scsi_sg_alloc+0x4c/0x60
> 10) 4000 112 __sg_alloc_table+0x66/0x140
> 11) 3888 32 scsi_init_sgtable+0x33/0x90
> 12) 3856 48 scsi_init_io+0x31/0xc0
> 13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
> 14) 3776 112 sd_prep_fn+0x150/0xa90
> 15) 3664 64 blk_peek_request+0xc7/0x230
> 16) 3600 96 scsi_request_fn+0x68/0x500
> 17) 3504 16 __blk_run_queue+0x1b/0x20
> 18) 3488 96 __make_request+0x2cb/0x310
> 19) 3392 192 generic_make_request+0x26d/0x500
> 20) 3200 96 submit_bio+0x64/0xe0
> 21) 3104 48 swap_writepage+0x83/0xd0
> 22) 3056 112 pageout+0x122/0x2f0
> 23) 2944 192 shrink_page_list+0x458/0x5f0
> 24) 2752 192 shrink_inactive_list+0x1ec/0x410
> 25) 2560 224 shrink_zone+0x468/0x500
> 26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
> 27) 2192 176 try_to_free_pages+0xa4/0x120
> 28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
> 29) 1728 48 kmem_getpages+0x62/0x160
> 30) 1680 128 fallback_alloc+0x192/0x240
> 31) 1552 96 ____cache_alloc_node+0x9a/0x170
> 32) 1456 16 __kmalloc+0x17d/0x200
> 33) 1440 128 kmem_alloc+0x77/0xf0
> 34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
> 35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
> 36) 1088 208 xfs_create+0x57a/0x640
> 37) 880 96 xfs_vn_mknod+0xa1/0x1b0
> 38) 784 16 xfs_vn_create+0x10/0x20
> 39) 768 64 vfs_create+0xb1/0xe0
> 40) 704 96 do_last+0x5f5/0x770
> 41) 608 144 path_openat+0xd5/0x400
> 42) 464 224 do_filp_open+0x49/0xa0
> 43) 240 96 do_sys_open+0x107/0x1e0
> 44) 144 16 sys_open+0x20/0x30
> 45) 128 128 system_call_fastpath+0x16/0x1b
>
>
> That's pretty damn bad. From kmem_alloc to the top of the stack is
> more than 3.5k through the direct reclaim swap IO path. That, to me,
> kind of indicates that even doing swap IO on dirty anonymous pages
> from direct reclaim risks overflowing the 8k stack on x86_64....
>
> Umm, hold on a second, WTF is my standard create-lots-of-zero-length
> inodes-in-parallel doing swapping? Oh, shit, it's also running about
> 50% slower (50-60k files/s instead of 110-120l files/s)....
It's the memory demand caused by the stack tracer causing the
swapping, and the slowdown is just the overhead of tracer. 2.6.38
doesn't swap very much at all, 2.6.39 swaps a bit more more and
3.0-rc7 is about the same....
IOWs the act of measuring stack usage causes the worst case stack
usage for that workload on 2.6.39 and 3.0-rc7.
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-18 3:06 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 3:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig, Linux-MM,
Mel Gorman, Wu Fengguang, Johannes Weiner, Minchan Kim
On Mon, Jul 18, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 00:46:43 -0400
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > + /*
> > > > > > + * Only kswapd can writeback filesystem pages to
> > > > > > + * avoid risk of stack overflow
> > > > > > + */
> > > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > > + goto keep_locked;
> > > > > > + }
> > > > > > +
> > > > >
> > > > >
> > > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > >
> > > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > > for the typical non-overwrite workloads, and none has fallen apart.
> > > >
> > > > In fact there's no way we can enable them as the memcg calling contexts
> > > > tend to have massive stack usage.
> > > >
> > >
> > > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> >
> > Here's an example writeback stack trace. Notice how deep it is from
> > the __writepage() call?
> ....
> >
> > So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> > of that is in XFS, and the worst I've seen is around 4k before
> > getting to the IO subsystem, which in the worst case I've seen
> > consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> > down to IO take over 6k of stack space on x86_64....
>
> BTW, here's a stack frame that indicates swap IO:
>
> dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (46 entries)
> ----- ---- --------
> 0) 5080 40 zone_statistics+0xad/0xc0
> 1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
> 2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
> 3) 4480 48 kmem_getpages+0x62/0x160
> 4) 4432 112 cache_grow+0x2d1/0x300
> 5) 4320 80 cache_alloc_refill+0x219/0x260
> 6) 4240 64 kmem_cache_alloc+0x182/0x190
> 7) 4176 16 mempool_alloc_slab+0x15/0x20
> 8) 4160 144 mempool_alloc+0x63/0x140
> 9) 4016 16 scsi_sg_alloc+0x4c/0x60
> 10) 4000 112 __sg_alloc_table+0x66/0x140
> 11) 3888 32 scsi_init_sgtable+0x33/0x90
> 12) 3856 48 scsi_init_io+0x31/0xc0
> 13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
> 14) 3776 112 sd_prep_fn+0x150/0xa90
> 15) 3664 64 blk_peek_request+0xc7/0x230
> 16) 3600 96 scsi_request_fn+0x68/0x500
> 17) 3504 16 __blk_run_queue+0x1b/0x20
> 18) 3488 96 __make_request+0x2cb/0x310
> 19) 3392 192 generic_make_request+0x26d/0x500
> 20) 3200 96 submit_bio+0x64/0xe0
> 21) 3104 48 swap_writepage+0x83/0xd0
> 22) 3056 112 pageout+0x122/0x2f0
> 23) 2944 192 shrink_page_list+0x458/0x5f0
> 24) 2752 192 shrink_inactive_list+0x1ec/0x410
> 25) 2560 224 shrink_zone+0x468/0x500
> 26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
> 27) 2192 176 try_to_free_pages+0xa4/0x120
> 28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
> 29) 1728 48 kmem_getpages+0x62/0x160
> 30) 1680 128 fallback_alloc+0x192/0x240
> 31) 1552 96 ____cache_alloc_node+0x9a/0x170
> 32) 1456 16 __kmalloc+0x17d/0x200
> 33) 1440 128 kmem_alloc+0x77/0xf0
> 34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
> 35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
> 36) 1088 208 xfs_create+0x57a/0x640
> 37) 880 96 xfs_vn_mknod+0xa1/0x1b0
> 38) 784 16 xfs_vn_create+0x10/0x20
> 39) 768 64 vfs_create+0xb1/0xe0
> 40) 704 96 do_last+0x5f5/0x770
> 41) 608 144 path_openat+0xd5/0x400
> 42) 464 224 do_filp_open+0x49/0xa0
> 43) 240 96 do_sys_open+0x107/0x1e0
> 44) 144 16 sys_open+0x20/0x30
> 45) 128 128 system_call_fastpath+0x16/0x1b
>
>
> That's pretty damn bad. From kmem_alloc to the top of the stack is
> more than 3.5k through the direct reclaim swap IO path. That, to me,
> kind of indicates that even doing swap IO on dirty anonymous pages
> from direct reclaim risks overflowing the 8k stack on x86_64....
>
> Umm, hold on a second, WTF is my standard create-lots-of-zero-length
> inodes-in-parallel doing swapping? Oh, shit, it's also running about
> 50% slower (50-60k files/s instead of 110-120l files/s)....
It's the memory demand caused by the stack tracer causing the
swapping, and the slowdown is just the overhead of tracer. 2.6.38
doesn't swap very much at all, 2.6.39 swaps a bit more more and
3.0-rc7 is about the same....
IOWs the act of measuring stack usage causes the worst case stack
usage for that workload on 2.6.39 and 3.0-rc7.
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 114+ messages in thread
* Re: [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim
@ 2011-07-18 3:06 ` Dave Chinner
0 siblings, 0 replies; 114+ messages in thread
From: Dave Chinner @ 2011-07-18 3:06 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig, Mel Gorman, Linux-MM, LKML, XFS,
Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
Minchan Kim
On Mon, Jul 18, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> On Fri, Jul 15, 2011 at 12:22:26PM +1000, Dave Chinner wrote:
> > On Thu, Jul 14, 2011 at 01:46:34PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 14 Jul 2011 00:46:43 -0400
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > On Thu, Jul 14, 2011 at 10:38:01AM +0900, KAMEZAWA Hiroyuki wrote:
> > > > > > + /*
> > > > > > + * Only kswapd can writeback filesystem pages to
> > > > > > + * avoid risk of stack overflow
> > > > > > + */
> > > > > > + if (page_is_file_cache(page) && !current_is_kswapd()) {
> > > > > > + inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > > > + goto keep_locked;
> > > > > > + }
> > > > > > +
> > > > >
> > > > >
> > > > > This will cause tons of memcg OOM kill because we have no help of kswapd (now).
> > > >
> > > > XFS and btrfs already disable writeback from memcg context, as does ext4
> > > > for the typical non-overwrite workloads, and none has fallen apart.
> > > >
> > > > In fact there's no way we can enable them as the memcg calling contexts
> > > > tend to have massive stack usage.
> > > >
> > >
> > > Hmm, XFS/btrfs adds pages to radix-tree in deep stack ?
> >
> > Here's an example writeback stack trace. Notice how deep it is from
> > the __writepage() call?
> ....
> >
> > So from ->writepage, there is about 3.5k of stack usage here. 2.5k
> > of that is in XFS, and the worst I've seen is around 4k before
> > getting to the IO subsystem, which in the worst case I've seen
> > consumed 2.5k of stack. IOWs, I've seen stack usage from .writepage
> > down to IO take over 6k of stack space on x86_64....
>
> BTW, here's a stack frame that indicates swap IO:
>
> dave@test-4:~$ cat /sys/kernel/debug/tracing/stack_trace
> Depth Size Location (46 entries)
> ----- ---- --------
> 0) 5080 40 zone_statistics+0xad/0xc0
> 1) 5040 272 get_page_from_freelist+0x2ad/0x7e0
> 2) 4768 288 __alloc_pages_nodemask+0x133/0x7b0
> 3) 4480 48 kmem_getpages+0x62/0x160
> 4) 4432 112 cache_grow+0x2d1/0x300
> 5) 4320 80 cache_alloc_refill+0x219/0x260
> 6) 4240 64 kmem_cache_alloc+0x182/0x190
> 7) 4176 16 mempool_alloc_slab+0x15/0x20
> 8) 4160 144 mempool_alloc+0x63/0x140
> 9) 4016 16 scsi_sg_alloc+0x4c/0x60
> 10) 4000 112 __sg_alloc_table+0x66/0x140
> 11) 3888 32 scsi_init_sgtable+0x33/0x90
> 12) 3856 48 scsi_init_io+0x31/0xc0
> 13) 3808 32 scsi_setup_fs_cmnd+0x79/0xe0
> 14) 3776 112 sd_prep_fn+0x150/0xa90
> 15) 3664 64 blk_peek_request+0xc7/0x230
> 16) 3600 96 scsi_request_fn+0x68/0x500
> 17) 3504 16 __blk_run_queue+0x1b/0x20
> 18) 3488 96 __make_request+0x2cb/0x310
> 19) 3392 192 generic_make_request+0x26d/0x500
> 20) 3200 96 submit_bio+0x64/0xe0
> 21) 3104 48 swap_writepage+0x83/0xd0
> 22) 3056 112 pageout+0x122/0x2f0
> 23) 2944 192 shrink_page_list+0x458/0x5f0
> 24) 2752 192 shrink_inactive_list+0x1ec/0x410
> 25) 2560 224 shrink_zone+0x468/0x500
> 26) 2336 144 do_try_to_free_pages+0x2b7/0x3f0
> 27) 2192 176 try_to_free_pages+0xa4/0x120
> 28) 2016 288 __alloc_pages_nodemask+0x43f/0x7b0
> 29) 1728 48 kmem_getpages+0x62/0x160
> 30) 1680 128 fallback_alloc+0x192/0x240
> 31) 1552 96 ____cache_alloc_node+0x9a/0x170
> 32) 1456 16 __kmalloc+0x17d/0x200
> 33) 1440 128 kmem_alloc+0x77/0xf0
> 34) 1312 128 xfs_log_commit_cil+0x95/0x3d0
> 35) 1184 96 _xfs_trans_commit+0x1e9/0x2a0
> 36) 1088 208 xfs_create+0x57a/0x640
> 37) 880 96 xfs_vn_mknod+0xa1/0x1b0
> 38) 784 16 xfs_vn_create+0x10/0x20
> 39) 768 64 vfs_create+0xb1/0xe0
> 40) 704 96 do_last+0x5f5/0x770
> 41) 608 144 path_openat+0xd5/0x400
> 42) 464 224 do_filp_open+0x49/0xa0
> 43) 240 96 do_sys_open+0x107/0x1e0
> 44) 144 16 sys_open+0x20/0x30
> 45) 128 128 system_call_fastpath+0x16/0x1b
>
>
> That's pretty damn bad. From kmem_alloc to the top of the stack is
> more than 3.5k through the direct reclaim swap IO path. That, to me,
> kind of indicates that even doing swap IO on dirty anonymous pages
> from direct reclaim risks overflowing the 8k stack on x86_64....
>
> Umm, hold on a second, WTF is my standard create-lots-of-zero-length
> inodes-in-parallel doing swapping? Oh, shit, it's also running about
> 50% slower (50-60k files/s instead of 110-120l files/s)....
It's the memory demand caused by the stack tracer causing the
swapping, and the slowdown is just the overhead of tracer. 2.6.38
doesn't swap very much at all, 2.6.39 swaps a bit more more and
3.0-rc7 is about the same....
IOWs the act of measuring stack usage causes the worst case stack
usage for that workload on 2.6.39 and 3.0-rc7.
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 114+ messages in thread
end of thread, other threads:[~2011-07-18 3:06 UTC | newest]
Thread overview: 114+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-13 14:31 [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` [PATCH 1/5] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 23:34 ` Dave Chinner
2011-07-13 23:34 ` Dave Chinner
2011-07-13 23:34 ` Dave Chinner
2011-07-14 6:17 ` Mel Gorman
2011-07-14 6:17 ` Mel Gorman
2011-07-14 6:17 ` Mel Gorman
2011-07-14 1:38 ` KAMEZAWA Hiroyuki
2011-07-14 1:38 ` KAMEZAWA Hiroyuki
2011-07-14 1:38 ` KAMEZAWA Hiroyuki
2011-07-14 4:46 ` Christoph Hellwig
2011-07-14 4:46 ` Christoph Hellwig
2011-07-14 4:46 ` Christoph Hellwig
2011-07-14 4:46 ` KAMEZAWA Hiroyuki
2011-07-14 4:46 ` KAMEZAWA Hiroyuki
2011-07-14 4:46 ` KAMEZAWA Hiroyuki
2011-07-14 15:07 ` Christoph Hellwig
2011-07-14 15:07 ` Christoph Hellwig
2011-07-14 15:07 ` Christoph Hellwig
2011-07-14 23:55 ` KAMEZAWA Hiroyuki
2011-07-14 23:55 ` KAMEZAWA Hiroyuki
2011-07-14 23:55 ` KAMEZAWA Hiroyuki
2011-07-15 2:22 ` Dave Chinner
2011-07-15 2:22 ` Dave Chinner
2011-07-15 2:22 ` Dave Chinner
2011-07-18 2:22 ` Dave Chinner
2011-07-18 2:22 ` Dave Chinner
2011-07-18 2:22 ` Dave Chinner
2011-07-18 3:06 ` Dave Chinner
2011-07-18 3:06 ` Dave Chinner
2011-07-18 3:06 ` Dave Chinner
2011-07-14 6:19 ` Mel Gorman
2011-07-14 6:19 ` Mel Gorman
2011-07-14 6:19 ` Mel Gorman
2011-07-14 6:17 ` KAMEZAWA Hiroyuki
2011-07-14 6:17 ` KAMEZAWA Hiroyuki
2011-07-14 6:17 ` KAMEZAWA Hiroyuki
2011-07-13 14:31 ` [PATCH 2/5] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 23:37 ` Dave Chinner
2011-07-13 23:37 ` Dave Chinner
2011-07-13 23:37 ` Dave Chinner
2011-07-14 6:29 ` Mel Gorman
2011-07-14 6:29 ` Mel Gorman
2011-07-14 6:29 ` Mel Gorman
2011-07-14 11:52 ` Dave Chinner
2011-07-14 11:52 ` Dave Chinner
2011-07-14 11:52 ` Dave Chinner
2011-07-14 13:17 ` Mel Gorman
2011-07-14 13:17 ` Mel Gorman
2011-07-14 13:17 ` Mel Gorman
2011-07-15 3:12 ` Dave Chinner
2011-07-15 3:12 ` Dave Chinner
2011-07-15 3:12 ` Dave Chinner
2011-07-13 14:31 ` [PATCH 3/5] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 23:41 ` Dave Chinner
2011-07-13 23:41 ` Dave Chinner
2011-07-13 23:41 ` Dave Chinner
2011-07-14 6:33 ` Mel Gorman
2011-07-14 6:33 ` Mel Gorman
2011-07-14 6:33 ` Mel Gorman
2011-07-13 14:31 ` [PATCH 4/5] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 16:40 ` Johannes Weiner
2011-07-13 16:40 ` Johannes Weiner
2011-07-13 16:40 ` Johannes Weiner
2011-07-13 17:15 ` Mel Gorman
2011-07-13 17:15 ` Mel Gorman
2011-07-13 17:15 ` Mel Gorman
2011-07-13 14:31 ` [PATCH 5/5] mm: writeback: Prioritise dirty inodes encountered by direct reclaim for background flushing Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 14:31 ` Mel Gorman
2011-07-13 21:39 ` Jan Kara
2011-07-13 21:39 ` Jan Kara
2011-07-13 21:39 ` Jan Kara
2011-07-14 0:09 ` Dave Chinner
2011-07-14 0:09 ` Dave Chinner
2011-07-14 0:09 ` Dave Chinner
2011-07-14 7:03 ` Mel Gorman
2011-07-14 7:03 ` Mel Gorman
2011-07-14 7:03 ` Mel Gorman
2011-07-13 23:56 ` Dave Chinner
2011-07-13 23:56 ` Dave Chinner
2011-07-13 23:56 ` Dave Chinner
2011-07-14 7:30 ` Mel Gorman
2011-07-14 7:30 ` Mel Gorman
2011-07-14 7:30 ` Mel Gorman
2011-07-14 15:09 ` Christoph Hellwig
2011-07-14 15:09 ` Christoph Hellwig
2011-07-14 15:09 ` Christoph Hellwig
2011-07-14 15:49 ` Mel Gorman
2011-07-14 15:49 ` Mel Gorman
2011-07-14 15:49 ` Mel Gorman
2011-07-13 15:31 ` [RFC PATCH 0/5] Reduce filesystem writeback from page reclaim (again) Mel Gorman
2011-07-13 15:31 ` Mel Gorman
2011-07-13 15:31 ` Mel Gorman
2011-07-14 0:33 ` Dave Chinner
2011-07-14 0:33 ` Dave Chinner
2011-07-14 0:33 ` Dave Chinner
2011-07-14 4:51 ` Christoph Hellwig
2011-07-14 4:51 ` Christoph Hellwig
2011-07-14 4:51 ` Christoph Hellwig
2011-07-14 7:37 ` Mel Gorman
2011-07-14 7:37 ` Mel Gorman
2011-07-14 7:37 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.