* [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 15:14 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
congestion_wait() is a bit stupid in that it goes to sleep even when there
is no congestion. This causes stalls in a number of situations and may be
partially responsible for bug reports about desktop interactivity.
This patch series aims to account for these unnecessary congestion_waits()
and to avoid going to sleep when there is no congestion available. Patches
1 and 2 add instrumentation related to congestion which should be reuable
by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
instead of going to sleep if there is no congestion.
Once again, I shoved this through performance test. Unlike previous tests,
I ran this on a ported version of my usual test-suite that should be suitable
for release soon. It's not quite as good as my old set but it's sufficient
for this and related series. The tests I ran were kernbench vmr-stream
iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
the usual stress the number of high order allocations that can be made while
the system is under severe stress. The suite contains the necessary analysis
scripts as well and I'd release it now except the documentation blows.
x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
The disks on all of them were single disks and not particularly fast.
Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
instrumentation and a second test with patch 3 applied.
In all cases, kernbench, hackbench, STREAM and iozone did not show any
performance difference because none of them were pressuring the system
enough to be calling congestion_wait() so I won't post the results.
About all worth noting for them is that nothing horrible appeared to break.
In the analysis scripts, I record unnecessary sleeps to be a sleep that
had no congestion. The post-processing scripts for cond_resched() will only
count an uncongested call to congestion_wait() as unnecessary if the process
actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
One vague concern I have is when too many pages are isolated, we call
congestion_wait(). This could now actively spin in the loop for its quanta
before calling cond_resched(). If it's calling with no congestion, it's
hard to know what the proper thing to do there is.
X86
Sysbench on this machine was not stressed enough to call congestion_wait
so I'll just discuss the stress-highalloc test. This is the full report
from the testsuite
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 70.00 ( 0.00%) 72.00 ( 2.00%)
Pass 2 72.00 ( 0.00%) 72.00 ( 0.00%)
At Rest 74.00 ( 0.00%) 73.00 (-1.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 409 755
Direct reclaim pages scanned 185585 212524
Direct reclaim write file async I/O 442 554
Direct reclaim write anon async I/O 31789 27074
Direct reclaim write file sync I/O 17 23
Direct reclaim write anon sync I/O 17825 15013
Wake kswapd requests 895 1274
Kswapd wakeups 387 432
Kswapd pages scanned 16373859 12892992
Kswapd reclaim write file async I/O 29267 18188
Kswapd reclaim write anon async I/O 1243386 1080234
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 4479.04 3446.81
Time kswapd awake (seconds) 2229.99 1218.52
Total pages scanned 16559444 13105516
%age total pages scanned/written 7.99% 8.71%
%age file pages scanned/written 0.18% 0.14%
Percentage Time Spent Direct Reclaim 74.99% 69.54%
Percentage Time kswapd Awake 41.78% 28.57%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 474 38
Direct number schedule waited 0 9478
Direct time congest waited 21564ms 3732ms
Direct time schedule waited 0ms 4ms
Direct unnecessary wait 434 1
KSwapd number congest waited 68 0
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 5424ms 0ms
Kswapd unnecessary wait 44 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1493.97 1509.88
Total Elapsed Time (seconds) 5337.71 4265.07
Allocations under stress were slightly better but by and large there is no
significant difference in success rates. The test completed 1072 seconds
faster which is a pretty decent speedup.
Scanning rates in reclaim were higher but that is somewhat expected because
we weren't going to sleep as much. Time stalled in reclaim for both direct
and kswapd was reduced which is pretty significant.
In terms of congestion_wait, the time spent asleep was massively reduced
by 17 seconds for direct reclaim and 5 seconds for kswapd. cond_reched
is called a number of times instead of course but the time it spent
being scheduled was a mere 4ms. Overall, this looked positive.
X86-64
Sysbench again wasn't under enough pressure so here is the high alloc test.
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 69.00 ( 0.00%) 73.00 ( 4.00%)
Pass 2 71.00 ( 0.00%) 74.00 ( 3.00%)
At Rest 72.00 ( 0.00%) 75.00 ( 3.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 646 1091
Direct reclaim pages scanned 94779 102392
Direct reclaim write file async I/O 164 216
Direct reclaim write anon async I/O 12162 15413
Direct reclaim write file sync I/O 64 45
Direct reclaim write anon sync I/O 5366 6987
Wake kswapd requests 3950 3912
Kswapd wakeups 613 579
Kswapd pages scanned 7544412 7267203
Kswapd reclaim write file async I/O 14660 16256
Kswapd reclaim write anon async I/O 964824 1065445
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 3279.00 3564.59
Time kswapd awake (seconds) 1445.70 1870.70
Total pages scanned 7639191 7369595
%age total pages scanned/written 13.05% 14.99%
%age file pages scanned/written 0.19% 0.22%
Percentage Time Spent Direct Reclaim 70.48% 72.04%
Percentage Time kswapd Awake 35.62% 42.94%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 801 97
Direct number schedule waited 0 16079
Direct time congest waited 37448ms 9004ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 696 0
KSwapd number congest waited 10 1
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 900ms 100ms
Kswapd unnecessary wait 6 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1373.11 1383.7
Total Elapsed Time (seconds) 4058.33 4356.47
Success rates were slightly higher again, not by a massive amount but some.
Time to complete the test was unfortunately increased slightly though and
I'm not sure where that is coming from. The increased number of successful
allocations would account for some of that because the system is under
greater memory pressure as a result of the allocations.
Scanning rates are comparable. Writing back files from reclaim was slighly
increased which I believe it due to less time being spent asleep so there
was a smaller window for the flusher threads to do their work. Reducing
that is the responsibility of another series.
Again, the time spent asleep in congestion_wait() is reduced by a large
amount - 28 seconds for direct reclaim and none of the cond_resched()
resulted in sleep times.
Overally, seems reasonable.
PPC64
Unlike the other two machines, sysbench called congestion_wait a few times so here are the full
results for sysbench
SYSBENCH
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
1 5307.36 ( 0.00%) 5349.58 ( 0.79%)
2 9886.45 ( 0.00%) 10274.78 ( 3.78%)
3 14165.01 ( 0.00%) 14210.64 ( 0.32%)
4 16239.12 ( 0.00%) 16201.46 (-0.23%)
5 15337.09 ( 0.00%) 15541.56 ( 1.32%)
6 14763.64 ( 0.00%) 15805.80 ( 6.59%)
7 14216.69 ( 0.00%) 15023.57 ( 5.37%)
8 13749.62 ( 0.00%) 14492.34 ( 5.12%)
9 13647.75 ( 0.00%) 13969.77 ( 2.31%)
10 13275.70 ( 0.00%) 13495.08 ( 1.63%)
11 13324.91 ( 0.00%) 12879.81 (-3.46%)
12 13169.23 ( 0.00%) 12967.36 (-1.56%)
13 12896.20 ( 0.00%) 12981.43 ( 0.66%)
14 12793.44 ( 0.00%) 12768.26 (-0.20%)
15 12627.98 ( 0.00%) 12522.86 (-0.84%)
16 12228.54 ( 0.00%) 12352.07 ( 1.00%)
FTrace Reclaim Statistics: vmscan
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
Direct reclaims 0 0
Direct reclaim pages scanned 0 0
Direct reclaim write file async I/O 0 0
Direct reclaim write anon async I/O 0 0
Direct reclaim write file sync I/O 0 0
Direct reclaim write anon sync I/O 0 0
Wake kswapd requests 0 0
Kswapd wakeups 202 194
Kswapd pages scanned 5990987 5618709
Kswapd reclaim write file async I/O 24 16
Kswapd reclaim write anon async I/O 1509 1564
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 0.00 0.00
Time kswapd awake (seconds) 174.23 152.17
Total pages scanned 5990987 5618709
%age total pages scanned/written 0.03% 0.03%
%age file pages scanned/written 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.00% 0.00%
Percentage Time kswapd Awake 2.80% 2.60%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0
Direct number schedule waited 0 0
Direct time congest waited 0ms 0ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 0 0
KSwapd number congest waited 10 3
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 800ms 300ms
Kswapd unnecessary wait 6 0
Performance is improved by a decent marging although I didn't check if
it was statistically significant or not. The time kswapd spent asleep was
slightly reduced.
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 40.00 ( 0.00%) 35.00 (-5.00%)
Pass 2 50.00 ( 0.00%) 45.00 (-5.00%)
At Rest 61.00 ( 0.00%) 64.00 ( 3.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 166 926
Direct reclaim pages scanned 167920 183644
Direct reclaim write file async I/O 391 412
Direct reclaim write anon async I/O 31563 31986
Direct reclaim write file sync I/O 54 52
Direct reclaim write anon sync I/O 21696 17087
Wake kswapd requests 123 128
Kswapd wakeups 143 143
Kswapd pages scanned 3899414 4229450
Kswapd reclaim write file async I/O 12392 13098
Kswapd reclaim write anon async I/O 673260 709817
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 1595.13 1692.18
Time kswapd awake (seconds) 1114.00 1210.48
Total pages scanned 4067334 4413094
%age total pages scanned/written 18.18% 17.50%
%age file pages scanned/written 0.32% 0.31%
Percentage Time Spent Direct Reclaim 45.89% 47.50%
Percentage Time kswapd Awake 46.09% 48.04%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 233 16
Direct number schedule waited 0 1323
Direct time congest waited 10164ms 1600ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 218 0
KSwapd number congest waited 11 13
KSwapd number schedule waited 0 3
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 1100ms 1244ms
Kswapd unnecessary wait 0 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1880.56 1870.36
Total Elapsed Time (seconds) 2417.17 2519.51
Allocation success rates are slightly down but on PPC64, they are always
very difficult and I have other ideas on how allocation success rates could
be improved.
What is more important is that again the time spent asleep due to
congestion_wait() was reduced for direct reclaimers.
The results here aren't as positive as the other two machines but they
still seem acceptable.
Broadly speaking, I think sleeping in congestion_wait() has been responsible
for some bugs related to stalls under large IO, particularly the read IO,
so we need to do something about it. These tests seem overall positive but
it'd be interesting if someone with a workload that stalls in congestion_wait
unnecessarily are helped by this patch. Desktop interactivity would be harder
to test because I think it has multiple root causes of which congestion_wait
is just one of them. I've included Christian Ehrhardt in the cc because he
had a bug back in April that was rooted in congestion_wait() that I think
this might help and hopefully he can provide hard data for a workload
with lots of IO but constrained memory. I cc'd Johannes because we were
discussion congestion_wait() at LSF/MM and he might have some thoughts and
I think I was talking to Jan briefly about congestion_wait() as well. As
this affects writeback, Wu and fsdevel might have some opinions.
include/trace/events/writeback.h | 22 ++++++++++++++++++++++
mm/backing-dev.c | 31 ++++++++++++++++++++++++++-----
2 files changed, 48 insertions(+), 5 deletions(-)
^ permalink raw reply [flat|nested] 76+ messages in thread
* [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 15:14 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
congestion_wait() is a bit stupid in that it goes to sleep even when there
is no congestion. This causes stalls in a number of situations and may be
partially responsible for bug reports about desktop interactivity.
This patch series aims to account for these unnecessary congestion_waits()
and to avoid going to sleep when there is no congestion available. Patches
1 and 2 add instrumentation related to congestion which should be reuable
by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
instead of going to sleep if there is no congestion.
Once again, I shoved this through performance test. Unlike previous tests,
I ran this on a ported version of my usual test-suite that should be suitable
for release soon. It's not quite as good as my old set but it's sufficient
for this and related series. The tests I ran were kernbench vmr-stream
iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
the usual stress the number of high order allocations that can be made while
the system is under severe stress. The suite contains the necessary analysis
scripts as well and I'd release it now except the documentation blows.
x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
The disks on all of them were single disks and not particularly fast.
Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
instrumentation and a second test with patch 3 applied.
In all cases, kernbench, hackbench, STREAM and iozone did not show any
performance difference because none of them were pressuring the system
enough to be calling congestion_wait() so I won't post the results.
About all worth noting for them is that nothing horrible appeared to break.
In the analysis scripts, I record unnecessary sleeps to be a sleep that
had no congestion. The post-processing scripts for cond_resched() will only
count an uncongested call to congestion_wait() as unnecessary if the process
actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
One vague concern I have is when too many pages are isolated, we call
congestion_wait(). This could now actively spin in the loop for its quanta
before calling cond_resched(). If it's calling with no congestion, it's
hard to know what the proper thing to do there is.
X86
Sysbench on this machine was not stressed enough to call congestion_wait
so I'll just discuss the stress-highalloc test. This is the full report
from the testsuite
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 70.00 ( 0.00%) 72.00 ( 2.00%)
Pass 2 72.00 ( 0.00%) 72.00 ( 0.00%)
At Rest 74.00 ( 0.00%) 73.00 (-1.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 409 755
Direct reclaim pages scanned 185585 212524
Direct reclaim write file async I/O 442 554
Direct reclaim write anon async I/O 31789 27074
Direct reclaim write file sync I/O 17 23
Direct reclaim write anon sync I/O 17825 15013
Wake kswapd requests 895 1274
Kswapd wakeups 387 432
Kswapd pages scanned 16373859 12892992
Kswapd reclaim write file async I/O 29267 18188
Kswapd reclaim write anon async I/O 1243386 1080234
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 4479.04 3446.81
Time kswapd awake (seconds) 2229.99 1218.52
Total pages scanned 16559444 13105516
%age total pages scanned/written 7.99% 8.71%
%age file pages scanned/written 0.18% 0.14%
Percentage Time Spent Direct Reclaim 74.99% 69.54%
Percentage Time kswapd Awake 41.78% 28.57%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 474 38
Direct number schedule waited 0 9478
Direct time congest waited 21564ms 3732ms
Direct time schedule waited 0ms 4ms
Direct unnecessary wait 434 1
KSwapd number congest waited 68 0
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 5424ms 0ms
Kswapd unnecessary wait 44 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1493.97 1509.88
Total Elapsed Time (seconds) 5337.71 4265.07
Allocations under stress were slightly better but by and large there is no
significant difference in success rates. The test completed 1072 seconds
faster which is a pretty decent speedup.
Scanning rates in reclaim were higher but that is somewhat expected because
we weren't going to sleep as much. Time stalled in reclaim for both direct
and kswapd was reduced which is pretty significant.
In terms of congestion_wait, the time spent asleep was massively reduced
by 17 seconds for direct reclaim and 5 seconds for kswapd. cond_reched
is called a number of times instead of course but the time it spent
being scheduled was a mere 4ms. Overall, this looked positive.
X86-64
Sysbench again wasn't under enough pressure so here is the high alloc test.
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 69.00 ( 0.00%) 73.00 ( 4.00%)
Pass 2 71.00 ( 0.00%) 74.00 ( 3.00%)
At Rest 72.00 ( 0.00%) 75.00 ( 3.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 646 1091
Direct reclaim pages scanned 94779 102392
Direct reclaim write file async I/O 164 216
Direct reclaim write anon async I/O 12162 15413
Direct reclaim write file sync I/O 64 45
Direct reclaim write anon sync I/O 5366 6987
Wake kswapd requests 3950 3912
Kswapd wakeups 613 579
Kswapd pages scanned 7544412 7267203
Kswapd reclaim write file async I/O 14660 16256
Kswapd reclaim write anon async I/O 964824 1065445
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 3279.00 3564.59
Time kswapd awake (seconds) 1445.70 1870.70
Total pages scanned 7639191 7369595
%age total pages scanned/written 13.05% 14.99%
%age file pages scanned/written 0.19% 0.22%
Percentage Time Spent Direct Reclaim 70.48% 72.04%
Percentage Time kswapd Awake 35.62% 42.94%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 801 97
Direct number schedule waited 0 16079
Direct time congest waited 37448ms 9004ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 696 0
KSwapd number congest waited 10 1
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 900ms 100ms
Kswapd unnecessary wait 6 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1373.11 1383.7
Total Elapsed Time (seconds) 4058.33 4356.47
Success rates were slightly higher again, not by a massive amount but some.
Time to complete the test was unfortunately increased slightly though and
I'm not sure where that is coming from. The increased number of successful
allocations would account for some of that because the system is under
greater memory pressure as a result of the allocations.
Scanning rates are comparable. Writing back files from reclaim was slighly
increased which I believe it due to less time being spent asleep so there
was a smaller window for the flusher threads to do their work. Reducing
that is the responsibility of another series.
Again, the time spent asleep in congestion_wait() is reduced by a large
amount - 28 seconds for direct reclaim and none of the cond_resched()
resulted in sleep times.
Overally, seems reasonable.
PPC64
Unlike the other two machines, sysbench called congestion_wait a few times so here are the full
results for sysbench
SYSBENCH
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
1 5307.36 ( 0.00%) 5349.58 ( 0.79%)
2 9886.45 ( 0.00%) 10274.78 ( 3.78%)
3 14165.01 ( 0.00%) 14210.64 ( 0.32%)
4 16239.12 ( 0.00%) 16201.46 (-0.23%)
5 15337.09 ( 0.00%) 15541.56 ( 1.32%)
6 14763.64 ( 0.00%) 15805.80 ( 6.59%)
7 14216.69 ( 0.00%) 15023.57 ( 5.37%)
8 13749.62 ( 0.00%) 14492.34 ( 5.12%)
9 13647.75 ( 0.00%) 13969.77 ( 2.31%)
10 13275.70 ( 0.00%) 13495.08 ( 1.63%)
11 13324.91 ( 0.00%) 12879.81 (-3.46%)
12 13169.23 ( 0.00%) 12967.36 (-1.56%)
13 12896.20 ( 0.00%) 12981.43 ( 0.66%)
14 12793.44 ( 0.00%) 12768.26 (-0.20%)
15 12627.98 ( 0.00%) 12522.86 (-0.84%)
16 12228.54 ( 0.00%) 12352.07 ( 1.00%)
FTrace Reclaim Statistics: vmscan
sysbench-traceonly-v1r1-sysbenchsysbench-nocongest-v1r1-sysbench
traceonly-v1r1 nocongest-v1r1
Direct reclaims 0 0
Direct reclaim pages scanned 0 0
Direct reclaim write file async I/O 0 0
Direct reclaim write anon async I/O 0 0
Direct reclaim write file sync I/O 0 0
Direct reclaim write anon sync I/O 0 0
Wake kswapd requests 0 0
Kswapd wakeups 202 194
Kswapd pages scanned 5990987 5618709
Kswapd reclaim write file async I/O 24 16
Kswapd reclaim write anon async I/O 1509 1564
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 0.00 0.00
Time kswapd awake (seconds) 174.23 152.17
Total pages scanned 5990987 5618709
%age total pages scanned/written 0.03% 0.03%
%age file pages scanned/written 0.00% 0.00%
Percentage Time Spent Direct Reclaim 0.00% 0.00%
Percentage Time kswapd Awake 2.80% 2.60%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 0 0
Direct number schedule waited 0 0
Direct time congest waited 0ms 0ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 0 0
KSwapd number congest waited 10 3
KSwapd number schedule waited 0 0
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 800ms 300ms
Kswapd unnecessary wait 6 0
Performance is improved by a decent marging although I didn't check if
it was statistically significant or not. The time kswapd spent asleep was
slightly reduced.
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Pass 1 40.00 ( 0.00%) 35.00 (-5.00%)
Pass 2 50.00 ( 0.00%) 45.00 (-5.00%)
At Rest 61.00 ( 0.00%) 64.00 ( 3.00%)
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc
traceonly-v1r1 nocongest-v1r1
Direct reclaims 166 926
Direct reclaim pages scanned 167920 183644
Direct reclaim write file async I/O 391 412
Direct reclaim write anon async I/O 31563 31986
Direct reclaim write file sync I/O 54 52
Direct reclaim write anon sync I/O 21696 17087
Wake kswapd requests 123 128
Kswapd wakeups 143 143
Kswapd pages scanned 3899414 4229450
Kswapd reclaim write file async I/O 12392 13098
Kswapd reclaim write anon async I/O 673260 709817
Kswapd reclaim write file sync I/O 0 0
Kswapd reclaim write anon sync I/O 0 0
Time stalled direct reclaim (seconds) 1595.13 1692.18
Time kswapd awake (seconds) 1114.00 1210.48
Total pages scanned 4067334 4413094
%age total pages scanned/written 18.18% 17.50%
%age file pages scanned/written 0.32% 0.31%
Percentage Time Spent Direct Reclaim 45.89% 47.50%
Percentage Time kswapd Awake 46.09% 48.04%
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 233 16
Direct number schedule waited 0 1323
Direct time congest waited 10164ms 1600ms
Direct time schedule waited 0ms 0ms
Direct unnecessary wait 218 0
KSwapd number congest waited 11 13
KSwapd number schedule waited 0 3
KSwapd time schedule waited 0ms 0ms
KSwapd time congest waited 1100ms 1244ms
Kswapd unnecessary wait 0 0
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1880.56 1870.36
Total Elapsed Time (seconds) 2417.17 2519.51
Allocation success rates are slightly down but on PPC64, they are always
very difficult and I have other ideas on how allocation success rates could
be improved.
What is more important is that again the time spent asleep due to
congestion_wait() was reduced for direct reclaimers.
The results here aren't as positive as the other two machines but they
still seem acceptable.
Broadly speaking, I think sleeping in congestion_wait() has been responsible
for some bugs related to stalls under large IO, particularly the read IO,
so we need to do something about it. These tests seem overall positive but
it'd be interesting if someone with a workload that stalls in congestion_wait
unnecessarily are helped by this patch. Desktop interactivity would be harder
to test because I think it has multiple root causes of which congestion_wait
is just one of them. I've included Christian Ehrhardt in the cc because he
had a bug back in April that was rooted in congestion_wait() that I think
this might help and hopefully he can provide hard data for a workload
with lots of IO but constrained memory. I cc'd Johannes because we were
discussion congestion_wait() at LSF/MM and he might have some thoughts and
I think I was talking to Jan briefly about congestion_wait() as well. As
this affects writeback, Wu and fsdevel might have some opinions.
include/trace/events/writeback.h | 22 ++++++++++++++++++++++
mm/backing-dev.c | 31 ++++++++++++++++++++++++++-----
2 files changed, 48 insertions(+), 5 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* [PATCH 1/3] writeback: Account for time spent congestion_waited
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a
tracepoint for congestion_wait to record when congestion_wait() occurred
and how long was spent.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/writeback.h | 17 +++++++++++++++++
mm/backing-dev.c | 4 ++++
2 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..e3bee61 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,23 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+TRACE_EVENT(writeback_congest_waited,
+
+ TP_PROTO(unsigned int usec_delayed),
+
+ TP_ARGS(usec_delayed),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, usec_delayed )
+ ),
+
+ TP_fast_assign(
+ __entry->usec_delayed = usec_delayed;
+ ),
+
+ TP_printk("usec_delayed=%u", __entry->usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..7ae33e2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,16 @@ EXPORT_SYMBOL(set_bdi_congested);
long congestion_wait(int sync, long timeout)
{
long ret;
+ unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
--
1.7.1
^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 1/3] writeback: Account for time spent congestion_waited
@ 2010-08-26 15:14 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
There is strong evidence to indicate a lot of time is being spent in
congestion_wait(), some of it unnecessarily. This patch adds a
tracepoint for congestion_wait to record when congestion_wait() occurred
and how long was spent.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/writeback.h | 17 +++++++++++++++++
mm/backing-dev.c | 4 ++++
2 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index f345f66..e3bee61 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -153,6 +153,23 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_written);
DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
DEFINE_WBC_EVENT(wbc_writepage);
+TRACE_EVENT(writeback_congest_waited,
+
+ TP_PROTO(unsigned int usec_delayed),
+
+ TP_ARGS(usec_delayed),
+
+ TP_STRUCT__entry(
+ __field( unsigned int, usec_delayed )
+ ),
+
+ TP_fast_assign(
+ __entry->usec_delayed = usec_delayed;
+ ),
+
+ TP_printk("usec_delayed=%u", __entry->usec_delayed)
+);
+
#endif /* _TRACE_WRITEBACK_H */
/* This part must be outside protection */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index eaa4a5b..7ae33e2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -759,12 +759,16 @@ EXPORT_SYMBOL(set_bdi_congested);
long congestion_wait(int sync, long timeout)
{
long ret;
+ unsigned long start = jiffies;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/writeback.h | 11 ++++++++---
mm/backing-dev.c | 15 ++++++++++++---
2 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
TRACE_EVENT(writeback_congest_waited,
- TP_PROTO(unsigned int usec_delayed),
+ TP_PROTO(unsigned int usec_delayed, bool unnecessary),
- TP_ARGS(usec_delayed),
+ TP_ARGS(usec_delayed, unnecessary),
TP_STRUCT__entry(
__field( unsigned int, usec_delayed )
+ __field( unsigned int, unnecessary )
),
TP_fast_assign(
__entry->usec_delayed = usec_delayed;
+ __entry->unnecessary = unnecessary;
),
- TP_printk("usec_delayed=%u", __entry->usec_delayed)
+ TP_printk("usec_delayed=%u unnecessary=%d",
+ __entry->usec_delayed,
+ __entry->unnecessary
+ )
);
#endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];
bit = sync ? BDI_sync_congested : BDI_async_congested;
- clear_bit(bit, &bdi->state);
+ if (test_and_clear_bit(bit, &bdi->state))
+ atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum bdi_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
- set_bit(bit, &bdi->state);
+ if (!test_and_set_bit(bit, &bdi->state))
+ atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
@@ -760,14 +763,20 @@ long congestion_wait(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
+ bool unnecessary = false;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
+ /* Check if this call to congestion_wait was necessary */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ unnecessary = true;
+
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
- trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
+ unnecessary);
return ret;
}
--
1.7.1
^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-26 15:14 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/writeback.h | 11 ++++++++---
mm/backing-dev.c | 15 ++++++++++++---
2 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
TRACE_EVENT(writeback_congest_waited,
- TP_PROTO(unsigned int usec_delayed),
+ TP_PROTO(unsigned int usec_delayed, bool unnecessary),
- TP_ARGS(usec_delayed),
+ TP_ARGS(usec_delayed, unnecessary),
TP_STRUCT__entry(
__field( unsigned int, usec_delayed )
+ __field( unsigned int, unnecessary )
),
TP_fast_assign(
__entry->usec_delayed = usec_delayed;
+ __entry->unnecessary = unnecessary;
),
- TP_printk("usec_delayed=%u", __entry->usec_delayed)
+ TP_printk("usec_delayed=%u unnecessary=%d",
+ __entry->usec_delayed,
+ __entry->unnecessary
+ )
);
#endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];
bit = sync ? BDI_sync_congested : BDI_async_congested;
- clear_bit(bit, &bdi->state);
+ if (test_and_clear_bit(bit, &bdi->state))
+ atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_clear_bit();
if (waitqueue_active(wqh))
wake_up(wqh);
@@ -743,7 +745,8 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum bdi_state bit;
bit = sync ? BDI_sync_congested : BDI_async_congested;
- set_bit(bit, &bdi->state);
+ if (!test_and_set_bit(bit, &bdi->state))
+ atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
@@ -760,14 +763,20 @@ long congestion_wait(int sync, long timeout)
{
long ret;
unsigned long start = jiffies;
+ bool unnecessary = false;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
+ /* Check if this call to congestion_wait was necessary */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ unnecessary = true;
+
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
- trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start));
+ trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
+ unnecessary);
return ret;
}
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 15:14 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
If congestion_wait() is called with no BDIs congested, the caller will
sleep for the full timeout and this is an unnecessary sleep. This patch
checks if there are BDIs congested. If so, it goes to sleep as normal.
If not, it calls cond_resched() to ensure the caller is not hogging the
CPU longer than its quota but otherwise will not sleep.
This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/backing-dev.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a49167f..6abe860 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
- /* Check if this call to congestion_wait was necessary */
- if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ /*
+ * If there is no congestion, there is no point sleeping on the queue.
+ * This call was unecessary but in case we are spinning due to a bad
+ * caller, at least call cond_reched() and sleep if our CPU quota
+ * has expired
+ */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0) {
unnecessary = true;
-
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- ret = io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
+ cond_resched();
+ ret = 0;
+ } else {
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ }
trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
unnecessary);
--
1.7.1
^ permalink raw reply related [flat|nested] 76+ messages in thread
* [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-26 15:14 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 15:14 UTC (permalink / raw)
To: linux-mm, linux-fsdevel
Cc: Mel Gorman, Andrew Morton, Christian Ehrhardt, Johannes Weiner,
Wu Fengguang, Jan Kara, linux-kernel
If congestion_wait() is called with no BDIs congested, the caller will
sleep for the full timeout and this is an unnecessary sleep. This patch
checks if there are BDIs congested. If so, it goes to sleep as normal.
If not, it calls cond_resched() to ensure the caller is not hogging the
CPU longer than its quota but otherwise will not sleep.
This is aimed at reducing some of the major desktop stalls reported during
IO. For example, while kswapd is operating, it calls congestion_wait()
but it could just have been reclaiming clean page cache pages with no
congestion. Without this patch, it would sleep for a full timeout but after
this patch, it'll just call schedule() if it has been on the CPU too long.
Similar logic applies to direct reclaimers that are not making enough
progress.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/backing-dev.c | 20 ++++++++++++++------
1 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index a49167f..6abe860 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
- /* Check if this call to congestion_wait was necessary */
- if (atomic_read(&nr_bdi_congested[sync]) == 0)
+ /*
+ * If there is no congestion, there is no point sleeping on the queue.
+ * This call was unecessary but in case we are spinning due to a bad
+ * caller, at least call cond_reched() and sleep if our CPU quota
+ * has expired
+ */
+ if (atomic_read(&nr_bdi_congested[sync]) == 0) {
unnecessary = true;
-
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- ret = io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
+ cond_resched();
+ ret = 0;
+ } else {
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = io_schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ }
trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
unnecessary);
--
1.7.1
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 17:20 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:20 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> congestion_wait() is a bit stupid in that it goes to sleep even when there
> is no congestion. This causes stalls in a number of situations and may be
> partially responsible for bug reports about desktop interactivity.
>
> This patch series aims to account for these unnecessary congestion_waits()
> and to avoid going to sleep when there is no congestion available. Patches
> 1 and 2 add instrumentation related to congestion which should be reuable
> by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> instead of going to sleep if there is no congestion.
>
> Once again, I shoved this through performance test. Unlike previous tests,
> I ran this on a ported version of my usual test-suite that should be suitable
> for release soon. It's not quite as good as my old set but it's sufficient
> for this and related series. The tests I ran were kernbench vmr-stream
> iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> the usual stress the number of high order allocations that can be made while
> the system is under severe stress. The suite contains the necessary analysis
> scripts as well and I'd release it now except the documentation blows.
>
> x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
>
> The disks on all of them were single disks and not particularly fast.
>
> Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> instrumentation and a second test with patch 3 applied.
>
> In all cases, kernbench, hackbench, STREAM and iozone did not show any
> performance difference because none of them were pressuring the system
> enough to be calling congestion_wait() so I won't post the results.
> About all worth noting for them is that nothing horrible appeared to break.
>
> In the analysis scripts, I record unnecessary sleeps to be a sleep that
> had no congestion. The post-processing scripts for cond_resched() will only
> count an uncongested call to congestion_wait() as unnecessary if the process
> actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
>
> One vague concern I have is when too many pages are isolated, we call
> congestion_wait(). This could now actively spin in the loop for its quanta
> before calling cond_resched(). If it's calling with no congestion, it's
> hard to know what the proper thing to do there is.
Suddenly, many processes could enter into the direct reclaim path by another
reason(ex, fork bomb) regradless of congestion. backing dev congestion is
just one of them.
I think if congestion_wait returns without calling io_schedule_timeout
by your patch, too_many_isolated can schedule_timeout to wait for the system's
calm to preventing OOM killing.
How about this?
If you don't mind, I will send the patch based on this patch series
after your patch settle down or Could you add this to your patch series?
But I admit this doesn't almost affect your experiment.
>From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan.kim@gmail.com>
Date: Fri, 27 Aug 2010 02:06:45 +0900
Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
Suddenly, many processes could enter into the direct reclaim path
regradless of congestion. backing dev congestion is just one of them.
But current implementation calls congestion_wait if too many pages are isolated.
if congestion_wait returns without calling io_schedule_timeout,
too_many_isolated can schedule_timeout to wait for the system's calm
to preventing OOM killing.
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
---
mm/backing-dev.c | 5 ++---
mm/compaction.c | 6 +++++-
mm/vmscan.c | 6 +++++-
3 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6abe860..9431bca 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
- * write congestion. If no backing_devs are congested then just wait for the
- * next write to be completed.
+ * write congestion. If no backing_devs are congested then just returns.
*/
long congestion_wait(int sync, long timeout)
{
@@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
if (atomic_read(&nr_bdi_congested[sync]) == 0) {
unnecessary = true;
cond_resched();
- ret = 0;
+ ret = timeout;
} else {
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/compaction.c b/mm/compaction.c
index 94cce51..7370683 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
if (fatal_signal_pending(current))
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
1.7.0.5
--
Kind regards,
Minchan Kim
^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 17:20 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:20 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> congestion_wait() is a bit stupid in that it goes to sleep even when there
> is no congestion. This causes stalls in a number of situations and may be
> partially responsible for bug reports about desktop interactivity.
>
> This patch series aims to account for these unnecessary congestion_waits()
> and to avoid going to sleep when there is no congestion available. Patches
> 1 and 2 add instrumentation related to congestion which should be reuable
> by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> instead of going to sleep if there is no congestion.
>
> Once again, I shoved this through performance test. Unlike previous tests,
> I ran this on a ported version of my usual test-suite that should be suitable
> for release soon. It's not quite as good as my old set but it's sufficient
> for this and related series. The tests I ran were kernbench vmr-stream
> iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> the usual stress the number of high order allocations that can be made while
> the system is under severe stress. The suite contains the necessary analysis
> scripts as well and I'd release it now except the documentation blows.
>
> x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
>
> The disks on all of them were single disks and not particularly fast.
>
> Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> instrumentation and a second test with patch 3 applied.
>
> In all cases, kernbench, hackbench, STREAM and iozone did not show any
> performance difference because none of them were pressuring the system
> enough to be calling congestion_wait() so I won't post the results.
> About all worth noting for them is that nothing horrible appeared to break.
>
> In the analysis scripts, I record unnecessary sleeps to be a sleep that
> had no congestion. The post-processing scripts for cond_resched() will only
> count an uncongested call to congestion_wait() as unnecessary if the process
> actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
>
> One vague concern I have is when too many pages are isolated, we call
> congestion_wait(). This could now actively spin in the loop for its quanta
> before calling cond_resched(). If it's calling with no congestion, it's
> hard to know what the proper thing to do there is.
Suddenly, many processes could enter into the direct reclaim path by another
reason(ex, fork bomb) regradless of congestion. backing dev congestion is
just one of them.
I think if congestion_wait returns without calling io_schedule_timeout
by your patch, too_many_isolated can schedule_timeout to wait for the system's
calm to preventing OOM killing.
How about this?
If you don't mind, I will send the patch based on this patch series
after your patch settle down or Could you add this to your patch series?
But I admit this doesn't almost affect your experiment.
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 17:23 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:23 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
I think that's enough to add tracepoint until solving this issue at least.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
@ 2010-08-26 17:23 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:23 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
I think that's enough to add tracepoint until solving this issue at least.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-26 17:20 ` Minchan Kim
@ 2010-08-26 17:31 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:31 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > is no congestion. This causes stalls in a number of situations and may be
> > partially responsible for bug reports about desktop interactivity.
> >
> > This patch series aims to account for these unnecessary congestion_waits()
> > and to avoid going to sleep when there is no congestion available. Patches
> > 1 and 2 add instrumentation related to congestion which should be reuable
> > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > instead of going to sleep if there is no congestion.
> >
> > Once again, I shoved this through performance test. Unlike previous tests,
> > I ran this on a ported version of my usual test-suite that should be suitable
> > for release soon. It's not quite as good as my old set but it's sufficient
> > for this and related series. The tests I ran were kernbench vmr-stream
> > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > the usual stress the number of high order allocations that can be made while
> > the system is under severe stress. The suite contains the necessary analysis
> > scripts as well and I'd release it now except the documentation blows.
> >
> > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> >
> > The disks on all of them were single disks and not particularly fast.
> >
> > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > instrumentation and a second test with patch 3 applied.
> >
> > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > performance difference because none of them were pressuring the system
> > enough to be calling congestion_wait() so I won't post the results.
> > About all worth noting for them is that nothing horrible appeared to break.
> >
> > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > had no congestion. The post-processing scripts for cond_resched() will only
> > count an uncongested call to congestion_wait() as unnecessary if the process
> > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> >
> > One vague concern I have is when too many pages are isolated, we call
> > congestion_wait(). This could now actively spin in the loop for its quanta
> > before calling cond_resched(). If it's calling with no congestion, it's
> > hard to know what the proper thing to do there is.
>
> Suddenly, many processes could enter into the direct reclaim path by another
> reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> just one of them.
>
This situation applys with or without this series, right?
> I think if congestion_wait returns without calling io_schedule_timeout
> by your patch, too_many_isolated can schedule_timeout to wait for the system's
> calm to preventing OOM killing.
>
More likely, to stop a loop in too_many_isolated() consuming CPU time it
can do nothing with.
> How about this?
>
> If you don't mind, I will send the patch based on this patch series
> after your patch settle down or Could you add this to your patch series?
> But I admit this doesn't almost affect your experiment.
>
I think it's a related topic so could belong with the series.
> From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Fri, 27 Aug 2010 02:06:45 +0900
> Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
>
> Suddenly, many processes could enter into the direct reclaim path
> regradless of congestion. backing dev congestion is just one of them.
> But current implementation calls congestion_wait if too many pages are isolated.
>
> if congestion_wait returns without calling io_schedule_timeout,
> too_many_isolated can schedule_timeout to wait for the system's calm
> to preventing OOM killing.
>
I think the reasoning here might be a little off. How about;
If many processes enter direct reclaim or memory compaction, too many pages
can get isolated. In this situation, too_many_isolated() can call
congestion_wait() but if there is no congestion, it fails to go to sleep
and instead spins until it's quota expires.
This patch checks if congestion_wait() returned without sleeping. If it
did because there was no congestion, it unconditionally goes to sleep
instead of hogging the CPU.
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
> mm/backing-dev.c | 5 ++---
> mm/compaction.c | 6 +++++-
> mm/vmscan.c | 6 +++++-
> 3 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6abe860..9431bca 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> * @timeout: timeout in jiffies
> *
> * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion. If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * write congestion. If no backing_devs are congested then just returns.
> */
> long congestion_wait(int sync, long timeout)
> {
> @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> cond_resched();
> - ret = 0;
> + ret = timeout;
> } else {
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
We don't really need the timeout variable here but I see what you are
at. It's unfortunate to just go to sleep for HZ/10 but if it's not
congestion, we do not have any other event to wake up on at the moment.
We'd have to introduce a too_many_isolated waitqueue that is kicked if
pages are put back on the LRU.
This is better than spinning though.
> if (fatal_signal_pending(current))
> return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
This seems very reasonable. I'll review it more carefully tomorrow and if I
spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
the too_many_isolated() case but I cannot think of a better alternative
without adding more waitqueues.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 17:31 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:31 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > is no congestion. This causes stalls in a number of situations and may be
> > partially responsible for bug reports about desktop interactivity.
> >
> > This patch series aims to account for these unnecessary congestion_waits()
> > and to avoid going to sleep when there is no congestion available. Patches
> > 1 and 2 add instrumentation related to congestion which should be reuable
> > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > instead of going to sleep if there is no congestion.
> >
> > Once again, I shoved this through performance test. Unlike previous tests,
> > I ran this on a ported version of my usual test-suite that should be suitable
> > for release soon. It's not quite as good as my old set but it's sufficient
> > for this and related series. The tests I ran were kernbench vmr-stream
> > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > the usual stress the number of high order allocations that can be made while
> > the system is under severe stress. The suite contains the necessary analysis
> > scripts as well and I'd release it now except the documentation blows.
> >
> > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> >
> > The disks on all of them were single disks and not particularly fast.
> >
> > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > instrumentation and a second test with patch 3 applied.
> >
> > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > performance difference because none of them were pressuring the system
> > enough to be calling congestion_wait() so I won't post the results.
> > About all worth noting for them is that nothing horrible appeared to break.
> >
> > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > had no congestion. The post-processing scripts for cond_resched() will only
> > count an uncongested call to congestion_wait() as unnecessary if the process
> > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> >
> > One vague concern I have is when too many pages are isolated, we call
> > congestion_wait(). This could now actively spin in the loop for its quanta
> > before calling cond_resched(). If it's calling with no congestion, it's
> > hard to know what the proper thing to do there is.
>
> Suddenly, many processes could enter into the direct reclaim path by another
> reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> just one of them.
>
This situation applys with or without this series, right?
> I think if congestion_wait returns without calling io_schedule_timeout
> by your patch, too_many_isolated can schedule_timeout to wait for the system's
> calm to preventing OOM killing.
>
More likely, to stop a loop in too_many_isolated() consuming CPU time it
can do nothing with.
> How about this?
>
> If you don't mind, I will send the patch based on this patch series
> after your patch settle down or Could you add this to your patch series?
> But I admit this doesn't almost affect your experiment.
>
I think it's a related topic so could belong with the series.
> From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan.kim@gmail.com>
> Date: Fri, 27 Aug 2010 02:06:45 +0900
> Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
>
> Suddenly, many processes could enter into the direct reclaim path
> regradless of congestion. backing dev congestion is just one of them.
> But current implementation calls congestion_wait if too many pages are isolated.
>
> if congestion_wait returns without calling io_schedule_timeout,
> too_many_isolated can schedule_timeout to wait for the system's calm
> to preventing OOM killing.
>
I think the reasoning here might be a little off. How about;
If many processes enter direct reclaim or memory compaction, too many pages
can get isolated. In this situation, too_many_isolated() can call
congestion_wait() but if there is no congestion, it fails to go to sleep
and instead spins until it's quota expires.
This patch checks if congestion_wait() returned without sleeping. If it
did because there was no congestion, it unconditionally goes to sleep
instead of hogging the CPU.
> Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> ---
> mm/backing-dev.c | 5 ++---
> mm/compaction.c | 6 +++++-
> mm/vmscan.c | 6 +++++-
> 3 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 6abe860..9431bca 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> * @timeout: timeout in jiffies
> *
> * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> - * write congestion. If no backing_devs are congested then just wait for the
> - * next write to be completed.
> + * write congestion. If no backing_devs are congested then just returns.
> */
> long congestion_wait(int sync, long timeout)
> {
> @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> cond_resched();
> - ret = 0;
> + ret = timeout;
> } else {
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
We don't really need the timeout variable here but I see what you are
at. It's unfortunate to just go to sleep for HZ/10 but if it's not
congestion, we do not have any other event to wake up on at the moment.
We'd have to introduce a too_many_isolated waitqueue that is kicked if
pages are put back on the LRU.
This is better than spinning though.
> if (fatal_signal_pending(current))
> return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
This seems very reasonable. I'll review it more carefully tomorrow and if I
spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
the too_many_isolated() case but I cannot think of a better alternative
without adding more waitqueues.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 17:35 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:35 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/trace/events/writeback.h | 11 ++++++++---
> mm/backing-dev.c | 15 ++++++++++++---
> 2 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index e3bee61..03bb04b 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
>
> TRACE_EVENT(writeback_congest_waited,
>
> - TP_PROTO(unsigned int usec_delayed),
> + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
>
> - TP_ARGS(usec_delayed),
> + TP_ARGS(usec_delayed, unnecessary),
>
> TP_STRUCT__entry(
> __field( unsigned int, usec_delayed )
> + __field( unsigned int, unnecessary )
> ),
>
> TP_fast_assign(
> __entry->usec_delayed = usec_delayed;
> + __entry->unnecessary = unnecessary;
> ),
>
> - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> + TP_printk("usec_delayed=%u unnecessary=%d",
> + __entry->usec_delayed,
> + __entry->unnecessary
> + )
> );
>
> #endif /* _TRACE_WRITEBACK_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7ae33e2..a49167f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
But this seems to consider whole backing dev. Is your intention? or Am I missing now?
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-26 17:35 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:35 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> include/trace/events/writeback.h | 11 ++++++++---
> mm/backing-dev.c | 15 ++++++++++++---
> 2 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index e3bee61..03bb04b 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
>
> TRACE_EVENT(writeback_congest_waited,
>
> - TP_PROTO(unsigned int usec_delayed),
> + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
>
> - TP_ARGS(usec_delayed),
> + TP_ARGS(usec_delayed, unnecessary),
>
> TP_STRUCT__entry(
> __field( unsigned int, usec_delayed )
> + __field( unsigned int, unnecessary )
> ),
>
> TP_fast_assign(
> __entry->usec_delayed = usec_delayed;
> + __entry->unnecessary = unnecessary;
> ),
>
> - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> + TP_printk("usec_delayed=%u unnecessary=%d",
> + __entry->usec_delayed,
> + __entry->unnecessary
> + )
> );
>
> #endif /* _TRACE_WRITEBACK_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7ae33e2..a49167f 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> +static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? BDI_sync_congested : BDI_async_congested;
> - clear_bit(bit, &bdi->state);
> + if (test_and_clear_bit(bit, &bdi->state))
> + atomic_dec(&nr_bdi_congested[sync]);
Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
But this seems to consider whole backing dev. Is your intention? or Am I missing now?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 17:38 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:38 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep. This patch
> checks if there are BDIs congested. If so, it goes to sleep as normal.
> If not, it calls cond_resched() to ensure the caller is not hogging the
> CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/backing-dev.c | 20 ++++++++++++++------
> 1 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index a49167f..6abe860 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
Function's decripton should be changed since we don't wait next write any more.
> @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> - /* Check if this call to congestion_wait was necessary */
> - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> + /*
> + * If there is no congestion, there is no point sleeping on the queue.
> + * This call was unecessary but in case we are spinning due to a bad
> + * caller, at least call cond_reched() and sleep if our CPU quota
> + * has expired
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> -
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + cond_resched();
> + ret = 0;
"ret = timeout" is more proper as considering io_schedule_timeout's return value.
> + } else {
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + }
>
> trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> unnecessary);
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-26 17:38 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:38 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep. This patch
> checks if there are BDIs congested. If so, it goes to sleep as normal.
> If not, it calls cond_resched() to ensure the caller is not hogging the
> CPU longer than its quota but otherwise will not sleep.
>
> This is aimed at reducing some of the major desktop stalls reported during
> IO. For example, while kswapd is operating, it calls congestion_wait()
> but it could just have been reclaiming clean page cache pages with no
> congestion. Without this patch, it would sleep for a full timeout but after
> this patch, it'll just call schedule() if it has been on the CPU too long.
> Similar logic applies to direct reclaimers that are not making enough
> progress.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> mm/backing-dev.c | 20 ++++++++++++++------
> 1 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index a49167f..6abe860 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
Function's decripton should be changed since we don't wait next write any more.
> @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> - /* Check if this call to congestion_wait was necessary */
> - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> + /*
> + * If there is no congestion, there is no point sleeping on the queue.
> + * This call was unecessary but in case we are spinning due to a bad
> + * caller, at least call cond_reched() and sleep if our CPU quota
> + * has expired
> + */
> + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> unnecessary = true;
> -
> - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> - ret = io_schedule_timeout(timeout);
> - finish_wait(wqh, &wait);
> + cond_resched();
> + ret = 0;
"ret = timeout" is more proper as considering io_schedule_timeout's return value.
> + } else {
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = io_schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + }
>
> trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> unnecessary);
> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 17:35 ` Minchan Kim
@ 2010-08-26 17:41 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:41 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 02:35:34AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/trace/events/writeback.h | 11 ++++++++---
> > mm/backing-dev.c | 15 ++++++++++++---
> > 2 files changed, 20 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index e3bee61..03bb04b 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
> >
> > TRACE_EVENT(writeback_congest_waited,
> >
> > - TP_PROTO(unsigned int usec_delayed),
> > + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
> >
> > - TP_ARGS(usec_delayed),
> > + TP_ARGS(usec_delayed, unnecessary),
> >
> > TP_STRUCT__entry(
> > __field( unsigned int, usec_delayed )
> > + __field( unsigned int, unnecessary )
> > ),
> >
> > TP_fast_assign(
> > __entry->usec_delayed = usec_delayed;
> > + __entry->unnecessary = unnecessary;
> > ),
> >
> > - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> > + TP_printk("usec_delayed=%u unnecessary=%d",
> > + __entry->usec_delayed,
> > + __entry->unnecessary
> > + )
> > );
> >
> > #endif /* _TRACE_WRITEBACK_H */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 7ae33e2..a49167f 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
>
> Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
> But this seems to consider whole backing dev. Is your intention? or Am I missing now?
>
Not whole backing devs, all backing devs. This is intentional.
If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout
because a wakeup event will not occur - this is a bad scenario. To know if
0 BDIs were congested, one could either walk all the BDIs checking their
status or maintain a counter like nr_bdi_congested which is what I decided on.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-26 17:41 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:41 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 02:35:34AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > include/trace/events/writeback.h | 11 ++++++++---
> > mm/backing-dev.c | 15 ++++++++++++---
> > 2 files changed, 20 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> > index e3bee61..03bb04b 100644
> > --- a/include/trace/events/writeback.h
> > +++ b/include/trace/events/writeback.h
> > @@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
> >
> > TRACE_EVENT(writeback_congest_waited,
> >
> > - TP_PROTO(unsigned int usec_delayed),
> > + TP_PROTO(unsigned int usec_delayed, bool unnecessary),
> >
> > - TP_ARGS(usec_delayed),
> > + TP_ARGS(usec_delayed, unnecessary),
> >
> > TP_STRUCT__entry(
> > __field( unsigned int, usec_delayed )
> > + __field( unsigned int, unnecessary )
> > ),
> >
> > TP_fast_assign(
> > __entry->usec_delayed = usec_delayed;
> > + __entry->unnecessary = unnecessary;
> > ),
> >
> > - TP_printk("usec_delayed=%u", __entry->usec_delayed)
> > + TP_printk("usec_delayed=%u unnecessary=%d",
> > + __entry->usec_delayed,
> > + __entry->unnecessary
> > + )
> > );
> >
> > #endif /* _TRACE_WRITEBACK_H */
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 7ae33e2..a49167f 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> > +static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > @@ -731,7 +732,8 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - clear_bit(bit, &bdi->state);
> > + if (test_and_clear_bit(bit, &bdi->state))
> > + atomic_dec(&nr_bdi_congested[sync]);
>
> Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
> But this seems to consider whole backing dev. Is your intention? or Am I missing now?
>
Not whole backing devs, all backing devs. This is intentional.
If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout
because a wakeup event will not occur - this is a bad scenario. To know if
0 BDIs were congested, one could either walk all the BDIs checking their
status or maintain a counter like nr_bdi_congested which is what I decided on.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 17:38 ` Minchan Kim
@ 2010-08-26 17:42 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:42 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep. This patch
> > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > If not, it calls cond_resched() to ensure the caller is not hogging the
> > CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/backing-dev.c | 20 ++++++++++++++------
> > 1 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index a49167f..6abe860 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
>
> Function's decripton should be changed since we don't wait next write any more.
>
My bad. I need to check that "next write" thing. It doesn't appear to be
happening but maybe that side of things just broke somewhere in the
distant past. I lack context of how this is meant to work so maybe
someone will educate me.
> > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> > DEFINE_WAIT(wait);
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > - /* Check if this call to congestion_wait was necessary */
> > - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> > + /*
> > + * If there is no congestion, there is no point sleeping on the queue.
> > + * This call was unecessary but in case we are spinning due to a bad
> > + * caller, at least call cond_reched() and sleep if our CPU quota
> > + * has expired
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > -
> > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > - ret = io_schedule_timeout(timeout);
> > - finish_wait(wqh, &wait);
> > + cond_resched();
> > + ret = 0;
>
> "ret = timeout" is more proper as considering io_schedule_timeout's return value.
>
Good point, will fix.
> > + } else {
> > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > + ret = io_schedule_timeout(timeout);
> > + finish_wait(wqh, &wait);
> > + }
> >
> > trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> > unnecessary);
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-26 17:42 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 17:42 UTC (permalink / raw)
To: Minchan Kim
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep. This patch
> > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > If not, it calls cond_resched() to ensure the caller is not hogging the
> > CPU longer than its quota but otherwise will not sleep.
> >
> > This is aimed at reducing some of the major desktop stalls reported during
> > IO. For example, while kswapd is operating, it calls congestion_wait()
> > but it could just have been reclaiming clean page cache pages with no
> > congestion. Without this patch, it would sleep for a full timeout but after
> > this patch, it'll just call schedule() if it has been on the CPU too long.
> > Similar logic applies to direct reclaimers that are not making enough
> > progress.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/backing-dev.c | 20 ++++++++++++++------
> > 1 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index a49167f..6abe860 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
>
> Function's decripton should be changed since we don't wait next write any more.
>
My bad. I need to check that "next write" thing. It doesn't appear to be
happening but maybe that side of things just broke somewhere in the
distant past. I lack context of how this is meant to work so maybe
someone will educate me.
> > @@ -767,13 +767,21 @@ long congestion_wait(int sync, long timeout)
> > DEFINE_WAIT(wait);
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > - /* Check if this call to congestion_wait was necessary */
> > - if (atomic_read(&nr_bdi_congested[sync]) == 0)
> > + /*
> > + * If there is no congestion, there is no point sleeping on the queue.
> > + * This call was unecessary but in case we are spinning due to a bad
> > + * caller, at least call cond_reched() and sleep if our CPU quota
> > + * has expired
> > + */
> > + if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > -
> > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > - ret = io_schedule_timeout(timeout);
> > - finish_wait(wqh, &wait);
> > + cond_resched();
> > + ret = 0;
>
> "ret = timeout" is more proper as considering io_schedule_timeout's return value.
>
Good point, will fix.
> > + } else {
> > + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > + ret = io_schedule_timeout(timeout);
> > + finish_wait(wqh, &wait);
> > + }
> >
> > trace_writeback_congest_waited(jiffies_to_usecs(jiffies - start),
> > unnecessary);
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-26 17:31 ` Mel Gorman
@ 2010-08-26 17:50 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:50 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Thu, Aug 26, 2010 at 06:31:47PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > > is no congestion. This causes stalls in a number of situations and may be
> > > partially responsible for bug reports about desktop interactivity.
> > >
> > > This patch series aims to account for these unnecessary congestion_waits()
> > > and to avoid going to sleep when there is no congestion available. Patches
> > > 1 and 2 add instrumentation related to congestion which should be reuable
> > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > > instead of going to sleep if there is no congestion.
> > >
> > > Once again, I shoved this through performance test. Unlike previous tests,
> > > I ran this on a ported version of my usual test-suite that should be suitable
> > > for release soon. It's not quite as good as my old set but it's sufficient
> > > for this and related series. The tests I ran were kernbench vmr-stream
> > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > > the usual stress the number of high order allocations that can be made while
> > > the system is under severe stress. The suite contains the necessary analysis
> > > scripts as well and I'd release it now except the documentation blows.
> > >
> > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> > >
> > > The disks on all of them were single disks and not particularly fast.
> > >
> > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > > instrumentation and a second test with patch 3 applied.
> > >
> > > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > > performance difference because none of them were pressuring the system
> > > enough to be calling congestion_wait() so I won't post the results.
> > > About all worth noting for them is that nothing horrible appeared to break.
> > >
> > > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > > had no congestion. The post-processing scripts for cond_resched() will only
> > > count an uncongested call to congestion_wait() as unnecessary if the process
> > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> > >
> > > One vague concern I have is when too many pages are isolated, we call
> > > congestion_wait(). This could now actively spin in the loop for its quanta
> > > before calling cond_resched(). If it's calling with no congestion, it's
> > > hard to know what the proper thing to do there is.
> >
> > Suddenly, many processes could enter into the direct reclaim path by another
> > reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> > just one of them.
> >
>
> This situation applys with or without this series, right?
I think the situation applys with this series. That's because old behavior was calling
schedule regardless of I/O congested as seeing io_schedule_timeout.
But you are changing it now as calling it conditionally.
>
> > I think if congestion_wait returns without calling io_schedule_timeout
> > by your patch, too_many_isolated can schedule_timeout to wait for the system's
> > calm to preventing OOM killing.
> >
>
> More likely, to stop a loop in too_many_isolated() consuming CPU time it
> can do nothing with.
>
> > How about this?
> >
> > If you don't mind, I will send the patch based on this patch series
> > after your patch settle down or Could you add this to your patch series?
> > But I admit this doesn't almost affect your experiment.
> >
>
> I think it's a related topic so could belong with the series.
>
> > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Fri, 27 Aug 2010 02:06:45 +0900
> > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
> >
> > Suddenly, many processes could enter into the direct reclaim path
> > regradless of congestion. backing dev congestion is just one of them.
> > But current implementation calls congestion_wait if too many pages are isolated.
> >
> > if congestion_wait returns without calling io_schedule_timeout,
> > too_many_isolated can schedule_timeout to wait for the system's calm
> > to preventing OOM killing.
> >
>
> I think the reasoning here might be a little off. How about;
>
> If many processes enter direct reclaim or memory compaction, too many pages
> can get isolated. In this situation, too_many_isolated() can call
> congestion_wait() but if there is no congestion, it fails to go to sleep
> and instead spins until it's quota expires.
>
> This patch checks if congestion_wait() returned without sleeping. If it
> did because there was no congestion, it unconditionally goes to sleep
> instead of hogging the CPU.
That's good to me. :)
>
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> > mm/backing-dev.c | 5 ++---
> > mm/compaction.c | 6 +++++-
> > mm/vmscan.c | 6 +++++-
> > 3 files changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 6abe860..9431bca 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> > * @timeout: timeout in jiffies
> > *
> > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > - * write congestion. If no backing_devs are congested then just wait for the
> > - * next write to be completed.
> > + * write congestion. If no backing_devs are congested then just returns.
> > */
> > long congestion_wait(int sync, long timeout)
> > {
> > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> > if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > cond_resched();
> > - ret = 0;
> > + ret = timeout;
> > } else {
> > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > ret = io_schedule_timeout(timeout);
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 94cce51..7370683 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> > * delay for some time until fewer pages are isolated
> > */
> > while (unlikely(too_many_isolated(zone))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
>
> We don't really need the timeout variable here but I see what you are
> at. It's unfortunate to just go to sleep for HZ/10 but if it's not
> congestion, we do not have any other event to wake up on at the moment.
> We'd have to introduce a too_many_isolated waitqueue that is kicked if
> pages are put back on the LRU.
I thought it firstly but first of all, let's make sure how often this situation happens
and it's really serious problem. I means it's rather overkill.
>
> This is better than spinning though.
>
> > if (fatal_signal_pending(current))
> > return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3109ff7..f5e3e28 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > unsigned long nr_dirty;
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
> > /* We are about to die and free our memory. Return now. */
> > if (fatal_signal_pending(current))
>
> This seems very reasonable. I'll review it more carefully tomorrow and if I
> spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
> the too_many_isolated() case but I cannot think of a better alternative
> without adding more waitqueues.
Thanks. Mel.
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-26 17:50 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-26 17:50 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki
On Thu, Aug 26, 2010 at 06:31:47PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:20:38AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:13PM +0100, Mel Gorman wrote:
> > > congestion_wait() is a bit stupid in that it goes to sleep even when there
> > > is no congestion. This causes stalls in a number of situations and may be
> > > partially responsible for bug reports about desktop interactivity.
> > >
> > > This patch series aims to account for these unnecessary congestion_waits()
> > > and to avoid going to sleep when there is no congestion available. Patches
> > > 1 and 2 add instrumentation related to congestion which should be reuable
> > > by alternative solutions to congestion_wait. Patch 3 calls cond_resched()
> > > instead of going to sleep if there is no congestion.
> > >
> > > Once again, I shoved this through performance test. Unlike previous tests,
> > > I ran this on a ported version of my usual test-suite that should be suitable
> > > for release soon. It's not quite as good as my old set but it's sufficient
> > > for this and related series. The tests I ran were kernbench vmr-stream
> > > iozone hackbench-sockets hackbench-pipes netperf-udp netperf-tcp sysbench
> > > stress-highalloc. Sysbench was a read/write tests and stress-highalloc is
> > > the usual stress the number of high order allocations that can be made while
> > > the system is under severe stress. The suite contains the necessary analysis
> > > scripts as well and I'd release it now except the documentation blows.
> > >
> > > x86: Intel Pentium D 3GHz with 3G RAM (no-brand machine)
> > > x86-64: AMD Phenom 9950 1.3GHz with 3G RAM (no-brand machine)
> > > ppc64: PPC970MP 2.5GHz with 3GB RAM (it's a terrasoft powerstation)
> > >
> > > The disks on all of them were single disks and not particularly fast.
> > >
> > > Comparison was between a 2.6.36-rc1 with patches 1 and 2 applied for
> > > instrumentation and a second test with patch 3 applied.
> > >
> > > In all cases, kernbench, hackbench, STREAM and iozone did not show any
> > > performance difference because none of them were pressuring the system
> > > enough to be calling congestion_wait() so I won't post the results.
> > > About all worth noting for them is that nothing horrible appeared to break.
> > >
> > > In the analysis scripts, I record unnecessary sleeps to be a sleep that
> > > had no congestion. The post-processing scripts for cond_resched() will only
> > > count an uncongested call to congestion_wait() as unnecessary if the process
> > > actually gets scheduled. Ordinarily, we'd expect it to continue uninterrupted.
> > >
> > > One vague concern I have is when too many pages are isolated, we call
> > > congestion_wait(). This could now actively spin in the loop for its quanta
> > > before calling cond_resched(). If it's calling with no congestion, it's
> > > hard to know what the proper thing to do there is.
> >
> > Suddenly, many processes could enter into the direct reclaim path by another
> > reason(ex, fork bomb) regradless of congestion. backing dev congestion is
> > just one of them.
> >
>
> This situation applys with or without this series, right?
I think the situation applys with this series. That's because old behavior was calling
schedule regardless of I/O congested as seeing io_schedule_timeout.
But you are changing it now as calling it conditionally.
>
> > I think if congestion_wait returns without calling io_schedule_timeout
> > by your patch, too_many_isolated can schedule_timeout to wait for the system's
> > calm to preventing OOM killing.
> >
>
> More likely, to stop a loop in too_many_isolated() consuming CPU time it
> can do nothing with.
>
> > How about this?
> >
> > If you don't mind, I will send the patch based on this patch series
> > after your patch settle down or Could you add this to your patch series?
> > But I admit this doesn't almost affect your experiment.
> >
>
> I think it's a related topic so could belong with the series.
>
> > From 70d6584e125c3954d74a69bfcb72de17244635d2 Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <minchan.kim@gmail.com>
> > Date: Fri, 27 Aug 2010 02:06:45 +0900
> > Subject: [PATCH] Wait regardless of congestion if too many pages are isolated
> >
> > Suddenly, many processes could enter into the direct reclaim path
> > regradless of congestion. backing dev congestion is just one of them.
> > But current implementation calls congestion_wait if too many pages are isolated.
> >
> > if congestion_wait returns without calling io_schedule_timeout,
> > too_many_isolated can schedule_timeout to wait for the system's calm
> > to preventing OOM killing.
> >
>
> I think the reasoning here might be a little off. How about;
>
> If many processes enter direct reclaim or memory compaction, too many pages
> can get isolated. In this situation, too_many_isolated() can call
> congestion_wait() but if there is no congestion, it fails to go to sleep
> and instead spins until it's quota expires.
>
> This patch checks if congestion_wait() returned without sleeping. If it
> did because there was no congestion, it unconditionally goes to sleep
> instead of hogging the CPU.
That's good to me. :)
>
> > Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
> > ---
> > mm/backing-dev.c | 5 ++---
> > mm/compaction.c | 6 +++++-
> > mm/vmscan.c | 6 +++++-
> > 3 files changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 6abe860..9431bca 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -756,8 +756,7 @@ EXPORT_SYMBOL(set_bdi_congested);
> > * @timeout: timeout in jiffies
> > *
> > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > - * write congestion. If no backing_devs are congested then just wait for the
> > - * next write to be completed.
> > + * write congestion. If no backing_devs are congested then just returns.
> > */
> > long congestion_wait(int sync, long timeout)
> > {
> > @@ -776,7 +775,7 @@ long congestion_wait(int sync, long timeout)
> > if (atomic_read(&nr_bdi_congested[sync]) == 0) {
> > unnecessary = true;
> > cond_resched();
> > - ret = 0;
> > + ret = timeout;
> > } else {
> > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > ret = io_schedule_timeout(timeout);
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 94cce51..7370683 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> > * delay for some time until fewer pages are isolated
> > */
> > while (unlikely(too_many_isolated(zone))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
>
> We don't really need the timeout variable here but I see what you are
> at. It's unfortunate to just go to sleep for HZ/10 but if it's not
> congestion, we do not have any other event to wake up on at the moment.
> We'd have to introduce a too_many_isolated waitqueue that is kicked if
> pages are put back on the LRU.
I thought it firstly but first of all, let's make sure how often this situation happens
and it's really serious problem. I means it's rather overkill.
>
> This is better than spinning though.
>
> > if (fatal_signal_pending(current))
> > return 0;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3109ff7..f5e3e28 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> > unsigned long nr_dirty;
> > while (unlikely(too_many_isolated(zone, file, sc))) {
> > - congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + long timeout = HZ/10;
> > + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > + schedule_timeout(timeout);
> > + }
> >
> > /* We are about to die and free our memory. Return now. */
> > if (fatal_signal_pending(current))
>
> This seems very reasonable. I'll review it more carefully tomorrow and if I
> spot nothing horrible, I'll add it onto the series. I'm not sure I'm hitting
> the too_many_isolated() case but I cannot think of a better alternative
> without adding more waitqueues.
Thanks. Mel.
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 18:10 ` Johannes Weiner
-1 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:10 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 1/3] writeback: Account for time spent congestion_waited
@ 2010-08-26 18:10 ` Johannes Weiner
0 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:10 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:14PM +0100, Mel Gorman wrote:
> There is strong evidence to indicate a lot of time is being spent in
> congestion_wait(), some of it unnecessarily. This patch adds a
> tracepoint for congestion_wait to record when congestion_wait() occurred
> and how long was spent.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 17:42 ` Mel Gorman
@ 2010-08-26 18:17 ` Johannes Weiner
-1 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called with no BDIs congested, the caller will
> > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > CPU longer than its quota but otherwise will not sleep.
> > >
> > > This is aimed at reducing some of the major desktop stalls reported during
> > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > but it could just have been reclaiming clean page cache pages with no
> > > congestion. Without this patch, it would sleep for a full timeout but after
> > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > Similar logic applies to direct reclaimers that are not making enough
> > > progress.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/backing-dev.c | 20 ++++++++++++++------
> > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index a49167f..6abe860 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> >
> > Function's decripton should be changed since we don't wait next write any more.
> >
>
> My bad. I need to check that "next write" thing. It doesn't appear to be
> happening but maybe that side of things just broke somewhere in the
> distant past. I lack context of how this is meant to work so maybe
> someone will educate me.
On every retired io request the congestion state on the bdi is checked
and the congestion waitqueue woken up.
So without congestion, we still only wait until the next write
retires, but without any IO, we sleep the full timeout.
Check __freed_requests() in block/blk-core.c.
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-26 18:17 ` Johannes Weiner
0 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:17 UTC (permalink / raw)
To: Mel Gorman
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called with no BDIs congested, the caller will
> > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > CPU longer than its quota but otherwise will not sleep.
> > >
> > > This is aimed at reducing some of the major desktop stalls reported during
> > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > but it could just have been reclaiming clean page cache pages with no
> > > congestion. Without this patch, it would sleep for a full timeout but after
> > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > Similar logic applies to direct reclaimers that are not making enough
> > > progress.
> > >
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > ---
> > > mm/backing-dev.c | 20 ++++++++++++++------
> > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index a49167f..6abe860 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> >
> > Function's decripton should be changed since we don't wait next write any more.
> >
>
> My bad. I need to check that "next write" thing. It doesn't appear to be
> happening but maybe that side of things just broke somewhere in the
> distant past. I lack context of how this is meant to work so maybe
> someone will educate me.
On every retired io request the congestion state on the bdi is checked
and the congestion waitqueue woken up.
So without congestion, we still only wait until the next write
retires, but without any IO, we sleep the full timeout.
Check __freed_requests() in block/blk-core.c.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-26 18:29 ` Johannes Weiner
-1 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:29 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
I am not convinced that unnecessary is the right word. On a workload
without any IO (i.e. no congestion_wait() necessary, ever), I noticed
the VM regressing both in time and in reclaiming the right pages when
simply removing congestion_wait() from the direct reclaim paths (the
one in __alloc_pages_slowpath and the other one in
do_try_to_free_pages).
So just being stupid and waiting for the timeout in direct reclaim
while kswapd can make progress seemed to do a better job for that
load.
I can not exactly pinpoint the reason for that behaviour, it would be
nice if somebody had an idea.
So personally I think it's a good idea to get an insight on the use of
congestion_wait() [patch 1] but I don't agree with changing its
behaviour just yet, or judging its usefulness solely on whether it
correctly waits for bdi congestion.
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-26 18:29 ` Johannes Weiner
0 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-26 18:29 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> If congestion_wait() is called when there is no congestion, the caller
> will wait for the full timeout. This can cause unreasonable and
> unnecessary stalls. There are a number of potential modifications that
> could be made to wake sleepers but this patch measures how serious the
> problem is. It keeps count of how many congested BDIs there are. If
> congestion_wait() is called with no BDIs congested, the tracepoint will
> record that the wait was unnecessary.
I am not convinced that unnecessary is the right word. On a workload
without any IO (i.e. no congestion_wait() necessary, ever), I noticed
the VM regressing both in time and in reclaiming the right pages when
simply removing congestion_wait() from the direct reclaim paths (the
one in __alloc_pages_slowpath and the other one in
do_try_to_free_pages).
So just being stupid and waiting for the timeout in direct reclaim
while kswapd can make progress seemed to do a better job for that
load.
I can not exactly pinpoint the reason for that behaviour, it would be
nice if somebody had an idea.
So personally I think it's a good idea to get an insight on the use of
congestion_wait() [patch 1] but I don't agree with changing its
behaviour just yet, or judging its usefulness solely on whether it
correctly waits for bdi congestion.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 18:17 ` Johannes Weiner
@ 2010-08-26 20:23 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 20:23 UTC (permalink / raw)
To: Johannes Weiner
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.
>
Seems reasonable. Still, if there is no write IO going on and no
congestion there seems to be no point going to sleep for the full
timeout. It still feels wrong.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-26 20:23 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 20:23 UTC (permalink / raw)
To: Johannes Weiner
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.
>
Seems reasonable. Still, if there is no write IO going on and no
congestion there seems to be no point going to sleep for the full
timeout. It still feels wrong.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 18:29 ` Johannes Weiner
@ 2010-08-26 20:31 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 20:31 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
>
There is a possibility that the behaviour in that case was due to flusher
threads doing the writes rather than direct reclaim queueing pages for IO
in an inefficient manner. So the stall is stupid but happens to work out
well because flusher threads get the chance to do work.
> So personally I think it's a good idea to get an insight on the use of
> congestion_wait() [patch 1] but I don't agree with changing its
> behaviour just yet, or judging its usefulness solely on whether it
> correctly waits for bdi congestion.
>
Unfortunately, I strongly suspect that some of the desktop stalls seen during
IO (one of which involved no writes) were due to calling congestion_wait
and waiting the full timeout where no writes are going on.
It gets potentially worse too. Lets say we have a system with many BDIs of
different speed - e.g. SSD on one end of the spectrum and USB flash drive
on the other. The congestion for writes could be on the USB flash drive but
due to low memory, the allocator, direct reclaimers and kswapd go to sleep
periodically on congestion_wait for USB even though the bulk of the pages
need reclaiming are backed by an SSD.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-26 20:31 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-26 20:31 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
>
There is a possibility that the behaviour in that case was due to flusher
threads doing the writes rather than direct reclaim queueing pages for IO
in an inefficient manner. So the stall is stupid but happens to work out
well because flusher threads get the chance to do work.
> So personally I think it's a good idea to get an insight on the use of
> congestion_wait() [patch 1] but I don't agree with changing its
> behaviour just yet, or judging its usefulness solely on whether it
> correctly waits for bdi congestion.
>
Unfortunately, I strongly suspect that some of the desktop stalls seen during
IO (one of which involved no writes) were due to calling congestion_wait
and waiting the full timeout where no writes are going on.
It gets potentially worse too. Lets say we have a system with many BDIs of
different speed - e.g. SSD on one end of the spectrum and USB flash drive
on the other. The congestion for writes could be on the USB flash drive but
due to low memory, the allocator, direct reclaimers and kswapd go to sleep
periodically on congestion_wait for USB even though the bulk of the pages
need reclaiming are backed by an SSD.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 20:23 ` Mel Gorman
@ 2010-08-27 1:11 ` Wu Fengguang
-1 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:11 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua
On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
> >
>
> Seems reasonable. Still, if there is no write IO going on and no
> congestion there seems to be no point going to sleep for the full
> timeout. It still feels wrong.
Yeah the stupid sleeping feels wrong. However there are ~20
congestion_wait() callers spread randomly in VM, FS and block drivers.
Many of them may be added by rule of thumb, however what if some of
them happen to depend on the old stupid sleeping behavior? Obviously
you've done extensive tests on the page reclaim paths, however that's
far from enough to cover the wider changes made by this patch.
We may have to do the conversions case by case. Converting to
congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
other waiting schemes.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 1:11 ` Wu Fengguang
0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:11 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua
On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
> >
>
> Seems reasonable. Still, if there is no write IO going on and no
> congestion there seems to be no point going to sleep for the full
> timeout. It still feels wrong.
Yeah the stupid sleeping feels wrong. However there are ~20
congestion_wait() callers spread randomly in VM, FS and block drivers.
Many of them may be added by rule of thumb, however what if some of
them happen to depend on the old stupid sleeping behavior? Obviously
you've done extensive tests on the page reclaim paths, however that's
far from enough to cover the wider changes made by this patch.
We may have to do the conversions case by case. Converting to
congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
other waiting schemes.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-26 17:20 ` Minchan Kim
@ 2010-08-27 1:21 ` Wu Fengguang
-1 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:21 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
Minchan,
It's much cleaner to keep the unchanged congestion_wait() and add a
congestion_wait_check() for converting problematic wait sites. The
too_many_isolated() wait is merely a protective mechanism, I won't
bother to improve it at the cost of more code.
Thanks,
Fengguang
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> if (fatal_signal_pending(current))
> return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> --
> 1.7.0.5
>
>
> --
> Kind regards,
> Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 1:21 ` Wu Fengguang
0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:21 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
Minchan,
It's much cleaner to keep the unchanged congestion_wait() and add a
congestion_wait_check() for converting problematic wait sites. The
too_many_isolated() wait is merely a protective mechanism, I won't
bother to improve it at the cost of more code.
Thanks,
Fengguang
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 94cce51..7370683 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
> * delay for some time until fewer pages are isolated
> */
> while (unlikely(too_many_isolated(zone))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> if (fatal_signal_pending(current))
> return 0;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3109ff7..f5e3e28 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> unsigned long nr_dirty;
> while (unlikely(too_many_isolated(zone, file, sc))) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + long timeout = HZ/10;
> + if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(timeout);
> + }
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> --
> 1.7.0.5
>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-27 1:21 ` Wu Fengguang
@ 2010-08-27 1:41 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-27 1:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
Hi, Wu.
On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
You means following as?
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait_check(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
>
> Thanks,
> Fengguang
>
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 1:41 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-27 1:41 UTC (permalink / raw)
To: Wu Fengguang
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
Hi, Wu.
On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
You means following as?
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait_check(BLK_RW_ASYNC, HZ/10);
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
>
> Thanks,
> Fengguang
>
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 18:17 ` Johannes Weiner
@ 2010-08-27 1:42 ` Wu Fengguang
-1 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:42 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Jan Kara, linux-kernel, Li Shaohua,
Rik van Riel
On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.
congestion_wait() is tightly related with pageout() and writeback,
however it may have some intention for the no-IO case as well.
- if write congested, maybe we are doing too much pageout(), so wait.
it might also reduce some get_request_wait() stalls (the normal way
is to explicitly check for congestion before doing write out).
- if any write completes, it may free some PG_reclaim pages, so proceed.
(when not congested)
- if no IO at all, the 100ms sleep might still prevent a page reclaimer
from stealing lots of slices from a busy computing program that
involves no page allocation at all.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 1:42 ` Wu Fengguang
0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:42 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Jan Kara, linux-kernel, Li Shaohua,
Rik van Riel
On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > CPU longer than its quota but otherwise will not sleep.
> > > >
> > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > but it could just have been reclaiming clean page cache pages with no
> > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > Similar logic applies to direct reclaimers that are not making enough
> > > > progress.
> > > >
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > ---
> > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > index a49167f..6abe860 100644
> > > > --- a/mm/backing-dev.c
> > > > +++ b/mm/backing-dev.c
> > >
> > > Function's decripton should be changed since we don't wait next write any more.
> > >
> >
> > My bad. I need to check that "next write" thing. It doesn't appear to be
> > happening but maybe that side of things just broke somewhere in the
> > distant past. I lack context of how this is meant to work so maybe
> > someone will educate me.
>
> On every retired io request the congestion state on the bdi is checked
> and the congestion waitqueue woken up.
>
> So without congestion, we still only wait until the next write
> retires, but without any IO, we sleep the full timeout.
>
> Check __freed_requests() in block/blk-core.c.
congestion_wait() is tightly related with pageout() and writeback,
however it may have some intention for the no-IO case as well.
- if write congested, maybe we are doing too much pageout(), so wait.
it might also reduce some get_request_wait() stalls (the normal way
is to explicitly check for congestion before doing write out).
- if any write completes, it may free some PG_reclaim pages, so proceed.
(when not congested)
- if no IO at all, the 100ms sleep might still prevent a page reclaimer
from stealing lots of slices from a busy computing program that
involves no page allocation at all.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-27 1:41 ` Minchan Kim
@ 2010-08-27 1:50 ` Wu Fengguang
-1 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:50 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> Hi, Wu.
>
> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Minchan,
> >
> > It's much cleaner to keep the unchanged congestion_wait() and add a
> > congestion_wait_check() for converting problematic wait sites. The
> > too_many_isolated() wait is merely a protective mechanism, I won't
> > bother to improve it at the cost of more code.
>
> You means following as?
No, I mean do not change the too_many_isolated() related code at all :)
And to use congestion_wait_check() in other places that we can prove
there is a problem that can be rightly fixed by changing to
congestion_wait_check().
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait_check(BLK_RW_ASYNC, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> return SWAP_CLUSTER_MAX;
> }
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 1:50 ` Wu Fengguang
0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 1:50 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> Hi, Wu.
>
> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > Minchan,
> >
> > It's much cleaner to keep the unchanged congestion_wait() and add a
> > congestion_wait_check() for converting problematic wait sites. The
> > too_many_isolated() wait is merely a protective mechanism, I won't
> > bother to improve it at the cost of more code.
>
> You means following as?
No, I mean do not change the too_many_isolated() related code at all :)
And to use congestion_wait_check() in other places that we can prove
there is a problem that can be rightly fixed by changing to
congestion_wait_check().
> while (unlikely(too_many_isolated(zone, file, sc))) {
> congestion_wait_check(BLK_RW_ASYNC, HZ/10);
>
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> return SWAP_CLUSTER_MAX;
> }
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-27 1:50 ` Wu Fengguang
@ 2010-08-27 2:02 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-27 2:02 UTC (permalink / raw)
To: Wu Fengguang
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > Minchan,
>> >
>> > It's much cleaner to keep the unchanged congestion_wait() and add a
>> > congestion_wait_check() for converting problematic wait sites. The
>> > too_many_isolated() wait is merely a protective mechanism, I won't
>> > bother to improve it at the cost of more code.
>>
>> You means following as?
>
> No, I mean do not change the too_many_isolated() related code at all :)
> And to use congestion_wait_check() in other places that we can prove
> there is a problem that can be rightly fixed by changing to
> congestion_wait_check().
I always suffer from understanding your comment.
Apparently, my eyes have a problem. ;(
This patch is dependent of Mel's series.
With changing congestion_wait with just return when no congestion, it
would have CPU hogging in too_many_isolated. I think it would apply in
Li's congestion_wait_check, too.
If no change is current congestion_wait, we doesn't need this patch.
Still, maybe I can't understand your comment. Sorry.
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 2:02 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-27 2:02 UTC (permalink / raw)
To: Wu Fengguang
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
>> Hi, Wu.
>>
>> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
>> > Minchan,
>> >
>> > It's much cleaner to keep the unchanged congestion_wait() and add a
>> > congestion_wait_check() for converting problematic wait sites. The
>> > too_many_isolated() wait is merely a protective mechanism, I won't
>> > bother to improve it at the cost of more code.
>>
>> You means following as?
>
> No, I mean do not change the too_many_isolated() related code at all :)
> And to use congestion_wait_check() in other places that we can prove
> there is a problem that can be rightly fixed by changing to
> congestion_wait_check().
I always suffer from understanding your comment.
Apparently, my eyes have a problem. ;(
This patch is dependent of Mel's series.
With changing congestion_wait with just return when no congestion, it
would have CPU hogging in too_many_isolated. I think it would apply in
Li's congestion_wait_check, too.
If no change is current congestion_wait, we doesn't need this patch.
Still, maybe I can't understand your comment. Sorry.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 20:31 ` Mel Gorman
(?)
@ 2010-08-27 2:12 ` Shaohua Li
-1 siblings, 0 replies; 76+ messages in thread
From: Shaohua Li @ 2010-08-27 2:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel
On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.
Thanks,
Shaohua
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-27 2:12 ` Shaohua Li
0 siblings, 0 replies; 76+ messages in thread
From: Shaohua Li @ 2010-08-27 2:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel
On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.
Thanks,
Shaohua
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-27 2:12 ` Shaohua Li
0 siblings, 0 replies; 76+ messages in thread
From: Shaohua Li @ 2010-08-27 2:12 UTC (permalink / raw)
To: Mel Gorman
Cc: Johannes Weiner, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel
On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.
Thanks,
Shaohua
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-27 2:02 ` Minchan Kim
@ 2010-08-27 4:34 ` Wu Fengguang
-1 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 4:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 10:02:52AM +0800, Minchan Kim wrote:
> On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > Minchan,
> >> >
> >> > It's much cleaner to keep the unchanged congestion_wait() and add a
> >> > congestion_wait_check() for converting problematic wait sites. The
> >> > too_many_isolated() wait is merely a protective mechanism, I won't
> >> > bother to improve it at the cost of more code.
> >>
> >> You means following as?
> >
> > No, I mean do not change the too_many_isolated() related code at all :)
> > And to use congestion_wait_check() in other places that we can prove
> > there is a problem that can be rightly fixed by changing to
> > congestion_wait_check().
>
> I always suffer from understanding your comment.
> Apparently, my eyes have a problem. ;(
> This patch is dependent of Mel's series.
> With changing congestion_wait with just return when no congestion, it
> would have CPU hogging in too_many_isolated. I think it would apply in
> Li's congestion_wait_check, too.
> If no change is current congestion_wait, we doesn't need this patch.
>
> Still, maybe I can't understand your comment. Sorry.
Sorry! The confusion must come from the modified congestion_wait() by
Mel. My proposal is to _not_ modify congestion_wait(), but add another
congestion_wait_check() which won't sleep 100ms when no IO. In this
way, the following chunks become unnecessary.
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
if (fatal_signal_pending(current))
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
Thanks,
Fengguang
^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 4:34 ` Wu Fengguang
0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2010-08-27 4:34 UTC (permalink / raw)
To: Minchan Kim
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li, Shaohua
On Fri, Aug 27, 2010 at 10:02:52AM +0800, Minchan Kim wrote:
> On Fri, Aug 27, 2010 at 10:50 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Fri, Aug 27, 2010 at 09:41:48AM +0800, Minchan Kim wrote:
> >> Hi, Wu.
> >>
> >> On Fri, Aug 27, 2010 at 10:21 AM, Wu Fengguang <fengguang.wu@intel.com> wrote:
> >> > Minchan,
> >> >
> >> > It's much cleaner to keep the unchanged congestion_wait() and add a
> >> > congestion_wait_check() for converting problematic wait sites. The
> >> > too_many_isolated() wait is merely a protective mechanism, I won't
> >> > bother to improve it at the cost of more code.
> >>
> >> You means following as?
> >
> > No, I mean do not change the too_many_isolated() related code at all :)
> > And to use congestion_wait_check() in other places that we can prove
> > there is a problem that can be rightly fixed by changing to
> > congestion_wait_check().
>
> I always suffer from understanding your comment.
> Apparently, my eyes have a problem. ;(
> This patch is dependent of Mel's series.
> With changing congestion_wait with just return when no congestion, it
> would have CPU hogging in too_many_isolated. I think it would apply in
> Li's congestion_wait_check, too.
> If no change is current congestion_wait, we doesn't need this patch.
>
> Still, maybe I can't understand your comment. Sorry.
Sorry! The confusion must come from the modified congestion_wait() by
Mel. My proposal is to _not_ modify congestion_wait(), but add another
congestion_wait_check() which won't sleep 100ms when no IO. In this
way, the following chunks become unnecessary.
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -253,7 +253,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
* delay for some time until fewer pages are isolated
*/
while (unlikely(too_many_isolated(zone))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
if (fatal_signal_pending(current))
return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3109ff7..f5e3e28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1337,7 +1337,11 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_dirty;
while (unlikely(too_many_isolated(zone, file, sc))) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ long timeout = HZ/10;
+ if (timeout == congestion_wait(BLK_RW_ASYNC, timeout)) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(timeout);
+ }
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-26 15:14 ` Mel Gorman
@ 2010-08-27 5:13 ` Dave Chinner
-1 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-08-27 5:13 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep.
That, I think, is an invalid assumption. congestion_wait is used in
some places as a backoff mechanism that waits for some IO work to be
done, with congestion disappearing being a indication that progress
has been made and so we can retry sooner than the entire timeout.
For example, if _xfs_buf_lookup_pages() fails to allocate page cache
pages for a buffer, it will kick the xfsbufd to writeback dirty
buffers (so they can be freed) and immediately enter
congestion_wait(). If there isn't congestion when we enter
congestion_wait(), we still want to give the xfsbufds a chance to
clean some pages before we retry the allocation for the new buffer.
Removing the congestion_wait() sleep behaviour will effectively
_increase_ memory pressure with XFS on fast disk subsystems because
it now won't backoff between failed allocation attempts...
Perhaps a congestion_wait_iff_congested() variant is needed for the
VM? I can certainly see how it benefits the VM from a latency
perspective, but it is the opposite behaviour that is expected in
other places...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 5:13 ` Dave Chinner
0 siblings, 0 replies; 76+ messages in thread
From: Dave Chinner @ 2010-08-27 5:13 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> If congestion_wait() is called with no BDIs congested, the caller will
> sleep for the full timeout and this is an unnecessary sleep.
That, I think, is an invalid assumption. congestion_wait is used in
some places as a backoff mechanism that waits for some IO work to be
done, with congestion disappearing being a indication that progress
has been made and so we can retry sooner than the entire timeout.
For example, if _xfs_buf_lookup_pages() fails to allocate page cache
pages for a buffer, it will kick the xfsbufd to writeback dirty
buffers (so they can be freed) and immediately enter
congestion_wait(). If there isn't congestion when we enter
congestion_wait(), we still want to give the xfsbufds a chance to
clean some pages before we retry the allocation for the new buffer.
Removing the congestion_wait() sleep behaviour will effectively
_increase_ memory pressure with XFS on fast disk subsystems because
it now won't backoff between failed allocation attempts...
Perhaps a congestion_wait_iff_congested() variant is needed for the
VM? I can certainly see how it benefits the VM from a latency
perspective, but it is the opposite behaviour that is expected in
other places...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 20:31 ` Mel Gorman
@ 2010-08-27 8:16 ` Johannes Weiner
-1 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-27 8:16 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
The workload was accessing a large sparse-file through mmap, so there
wasn't much IO in the first place.
And I experimented on the latest -mmotm where direct reclaim wouldn't
do writeback by itself anymore, but kick the flushers.
> > So personally I think it's a good idea to get an insight on the use of
> > congestion_wait() [patch 1] but I don't agree with changing its
> > behaviour just yet, or judging its usefulness solely on whether it
> > correctly waits for bdi congestion.
> >
>
> Unfortunately, I strongly suspect that some of the desktop stalls seen during
> IO (one of which involved no writes) were due to calling congestion_wait
> and waiting the full timeout where no writes are going on.
Oh, I am in full agreement here! Removing those congestion_wait() as
described above showed a reduction in peak latency. The dilemma is
only that it increased the overall walltime of the load.
And the scanning behaviour deteriorated, as in having increased
scanning pressure on other zones than the unpatched kernel did.
So I think very much that we need a fix. congestion_wait() causes
stalls and relying on random sleeps for the current reclaim behaviour
can not be the solution, at all.
I just don't think we can remove it based on the argument that it
doesn't do what it is supposed to do, when it does other things right
that it is not supposed to do ;-)
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-27 8:16 ` Johannes Weiner
0 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-08-27 8:16 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > If congestion_wait() is called when there is no congestion, the caller
> > > will wait for the full timeout. This can cause unreasonable and
> > > unnecessary stalls. There are a number of potential modifications that
> > > could be made to wake sleepers but this patch measures how serious the
> > > problem is. It keeps count of how many congested BDIs there are. If
> > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > record that the wait was unnecessary.
> >
> > I am not convinced that unnecessary is the right word. On a workload
> > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > the VM regressing both in time and in reclaiming the right pages when
> > simply removing congestion_wait() from the direct reclaim paths (the
> > one in __alloc_pages_slowpath and the other one in
> > do_try_to_free_pages).
> >
> > So just being stupid and waiting for the timeout in direct reclaim
> > while kswapd can make progress seemed to do a better job for that
> > load.
> >
> > I can not exactly pinpoint the reason for that behaviour, it would be
> > nice if somebody had an idea.
> >
>
> There is a possibility that the behaviour in that case was due to flusher
> threads doing the writes rather than direct reclaim queueing pages for IO
> in an inefficient manner. So the stall is stupid but happens to work out
> well because flusher threads get the chance to do work.
The workload was accessing a large sparse-file through mmap, so there
wasn't much IO in the first place.
And I experimented on the latest -mmotm where direct reclaim wouldn't
do writeback by itself anymore, but kick the flushers.
> > So personally I think it's a good idea to get an insight on the use of
> > congestion_wait() [patch 1] but I don't agree with changing its
> > behaviour just yet, or judging its usefulness solely on whether it
> > correctly waits for bdi congestion.
> >
>
> Unfortunately, I strongly suspect that some of the desktop stalls seen during
> IO (one of which involved no writes) were due to calling congestion_wait
> and waiting the full timeout where no writes are going on.
Oh, I am in full agreement here! Removing those congestion_wait() as
described above showed a reduction in peak latency. The dilemma is
only that it increased the overall walltime of the load.
And the scanning behaviour deteriorated, as in having increased
scanning pressure on other zones than the unpatched kernel did.
So I think very much that we need a fix. congestion_wait() causes
stalls and relying on random sleeps for the current reclaim behaviour
can not be the solution, at all.
I just don't think we can remove it based on the argument that it
doesn't do what it is supposed to do, when it does other things right
that it is not supposed to do ;-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-27 2:12 ` Shaohua Li
@ 2010-08-27 9:20 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:20 UTC (permalink / raw)
To: Shaohua Li
Cc: Johannes Weiner, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 10:12:10AM +0800, Shaohua Li wrote:
> On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> If this is the case, we already have queue congested.
Not necessarily. The fact that with the full series we sometimes call
cond_sched() indicating that there was no congestion when congestion_wait()
was called proves that. We might have some IO on the queue but it's not
congested. Also, there is no guarantee that the congested queue is one we
care about. If we are reclaiming main memory and the congested queue is a
USB stick, we do not necessarily need to stall.
> removing
> congestion_wait() might cause regression but either your change or the
> congestion_wait_check() should not have the regression, as we do check
> if the bdi is congested.
>
What congestion_wait_check()? If there is no congestion and no writes,
congestion is the wrong event to sleep on.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-27 9:20 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:20 UTC (permalink / raw)
To: Shaohua Li
Cc: Johannes Weiner, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu, Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 10:12:10AM +0800, Shaohua Li wrote:
> On Fri, 2010-08-27 at 04:31 +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> If this is the case, we already have queue congested.
Not necessarily. The fact that with the full series we sometimes call
cond_sched() indicating that there was no congestion when congestion_wait()
was called proves that. We might have some IO on the queue but it's not
congested. Also, there is no guarantee that the congested queue is one we
care about. If we are reclaiming main memory and the congested queue is a
USB stick, we do not necessarily need to stall.
> removing
> congestion_wait() might cause regression but either your change or the
> congestion_wait_check() should not have the regression, as we do check
> if the bdi is congested.
>
What congestion_wait_check()? If there is no congestion and no writes,
congestion is the wrong event to sleep on.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-27 8:16 ` Johannes Weiner
@ 2010-08-27 9:24 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:24 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> The workload was accessing a large sparse-file through mmap, so there
> wasn't much IO in the first place.
>
Then waiting on congestion was the totally wrong thing to do. We were
effectively calling sleep(HZ/10) and magically this was helping in some
undefined manner. Do you know *which* called of congestion_wait() was
the most important to you?
> And I experimented on the latest -mmotm where direct reclaim wouldn't
> do writeback by itself anymore, but kick the flushers.
>
What were the results? I'm preparing a full series incorporating a
number of patches in this area to see how they behave in aggregate.
> > > So personally I think it's a good idea to get an insight on the use of
> > > congestion_wait() [patch 1] but I don't agree with changing its
> > > behaviour just yet, or judging its usefulness solely on whether it
> > > correctly waits for bdi congestion.
> > >
> >
> > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > IO (one of which involved no writes) were due to calling congestion_wait
> > and waiting the full timeout where no writes are going on.
>
> Oh, I am in full agreement here! Removing those congestion_wait() as
> described above showed a reduction in peak latency. The dilemma is
> only that it increased the overall walltime of the load.
>
Do you know why because leaving in random sleeps() hardly seems to be
the right approach?
> And the scanning behaviour deteriorated, as in having increased
> scanning pressure on other zones than the unpatched kernel did.
>
Probably because it was scanning more but not finding what it needed.
There is a condition other than congestion it is having trouble with. In
some respects, I think if we change congestion_wait() as I propose,
we may see a case where CPU usage is higher because it's now
encountering the unspecified reclaim problem we have.
> So I think very much that we need a fix. congestion_wait() causes
> stalls and relying on random sleeps for the current reclaim behaviour
> can not be the solution, at all.
>
> I just don't think we can remove it based on the argument that it
> doesn't do what it is supposed to do, when it does other things right
> that it is not supposed to do ;-)
>
We are not removing it, we are just stopping it going to sleep for
stupid reasons. If we find that wall time is increasing as a result, we
have a path to figuring out what the real underlying problem is instead
of sweeping it under the rug.
congestion_wait() is causing other problems such as Christian's bug of
massive IO regressions because it was sleeping when it shouldn't.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-27 9:24 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:24 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > If congestion_wait() is called when there is no congestion, the caller
> > > > will wait for the full timeout. This can cause unreasonable and
> > > > unnecessary stalls. There are a number of potential modifications that
> > > > could be made to wake sleepers but this patch measures how serious the
> > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > record that the wait was unnecessary.
> > >
> > > I am not convinced that unnecessary is the right word. On a workload
> > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > the VM regressing both in time and in reclaiming the right pages when
> > > simply removing congestion_wait() from the direct reclaim paths (the
> > > one in __alloc_pages_slowpath and the other one in
> > > do_try_to_free_pages).
> > >
> > > So just being stupid and waiting for the timeout in direct reclaim
> > > while kswapd can make progress seemed to do a better job for that
> > > load.
> > >
> > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > nice if somebody had an idea.
> > >
> >
> > There is a possibility that the behaviour in that case was due to flusher
> > threads doing the writes rather than direct reclaim queueing pages for IO
> > in an inefficient manner. So the stall is stupid but happens to work out
> > well because flusher threads get the chance to do work.
>
> The workload was accessing a large sparse-file through mmap, so there
> wasn't much IO in the first place.
>
Then waiting on congestion was the totally wrong thing to do. We were
effectively calling sleep(HZ/10) and magically this was helping in some
undefined manner. Do you know *which* called of congestion_wait() was
the most important to you?
> And I experimented on the latest -mmotm where direct reclaim wouldn't
> do writeback by itself anymore, but kick the flushers.
>
What were the results? I'm preparing a full series incorporating a
number of patches in this area to see how they behave in aggregate.
> > > So personally I think it's a good idea to get an insight on the use of
> > > congestion_wait() [patch 1] but I don't agree with changing its
> > > behaviour just yet, or judging its usefulness solely on whether it
> > > correctly waits for bdi congestion.
> > >
> >
> > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > IO (one of which involved no writes) were due to calling congestion_wait
> > and waiting the full timeout where no writes are going on.
>
> Oh, I am in full agreement here! Removing those congestion_wait() as
> described above showed a reduction in peak latency. The dilemma is
> only that it increased the overall walltime of the load.
>
Do you know why because leaving in random sleeps() hardly seems to be
the right approach?
> And the scanning behaviour deteriorated, as in having increased
> scanning pressure on other zones than the unpatched kernel did.
>
Probably because it was scanning more but not finding what it needed.
There is a condition other than congestion it is having trouble with. In
some respects, I think if we change congestion_wait() as I propose,
we may see a case where CPU usage is higher because it's now
encountering the unspecified reclaim problem we have.
> So I think very much that we need a fix. congestion_wait() causes
> stalls and relying on random sleeps for the current reclaim behaviour
> can not be the solution, at all.
>
> I just don't think we can remove it based on the argument that it
> doesn't do what it is supposed to do, when it does other things right
> that it is not supposed to do ;-)
>
We are not removing it, we are just stopping it going to sleep for
stupid reasons. If we find that wall time is increasing as a result, we
have a path to figuring out what the real underlying problem is instead
of sweeping it under the rug.
congestion_wait() is causing other problems such as Christian's bug of
massive IO regressions because it was sleeping when it shouldn't.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-27 5:13 ` Dave Chinner
@ 2010-08-27 9:33 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:33 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 03:13:16PM +1000, Dave Chinner wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep.
>
> That, I think, is an invalid assumption. congestion_wait is used in
> some places as a backoff mechanism that waits for some IO work to be
> done, with congestion disappearing being a indication that progress
> has been made and so we can retry sooner than the entire timeout.
>
As it's write IO rather than some IO, I wonder if that's really the
right thing to do. However, I accept your (and others) point that
converting all congestion_wait() callers may be too much of a change.
> For example, if _xfs_buf_lookup_pages() fails to allocate page cache
> pages for a buffer, it will kick the xfsbufd to writeback dirty
> buffers (so they can be freed) and immediately enter
> congestion_wait(). If there isn't congestion when we enter
> congestion_wait(), we still want to give the xfsbufds a chance to
> clean some pages before we retry the allocation for the new buffer.
> Removing the congestion_wait() sleep behaviour will effectively
> _increase_ memory pressure with XFS on fast disk subsystems because
> it now won't backoff between failed allocation attempts...
>
> Perhaps a congestion_wait_iff_congested() variant is needed for the
> VM? I can certainly see how it benefits the VM from a latency
> perspective, but it is the opposite behaviour that is expected in
> other places...
>
I'm added a wait_iff_congested() and updated a few of the VM callers. I changed
a fairly minimum number of what appeared to be the obvious ones to change.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 9:33 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:33 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Johannes Weiner, Wu Fengguang, Jan Kara, linux-kernel
On Fri, Aug 27, 2010 at 03:13:16PM +1000, Dave Chinner wrote:
> On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called with no BDIs congested, the caller will
> > sleep for the full timeout and this is an unnecessary sleep.
>
> That, I think, is an invalid assumption. congestion_wait is used in
> some places as a backoff mechanism that waits for some IO work to be
> done, with congestion disappearing being a indication that progress
> has been made and so we can retry sooner than the entire timeout.
>
As it's write IO rather than some IO, I wonder if that's really the
right thing to do. However, I accept your (and others) point that
converting all congestion_wait() callers may be too much of a change.
> For example, if _xfs_buf_lookup_pages() fails to allocate page cache
> pages for a buffer, it will kick the xfsbufd to writeback dirty
> buffers (so they can be freed) and immediately enter
> congestion_wait(). If there isn't congestion when we enter
> congestion_wait(), we still want to give the xfsbufds a chance to
> clean some pages before we retry the allocation for the new buffer.
> Removing the congestion_wait() sleep behaviour will effectively
> _increase_ memory pressure with XFS on fast disk subsystems because
> it now won't backoff between failed allocation attempts...
>
> Perhaps a congestion_wait_iff_congested() variant is needed for the
> VM? I can certainly see how it benefits the VM from a latency
> perspective, but it is the opposite behaviour that is expected in
> other places...
>
I'm added a wait_iff_congested() and updated a few of the VM callers. I changed
a fairly minimum number of what appeared to be the obvious ones to change.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-27 1:11 ` Wu Fengguang
@ 2010-08-27 9:34 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:34 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua
On Fri, Aug 27, 2010 at 09:11:06AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > >
> > > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > > progress.
> > > > > >
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > > ---
> > > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > > index a49167f..6abe860 100644
> > > > > > --- a/mm/backing-dev.c
> > > > > > +++ b/mm/backing-dev.c
> > > > >
> > > > > Function's decripton should be changed since we don't wait next write any more.
> > > > >
> > > >
> > > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > > happening but maybe that side of things just broke somewhere in the
> > > > distant past. I lack context of how this is meant to work so maybe
> > > > someone will educate me.
> > >
> > > On every retired io request the congestion state on the bdi is checked
> > > and the congestion waitqueue woken up.
> > >
> > > So without congestion, we still only wait until the next write
> > > retires, but without any IO, we sleep the full timeout.
> > >
> > > Check __freed_requests() in block/blk-core.c.
> > >
> >
> > Seems reasonable. Still, if there is no write IO going on and no
> > congestion there seems to be no point going to sleep for the full
> > timeout. It still feels wrong.
>
> Yeah the stupid sleeping feels wrong. However there are ~20
> congestion_wait() callers spread randomly in VM, FS and block drivers.
> Many of them may be added by rule of thumb, however what if some of
> them happen to depend on the old stupid sleeping behavior? Obviously
> you've done extensive tests on the page reclaim paths, however that's
> far from enough to cover the wider changes made by this patch.
>
> We may have to do the conversions case by case. Converting to
> congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
> other waiting schemes.
>
I am taking this direction now.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 9:34 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:34 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua
On Fri, Aug 27, 2010 at 09:11:06AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 04:23:24AM +0800, Mel Gorman wrote:
> > On Thu, Aug 26, 2010 at 08:17:35PM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > > CPU longer than its quota but otherwise will not sleep.
> > > > > >
> > > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > > progress.
> > > > > >
> > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > > ---
> > > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > > index a49167f..6abe860 100644
> > > > > > --- a/mm/backing-dev.c
> > > > > > +++ b/mm/backing-dev.c
> > > > >
> > > > > Function's decripton should be changed since we don't wait next write any more.
> > > > >
> > > >
> > > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > > happening but maybe that side of things just broke somewhere in the
> > > > distant past. I lack context of how this is meant to work so maybe
> > > > someone will educate me.
> > >
> > > On every retired io request the congestion state on the bdi is checked
> > > and the congestion waitqueue woken up.
> > >
> > > So without congestion, we still only wait until the next write
> > > retires, but without any IO, we sleep the full timeout.
> > >
> > > Check __freed_requests() in block/blk-core.c.
> > >
> >
> > Seems reasonable. Still, if there is no write IO going on and no
> > congestion there seems to be no point going to sleep for the full
> > timeout. It still feels wrong.
>
> Yeah the stupid sleeping feels wrong. However there are ~20
> congestion_wait() callers spread randomly in VM, FS and block drivers.
> Many of them may be added by rule of thumb, however what if some of
> them happen to depend on the old stupid sleeping behavior? Obviously
> you've done extensive tests on the page reclaim paths, however that's
> far from enough to cover the wider changes made by this patch.
>
> We may have to do the conversions case by case. Converting to
> congestion_wait_check() (see http://lkml.org/lkml/2010/8/18/292) or
> other waiting schemes.
>
I am taking this direction now.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
2010-08-27 1:42 ` Wu Fengguang
@ 2010-08-27 9:37 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:37 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua, Rik van Riel
On Fri, Aug 27, 2010 at 09:42:54AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
>
> congestion_wait() is tightly related with pageout() and writeback,
> however it may have some intention for the no-IO case as well.
>
> - if write congested, maybe we are doing too much pageout(), so wait.
> it might also reduce some get_request_wait() stalls (the normal way
> is to explicitly check for congestion before doing write out).
>
> - if any write completes, it may free some PG_reclaim pages, so proceed.
> (when not congested)
>
For these cases, would it make sense for wait_iff_congested() to compare
nr_writeback to nr_inactive and decide to wait on congestion if more
than half the inactive list is in writeback?
> - if no IO at all, the 100ms sleep might still prevent a page reclaimer
> from stealing lots of slices from a busy computing program that
> involves no page allocation at all.
>
I don't think this is a very strong arguement because cond_reched() is
being called.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs
@ 2010-08-27 9:37 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:37 UTC (permalink / raw)
To: Wu Fengguang
Cc: Johannes Weiner, Minchan Kim, linux-mm, linux-fsdevel,
Andrew Morton, Christian Ehrhardt, Jan Kara, linux-kernel,
Li Shaohua, Rik van Riel
On Fri, Aug 27, 2010 at 09:42:54AM +0800, Wu Fengguang wrote:
> On Fri, Aug 27, 2010 at 02:17:35AM +0800, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 06:42:45PM +0100, Mel Gorman wrote:
> > > On Fri, Aug 27, 2010 at 02:38:43AM +0900, Minchan Kim wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:16PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called with no BDIs congested, the caller will
> > > > > sleep for the full timeout and this is an unnecessary sleep. This patch
> > > > > checks if there are BDIs congested. If so, it goes to sleep as normal.
> > > > > If not, it calls cond_resched() to ensure the caller is not hogging the
> > > > > CPU longer than its quota but otherwise will not sleep.
> > > > >
> > > > > This is aimed at reducing some of the major desktop stalls reported during
> > > > > IO. For example, while kswapd is operating, it calls congestion_wait()
> > > > > but it could just have been reclaiming clean page cache pages with no
> > > > > congestion. Without this patch, it would sleep for a full timeout but after
> > > > > this patch, it'll just call schedule() if it has been on the CPU too long.
> > > > > Similar logic applies to direct reclaimers that are not making enough
> > > > > progress.
> > > > >
> > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > > > ---
> > > > > mm/backing-dev.c | 20 ++++++++++++++------
> > > > > 1 files changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > > > index a49167f..6abe860 100644
> > > > > --- a/mm/backing-dev.c
> > > > > +++ b/mm/backing-dev.c
> > > >
> > > > Function's decripton should be changed since we don't wait next write any more.
> > > >
> > >
> > > My bad. I need to check that "next write" thing. It doesn't appear to be
> > > happening but maybe that side of things just broke somewhere in the
> > > distant past. I lack context of how this is meant to work so maybe
> > > someone will educate me.
> >
> > On every retired io request the congestion state on the bdi is checked
> > and the congestion waitqueue woken up.
> >
> > So without congestion, we still only wait until the next write
> > retires, but without any IO, we sleep the full timeout.
> >
> > Check __freed_requests() in block/blk-core.c.
>
> congestion_wait() is tightly related with pageout() and writeback,
> however it may have some intention for the no-IO case as well.
>
> - if write congested, maybe we are doing too much pageout(), so wait.
> it might also reduce some get_request_wait() stalls (the normal way
> is to explicitly check for congestion before doing write out).
>
> - if any write completes, it may free some PG_reclaim pages, so proceed.
> (when not congested)
>
For these cases, would it make sense for wait_iff_congested() to compare
nr_writeback to nr_inactive and decide to wait on congestion if more
than half the inactive list is in writeback?
> - if no IO at all, the 100ms sleep might still prevent a page reclaimer
> from stealing lots of slices from a busy computing program that
> involves no page allocation at all.
>
I don't think this is a very strong arguement because cond_reched() is
being called.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
2010-08-27 1:21 ` Wu Fengguang
@ 2010-08-27 9:38 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
On Fri, Aug 27, 2010 at 09:21:47AM +0800, Wu Fengguang wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
>
This is what I've done. I dropped the patch again and am using
wait_iff_congested(). I left the too_many_isolated() callers as
congestion_wait().
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion
@ 2010-08-27 9:38 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-27 9:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Minchan Kim, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Johannes Weiner, Jan Kara, linux-kernel,
Rik van Riel, KOSAKI Motohiro, KAMEZAWA Hiroyuki, Li Shaohua
On Fri, Aug 27, 2010 at 09:21:47AM +0800, Wu Fengguang wrote:
> Minchan,
>
> It's much cleaner to keep the unchanged congestion_wait() and add a
> congestion_wait_check() for converting problematic wait sites. The
> too_many_isolated() wait is merely a protective mechanism, I won't
> bother to improve it at the cost of more code.
>
This is what I've done. I dropped the patch again and am using
wait_iff_congested(). I left the too_many_isolated() callers as
congestion_wait().
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-26 18:29 ` Johannes Weiner
@ 2010-08-29 16:03 ` Minchan Kim
-1 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-29 16:03 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
Hi, Hannes.
On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
Not exactly same your experiment but I had a simillar experince.
I had a experiement about swapout. System has lots of anon pages but
almost no file pages and it already started to swap out. It means
system have no memory. In this case, I forked new process which mmap
some MB pages and touch the pages. It means VM should swapout some MB page
for the process. And I measured the time until completing touching the pages.
Sometime it's fast, sometime it's slow. time gap is almost two.
Interesting thing is when it is fast, many of pages are reclaimed by kswapd.
Ah.. I used swap to ramdisk and reserve the swap pages by touching before
starting the experiment. So I would say it's not a _flushd_ effect.
>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
I just thought the cause is direct reclaim just reclaims by 32 pages
but kswapd could reclaim many pages by batch. But i didn't look at it any more
due to busy. Does it make sense?
--
Kind regards,
Minchan Kim
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-29 16:03 ` Minchan Kim
0 siblings, 0 replies; 76+ messages in thread
From: Minchan Kim @ 2010-08-29 16:03 UTC (permalink / raw)
To: Johannes Weiner
Cc: Mel Gorman, linux-mm, linux-fsdevel, Andrew Morton,
Christian Ehrhardt, Wu Fengguang, Jan Kara, linux-kernel
Hi, Hannes.
On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > If congestion_wait() is called when there is no congestion, the caller
> > will wait for the full timeout. This can cause unreasonable and
> > unnecessary stalls. There are a number of potential modifications that
> > could be made to wake sleepers but this patch measures how serious the
> > problem is. It keeps count of how many congested BDIs there are. If
> > congestion_wait() is called with no BDIs congested, the tracepoint will
> > record that the wait was unnecessary.
>
> I am not convinced that unnecessary is the right word. On a workload
> without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> the VM regressing both in time and in reclaiming the right pages when
> simply removing congestion_wait() from the direct reclaim paths (the
> one in __alloc_pages_slowpath and the other one in
> do_try_to_free_pages).
Not exactly same your experiment but I had a simillar experince.
I had a experiement about swapout. System has lots of anon pages but
almost no file pages and it already started to swap out. It means
system have no memory. In this case, I forked new process which mmap
some MB pages and touch the pages. It means VM should swapout some MB page
for the process. And I measured the time until completing touching the pages.
Sometime it's fast, sometime it's slow. time gap is almost two.
Interesting thing is when it is fast, many of pages are reclaimed by kswapd.
Ah.. I used swap to ramdisk and reserve the swap pages by touching before
starting the experiment. So I would say it's not a _flushd_ effect.
>
> So just being stupid and waiting for the timeout in direct reclaim
> while kswapd can make progress seemed to do a better job for that
> load.
>
> I can not exactly pinpoint the reason for that behaviour, it would be
> nice if somebody had an idea.
I just thought the cause is direct reclaim just reclaims by 32 pages
but kswapd could reclaim many pages by batch. But i didn't look at it any more
due to busy. Does it make sense?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-27 9:24 ` Mel Gorman
(?)
@ 2010-08-30 13:19 ` Johannes Weiner
2010-08-31 15:02 ` Mel Gorman
-1 siblings, 1 reply; 76+ messages in thread
From: Johannes Weiner @ 2010-08-30 13:19 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 5340 bytes --]
On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > record that the wait was unnecessary.
> > > >
> > > > I am not convinced that unnecessary is the right word. On a workload
> > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > the VM regressing both in time and in reclaiming the right pages when
> > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > one in __alloc_pages_slowpath and the other one in
> > > > do_try_to_free_pages).
> > > >
> > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > while kswapd can make progress seemed to do a better job for that
> > > > load.
> > > >
> > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > nice if somebody had an idea.
> > > >
> > >
> > > There is a possibility that the behaviour in that case was due to flusher
> > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > in an inefficient manner. So the stall is stupid but happens to work out
> > > well because flusher threads get the chance to do work.
> >
> > The workload was accessing a large sparse-file through mmap, so there
> > wasn't much IO in the first place.
> >
>
> Then waiting on congestion was the totally wrong thing to do. We were
> effectively calling sleep(HZ/10) and magically this was helping in some
> undefined manner. Do you know *which* called of congestion_wait() was
> the most important to you?
Removing congestion_wait() in do_try_to_free_pages() definitely
worsens reclaim behaviour for this workload:
1. wallclock time of the testrun increases by 11%
2. the scanners do a worse job and go for the wrong zone:
-pgalloc_dma 79597
-pgalloc_dma32 134465902
+pgalloc_dma 297089
+pgalloc_dma32 134247237
-pgsteal_dma 77501
-pgsteal_dma32 133939446
+pgsteal_dma 294998
+pgsteal_dma32 133722312
-pgscan_kswapd_dma 145897
-pgscan_kswapd_dma32 266141381
+pgscan_kswapd_dma 287981
+pgscan_kswapd_dma32 186647637
-pgscan_direct_dma 9666
-pgscan_direct_dma32 1758655
+pgscan_direct_dma 302495
+pgscan_direct_dma32 80947179
-pageoutrun 1768531
-allocstall 614
+pageoutrun 1927451
+allocstall 8566
I attached the full vmstat contents below. Also the test program,
which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> > > > So personally I think it's a good idea to get an insight on the use of
> > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > correctly waits for bdi congestion.
> > > >
> > >
> > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > IO (one of which involved no writes) were due to calling congestion_wait
> > > and waiting the full timeout where no writes are going on.
> >
> > Oh, I am in full agreement here! Removing those congestion_wait() as
> > described above showed a reduction in peak latency. The dilemma is
> > only that it increased the overall walltime of the load.
> >
>
> Do you know why because leaving in random sleeps() hardly seems to be
> the right approach?
I am still trying to find out what's going wrong.
> > And the scanning behaviour deteriorated, as in having increased
> > scanning pressure on other zones than the unpatched kernel did.
> >
>
> Probably because it was scanning more but not finding what it needed.
> There is a condition other than congestion it is having trouble with. In
> some respects, I think if we change congestion_wait() as I propose,
> we may see a case where CPU usage is higher because it's now
> encountering the unspecified reclaim problem we have.
Exactly.
> > So I think very much that we need a fix. congestion_wait() causes
> > stalls and relying on random sleeps for the current reclaim behaviour
> > can not be the solution, at all.
> >
> > I just don't think we can remove it based on the argument that it
> > doesn't do what it is supposed to do, when it does other things right
> > that it is not supposed to do ;-)
> >
>
> We are not removing it, we are just stopping it going to sleep for
> stupid reasons. If we find that wall time is increasing as a result, we
> have a path to figuring out what the real underlying problem is instead
> of sweeping it under the rug.
Well, for that testcase it is in effect the same as a removal as
there's never congestion.
But again: I agree with your changes per-se, I just don't think they
should get merged as long as they knowingly catalyze a problem that
has yet to be identified.
[-- Attachment #2: mapped-file-stream.c --]
[-- Type: text/plain, Size: 1885 bytes --]
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <limits.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
static int start_process(unsigned long nr_bytes)
{
char filename[] = "/tmp/clog-XXXXXX";
unsigned long i;
char *map;
int fd;
fd = mkstemp(filename);
unlink(filename);
if (fd == -1) {
perror("mkstemp()");
return -1;
}
if (ftruncate(fd, nr_bytes)) {
perror("ftruncate()");
return -1;
}
map = mmap(NULL, nr_bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
perror("mmap()");
return -1;
}
if (madvise(map, nr_bytes, MADV_RANDOM)) {
perror("madvise()");
return -1;
}
kill(getpid(), SIGSTOP);
for (i = 0; i < nr_bytes; i += 4096)
((volatile char *)map)[i];
close(fd);
return 0;
}
static int do_test(unsigned long nr_procs, unsigned long nr_bytes)
{
pid_t procs[nr_procs];
unsigned long i;
int dummy;
for (i = 0; i < nr_procs; i++) {
switch ((procs[i] = fork())) {
case -1:
kill(0, SIGKILL);
perror("fork()");
return -1;
case 0:
return start_process(nr_bytes);
default:
waitpid(procs[i], &dummy, WUNTRACED);
break;
}
}
kill(0, SIGCONT);
for (i = 0; i < nr_procs; i++)
waitpid(procs[i], &dummy, 0);
return 0;
}
static int xstrtoul(const char *str, unsigned long *valuep)
{
unsigned long value;
char *endp;
value = strtoul(str, &endp, 0);
if (*endp || (value == ULONG_MAX && errno == ERANGE))
return -1;
*valuep = value;
return 0;
}
int main(int ac, char **av)
{
unsigned long nr_procs, nr_bytes;
if (ac != 3)
goto usage;
if (xstrtoul(av[1], &nr_procs))
goto usage;
if (xstrtoul(av[2], &nr_bytes))
goto usage;
setbuf(stdout, NULL);
setbuf(stderr, NULL);
return !!do_test(nr_procs, nr_bytes);
usage:
fprintf(stderr, "usage: %s nr_procs nr_bytes\n", av[0]);
return 1;
}
[-- Attachment #3: vmstat.a.2 --]
[-- Type: application/x-troff-man, Size: 1794 bytes --]
[-- Attachment #4: vmstat.b.2 --]
[-- Type: application/x-troff-man, Size: 1803 bytes --]
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-30 13:19 ` Johannes Weiner
@ 2010-08-31 15:02 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-31 15:02 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > > record that the wait was unnecessary.
> > > > >
> > > > > I am not convinced that unnecessary is the right word. On a workload
> > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > > the VM regressing both in time and in reclaiming the right pages when
> > > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > > one in __alloc_pages_slowpath and the other one in
> > > > > do_try_to_free_pages).
> > > > >
> > > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > > while kswapd can make progress seemed to do a better job for that
> > > > > load.
> > > > >
> > > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > > nice if somebody had an idea.
> > > > >
> > > >
> > > > There is a possibility that the behaviour in that case was due to flusher
> > > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > > in an inefficient manner. So the stall is stupid but happens to work out
> > > > well because flusher threads get the chance to do work.
> > >
> > > The workload was accessing a large sparse-file through mmap, so there
> > > wasn't much IO in the first place.
> > >
> >
> > Then waiting on congestion was the totally wrong thing to do. We were
> > effectively calling sleep(HZ/10) and magically this was helping in some
> > undefined manner. Do you know *which* called of congestion_wait() was
> > the most important to you?
>
> Removing congestion_wait() in do_try_to_free_pages() definitely
> worsens reclaim behaviour for this workload:
>
> 1. wallclock time of the testrun increases by 11%
>
> 2. the scanners do a worse job and go for the wrong zone:
>
> -pgalloc_dma 79597
> -pgalloc_dma32 134465902
> +pgalloc_dma 297089
> +pgalloc_dma32 134247237
>
> -pgsteal_dma 77501
> -pgsteal_dma32 133939446
> +pgsteal_dma 294998
> +pgsteal_dma32 133722312
>
> -pgscan_kswapd_dma 145897
> -pgscan_kswapd_dma32 266141381
> +pgscan_kswapd_dma 287981
> +pgscan_kswapd_dma32 186647637
>
> -pgscan_direct_dma 9666
> -pgscan_direct_dma32 1758655
> +pgscan_direct_dma 302495
> +pgscan_direct_dma32 80947179
>
> -pageoutrun 1768531
> -allocstall 614
> +pageoutrun 1927451
> +allocstall 8566
>
> I attached the full vmstat contents below. Also the test program,
> which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
>
Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
interesting is your workload.
> > > > > So personally I think it's a good idea to get an insight on the use of
> > > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > > correctly waits for bdi congestion.
> > > > >
> > > >
> > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > > IO (one of which involved no writes) were due to calling congestion_wait
> > > > and waiting the full timeout where no writes are going on.
> > >
> > > Oh, I am in full agreement here! Removing those congestion_wait() as
> > > described above showed a reduction in peak latency. The dilemma is
> > > only that it increased the overall walltime of the load.
> > >
> >
> > Do you know why because leaving in random sleeps() hardly seems to be
> > the right approach?
>
> I am still trying to find out what's going wrong.
>
> > > And the scanning behaviour deteriorated, as in having increased
> > > scanning pressure on other zones than the unpatched kernel did.
> > >
> >
> > Probably because it was scanning more but not finding what it needed.
> > There is a condition other than congestion it is having trouble with. In
> > some respects, I think if we change congestion_wait() as I propose,
> > we may see a case where CPU usage is higher because it's now
> > encountering the unspecified reclaim problem we have.
>
> Exactly.
>
> > > So I think very much that we need a fix. congestion_wait() causes
> > > stalls and relying on random sleeps for the current reclaim behaviour
> > > can not be the solution, at all.
> > >
> > > I just don't think we can remove it based on the argument that it
> > > doesn't do what it is supposed to do, when it does other things right
> > > that it is not supposed to do ;-)
> > >
> >
> > We are not removing it, we are just stopping it going to sleep for
> > stupid reasons. If we find that wall time is increasing as a result, we
> > have a path to figuring out what the real underlying problem is instead
> > of sweeping it under the rug.
>
> Well, for that testcase it is in effect the same as a removal as
> there's never congestion.
>
> But again: I agree with your changes per-se, I just don't think they
> should get merged as long as they knowingly catalyze a problem that
> has yet to be identified.
Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).
These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.
The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.
MICRO
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
So, the allocstall counts go up of course because it is incremented
every time direct reclaim is entered and nocongest is only going to
sleep when there is congestion or significant writeback. I don't see
this as being nceessarily bad.
Direct scanning rates go up a bit as you'd expect - again because we are
sleeping less. It's interesting that the pages reclaimed is reduced implying
that despite higher scanning rates, there is less reclaim activity.
It's debatable if this is good or not because higher scanning rates in
themselves are not bad but fewer pages reclaimed seems positive so lets
see what the rest of the reports look like.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
Direct reclaims 2722 70156 67554 87691
Direct reclaim pages scanned 23955333 40161426 40034132 45080524
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 2718040 17801688 17622777 18379572
Kswapd wakeups 24 1 1 1
Kswapd pages scanned 177738381 161597313 161729224 156704078
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
Total pages scanned 201693714 201758739 201763356 201784602
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
Interesting, kswapd is now staying awake (woke up only once) even though
the total time awake was reduced and it looks like because it was requested
to wake up a lot more that was keeping it awake. Despite the higher scan
rates from direct reclaim, the time actually spent direct reclaiming is
significantly reduced.
Scanning rates and times we direct reclaim go up but as we finish work a
lot faster, it would seem that we are doing less work overall.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 3664 0 0 0
Direct time congest waited 247636ms 0ms 0ms 0ms
Direct full congest waited 3081 0 0 0
Direct number conditional waited 0 47587 45659 58779
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 3081 0 0 0
KSwapd number congest waited 1448 949 909 981
KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
KSwapd full congest waited 1056 90 115 147
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1056 90 115 147
congest waited is congestion_wait() and conditional waited is
wait_iff_congested(). Look at what happens to the congest waited times
for direct reclaim - it disappeared and despite the number of times
wait_iff_congested() was called, it never actually decided it needed to
sleep. kswapd is still congestion waiting but the time it spent is
reduced.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
This is plain old time. The same test completes 91 seconds faster.
Ordinarily at this point I would be preparing to do a full series report
including the other benchmarks but I'm interested in seeing if there is
a significantly different reading of the above report as to whether it
is a "good" or "bad" result?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-08-31 15:02 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-08-31 15:02 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> On Fri, Aug 27, 2010 at 10:24:16AM +0100, Mel Gorman wrote:
> > On Fri, Aug 27, 2010 at 10:16:48AM +0200, Johannes Weiner wrote:
> > > On Thu, Aug 26, 2010 at 09:31:30PM +0100, Mel Gorman wrote:
> > > > On Thu, Aug 26, 2010 at 08:29:04PM +0200, Johannes Weiner wrote:
> > > > > On Thu, Aug 26, 2010 at 04:14:15PM +0100, Mel Gorman wrote:
> > > > > > If congestion_wait() is called when there is no congestion, the caller
> > > > > > will wait for the full timeout. This can cause unreasonable and
> > > > > > unnecessary stalls. There are a number of potential modifications that
> > > > > > could be made to wake sleepers but this patch measures how serious the
> > > > > > problem is. It keeps count of how many congested BDIs there are. If
> > > > > > congestion_wait() is called with no BDIs congested, the tracepoint will
> > > > > > record that the wait was unnecessary.
> > > > >
> > > > > I am not convinced that unnecessary is the right word. On a workload
> > > > > without any IO (i.e. no congestion_wait() necessary, ever), I noticed
> > > > > the VM regressing both in time and in reclaiming the right pages when
> > > > > simply removing congestion_wait() from the direct reclaim paths (the
> > > > > one in __alloc_pages_slowpath and the other one in
> > > > > do_try_to_free_pages).
> > > > >
> > > > > So just being stupid and waiting for the timeout in direct reclaim
> > > > > while kswapd can make progress seemed to do a better job for that
> > > > > load.
> > > > >
> > > > > I can not exactly pinpoint the reason for that behaviour, it would be
> > > > > nice if somebody had an idea.
> > > > >
> > > >
> > > > There is a possibility that the behaviour in that case was due to flusher
> > > > threads doing the writes rather than direct reclaim queueing pages for IO
> > > > in an inefficient manner. So the stall is stupid but happens to work out
> > > > well because flusher threads get the chance to do work.
> > >
> > > The workload was accessing a large sparse-file through mmap, so there
> > > wasn't much IO in the first place.
> > >
> >
> > Then waiting on congestion was the totally wrong thing to do. We were
> > effectively calling sleep(HZ/10) and magically this was helping in some
> > undefined manner. Do you know *which* called of congestion_wait() was
> > the most important to you?
>
> Removing congestion_wait() in do_try_to_free_pages() definitely
> worsens reclaim behaviour for this workload:
>
> 1. wallclock time of the testrun increases by 11%
>
> 2. the scanners do a worse job and go for the wrong zone:
>
> -pgalloc_dma 79597
> -pgalloc_dma32 134465902
> +pgalloc_dma 297089
> +pgalloc_dma32 134247237
>
> -pgsteal_dma 77501
> -pgsteal_dma32 133939446
> +pgsteal_dma 294998
> +pgsteal_dma32 133722312
>
> -pgscan_kswapd_dma 145897
> -pgscan_kswapd_dma32 266141381
> +pgscan_kswapd_dma 287981
> +pgscan_kswapd_dma32 186647637
>
> -pgscan_direct_dma 9666
> -pgscan_direct_dma32 1758655
> +pgscan_direct_dma 302495
> +pgscan_direct_dma32 80947179
>
> -pageoutrun 1768531
> -allocstall 614
> +pageoutrun 1927451
> +allocstall 8566
>
> I attached the full vmstat contents below. Also the test program,
> which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
>
Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
interesting is your workload.
> > > > > So personally I think it's a good idea to get an insight on the use of
> > > > > congestion_wait() [patch 1] but I don't agree with changing its
> > > > > behaviour just yet, or judging its usefulness solely on whether it
> > > > > correctly waits for bdi congestion.
> > > > >
> > > >
> > > > Unfortunately, I strongly suspect that some of the desktop stalls seen during
> > > > IO (one of which involved no writes) were due to calling congestion_wait
> > > > and waiting the full timeout where no writes are going on.
> > >
> > > Oh, I am in full agreement here! Removing those congestion_wait() as
> > > described above showed a reduction in peak latency. The dilemma is
> > > only that it increased the overall walltime of the load.
> > >
> >
> > Do you know why because leaving in random sleeps() hardly seems to be
> > the right approach?
>
> I am still trying to find out what's going wrong.
>
> > > And the scanning behaviour deteriorated, as in having increased
> > > scanning pressure on other zones than the unpatched kernel did.
> > >
> >
> > Probably because it was scanning more but not finding what it needed.
> > There is a condition other than congestion it is having trouble with. In
> > some respects, I think if we change congestion_wait() as I propose,
> > we may see a case where CPU usage is higher because it's now
> > encountering the unspecified reclaim problem we have.
>
> Exactly.
>
> > > So I think very much that we need a fix. congestion_wait() causes
> > > stalls and relying on random sleeps for the current reclaim behaviour
> > > can not be the solution, at all.
> > >
> > > I just don't think we can remove it based on the argument that it
> > > doesn't do what it is supposed to do, when it does other things right
> > > that it is not supposed to do ;-)
> > >
> >
> > We are not removing it, we are just stopping it going to sleep for
> > stupid reasons. If we find that wall time is increasing as a result, we
> > have a path to figuring out what the real underlying problem is instead
> > of sweeping it under the rug.
>
> Well, for that testcase it is in effect the same as a removal as
> there's never congestion.
>
> But again: I agree with your changes per-se, I just don't think they
> should get merged as long as they knowingly catalyze a problem that
> has yet to be identified.
Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).
These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.
The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.
MICRO
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
So, the allocstall counts go up of course because it is incremented
every time direct reclaim is entered and nocongest is only going to
sleep when there is congestion or significant writeback. I don't see
this as being nceessarily bad.
Direct scanning rates go up a bit as you'd expect - again because we are
sleeping less. It's interesting that the pages reclaimed is reduced implying
that despite higher scanning rates, there is less reclaim activity.
It's debatable if this is good or not because higher scanning rates in
themselves are not bad but fewer pages reclaimed seems positive so lets
see what the rest of the reports look like.
FTrace Reclaim Statistics: vmscan
micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
Direct reclaims 2722 70156 67554 87691
Direct reclaim pages scanned 23955333 40161426 40034132 45080524
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 0 0 0
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 2718040 17801688 17622777 18379572
Kswapd wakeups 24 1 1 1
Kswapd pages scanned 177738381 161597313 161729224 156704078
Kswapd reclaim write file async I/O 0 0 0 0
Kswapd reclaim write anon async I/O 0 0 0 0
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
Total pages scanned 201693714 201758739 201763356 201784602
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
Interesting, kswapd is now staying awake (woke up only once) even though
the total time awake was reduced and it looks like because it was requested
to wake up a lot more that was keeping it awake. Despite the higher scan
rates from direct reclaim, the time actually spent direct reclaiming is
significantly reduced.
Scanning rates and times we direct reclaim go up but as we finish work a
lot faster, it would seem that we are doing less work overall.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 3664 0 0 0
Direct time congest waited 247636ms 0ms 0ms 0ms
Direct full congest waited 3081 0 0 0
Direct number conditional waited 0 47587 45659 58779
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 3081 0 0 0
KSwapd number congest waited 1448 949 909 981
KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
KSwapd full congest waited 1056 90 115 147
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 1056 90 115 147
congest waited is congestion_wait() and conditional waited is
wait_iff_congested(). Look at what happens to the congest waited times
for direct reclaim - it disappeared and despite the number of times
wait_iff_congested() was called, it never actually decided it needed to
sleep. kswapd is still congestion waiting but the time it spent is
reduced.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
This is plain old time. The same test completes 91 seconds faster.
Ordinarily at this point I would be preparing to do a full series report
including the other benchmarks but I'm interested in seeing if there is
a significantly different reading of the above report as to whether it
is a "good" or "bad" result?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-08-31 15:02 ` Mel Gorman
@ 2010-09-02 15:49 ` Johannes Weiner
-1 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-09-02 15:49 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > Removing congestion_wait() in do_try_to_free_pages() definitely
> > worsens reclaim behaviour for this workload:
> >
> > 1. wallclock time of the testrun increases by 11%
> >
> > 2. the scanners do a worse job and go for the wrong zone:
> >
> > -pgalloc_dma 79597
> > -pgalloc_dma32 134465902
> > +pgalloc_dma 297089
> > +pgalloc_dma32 134247237
> >
> > -pgsteal_dma 77501
> > -pgsteal_dma32 133939446
> > +pgsteal_dma 294998
> > +pgsteal_dma32 133722312
> >
> > -pgscan_kswapd_dma 145897
> > -pgscan_kswapd_dma32 266141381
> > +pgscan_kswapd_dma 287981
> > +pgscan_kswapd_dma32 186647637
> >
> > -pgscan_direct_dma 9666
> > -pgscan_direct_dma32 1758655
> > +pgscan_direct_dma 302495
> > +pgscan_direct_dma32 80947179
> >
> > -pageoutrun 1768531
> > -allocstall 614
> > +pageoutrun 1927451
> > +allocstall 8566
> >
> > I attached the full vmstat contents below. Also the test program,
> > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
>
> Excellent stuff. I didn't look at your vmstat output because it was for
> an old patch and you have already highlighted the problems related to
> the workload. Chances are, I'd just reach the same conclusions. What is
> interesting is your workload.
[...]
> Ok, well there was some significant feedback on why wholesale changing of
> congestion_wait() reached too far and I've incorporated that feedback. I
> also integrated your workload into my testsuite (btw, because there is no
> license the script has to download it from a google archive. I might get
> back to you on licensing this so it can be made a permanent part of the suite).
Oh, certainly, feel free to add the following file header:
/*
* Copyright (c) 2010 Johannes Weiner
* Code released under the GNU GPLv2.
*/
> These are the results just for your workload on the only machine I had
> available with a lot of disk. There are a bunch of kernels because I'm testing
> a superset of different series posted recently. The nocongest column is an
> unreleased patch that has congestion_wait() and wait_iff_congested() that
> only goes to sleep if there is real congestion or a lot of writeback going
> on. Rather than worrying about the patch contents for now, lets consider
> the results for just your workload.
>
> The report is in 4 parts. The first is the vmstat counter differences as
> a result of running your test. The exact interpretation of good and bad
> here is open to interpretation. The second part is based on the vmscan
> tracepoints. The third part is based on the congestion tracepoints and
> the final part reports CPU usage and elapsed time.
>
> MICRO
> traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
> pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
> pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
> pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
> pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
> pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
> pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
> pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
> pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
> allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
>
>
> So, the allocstall counts go up of course because it is incremented
> every time direct reclaim is entered and nocongest is only going to
> sleep when there is congestion or significant writeback. I don't see
> this as being nceessarily bad.
Agreed. Also the dma zone is less allocated from, which I suppose is
only the second zone in the zonelist, after dma32. So allocations
succeed more often from the first-choice zone with your patches.
> Direct scanning rates go up a bit as you'd expect - again because we
> are sleeping less. It's interesting that the pages reclaimed is
> reduced implying that despite higher scanning rates, there is less
> reclaim activity.
>
> It's debatable if this is good or not because higher scanning rates in
> themselves are not bad but fewer pages reclaimed seems positive so lets
> see what the rest of the reports look like.
>
> FTrace Reclaim Statistics: vmscan
> micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> Direct reclaims 2722 70156 67554 87691
> Direct reclaim pages scanned 23955333 40161426 40034132 45080524
> Direct reclaim write file async I/O 0 0 0 0
> Direct reclaim write anon async I/O 0 0 0 0
> Direct reclaim write file sync I/O 0 0 0 0
> Direct reclaim write anon sync I/O 0 0 0 0
> Wake kswapd requests 2718040 17801688 17622777 18379572
> Kswapd wakeups 24 1 1 1
> Kswapd pages scanned 177738381 161597313 161729224 156704078
> Kswapd reclaim write file async I/O 0 0 0 0
> Kswapd reclaim write anon async I/O 0 0 0 0
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
> Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
>
> Total pages scanned 201693714 201758739 201763356 201784602
> %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
> Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
>
> Interesting, kswapd is now staying awake (woke up only once) even though
> the total time awake was reduced and it looks like because it was requested
> to wake up a lot more that was keeping it awake. Despite the higher scan
> rates from direct reclaim, the time actually spent direct reclaiming is
> significantly reduced.
>
> Scanning rates and times we direct reclaim go up but as we finish work a
> lot faster, it would seem that we are doing less work overall.
I do not reach the same conclusion here. More pages are scanned
overall on the same workload, so we _are_ doing more work.
The result for this single-threaded workload improves because CPU-time
is not the issue when the only runnable process needs memory.
But we are in fact becoming less efficient at reclaim, so it would
make sense to also test how this interacts with other processes that
do need the CPU concurrently.
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest waited 3664 0 0 0
> Direct time congest waited 247636ms 0ms 0ms 0ms
> Direct full congest waited 3081 0 0 0
> Direct number conditional waited 0 47587 45659 58779
> Direct time conditional waited 0ms 0ms 0ms 0ms
> Direct full conditional waited 3081 0 0 0
> KSwapd number congest waited 1448 949 909 981
> KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
> KSwapd full congest waited 1056 90 115 147
> KSwapd number conditional waited 0 0 0 0
> KSwapd time conditional waited 0ms 0ms 0ms 0ms
> KSwapd full conditional waited 1056 90 115 147
>
> congest waited is congestion_wait() and conditional waited is
> wait_iff_congested(). Look at what happens to the congest waited times
> for direct reclaim - it disappeared and despite the number of times
> wait_iff_congested() was called, it never actually decided it needed to
> sleep. kswapd is still congestion waiting but the time it spent is
> reduced.
>
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
> Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
>
> This is plain old time. The same test completes 91 seconds faster.
> Ordinarily at this point I would be preparing to do a full series report
> including the other benchmarks but I'm interested in seeing if there is
> a significantly different reading of the above report as to whether it
> is a "good" or "bad" result?
I think one interesting piece that is missing is whether the
scanned/reclaimed ratio went up. Do you have the kswapd_steal counter
value still available to calculate that ratio?
A "good" result would be, IMO, if that ratio did not get worse, while
at the same time having reclaim perform better due to reduced sleeps.
Another aspect to look out for is increased overreclaim: the total
number of allocations went up (I suppose the sum of reclaimed pages as
well), which means reclaim became more eager and created more
throwout-refault churn. Those were refaults from a sparse-file, but a
slow backing dev will have more impact on wall clock time.
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-09-02 15:49 ` Johannes Weiner
0 siblings, 0 replies; 76+ messages in thread
From: Johannes Weiner @ 2010-09-02 15:49 UTC (permalink / raw)
To: Mel Gorman
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > Removing congestion_wait() in do_try_to_free_pages() definitely
> > worsens reclaim behaviour for this workload:
> >
> > 1. wallclock time of the testrun increases by 11%
> >
> > 2. the scanners do a worse job and go for the wrong zone:
> >
> > -pgalloc_dma 79597
> > -pgalloc_dma32 134465902
> > +pgalloc_dma 297089
> > +pgalloc_dma32 134247237
> >
> > -pgsteal_dma 77501
> > -pgsteal_dma32 133939446
> > +pgsteal_dma 294998
> > +pgsteal_dma32 133722312
> >
> > -pgscan_kswapd_dma 145897
> > -pgscan_kswapd_dma32 266141381
> > +pgscan_kswapd_dma 287981
> > +pgscan_kswapd_dma32 186647637
> >
> > -pgscan_direct_dma 9666
> > -pgscan_direct_dma32 1758655
> > +pgscan_direct_dma 302495
> > +pgscan_direct_dma32 80947179
> >
> > -pageoutrun 1768531
> > -allocstall 614
> > +pageoutrun 1927451
> > +allocstall 8566
> >
> > I attached the full vmstat contents below. Also the test program,
> > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
>
> Excellent stuff. I didn't look at your vmstat output because it was for
> an old patch and you have already highlighted the problems related to
> the workload. Chances are, I'd just reach the same conclusions. What is
> interesting is your workload.
[...]
> Ok, well there was some significant feedback on why wholesale changing of
> congestion_wait() reached too far and I've incorporated that feedback. I
> also integrated your workload into my testsuite (btw, because there is no
> license the script has to download it from a google archive. I might get
> back to you on licensing this so it can be made a permanent part of the suite).
Oh, certainly, feel free to add the following file header:
/*
* Copyright (c) 2010 Johannes Weiner
* Code released under the GNU GPLv2.
*/
> These are the results just for your workload on the only machine I had
> available with a lot of disk. There are a bunch of kernels because I'm testing
> a superset of different series posted recently. The nocongest column is an
> unreleased patch that has congestion_wait() and wait_iff_congested() that
> only goes to sleep if there is real congestion or a lot of writeback going
> on. Rather than worrying about the patch contents for now, lets consider
> the results for just your workload.
>
> The report is in 4 parts. The first is the vmstat counter differences as
> a result of running your test. The exact interpretation of good and bad
> here is open to interpretation. The second part is based on the vmscan
> tracepoints. The third part is based on the congestion tracepoints and
> the final part reports CPU usage and elapsed time.
>
> MICRO
> traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
> pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
> pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
> pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
> pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
> pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
> pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
> pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
> pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
> allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
>
>
> So, the allocstall counts go up of course because it is incremented
> every time direct reclaim is entered and nocongest is only going to
> sleep when there is congestion or significant writeback. I don't see
> this as being nceessarily bad.
Agreed. Also the dma zone is less allocated from, which I suppose is
only the second zone in the zonelist, after dma32. So allocations
succeed more often from the first-choice zone with your patches.
> Direct scanning rates go up a bit as you'd expect - again because we
> are sleeping less. It's interesting that the pages reclaimed is
> reduced implying that despite higher scanning rates, there is less
> reclaim activity.
>
> It's debatable if this is good or not because higher scanning rates in
> themselves are not bad but fewer pages reclaimed seems positive so lets
> see what the rest of the reports look like.
>
> FTrace Reclaim Statistics: vmscan
> micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> Direct reclaims 2722 70156 67554 87691
> Direct reclaim pages scanned 23955333 40161426 40034132 45080524
> Direct reclaim write file async I/O 0 0 0 0
> Direct reclaim write anon async I/O 0 0 0 0
> Direct reclaim write file sync I/O 0 0 0 0
> Direct reclaim write anon sync I/O 0 0 0 0
> Wake kswapd requests 2718040 17801688 17622777 18379572
> Kswapd wakeups 24 1 1 1
> Kswapd pages scanned 177738381 161597313 161729224 156704078
> Kswapd reclaim write file async I/O 0 0 0 0
> Kswapd reclaim write anon async I/O 0 0 0 0
> Kswapd reclaim write file sync I/O 0 0 0 0
> Kswapd reclaim write anon sync I/O 0 0 0 0
> Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
> Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
>
> Total pages scanned 201693714 201758739 201763356 201784602
> %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
> Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
>
> Interesting, kswapd is now staying awake (woke up only once) even though
> the total time awake was reduced and it looks like because it was requested
> to wake up a lot more that was keeping it awake. Despite the higher scan
> rates from direct reclaim, the time actually spent direct reclaiming is
> significantly reduced.
>
> Scanning rates and times we direct reclaim go up but as we finish work a
> lot faster, it would seem that we are doing less work overall.
I do not reach the same conclusion here. More pages are scanned
overall on the same workload, so we _are_ doing more work.
The result for this single-threaded workload improves because CPU-time
is not the issue when the only runnable process needs memory.
But we are in fact becoming less efficient at reclaim, so it would
make sense to also test how this interacts with other processes that
do need the CPU concurrently.
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest waited 3664 0 0 0
> Direct time congest waited 247636ms 0ms 0ms 0ms
> Direct full congest waited 3081 0 0 0
> Direct number conditional waited 0 47587 45659 58779
> Direct time conditional waited 0ms 0ms 0ms 0ms
> Direct full conditional waited 3081 0 0 0
> KSwapd number congest waited 1448 949 909 981
> KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
> KSwapd full congest waited 1056 90 115 147
> KSwapd number conditional waited 0 0 0 0
> KSwapd time conditional waited 0ms 0ms 0ms 0ms
> KSwapd full conditional waited 1056 90 115 147
>
> congest waited is congestion_wait() and conditional waited is
> wait_iff_congested(). Look at what happens to the congest waited times
> for direct reclaim - it disappeared and despite the number of times
> wait_iff_congested() was called, it never actually decided it needed to
> sleep. kswapd is still congestion waiting but the time it spent is
> reduced.
>
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
> Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
>
> This is plain old time. The same test completes 91 seconds faster.
> Ordinarily at this point I would be preparing to do a full series report
> including the other benchmarks but I'm interested in seeing if there is
> a significantly different reading of the above report as to whether it
> is a "good" or "bad" result?
I think one interesting piece that is missing is whether the
scanned/reclaimed ratio went up. Do you have the kswapd_steal counter
value still available to calculate that ratio?
A "good" result would be, IMO, if that ratio did not get worse, while
at the same time having reclaim perform better due to reduced sleeps.
Another aspect to look out for is increased overreclaim: the total
number of allocations went up (I suppose the sum of reclaimed pages as
well), which means reclaim became more eager and created more
throwout-refault churn. Those were refaults from a sparse-file, but a
slow backing dev will have more impact on wall clock time.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
2010-09-02 15:49 ` Johannes Weiner
@ 2010-09-02 18:28 ` Mel Gorman
-1 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-09-02 18:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Sep 02, 2010 at 05:49:54PM +0200, Johannes Weiner wrote:
> On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > > Removing congestion_wait() in do_try_to_free_pages() definitely
> > > worsens reclaim behaviour for this workload:
> > >
> > > 1. wallclock time of the testrun increases by 11%
> > >
> > > 2. the scanners do a worse job and go for the wrong zone:
> > >
> > > -pgalloc_dma 79597
> > > -pgalloc_dma32 134465902
> > > +pgalloc_dma 297089
> > > +pgalloc_dma32 134247237
> > >
> > > -pgsteal_dma 77501
> > > -pgsteal_dma32 133939446
> > > +pgsteal_dma 294998
> > > +pgsteal_dma32 133722312
> > >
> > > -pgscan_kswapd_dma 145897
> > > -pgscan_kswapd_dma32 266141381
> > > +pgscan_kswapd_dma 287981
> > > +pgscan_kswapd_dma32 186647637
> > >
> > > -pgscan_direct_dma 9666
> > > -pgscan_direct_dma32 1758655
> > > +pgscan_direct_dma 302495
> > > +pgscan_direct_dma32 80947179
> > >
> > > -pageoutrun 1768531
> > > -allocstall 614
> > > +pageoutrun 1927451
> > > +allocstall 8566
> > >
> > > I attached the full vmstat contents below. Also the test program,
> > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> >
> > Excellent stuff. I didn't look at your vmstat output because it was for
> > an old patch and you have already highlighted the problems related to
> > the workload. Chances are, I'd just reach the same conclusions. What is
> > interesting is your workload.
>
> [...]
>
> > Ok, well there was some significant feedback on why wholesale changing of
> > congestion_wait() reached too far and I've incorporated that feedback. I
> > also integrated your workload into my testsuite (btw, because there is no
> > license the script has to download it from a google archive. I might get
> > back to you on licensing this so it can be made a permanent part of the suite).
>
> Oh, certainly, feel free to add the following file header:
>
> /*
> * Copyright (c) 2010 Johannes Weiner
> * Code released under the GNU GPLv2.
> */
>
Super, thanks. Downloading off a mail archive is a bit contorted :)
> > These are the results just for your workload on the only machine I had
> > available with a lot of disk. There are a bunch of kernels because I'm testing
> > a superset of different series posted recently. The nocongest column is an
> > unreleased patch that has congestion_wait() and wait_iff_congested() that
> > only goes to sleep if there is real congestion or a lot of writeback going
> > on. Rather than worrying about the patch contents for now, lets consider
> > the results for just your workload.
> >
> > The report is in 4 parts. The first is the vmstat counter differences as
> > a result of running your test. The exact interpretation of good and bad
> > here is open to interpretation. The second part is based on the vmscan
> > tracepoints. The third part is based on the congestion tracepoints and
> > the final part reports CPU usage and elapsed time.
> >
> > MICRO
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
> > pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
> > pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
> > pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
> > pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
> > pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
> > pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
> > pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
> > pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
> > allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
> >
> >
> > So, the allocstall counts go up of course because it is incremented
> > every time direct reclaim is entered and nocongest is only going to
> > sleep when there is congestion or significant writeback. I don't see
> > this as being nceessarily bad.
>
> Agreed. Also the dma zone is less allocated from, which I suppose is
> only the second zone in the zonelist, after dma32. So allocations
> succeed more often from the first-choice zone with your patches.
>
Which is good sortof.
> > Direct scanning rates go up a bit as you'd expect - again because we
> > are sleeping less. It's interesting that the pages reclaimed is
> > reduced implying that despite higher scanning rates, there is less
> > reclaim activity.
> >
> > It's debatable if this is good or not because higher scanning rates in
> > themselves are not bad but fewer pages reclaimed seems positive so lets
> > see what the rest of the reports look like.
> >
> > FTrace Reclaim Statistics: vmscan
> > micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > Direct reclaims 2722 70156 67554 87691
> > Direct reclaim pages scanned 23955333 40161426 40034132 45080524
> > Direct reclaim write file async I/O 0 0 0 0
> > Direct reclaim write anon async I/O 0 0 0 0
> > Direct reclaim write file sync I/O 0 0 0 0
> > Direct reclaim write anon sync I/O 0 0 0 0
> > Wake kswapd requests 2718040 17801688 17622777 18379572
> > Kswapd wakeups 24 1 1 1
> > Kswapd pages scanned 177738381 161597313 161729224 156704078
> > Kswapd reclaim write file async I/O 0 0 0 0
> > Kswapd reclaim write anon async I/O 0 0 0 0
> > Kswapd reclaim write file sync I/O 0 0 0 0
> > Kswapd reclaim write anon sync I/O 0 0 0 0
> > Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
> > Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
> >
> > Total pages scanned 201693714 201758739 201763356 201784602
> > %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
> > Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
> >
> > Interesting, kswapd is now staying awake (woke up only once) even though
> > the total time awake was reduced and it looks like because it was requested
> > to wake up a lot more that was keeping it awake. Despite the higher scan
> > rates from direct reclaim, the time actually spent direct reclaiming is
> > significantly reduced.
> >
> > Scanning rates and times we direct reclaim go up but as we finish work a
> > lot faster, it would seem that we are doing less work overall.
>
> I do not reach the same conclusion here. More pages are scanned
> overall on the same workload, so we _are_ doing more work.
>
We did more scanning but we finished sooner so the processor would sleep
sooner, consume less power etc. How good or bad this is depends on what
we mean by work but I am of the opinion that finishing sooner is better
overall. That said...
> The result for this single-threaded workload improves because CPU-time
> is not the issue when the only runnable process needs memory.
>
> But we are in fact becoming less efficient at reclaim, so it would
> make sense to also test how this interacts with other processes that
> do need the CPU concurrently.
>
This is indeed a problem. Measuring multiple workloads and drawing
sensible conclusions is going to be a real pain. I'd at least try
running multiple instances of your workload first and seeing how it
interacts.
> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest waited 3664 0 0 0
> > Direct time congest waited 247636ms 0ms 0ms 0ms
> > Direct full congest waited 3081 0 0 0
> > Direct number conditional waited 0 47587 45659 58779
> > Direct time conditional waited 0ms 0ms 0ms 0ms
> > Direct full conditional waited 3081 0 0 0
> > KSwapd number congest waited 1448 949 909 981
> > KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
> > KSwapd full congest waited 1056 90 115 147
> > KSwapd number conditional waited 0 0 0 0
> > KSwapd time conditional waited 0ms 0ms 0ms 0ms
> > KSwapd full conditional waited 1056 90 115 147
> >
> > congest waited is congestion_wait() and conditional waited is
> > wait_iff_congested(). Look at what happens to the congest waited times
> > for direct reclaim - it disappeared and despite the number of times
> > wait_iff_congested() was called, it never actually decided it needed to
> > sleep. kswapd is still congestion waiting but the time it spent is
> > reduced.
> >
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
> > Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
> >
> > This is plain old time. The same test completes 91 seconds faster.
> > Ordinarily at this point I would be preparing to do a full series report
> > including the other benchmarks but I'm interested in seeing if there is
> > a significantly different reading of the above report as to whether it
> > is a "good" or "bad" result?
>
> I think one interesting piece that is missing is whether the
> scanned/reclaimed ratio went up. Do you have the kswapd_steal counter
> value still available to calculate that ratio?
>
Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is
are orientiated around scanning/write ratio and assumed discarding clean
pages was not that big a deal. I did capture vmstat before and after the
test which is not as fine-grained as the tracepoints but it'll do
Vanilla scan: 203050870
Vanilla steal: 102099263
Nocongest scan: 203117496
Nocongest stal: 102114301
Vanilla ratio: 0.5028
Nocongest ratio: 0.5027
So here is an interesting point. While direct scanning rates went up,
overall scanning rates were approximately the same. All that changed
really was who was doing the scanning which is a timing issue. The
reclaim to scanning ratio were *very* close together for the vanilla and
nocongest kernels but nocongest completed far faster.
> A "good" result would be, IMO, if that ratio did not get worse, while
> at the same time having reclaim perform better due to reduced sleeps.
>
So, the ratio did not get worse. The scanning rates were the same although
who was scanning changed and I'm undecided as to whether that is significant
or not. Overall, the test completed faster which is good.
> Another aspect to look out for is increased overreclaim: the total
> number of allocations went up (I suppose the sum of reclaimed pages as
> well), which means reclaim became more eager and created more
> throwout-refault churn. Those were refaults from a sparse-file, but a
> slow backing dev will have more impact on wall clock time.
>
I can capture the reclaim rates handy enough - just needs another trace
point to keep everything in step. Refaults would be trickier so I would
have a tendency to rely on wall time to tell me if the wrong reclaim
decisions were being made.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: [PATCH 2/3] writeback: Record if the congestion was unnecessary
@ 2010-09-02 18:28 ` Mel Gorman
0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2010-09-02 18:28 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-mm, linux-fsdevel, Andrew Morton, Christian Ehrhardt,
Wu Fengguang, Jan Kara, linux-kernel
On Thu, Sep 02, 2010 at 05:49:54PM +0200, Johannes Weiner wrote:
> On Tue, Aug 31, 2010 at 04:02:07PM +0100, Mel Gorman wrote:
> > On Mon, Aug 30, 2010 at 03:19:29PM +0200, Johannes Weiner wrote:
> > > Removing congestion_wait() in do_try_to_free_pages() definitely
> > > worsens reclaim behaviour for this workload:
> > >
> > > 1. wallclock time of the testrun increases by 11%
> > >
> > > 2. the scanners do a worse job and go for the wrong zone:
> > >
> > > -pgalloc_dma 79597
> > > -pgalloc_dma32 134465902
> > > +pgalloc_dma 297089
> > > +pgalloc_dma32 134247237
> > >
> > > -pgsteal_dma 77501
> > > -pgsteal_dma32 133939446
> > > +pgsteal_dma 294998
> > > +pgsteal_dma32 133722312
> > >
> > > -pgscan_kswapd_dma 145897
> > > -pgscan_kswapd_dma32 266141381
> > > +pgscan_kswapd_dma 287981
> > > +pgscan_kswapd_dma32 186647637
> > >
> > > -pgscan_direct_dma 9666
> > > -pgscan_direct_dma32 1758655
> > > +pgscan_direct_dma 302495
> > > +pgscan_direct_dma32 80947179
> > >
> > > -pageoutrun 1768531
> > > -allocstall 614
> > > +pageoutrun 1927451
> > > +allocstall 8566
> > >
> > > I attached the full vmstat contents below. Also the test program,
> > > which I ran in this case as: ./mapped-file-stream 1 $((512 << 30))
> >
> > Excellent stuff. I didn't look at your vmstat output because it was for
> > an old patch and you have already highlighted the problems related to
> > the workload. Chances are, I'd just reach the same conclusions. What is
> > interesting is your workload.
>
> [...]
>
> > Ok, well there was some significant feedback on why wholesale changing of
> > congestion_wait() reached too far and I've incorporated that feedback. I
> > also integrated your workload into my testsuite (btw, because there is no
> > license the script has to download it from a google archive. I might get
> > back to you on licensing this so it can be made a permanent part of the suite).
>
> Oh, certainly, feel free to add the following file header:
>
> /*
> * Copyright (c) 2010 Johannes Weiner
> * Code released under the GNU GPLv2.
> */
>
Super, thanks. Downloading off a mail archive is a bit contorted :)
> > These are the results just for your workload on the only machine I had
> > available with a lot of disk. There are a bunch of kernels because I'm testing
> > a superset of different series posted recently. The nocongest column is an
> > unreleased patch that has congestion_wait() and wait_iff_congested() that
> > only goes to sleep if there is real congestion or a lot of writeback going
> > on. Rather than worrying about the patch contents for now, lets consider
> > the results for just your workload.
> >
> > The report is in 4 parts. The first is the vmstat counter differences as
> > a result of running your test. The exact interpretation of good and bad
> > here is open to interpretation. The second part is based on the vmscan
> > tracepoints. The third part is based on the congestion tracepoints and
> > the final part reports CPU usage and elapsed time.
> >
> > MICRO
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
> > pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
> > pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) 42691.00 ( -74.58%)
> > pgsteal_dma32 100666955.00 ( 0.00%) 100712596.00 ( 0.05%) 100712537.00 ( 0.05%) 100713305.00 ( 0.05%)
> > pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_kswapd_dma 118198.00 ( 0.00%) 47370.00 (-149.52%) 49515.00 (-138.71%) 46134.00 (-156.21%)
> > pgscan_kswapd_dma32 177619794.00 ( 0.00%) 161549938.00 ( -9.95%) 161679701.00 ( -9.86%) 156657926.00 ( -13.38%)
> > pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pgscan_direct_dma 27128.00 ( 0.00%) 39215.00 ( 30.82%) 36561.00 ( 25.80%) 38806.00 ( 30.09%)
> > pgscan_direct_dma32 23927492.00 ( 0.00%) 40122173.00 ( 40.36%) 39997463.00 ( 40.18%) 45041626.00 ( 46.88%)
> > pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> > pageoutrun 756020.00 ( 0.00%) 903192.00 ( 16.29%) 899965.00 ( 15.99%) 868055.00 ( 12.91%)
> > allocstall 2722.00 ( 0.00%) 70156.00 ( 96.12%) 67554.00 ( 95.97%) 87691.00 ( 96.90%)
> >
> >
> > So, the allocstall counts go up of course because it is incremented
> > every time direct reclaim is entered and nocongest is only going to
> > sleep when there is congestion or significant writeback. I don't see
> > this as being nceessarily bad.
>
> Agreed. Also the dma zone is less allocated from, which I suppose is
> only the second zone in the zonelist, after dma32. So allocations
> succeed more often from the first-choice zone with your patches.
>
Which is good sortof.
> > Direct scanning rates go up a bit as you'd expect - again because we
> > are sleeping less. It's interesting that the pages reclaimed is
> > reduced implying that despite higher scanning rates, there is less
> > reclaim activity.
> >
> > It's debatable if this is good or not because higher scanning rates in
> > themselves are not bad but fewer pages reclaimed seems positive so lets
> > see what the rest of the reports look like.
> >
> > FTrace Reclaim Statistics: vmscan
> > micro-traceonly-v1r4-micromicro-nocongest-v1r4-micromicro-lowlumpy-v1r4-micromicro-nodirect-v1r4-micro
> > traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
> > Direct reclaims 2722 70156 67554 87691
> > Direct reclaim pages scanned 23955333 40161426 40034132 45080524
> > Direct reclaim write file async I/O 0 0 0 0
> > Direct reclaim write anon async I/O 0 0 0 0
> > Direct reclaim write file sync I/O 0 0 0 0
> > Direct reclaim write anon sync I/O 0 0 0 0
> > Wake kswapd requests 2718040 17801688 17622777 18379572
> > Kswapd wakeups 24 1 1 1
> > Kswapd pages scanned 177738381 161597313 161729224 156704078
> > Kswapd reclaim write file async I/O 0 0 0 0
> > Kswapd reclaim write anon async I/O 0 0 0 0
> > Kswapd reclaim write file sync I/O 0 0 0 0
> > Kswapd reclaim write anon sync I/O 0 0 0 0
> > Time stalled direct reclaim (seconds) 247.97 76.97 77.15 76.63
> > Time kswapd awake (seconds) 489.17 400.20 403.19 390.08
> >
> > Total pages scanned 201693714 201758739 201763356 201784602
> > %age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > %age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
> > Percentage Time Spent Direct Reclaim 41.41% 16.03% 15.96% 16.32%
> > Percentage Time kswapd Awake 98.76% 98.94% 98.96% 98.87%
> >
> > Interesting, kswapd is now staying awake (woke up only once) even though
> > the total time awake was reduced and it looks like because it was requested
> > to wake up a lot more that was keeping it awake. Despite the higher scan
> > rates from direct reclaim, the time actually spent direct reclaiming is
> > significantly reduced.
> >
> > Scanning rates and times we direct reclaim go up but as we finish work a
> > lot faster, it would seem that we are doing less work overall.
>
> I do not reach the same conclusion here. More pages are scanned
> overall on the same workload, so we _are_ doing more work.
>
We did more scanning but we finished sooner so the processor would sleep
sooner, consume less power etc. How good or bad this is depends on what
we mean by work but I am of the opinion that finishing sooner is better
overall. That said...
> The result for this single-threaded workload improves because CPU-time
> is not the issue when the only runnable process needs memory.
>
> But we are in fact becoming less efficient at reclaim, so it would
> make sense to also test how this interacts with other processes that
> do need the CPU concurrently.
>
This is indeed a problem. Measuring multiple workloads and drawing
sensible conclusions is going to be a real pain. I'd at least try
running multiple instances of your workload first and seeing how it
interacts.
> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest waited 3664 0 0 0
> > Direct time congest waited 247636ms 0ms 0ms 0ms
> > Direct full congest waited 3081 0 0 0
> > Direct number conditional waited 0 47587 45659 58779
> > Direct time conditional waited 0ms 0ms 0ms 0ms
> > Direct full conditional waited 3081 0 0 0
> > KSwapd number congest waited 1448 949 909 981
> > KSwapd time congest waited 118552ms 31652ms 32780ms 38732ms
> > KSwapd full congest waited 1056 90 115 147
> > KSwapd number conditional waited 0 0 0 0
> > KSwapd time conditional waited 0ms 0ms 0ms 0ms
> > KSwapd full conditional waited 1056 90 115 147
> >
> > congest waited is congestion_wait() and conditional waited is
> > wait_iff_congested(). Look at what happens to the congest waited times
> > for direct reclaim - it disappeared and despite the number of times
> > wait_iff_congested() was called, it never actually decided it needed to
> > sleep. kswapd is still congestion waiting but the time it spent is
> > reduced.
> >
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 350.9 403.27 406.12 393.02
> > Total Elapsed Time (seconds) 495.29 404.47 407.44 394.53
> >
> > This is plain old time. The same test completes 91 seconds faster.
> > Ordinarily at this point I would be preparing to do a full series report
> > including the other benchmarks but I'm interested in seeing if there is
> > a significantly different reading of the above report as to whether it
> > is a "good" or "bad" result?
>
> I think one interesting piece that is missing is whether the
> scanned/reclaimed ratio went up. Do you have the kswapd_steal counter
> value still available to calculate that ratio?
>
Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is
are orientiated around scanning/write ratio and assumed discarding clean
pages was not that big a deal. I did capture vmstat before and after the
test which is not as fine-grained as the tracepoints but it'll do
Vanilla scan: 203050870
Vanilla steal: 102099263
Nocongest scan: 203117496
Nocongest stal: 102114301
Vanilla ratio: 0.5028
Nocongest ratio: 0.5027
So here is an interesting point. While direct scanning rates went up,
overall scanning rates were approximately the same. All that changed
really was who was doing the scanning which is a timing issue. The
reclaim to scanning ratio were *very* close together for the vanilla and
nocongest kernels but nocongest completed far faster.
> A "good" result would be, IMO, if that ratio did not get worse, while
> at the same time having reclaim perform better due to reduced sleeps.
>
So, the ratio did not get worse. The scanning rates were the same although
who was scanning changed and I'm undecided as to whether that is significant
or not. Overall, the test completed faster which is good.
> Another aspect to look out for is increased overreclaim: the total
> number of allocations went up (I suppose the sum of reclaimed pages as
> well), which means reclaim became more eager and created more
> throwout-refault churn. Those were refaults from a sparse-file, but a
> slow backing dev will have more impact on wall clock time.
>
I can capture the reclaim rates handy enough - just needs another trace
point to keep everything in step. Refaults would be trickier so I would
have a tendency to rely on wall time to tell me if the wrong reclaim
decisions were being made.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 76+ messages in thread
end of thread, other threads:[~2010-09-02 18:29 UTC | newest]
Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-26 15:14 [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Mel Gorman
2010-08-26 15:14 ` Mel Gorman
2010-08-26 15:14 ` [PATCH 1/3] writeback: Account for time spent congestion_waited Mel Gorman
2010-08-26 15:14 ` Mel Gorman
2010-08-26 17:23 ` Minchan Kim
2010-08-26 17:23 ` Minchan Kim
2010-08-26 18:10 ` Johannes Weiner
2010-08-26 18:10 ` Johannes Weiner
2010-08-26 15:14 ` [PATCH 2/3] writeback: Record if the congestion was unnecessary Mel Gorman
2010-08-26 15:14 ` Mel Gorman
2010-08-26 17:35 ` Minchan Kim
2010-08-26 17:35 ` Minchan Kim
2010-08-26 17:41 ` Mel Gorman
2010-08-26 17:41 ` Mel Gorman
2010-08-26 18:29 ` Johannes Weiner
2010-08-26 18:29 ` Johannes Weiner
2010-08-26 20:31 ` Mel Gorman
2010-08-26 20:31 ` Mel Gorman
2010-08-27 2:12 ` Shaohua Li
2010-08-27 2:12 ` Shaohua Li
2010-08-27 2:12 ` Shaohua Li
2010-08-27 9:20 ` Mel Gorman
2010-08-27 9:20 ` Mel Gorman
2010-08-27 8:16 ` Johannes Weiner
2010-08-27 8:16 ` Johannes Weiner
2010-08-27 9:24 ` Mel Gorman
2010-08-27 9:24 ` Mel Gorman
2010-08-30 13:19 ` Johannes Weiner
2010-08-31 15:02 ` Mel Gorman
2010-08-31 15:02 ` Mel Gorman
2010-09-02 15:49 ` Johannes Weiner
2010-09-02 15:49 ` Johannes Weiner
2010-09-02 18:28 ` Mel Gorman
2010-09-02 18:28 ` Mel Gorman
2010-08-29 16:03 ` Minchan Kim
2010-08-29 16:03 ` Minchan Kim
2010-08-26 15:14 ` [PATCH 3/3] writeback: Do not congestion sleep when there are no congested BDIs Mel Gorman
2010-08-26 15:14 ` Mel Gorman
2010-08-26 17:38 ` Minchan Kim
2010-08-26 17:38 ` Minchan Kim
2010-08-26 17:42 ` Mel Gorman
2010-08-26 17:42 ` Mel Gorman
2010-08-26 18:17 ` Johannes Weiner
2010-08-26 18:17 ` Johannes Weiner
2010-08-26 20:23 ` Mel Gorman
2010-08-26 20:23 ` Mel Gorman
2010-08-27 1:11 ` Wu Fengguang
2010-08-27 1:11 ` Wu Fengguang
2010-08-27 9:34 ` Mel Gorman
2010-08-27 9:34 ` Mel Gorman
2010-08-27 1:42 ` Wu Fengguang
2010-08-27 1:42 ` Wu Fengguang
2010-08-27 9:37 ` Mel Gorman
2010-08-27 9:37 ` Mel Gorman
2010-08-27 5:13 ` Dave Chinner
2010-08-27 5:13 ` Dave Chinner
2010-08-27 9:33 ` Mel Gorman
2010-08-27 9:33 ` Mel Gorman
2010-08-26 17:20 ` [RFC PATCH 0/3] Do not wait the full timeout on congestion_wait when there is no congestion Minchan Kim
2010-08-26 17:20 ` Minchan Kim
2010-08-26 17:31 ` Mel Gorman
2010-08-26 17:31 ` Mel Gorman
2010-08-26 17:50 ` Minchan Kim
2010-08-26 17:50 ` Minchan Kim
2010-08-27 1:21 ` Wu Fengguang
2010-08-27 1:21 ` Wu Fengguang
2010-08-27 1:41 ` Minchan Kim
2010-08-27 1:41 ` Minchan Kim
2010-08-27 1:50 ` Wu Fengguang
2010-08-27 1:50 ` Wu Fengguang
2010-08-27 2:02 ` Minchan Kim
2010-08-27 2:02 ` Minchan Kim
2010-08-27 4:34 ` Wu Fengguang
2010-08-27 4:34 ` Wu Fengguang
2010-08-27 9:38 ` Mel Gorman
2010-08-27 9:38 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.