* [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, Wu Fengguang, LKML
Andrew,
This is to lift default readahead size to 512KB, which I believe yields
more I/O throughput without noticeably increasing I/O latency for today's HDD.
For example, for a 100MB/s and 8ms access time HDD:
io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
4 8 0.04 8.04 0.49% 497.57 124.39
8 8 0.08 8.08 0.97% 990.33 123.79
16 8 0.16 8.16 1.92% 1961.69 122.61
32 8 0.31 8.31 3.76% 3849.62 120.30
64 8 0.62 8.62 7.25% 7420.29 115.94
128 8 1.25 9.25 13.51% 13837.84 108.11
256 8 2.50 10.50 23.81% 24380.95 95.24
512 8 5.00 13.00 38.46% 39384.62 76.92
1024 8 10.00 18.00 55.56% 56888.89 55.56
2048 8 20.00 28.00 71.43% 73142.86 35.71
4096 8 40.00 48.00 83.33% 85333.33 20.83
The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
merely increases IO latency from 9.25ms to 13.00ms.
As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads (the first patch has benchmark details):
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
As suggested by Linus, decrease default readahead size for small devices at the same time.
[PATCH 01/11] readahead: limit readahead size for small devices
[PATCH 02/11] readahead: bump up the default readahead size
[PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
The two other impacts of an enlarged readahead size are
- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead cache hit ratio = 143MB/230MB = 62%
512KB readahead cache hit ratio = 138MB/248MB = 55%
- under console: (seems more stable than the Xwindow data)
128KB readahead cache hit ratio = 30MB/56MB = 53%
1MB readahead cache hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.
- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight systems
typically do not run multiple streams; but if they do so, it should
help I/O performance as long as we can avoid thrashing, which can be
achieved with the following patches.
[PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
[PATCH 06/11] readahead: thrashing safe context readahead
This is a major rewrite of the readahead algorithm, so I did careful tests with
the following tracing/stats patches:
[PATCH 07/11] readahead: record readahead patterns
[PATCH 08/11] readahead: add tracing event
[PATCH 09/11] readahead: add /debug/readahead/stats
I verified the new readahead behavior on various access patterns,
as well as stress tested the thrashing safety, by running 300 streams
with mem=128M.
Only 2031/61325=3.3% readahead windows are thrashed (due to workload
variation):
# cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 20 9 4 20 20 12 73 37 35
subsequent 3 3 0 1 0 1 8 8 1
context 61325 1 5479 61325 6788 5 14 2 13
thrash 2031 0 1222 2031 2031 0 9 0 6
around 235 90 142 235 235 235 60 0 19
fadvise 0 0 0 0 0 0 0 0 0
random 223 133 0 91 91 1 1 0 1
all 63837 236 6847 63703 9165 0 14 2 13
And the readahead inside a single stream is working as expected:
# grep streams-3162 /debug/tracing/trace
streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
[...]
btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
[PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
At last, the updated context readahead will do more radix tree scans, so need
to optimize radix_tree_prev_hole():
[PATCH 11/11] radixtree: speed up next/prev hole search
It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
plus 8 level-1 node searches.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, Wu Fengguang, LKML
Andrew,
This is to lift default readahead size to 512KB, which I believe yields
more I/O throughput without noticeably increasing I/O latency for today's HDD.
For example, for a 100MB/s and 8ms access time HDD:
io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
4 8 0.04 8.04 0.49% 497.57 124.39
8 8 0.08 8.08 0.97% 990.33 123.79
16 8 0.16 8.16 1.92% 1961.69 122.61
32 8 0.31 8.31 3.76% 3849.62 120.30
64 8 0.62 8.62 7.25% 7420.29 115.94
128 8 1.25 9.25 13.51% 13837.84 108.11
256 8 2.50 10.50 23.81% 24380.95 95.24
512 8 5.00 13.00 38.46% 39384.62 76.92
1024 8 10.00 18.00 55.56% 56888.89 55.56
2048 8 20.00 28.00 71.43% 73142.86 35.71
4096 8 40.00 48.00 83.33% 85333.33 20.83
The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
merely increases IO latency from 9.25ms to 13.00ms.
As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads (the first patch has benchmark details):
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
As suggested by Linus, decrease default readahead size for small devices at the same time.
[PATCH 01/11] readahead: limit readahead size for small devices
[PATCH 02/11] readahead: bump up the default readahead size
[PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
The two other impacts of an enlarged readahead size are
- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead cache hit ratio = 143MB/230MB = 62%
512KB readahead cache hit ratio = 138MB/248MB = 55%
- under console: (seems more stable than the Xwindow data)
128KB readahead cache hit ratio = 30MB/56MB = 53%
1MB readahead cache hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.
- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight systems
typically do not run multiple streams; but if they do so, it should
help I/O performance as long as we can avoid thrashing, which can be
achieved with the following patches.
[PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
[PATCH 06/11] readahead: thrashing safe context readahead
This is a major rewrite of the readahead algorithm, so I did careful tests with
the following tracing/stats patches:
[PATCH 07/11] readahead: record readahead patterns
[PATCH 08/11] readahead: add tracing event
[PATCH 09/11] readahead: add /debug/readahead/stats
I verified the new readahead behavior on various access patterns,
as well as stress tested the thrashing safety, by running 300 streams
with mem=128M.
Only 2031/61325=3.3% readahead windows are thrashed (due to workload
variation):
# cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 20 9 4 20 20 12 73 37 35
subsequent 3 3 0 1 0 1 8 8 1
context 61325 1 5479 61325 6788 5 14 2 13
thrash 2031 0 1222 2031 2031 0 9 0 6
around 235 90 142 235 235 235 60 0 19
fadvise 0 0 0 0 0 0 0 0 0
random 223 133 0 91 91 1 1 0 1
all 63837 236 6847 63703 9165 0 14 2 13
And the readahead inside a single stream is working as expected:
# grep streams-3162 /debug/tracing/trace
streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
[...]
btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
[PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
At last, the updated context readahead will do more radix tree scans, so need
to optimize radix_tree_prev_hole():
[PATCH 11/11] radixtree: speed up next/prev hole search
It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
plus 8 level-1 node searches.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, Wu Fengguang, LKML
Andrew,
This is to lift default readahead size to 512KB, which I believe yields
more I/O throughput without noticeably increasing I/O latency for today's HDD.
For example, for a 100MB/s and 8ms access time HDD:
io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
4 8 0.04 8.04 0.49% 497.57 124.39
8 8 0.08 8.08 0.97% 990.33 123.79
16 8 0.16 8.16 1.92% 1961.69 122.61
32 8 0.31 8.31 3.76% 3849.62 120.30
64 8 0.62 8.62 7.25% 7420.29 115.94
128 8 1.25 9.25 13.51% 13837.84 108.11
256 8 2.50 10.50 23.81% 24380.95 95.24
512 8 5.00 13.00 38.46% 39384.62 76.92
1024 8 10.00 18.00 55.56% 56888.89 55.56
2048 8 20.00 28.00 71.43% 73142.86 35.71
4096 8 40.00 48.00 83.33% 85333.33 20.83
The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
merely increases IO latency from 9.25ms to 13.00ms.
As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads (the first patch has benchmark details):
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
As suggested by Linus, decrease default readahead size for small devices at the same time.
[PATCH 01/11] readahead: limit readahead size for small devices
[PATCH 02/11] readahead: bump up the default readahead size
[PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
The two other impacts of an enlarged readahead size are
- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead cache hit ratio = 143MB/230MB = 62%
512KB readahead cache hit ratio = 138MB/248MB = 55%
- under console: (seems more stable than the Xwindow data)
128KB readahead cache hit ratio = 30MB/56MB = 53%
1MB readahead cache hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.
- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight systems
typically do not run multiple streams; but if they do so, it should
help I/O performance as long as we can avoid thrashing, which can be
achieved with the following patches.
[PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
[PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
[PATCH 06/11] readahead: thrashing safe context readahead
This is a major rewrite of the readahead algorithm, so I did careful tests with
the following tracing/stats patches:
[PATCH 07/11] readahead: record readahead patterns
[PATCH 08/11] readahead: add tracing event
[PATCH 09/11] readahead: add /debug/readahead/stats
I verified the new readahead behavior on various access patterns,
as well as stress tested the thrashing safety, by running 300 streams
with mem=128M.
Only 2031/61325=3.3% readahead windows are thrashed (due to workload
variation):
# cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 20 9 4 20 20 12 73 37 35
subsequent 3 3 0 1 0 1 8 8 1
context 61325 1 5479 61325 6788 5 14 2 13
thrash 2031 0 1222 2031 2031 0 9 0 6
around 235 90 142 235 235 235 60 0 19
fadvise 0 0 0 0 0 0 0 0 0
random 223 133 0 91 91 1 1 0 1
all 63837 236 6847 63703 9165 0 14 2 13
And the readahead inside a single stream is working as expected:
# grep streams-3162 /debug/tracing/trace
streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
[...]
btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
[PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
At last, the updated context readahead will do more radix tree scans, so need
to optimize radix_tree_prev_hole():
[PATCH 11/11] radixtree: speed up next/prev hole search
It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
plus 8 level-1 node searches.
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6985 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. We use a formula that
generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
2M 4k
8M 8k
32M 16k
128M 32k
512M 64k
2G 128k
8G 256k
32G 512k
128G 1024k
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
==> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 30G SanDisk SATA 5000
4k 14.1593 s, 29.6 MB/s 14.1699 s, 29.6 MB/s 14.1782 s, 29.6 MB/s
8k 8.05231 s, 52.1 MB/s 8.04463 s, 52.1 MB/s 8.04758 s, 52.1 MB/s
16k 6.81751 s, 61.5 MB/s 6.81564 s, 61.5 MB/s 6.8146 s, 61.5 MB/s
32k 6.24176 s, 67.2 MB/s 6.2438 s, 67.2 MB/s 6.24645 s, 67.1 MB/s
64k 5.87828 s, 71.4 MB/s 5.87858 s, 71.3 MB/s 5.87481 s, 71.4 MB/s
128k 5.71649 s, 73.4 MB/s 5.71804 s, 73.4 MB/s 5.72055 s, 73.3 MB/s
==> 256k 5.62466 s, 74.6 MB/s 5.62304 s, 74.6 MB/s 5.62114 s, 74.6 MB/s
512k 5.61532 s, 74.7 MB/s 5.62098 s, 74.6 MB/s 5.61818 s, 74.7 MB/s
1M 5.50888 s, 76.1 MB/s 5.6204 s, 74.6 MB/s 5.62281 s, 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 53.1635 s, 7.9 MB/s 53.155 s, 7.9 MB/s 53.107 s, 7.9 MB/s
8k 23.4061 s, 17.9 MB/s 23.3955 s, 17.9 MB/s 23.4222 s, 17.9 MB/s
16k 17.1077 s, 24.5 MB/s 17.0909 s, 24.5 MB/s 17.0875 s, 24.5 MB/s
32k 14.6029 s, 28.7 MB/s 14.5913 s, 28.7 MB/s 14.5951 s, 28.7 MB/s
64k 14.5483 s, 28.8 MB/s 14.5344 s, 28.9 MB/s 14.5333 s, 28.9 MB/s
==> 128k 13.7497 s, 30.5 MB/s 13.7364 s, 30.5 MB/s 13.731 s, 30.5 MB/s
256k 13.5521 s, 30.9 MB/s 13.5415 s, 31.0 MB/s 13.5554 s, 30.9 MB/s
512k 13.5414 s, 31.0 MB/s 13.5631 s, 30.9 MB/s 13.5654 s, 30.9 MB/s
1M 13.574 s, 30.9 MB/s 13.5686 s, 30.9 MB/s 13.5667 s, 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 65.3449 s, 6.4 MB/s 65.3759 s, 6.4 MB/s 65.3405 s, 6.4 MB/s
8k 31.2002 s, 13.4 MB/s 31.1914 s, 13.4 MB/s 31.6836 s, 13.2 MB/s
16k 23.5281 s, 17.8 MB/s 23.4705 s, 17.9 MB/s 23.5859 s, 17.8 MB/s
32k 19.6786 s, 21.3 MB/s 19.719 s, 21.3 MB/s 19.7548 s, 21.2 MB/s
64k 19.6219 s, 21.4 MB/s 19.6125 s, 21.4 MB/s 19.594 s, 21.4 MB/s
==> 128k 18.021 s, 23.3 MB/s 18.0527 s, 23.2 MB/s 18.0694 s, 23.2 MB/s
256k 17.978 s, 23.3 MB/s 17.6483 s, 23.8 MB/s 17.9324 s, 23.4 MB/s
512k 17.659 s, 23.8 MB/s 17.9403 s, 23.4 MB/s 17.986 s, 23.3 MB/s
1M 17.9437 s, 23.4 MB/s 18.0634 s, 23.2 MB/s 17.9469 s, 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 62.6246 s, 6.7 MB/s 60.5872 s, 6.9 MB/s 62.2581 s, 6.7 MB/s
8k 35.7505 s, 11.7 MB/s 35.764 s, 11.7 MB/s 35.7396 s, 11.7 MB/s
16k 33.7949 s, 12.4 MB/s 33.8041 s, 12.4 MB/s 33.8015 s, 12.4 MB/s
--> 32k 31.3851 s, 13.4 MB/s 31.381 s, 13.4 MB/s 31.3784 s, 13.4 MB/s
64k 31.3478 s, 13.4 MB/s 31.3494 s, 13.4 MB/s 31.3486 s, 13.4 MB/s
==> 128k 30.7384 s, 13.6 MB/s 30.7337 s, 13.6 MB/s 30.728 s, 13.6 MB/s
256k 30.5439 s, 13.7 MB/s 30.544 s, 13.7 MB/s 30.5433 s, 13.7 MB/s
512k 30.5408 s, 13.7 MB/s 30.543 s, 13.7 MB/s 30.5447 s, 13.7 MB/s
1M 30.5919 s, 13.7 MB/s 30.5893 s, 13.7 MB/s 30.5939 s, 13.7 MB/s
Anyone has 512/128MB USB stick? Anyway you get satisfiable performance
with >= 32k readahead size.
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
--- linux.orig/block/genhd.c 2010-01-21 21:17:16.000000000 +0800
+++ linux/block/genhd.c 2010-01-22 17:09:34.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,23 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * limit readahead size for small devices
+ * disk size readahead size
+ * 2M 4k
+ * 8M 8k
+ * 32M 16k
+ * 128M 32k
+ * 512M 64k
+ * 2G 128k
+ * 8G 256k
+ * 32G 512k
+ * 128G 1024k
+ */
+ size = get_capacity(disk) >> 12;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7210 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. We use a formula that
generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
2M 4k
8M 8k
32M 16k
128M 32k
512M 64k
2G 128k
8G 256k
32G 512k
128G 1024k
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
==> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 30G SanDisk SATA 5000
4k 14.1593 s, 29.6 MB/s 14.1699 s, 29.6 MB/s 14.1782 s, 29.6 MB/s
8k 8.05231 s, 52.1 MB/s 8.04463 s, 52.1 MB/s 8.04758 s, 52.1 MB/s
16k 6.81751 s, 61.5 MB/s 6.81564 s, 61.5 MB/s 6.8146 s, 61.5 MB/s
32k 6.24176 s, 67.2 MB/s 6.2438 s, 67.2 MB/s 6.24645 s, 67.1 MB/s
64k 5.87828 s, 71.4 MB/s 5.87858 s, 71.3 MB/s 5.87481 s, 71.4 MB/s
128k 5.71649 s, 73.4 MB/s 5.71804 s, 73.4 MB/s 5.72055 s, 73.3 MB/s
==> 256k 5.62466 s, 74.6 MB/s 5.62304 s, 74.6 MB/s 5.62114 s, 74.6 MB/s
512k 5.61532 s, 74.7 MB/s 5.62098 s, 74.6 MB/s 5.61818 s, 74.7 MB/s
1M 5.50888 s, 76.1 MB/s 5.6204 s, 74.6 MB/s 5.62281 s, 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 53.1635 s, 7.9 MB/s 53.155 s, 7.9 MB/s 53.107 s, 7.9 MB/s
8k 23.4061 s, 17.9 MB/s 23.3955 s, 17.9 MB/s 23.4222 s, 17.9 MB/s
16k 17.1077 s, 24.5 MB/s 17.0909 s, 24.5 MB/s 17.0875 s, 24.5 MB/s
32k 14.6029 s, 28.7 MB/s 14.5913 s, 28.7 MB/s 14.5951 s, 28.7 MB/s
64k 14.5483 s, 28.8 MB/s 14.5344 s, 28.9 MB/s 14.5333 s, 28.9 MB/s
==> 128k 13.7497 s, 30.5 MB/s 13.7364 s, 30.5 MB/s 13.731 s, 30.5 MB/s
256k 13.5521 s, 30.9 MB/s 13.5415 s, 31.0 MB/s 13.5554 s, 30.9 MB/s
512k 13.5414 s, 31.0 MB/s 13.5631 s, 30.9 MB/s 13.5654 s, 30.9 MB/s
1M 13.574 s, 30.9 MB/s 13.5686 s, 30.9 MB/s 13.5667 s, 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 65.3449 s, 6.4 MB/s 65.3759 s, 6.4 MB/s 65.3405 s, 6.4 MB/s
8k 31.2002 s, 13.4 MB/s 31.1914 s, 13.4 MB/s 31.6836 s, 13.2 MB/s
16k 23.5281 s, 17.8 MB/s 23.4705 s, 17.9 MB/s 23.5859 s, 17.8 MB/s
32k 19.6786 s, 21.3 MB/s 19.719 s, 21.3 MB/s 19.7548 s, 21.2 MB/s
64k 19.6219 s, 21.4 MB/s 19.6125 s, 21.4 MB/s 19.594 s, 21.4 MB/s
==> 128k 18.021 s, 23.3 MB/s 18.0527 s, 23.2 MB/s 18.0694 s, 23.2 MB/s
256k 17.978 s, 23.3 MB/s 17.6483 s, 23.8 MB/s 17.9324 s, 23.4 MB/s
512k 17.659 s, 23.8 MB/s 17.9403 s, 23.4 MB/s 17.986 s, 23.3 MB/s
1M 17.9437 s, 23.4 MB/s 18.0634 s, 23.2 MB/s 17.9469 s, 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 62.6246 s, 6.7 MB/s 60.5872 s, 6.9 MB/s 62.2581 s, 6.7 MB/s
8k 35.7505 s, 11.7 MB/s 35.764 s, 11.7 MB/s 35.7396 s, 11.7 MB/s
16k 33.7949 s, 12.4 MB/s 33.8041 s, 12.4 MB/s 33.8015 s, 12.4 MB/s
--> 32k 31.3851 s, 13.4 MB/s 31.381 s, 13.4 MB/s 31.3784 s, 13.4 MB/s
64k 31.3478 s, 13.4 MB/s 31.3494 s, 13.4 MB/s 31.3486 s, 13.4 MB/s
==> 128k 30.7384 s, 13.6 MB/s 30.7337 s, 13.6 MB/s 30.728 s, 13.6 MB/s
256k 30.5439 s, 13.7 MB/s 30.544 s, 13.7 MB/s 30.5433 s, 13.7 MB/s
512k 30.5408 s, 13.7 MB/s 30.543 s, 13.7 MB/s 30.5447 s, 13.7 MB/s
1M 30.5919 s, 13.7 MB/s 30.5893 s, 13.7 MB/s 30.5939 s, 13.7 MB/s
Anyone has 512/128MB USB stick? Anyway you get satisfiable performance
with >= 32k readahead size.
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
--- linux.orig/block/genhd.c 2010-01-21 21:17:16.000000000 +0800
+++ linux/block/genhd.c 2010-01-22 17:09:34.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,23 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * limit readahead size for small devices
+ * disk size readahead size
+ * 2M 4k
+ * 8M 8k
+ * 32M 16k
+ * 128M 32k
+ * 512M 64k
+ * 2G 128k
+ * 8G 256k
+ * 32G 512k
+ * 128G 1024k
+ */
+ size = get_capacity(disk) >> 12;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7210 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. We use a formula that
generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
2M 4k
8M 8k
32M 16k
128M 32k
512M 64k
2G 128k
8G 256k
32G 512k
128G 1024k
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080
rasize first run time/throughput second run time/throughput
------------------------------------------------------------------
4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
==> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 30G SanDisk SATA 5000
4k 14.1593 s, 29.6 MB/s 14.1699 s, 29.6 MB/s 14.1782 s, 29.6 MB/s
8k 8.05231 s, 52.1 MB/s 8.04463 s, 52.1 MB/s 8.04758 s, 52.1 MB/s
16k 6.81751 s, 61.5 MB/s 6.81564 s, 61.5 MB/s 6.8146 s, 61.5 MB/s
32k 6.24176 s, 67.2 MB/s 6.2438 s, 67.2 MB/s 6.24645 s, 67.1 MB/s
64k 5.87828 s, 71.4 MB/s 5.87858 s, 71.3 MB/s 5.87481 s, 71.4 MB/s
128k 5.71649 s, 73.4 MB/s 5.71804 s, 73.4 MB/s 5.72055 s, 73.3 MB/s
==> 256k 5.62466 s, 74.6 MB/s 5.62304 s, 74.6 MB/s 5.62114 s, 74.6 MB/s
512k 5.61532 s, 74.7 MB/s 5.62098 s, 74.6 MB/s 5.61818 s, 74.7 MB/s
1M 5.50888 s, 76.1 MB/s 5.6204 s, 74.6 MB/s 5.62281 s, 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 53.1635 s, 7.9 MB/s 53.155 s, 7.9 MB/s 53.107 s, 7.9 MB/s
8k 23.4061 s, 17.9 MB/s 23.3955 s, 17.9 MB/s 23.4222 s, 17.9 MB/s
16k 17.1077 s, 24.5 MB/s 17.0909 s, 24.5 MB/s 17.0875 s, 24.5 MB/s
32k 14.6029 s, 28.7 MB/s 14.5913 s, 28.7 MB/s 14.5951 s, 28.7 MB/s
64k 14.5483 s, 28.8 MB/s 14.5344 s, 28.9 MB/s 14.5333 s, 28.9 MB/s
==> 128k 13.7497 s, 30.5 MB/s 13.7364 s, 30.5 MB/s 13.731 s, 30.5 MB/s
256k 13.5521 s, 30.9 MB/s 13.5415 s, 31.0 MB/s 13.5554 s, 30.9 MB/s
512k 13.5414 s, 31.0 MB/s 13.5631 s, 30.9 MB/s 13.5654 s, 30.9 MB/s
1M 13.574 s, 30.9 MB/s 13.5686 s, 30.9 MB/s 13.5667 s, 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 65.3449 s, 6.4 MB/s 65.3759 s, 6.4 MB/s 65.3405 s, 6.4 MB/s
8k 31.2002 s, 13.4 MB/s 31.1914 s, 13.4 MB/s 31.6836 s, 13.2 MB/s
16k 23.5281 s, 17.8 MB/s 23.4705 s, 17.9 MB/s 23.5859 s, 17.8 MB/s
32k 19.6786 s, 21.3 MB/s 19.719 s, 21.3 MB/s 19.7548 s, 21.2 MB/s
64k 19.6219 s, 21.4 MB/s 19.6125 s, 21.4 MB/s 19.594 s, 21.4 MB/s
==> 128k 18.021 s, 23.3 MB/s 18.0527 s, 23.2 MB/s 18.0694 s, 23.2 MB/s
256k 17.978 s, 23.3 MB/s 17.6483 s, 23.8 MB/s 17.9324 s, 23.4 MB/s
512k 17.659 s, 23.8 MB/s 17.9403 s, 23.4 MB/s 17.986 s, 23.3 MB/s
1M 17.9437 s, 23.4 MB/s 18.0634 s, 23.2 MB/s 17.9469 s, 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 62.6246 s, 6.7 MB/s 60.5872 s, 6.9 MB/s 62.2581 s, 6.7 MB/s
8k 35.7505 s, 11.7 MB/s 35.764 s, 11.7 MB/s 35.7396 s, 11.7 MB/s
16k 33.7949 s, 12.4 MB/s 33.8041 s, 12.4 MB/s 33.8015 s, 12.4 MB/s
--> 32k 31.3851 s, 13.4 MB/s 31.381 s, 13.4 MB/s 31.3784 s, 13.4 MB/s
64k 31.3478 s, 13.4 MB/s 31.3494 s, 13.4 MB/s 31.3486 s, 13.4 MB/s
==> 128k 30.7384 s, 13.6 MB/s 30.7337 s, 13.6 MB/s 30.728 s, 13.6 MB/s
256k 30.5439 s, 13.7 MB/s 30.544 s, 13.7 MB/s 30.5433 s, 13.7 MB/s
512k 30.5408 s, 13.7 MB/s 30.543 s, 13.7 MB/s 30.5447 s, 13.7 MB/s
1M 30.5919 s, 13.7 MB/s 30.5893 s, 13.7 MB/s 30.5939 s, 13.7 MB/s
Anyone has 512/128MB USB stick? Anyway you get satisfiable performance
with >= 32k readahead size.
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
--- linux.orig/block/genhd.c 2010-01-21 21:17:16.000000000 +0800
+++ linux/block/genhd.c 2010-01-22 17:09:34.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,23 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * limit readahead size for small devices
+ * disk size readahead size
+ * 2M 4k
+ * 8M 8k
+ * 32M 16k
+ * 128M 32k
+ * 512M 64k
+ * 2G 128k
+ * 8G 256k
+ * 32G 512k
+ * 128G 1024k
+ */
+ size = get_capacity(disk) >> 12;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 02/11] readahead: bump up the default readahead size
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Martin Schwidefsky,
Christian Ehrhardt, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 1120 bytes --]
Use 512kb max readahead size, and 32kb min readahead size.
The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 02/11] readahead: bump up the default readahead size
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Martin Schwidefsky,
Christian Ehrhardt, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 1345 bytes --]
Use 512kb max readahead size, and 32kb min readahead size.
The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 02/11] readahead: bump up the default readahead size
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Peter Zijlstra, Martin Schwidefsky,
Christian Ehrhardt, Wu Fengguang, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-enlarge-default-size.patch --]
[-- Type: text/plain, Size: 1345 bytes --]
Use 512kb max readahead size, and 32kb min readahead size.
The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-min-max-pages.patch --]
[-- Type: text/plain, Size: 2309 bytes --]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/blk-core.c | 3 +--
fs/fuse/inode.c | 2 +-
include/linux/mm.h | 3 +++
mm/backing-dev.c | 2 +-
4 files changed, 6 insertions(+), 4 deletions(-)
--- linux.orig/block/blk-core.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/block/blk-core.c 2010-01-30 18:10:01.000000000 +0800
@@ -498,8 +498,7 @@ struct request_queue *blk_alloc_queue_no
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;
- q->backing_dev_info.ra_pages =
- (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ q->backing_dev_info.ra_pages = MAX_READAHEAD_PAGES;
q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
q->backing_dev_info.name = "block";
--- linux.orig/fs/fuse/inode.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/fs/fuse/inode.c 2010-01-30 18:10:01.000000000 +0800
@@ -870,7 +870,7 @@ static int fuse_bdi_init(struct fuse_con
int err;
fc->bdi.name = "fuse";
- fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ fc->bdi.ra_pages = MAX_READAHEAD_PAGES;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
/* fuse does it's own writeback accounting */
fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux.orig/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:10:01.000000000 +0800
@@ -1187,6 +1187,9 @@ void task_dirty_inc(struct task_struct *
#define VM_MAX_READAHEAD 512 /* kbytes */
#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
+#define MAX_READAHEAD_PAGES (VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--- linux.orig/mm/backing-dev.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/mm/backing-dev.c 2010-01-30 18:10:01.000000000 +0800
@@ -18,7 +18,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
.name = "default",
- .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_READAHEAD_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-min-max-pages.patch --]
[-- Type: text/plain, Size: 2307 bytes --]
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/blk-core.c | 3 +--
fs/fuse/inode.c | 2 +-
include/linux/mm.h | 3 +++
mm/backing-dev.c | 2 +-
4 files changed, 6 insertions(+), 4 deletions(-)
--- linux.orig/block/blk-core.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/block/blk-core.c 2010-01-30 18:10:01.000000000 +0800
@@ -498,8 +498,7 @@ struct request_queue *blk_alloc_queue_no
q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
q->backing_dev_info.unplug_io_data = q;
- q->backing_dev_info.ra_pages =
- (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ q->backing_dev_info.ra_pages = MAX_READAHEAD_PAGES;
q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
q->backing_dev_info.name = "block";
--- linux.orig/fs/fuse/inode.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/fs/fuse/inode.c 2010-01-30 18:10:01.000000000 +0800
@@ -870,7 +870,7 @@ static int fuse_bdi_init(struct fuse_con
int err;
fc->bdi.name = "fuse";
- fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ fc->bdi.ra_pages = MAX_READAHEAD_PAGES;
fc->bdi.unplug_io_fn = default_unplug_io_fn;
/* fuse does it's own writeback accounting */
fc->bdi.capabilities = BDI_CAP_NO_ACCT_WB;
--- linux.orig/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:10:01.000000000 +0800
@@ -1187,6 +1187,9 @@ void task_dirty_inc(struct task_struct *
#define VM_MAX_READAHEAD 512 /* kbytes */
#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
+#define MAX_READAHEAD_PAGES (VM_MAX_READAHEAD*1024 / PAGE_CACHE_SIZE)
+#define MIN_READAHEAD_PAGES DIV_ROUND_UP(VM_MIN_READAHEAD*1024, PAGE_CACHE_SIZE)
+
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--- linux.orig/mm/backing-dev.c 2010-01-30 17:38:48.000000000 +0800
+++ linux/mm/backing-dev.c 2010-01-30 18:10:01.000000000 +0800
@@ -18,7 +18,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
.name = "default",
- .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_READAHEAD_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Nick Piggin, Andi Kleen, Steven Whitehouse,
Wu Fengguang, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 2548 bytes --]
Introduce a readahead flags field and embed the existing mmap_miss in it
(to save space).
It will be possible to lose the flags in race conditions, however the
impact should be limited.
CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 30 +++++++++++++++++++++++++++++-
mm/filemap.c | 7 ++-----
2 files changed, 31 insertions(+), 6 deletions(-)
--- linux.orig/include/linux/fs.h 2010-01-30 18:09:33.000000000 +0800
+++ linux/include/linux/fs.h 2010-01-30 18:10:04.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
there are only # of pages ahead */
unsigned int ra_pages; /* Maximum readahead window */
- unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
+ unsigned int ra_flags;
loff_t prev_pos; /* Cache last read() position */
};
+/* ra_flags bits */
+#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ if (miss < READAHEAD_MMAP_MISS) {
+ miss++;
+ ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+ }
+ return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ if (miss) {
+ miss--;
+ ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+ }
+}
+
/*
* Check if @index falls in the readahead windows.
*/
--- linux.orig/mm/filemap.c 2010-01-30 18:09:33.000000000 +0800
+++ linux/mm/filemap.c 2010-01-30 18:10:04.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
return;
}
- if (ra->mmap_miss < INT_MAX)
- ra->mmap_miss++;
/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
+ if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
return;
/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
- if (ra->mmap_miss > 0)
- ra->mmap_miss--;
+ ra_mmap_miss_dec(ra);
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Nick Piggin, Andi Kleen, Steven Whitehouse,
Wu Fengguang, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-flags.patch --]
[-- Type: text/plain, Size: 2546 bytes --]
Introduce a readahead flags field and embed the existing mmap_miss in it
(to save space).
It will be possible to lose the flags in race conditions, however the
impact should be limited.
CC: Nick Piggin <npiggin@suse.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 30 +++++++++++++++++++++++++++++-
mm/filemap.c | 7 ++-----
2 files changed, 31 insertions(+), 6 deletions(-)
--- linux.orig/include/linux/fs.h 2010-01-30 18:09:33.000000000 +0800
+++ linux/include/linux/fs.h 2010-01-30 18:10:04.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
there are only # of pages ahead */
unsigned int ra_pages; /* Maximum readahead window */
- unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
+ unsigned int ra_flags;
loff_t prev_pos; /* Cache last read() position */
};
+/* ra_flags bits */
+#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
+
+/*
+ * Don't do ra_flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ if (miss < READAHEAD_MMAP_MISS) {
+ miss++;
+ ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+ }
+ return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->ra_flags & READAHEAD_MMAP_MISS;
+
+ if (miss) {
+ miss--;
+ ra->ra_flags = miss | (ra->ra_flags &~ READAHEAD_MMAP_MISS);
+ }
+}
+
/*
* Check if @index falls in the readahead windows.
*/
--- linux.orig/mm/filemap.c 2010-01-30 18:09:33.000000000 +0800
+++ linux/mm/filemap.c 2010-01-30 18:10:04.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
return;
}
- if (ra->mmap_miss < INT_MAX)
- ra->mmap_miss++;
/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
+ if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
return;
/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
- if (ra->mmap_miss > 0)
- ra->mmap_miss--;
+ ra_mmap_miss_dec(ra);
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
Wu Fengguang, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3346 bytes --]
From: Chris Frost <frost@cs.ucla.edu>
Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application
Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.
In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.
This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-01 10:18:57.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
@@ -9,7 +9,9 @@
#include <linux/kernel.h>
#include <linux/fs.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/module.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
}
/*
+ * The file range is expected to be accessed in near future. Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+ pgoff_t index, int len)
+{
+ int i;
+ struct page *page;
+ struct zone *zone;
+
+ for (i = 0; i < len; i++) {
+ page = find_get_page(mapping, index + i);
+ if (!page)
+ continue;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (PageLRU(page) &&
+ !PageActive(page) &&
+ !PageUnevictable(page)) {
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(zone, page, lru);
+ add_page_to_lru_list(zone, page, lru);
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+ put_page(page);
+ }
+}
+
+/*
* __do_page_cache_readahead() actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
* behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
}
/*
+ * Normally readahead will auto stop on cached segments, so we won't
+ * hit many cached pages. If it does happen, bring the inactive pages
+ * adjecent to the newly prefetched ones(if any).
+ */
+ if (ret < nr_to_read)
+ retain_inactive_pages(mapping, offset, page_idx);
+
+ /*
* Now start the IO. We ignore I/O errors - if the page is not
* uptodate then the caller will launch readpage again, and
* will then handle the error.
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
Wu Fengguang, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3571 bytes --]
From: Chris Frost <frost@cs.ucla.edu>
Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application
Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.
In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.
This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-01 10:18:57.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
@@ -9,7 +9,9 @@
#include <linux/kernel.h>
#include <linux/fs.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/module.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
}
/*
+ * The file range is expected to be accessed in near future. Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+ pgoff_t index, int len)
+{
+ int i;
+ struct page *page;
+ struct zone *zone;
+
+ for (i = 0; i < len; i++) {
+ page = find_get_page(mapping, index + i);
+ if (!page)
+ continue;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (PageLRU(page) &&
+ !PageActive(page) &&
+ !PageUnevictable(page)) {
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(zone, page, lru);
+ add_page_to_lru_list(zone, page, lru);
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+ put_page(page);
+ }
+}
+
+/*
* __do_page_cache_readahead() actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
* behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
}
/*
+ * Normally readahead will auto stop on cached segments, so we won't
+ * hit many cached pages. If it does happen, bring the inactive pages
+ * adjecent to the newly prefetched ones(if any).
+ */
+ if (ret < nr_to_read)
+ retain_inactive_pages(mapping, offset, page_idx);
+
+ /*
* Now start the IO. We ignore I/O errors - if the page is not
* uptodate then the caller will launch readpage again, and
* will then handle the error.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Chris Frost, Steve VanDeBogart, KAMEZAWA Hiroyuki,
Wu Fengguang, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
[-- Attachment #1: readahead-retain-pages-find_get_page.patch --]
[-- Type: text/plain, Size: 3571 bytes --]
From: Chris Frost <frost@cs.ucla.edu>
Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application
Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
Many small read requests may be forced on the disk because of this
behavior.
In particular, posix_fadvise(... POSIX_FADV_WILLNEED) on an in-core page
has no effect on the page's location in the LRU list, even if it is the
next victim on the inactive list.
This change helps address the performance problems we encountered
while modifying SQLite and the GIMP to use large file prefetching.
Overall these prefetching techniques improved the runtime of large
benchmarks by 10-17x for these applications. More in the publication
_Reducing Seek Overhead with Application-Directed Prefetching_ in
USENIX ATC 2009 and at http://libprefetch.cs.ucla.edu/.
Signed-off-by: Chris Frost <frost@cs.ucla.edu>
Signed-off-by: Steve VanDeBogart <vandebo@cs.ucla.edu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/readahead.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-01 10:18:57.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
@@ -9,7 +9,9 @@
#include <linux/kernel.h>
#include <linux/fs.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/module.h>
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
@@ -133,6 +135,40 @@ out:
}
/*
+ * The file range is expected to be accessed in near future. Move pages
+ * (possibly in inactive lru tail) to lru head, so that they are retained
+ * in memory for some reasonable time.
+ */
+static void retain_inactive_pages(struct address_space *mapping,
+ pgoff_t index, int len)
+{
+ int i;
+ struct page *page;
+ struct zone *zone;
+
+ for (i = 0; i < len; i++) {
+ page = find_get_page(mapping, index + i);
+ if (!page)
+ continue;
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (PageLRU(page) &&
+ !PageActive(page) &&
+ !PageUnevictable(page)) {
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(zone, page, lru);
+ add_page_to_lru_list(zone, page, lru);
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+ put_page(page);
+ }
+}
+
+/*
* __do_page_cache_readahead() actually reads a chunk of disk. It allocates all
* the pages first, then submits them all for I/O. This avoids the very bad
* behaviour which would occur if page allocations are causing VM writeback.
@@ -184,6 +220,14 @@ __do_page_cache_readahead(struct address
}
/*
+ * Normally readahead will auto stop on cached segments, so we won't
+ * hit many cached pages. If it does happen, bring the inactive pages
+ * adjecent to the newly prefetched ones(if any).
+ */
+ if (ret < nr_to_read)
+ retain_inactive_pages(mapping, offset, page_idx);
+
+ /*
* Now start the IO. We ignore I/O errors - if the page is not
* uptodate then the caller will launch readpage again, and
* will then handle the error.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 06/11] readahead: thrashing safe context readahead
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 9147 bytes --]
Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.
- oversize read
no behavior change; except in thrashed mode, async_size will be 0
- random read
no behavior change; implies some different internal handling
The random read will now be recorded in file_ra_state, which means in
an intermixed sequential+random pattern, the sequential part's state
will be flushed by random ones, and hence will be serviced by the
context readahead instead of the stateful one. Also means that the
first readahead for a sequential read in the middle of file will be
started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
better
When walking out of a cached page segment, the readahead size will
be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
better in rare cases; costs more radix tree lookups, but won't be a
problem with optimized radix_tree_prev_hole(). The added radix tree
scan for history pages is to calculate the thrashing safe readahead
size and adaptive async size.
The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.
It is able to automatically adapt to the thrashing threshold in a smooth
workload. The estimation theory can be illustrated with figure:
chunk A chunk B chunk C head
l01 l11 l12 l21 l22
| |-->|-->| |------>|-->| |------>|
| +-------+ +-----------+ +-------------+ |
| | # | | # | | # | |
| +-------+ +-----------+ +-------------+ |
| |<==============|<===========================|<============================|
L0 L1 L2
Let f(l) = L be a map from
l: the number of pages read by the stream
to
L: the number of pages pushed into inactive_list in the mean time
then
f(l01) <= L0
f(l11 + l12) = L1
f(l21 + l22) = L2
...
f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
<= Length(inactive_list) = f(thrashing-threshold)
So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range
(thrashing_threshold/2, thrashing_threshold)
This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 1
mm/readahead.c | 155 ++++++++++++++++++++++++-------------------
2 files changed, 88 insertions(+), 68 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
@@ -20,6 +20,11 @@
#include <linux/pagemap.h>
/*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO 8
+
+/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
*/
@@ -393,39 +398,16 @@ static pgoff_t count_history_pages(struc
}
/*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
*/
-static int try_context_readahead(struct address_space *mapping,
- struct file_ra_state *ra,
- pgoff_t offset,
- unsigned long req_size,
- unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
{
- pgoff_t size;
-
- size = count_history_pages(mapping, ra, offset, max);
-
- /*
- * no history pages:
- * it could be a random read
- */
- if (!size)
- return 0;
-
- /*
- * starts from beginning of file:
- * it is a strong indication of long-run stream (or whole-file-read)
- */
- if (size >= offset)
- size *= 2;
-
- ra->start = offset;
- ra->size = get_init_ra_size(size + req_size, max);
- ra->async_size = ra->size;
-
- return 1;
+ return (index >= ra->start - ra->size &&
+ index < ra->start + ra->size);
}
+
/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
@@ -436,12 +418,26 @@ ondemand_readahead(struct address_space
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
+ unsigned int size;
+ pgoff_t start;
/*
* start of file
*/
- if (!offset)
- goto initial_readahead;
+ if (!offset) {
+ ra->start = offset;
+ ra->size = get_init_ra_size(req_size, max);
+ ra->async_size = ra->size > req_size ?
+ ra->size - req_size : ra->size;
+ goto readit;
+ }
+
+ /*
+ * Context readahead is thrashing safe, and can adapt to near the
+ * thrashing threshold given a stable workload.
+ */
+ if (ra->ra_flags & READAHEAD_THRASHED)
+ goto context_readahead;
/*
* It's the expected callback offset, assume sequential access.
@@ -456,58 +452,81 @@ ondemand_readahead(struct address_space
}
/*
- * Hit a marked page without valid readahead state.
- * E.g. interleaved reads.
- * Query the pagecache for async_size, which normally equals to
- * readahead size. Ramp it up and use it as the new readahead size.
+ * oversize read, no need to query page cache
*/
- if (hit_readahead_marker) {
- pgoff_t start;
+ if (req_size > max && !hit_readahead_marker) {
+ ra->start = offset;
+ ra->size = max;
+ ra->async_size = max;
+ goto readit;
+ }
+ /*
+ * page cache context based read-ahead
+ *
+ * ==========================_____________..............
+ * [ current window ]
+ * ^offset
+ * 1) |---- A ---->[start
+ * 2) |<----------- H -----------|
+ * 3) |----------- H ----------->]end
+ * [ new window ]
+ * [=] cached,visited [_] cached,to-be-visited [.] not cached
+ *
+ * 1) A = pages ahead = previous async_size
+ * 2) H = history pages = thrashing safe size
+ * 3) H - A = new readahead size
+ */
+context_readahead:
+ if (hit_readahead_marker) {
rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+ start = radix_tree_next_hole(&mapping->page_tree,
+ offset + 1, max);
rcu_read_unlock();
-
+ /*
+ * there are enough pages ahead: no readahead
+ */
if (!start || start - offset > max)
return 0;
+ } else
+ start = offset;
+ size = count_history_pages(mapping, ra, offset,
+ READAHEAD_ASYNC_RATIO * max);
+ /*
+ * no history pages cached, could be
+ * - a random read
+ * - a thrashed sequential read
+ */
+ if (!size && !hit_readahead_marker) {
+ if (!ra_thrashed(ra, offset)) {
+ ra->size = min(req_size, max);
+ } else {
+ retain_inactive_pages(mapping, offset, min(2 * max,
+ ra->start + ra->size - offset));
+ ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+ ra->ra_flags |= READAHEAD_THRASHED;
+ }
+ ra->async_size = 0;
ra->start = start;
- ra->size = start - offset; /* old async_size */
- ra->size += req_size;
- ra->size = get_next_ra_size(ra, max);
- ra->async_size = ra->size;
goto readit;
}
-
/*
- * oversize read
+ * history pages start from beginning of file:
+ * it is a strong indication of long-run stream (or whole-file reads)
*/
- if (req_size > max)
- goto initial_readahead;
-
- /*
- * sequential cache miss
- */
- if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
- goto initial_readahead;
-
- /*
- * Query the page cache and look for the traces(cached history pages)
- * that a sequential stream would leave behind.
- */
- if (try_context_readahead(mapping, ra, offset, req_size, max))
- goto readit;
-
+ if (size >= offset)
+ size *= 2;
/*
- * standalone, small random read
- * Read as is, and do not pollute the readahead state.
+ * pages to readahead are already cached
*/
- return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+ if (size <= start - offset)
+ return 0;
-initial_readahead:
- ra->start = offset;
- ra->size = get_init_ra_size(req_size, max);
- ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+ size -= start - offset;
+ ra->start = start;
+ ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+ ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
readit:
/*
--- linux.orig/include/linux/fs.h 2010-02-01 10:21:09.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
/* ra_flags bits */
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
+#define READAHEAD_THRASHED 0x10000000
/*
* Don't do ra_flags++ directly to avoid possible overflow:
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 06/11] readahead: thrashing safe context readahead
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 9372 bytes --]
Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.
- oversize read
no behavior change; except in thrashed mode, async_size will be 0
- random read
no behavior change; implies some different internal handling
The random read will now be recorded in file_ra_state, which means in
an intermixed sequential+random pattern, the sequential part's state
will be flushed by random ones, and hence will be serviced by the
context readahead instead of the stateful one. Also means that the
first readahead for a sequential read in the middle of file will be
started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
better
When walking out of a cached page segment, the readahead size will
be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
better in rare cases; costs more radix tree lookups, but won't be a
problem with optimized radix_tree_prev_hole(). The added radix tree
scan for history pages is to calculate the thrashing safe readahead
size and adaptive async size.
The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.
It is able to automatically adapt to the thrashing threshold in a smooth
workload. The estimation theory can be illustrated with figure:
chunk A chunk B chunk C head
l01 l11 l12 l21 l22
| |-->|-->| |------>|-->| |------>|
| +-------+ +-----------+ +-------------+ |
| | # | | # | | # | |
| +-------+ +-----------+ +-------------+ |
| |<==============|<===========================|<============================|
L0 L1 L2
Let f(l) = L be a map from
l: the number of pages read by the stream
to
L: the number of pages pushed into inactive_list in the mean time
then
f(l01) <= L0
f(l11 + l12) = L1
f(l21 + l22) = L2
...
f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
<= Length(inactive_list) = f(thrashing-threshold)
So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range
(thrashing_threshold/2, thrashing_threshold)
This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 1
mm/readahead.c | 155 ++++++++++++++++++++++++-------------------
2 files changed, 88 insertions(+), 68 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
@@ -20,6 +20,11 @@
#include <linux/pagemap.h>
/*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO 8
+
+/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
*/
@@ -393,39 +398,16 @@ static pgoff_t count_history_pages(struc
}
/*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
*/
-static int try_context_readahead(struct address_space *mapping,
- struct file_ra_state *ra,
- pgoff_t offset,
- unsigned long req_size,
- unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
{
- pgoff_t size;
-
- size = count_history_pages(mapping, ra, offset, max);
-
- /*
- * no history pages:
- * it could be a random read
- */
- if (!size)
- return 0;
-
- /*
- * starts from beginning of file:
- * it is a strong indication of long-run stream (or whole-file-read)
- */
- if (size >= offset)
- size *= 2;
-
- ra->start = offset;
- ra->size = get_init_ra_size(size + req_size, max);
- ra->async_size = ra->size;
-
- return 1;
+ return (index >= ra->start - ra->size &&
+ index < ra->start + ra->size);
}
+
/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
@@ -436,12 +418,26 @@ ondemand_readahead(struct address_space
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
+ unsigned int size;
+ pgoff_t start;
/*
* start of file
*/
- if (!offset)
- goto initial_readahead;
+ if (!offset) {
+ ra->start = offset;
+ ra->size = get_init_ra_size(req_size, max);
+ ra->async_size = ra->size > req_size ?
+ ra->size - req_size : ra->size;
+ goto readit;
+ }
+
+ /*
+ * Context readahead is thrashing safe, and can adapt to near the
+ * thrashing threshold given a stable workload.
+ */
+ if (ra->ra_flags & READAHEAD_THRASHED)
+ goto context_readahead;
/*
* It's the expected callback offset, assume sequential access.
@@ -456,58 +452,81 @@ ondemand_readahead(struct address_space
}
/*
- * Hit a marked page without valid readahead state.
- * E.g. interleaved reads.
- * Query the pagecache for async_size, which normally equals to
- * readahead size. Ramp it up and use it as the new readahead size.
+ * oversize read, no need to query page cache
*/
- if (hit_readahead_marker) {
- pgoff_t start;
+ if (req_size > max && !hit_readahead_marker) {
+ ra->start = offset;
+ ra->size = max;
+ ra->async_size = max;
+ goto readit;
+ }
+ /*
+ * page cache context based read-ahead
+ *
+ * ==========================_____________..............
+ * [ current window ]
+ * ^offset
+ * 1) |---- A ---->[start
+ * 2) |<----------- H -----------|
+ * 3) |----------- H ----------->]end
+ * [ new window ]
+ * [=] cached,visited [_] cached,to-be-visited [.] not cached
+ *
+ * 1) A = pages ahead = previous async_size
+ * 2) H = history pages = thrashing safe size
+ * 3) H - A = new readahead size
+ */
+context_readahead:
+ if (hit_readahead_marker) {
rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+ start = radix_tree_next_hole(&mapping->page_tree,
+ offset + 1, max);
rcu_read_unlock();
-
+ /*
+ * there are enough pages ahead: no readahead
+ */
if (!start || start - offset > max)
return 0;
+ } else
+ start = offset;
+ size = count_history_pages(mapping, ra, offset,
+ READAHEAD_ASYNC_RATIO * max);
+ /*
+ * no history pages cached, could be
+ * - a random read
+ * - a thrashed sequential read
+ */
+ if (!size && !hit_readahead_marker) {
+ if (!ra_thrashed(ra, offset)) {
+ ra->size = min(req_size, max);
+ } else {
+ retain_inactive_pages(mapping, offset, min(2 * max,
+ ra->start + ra->size - offset));
+ ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+ ra->ra_flags |= READAHEAD_THRASHED;
+ }
+ ra->async_size = 0;
ra->start = start;
- ra->size = start - offset; /* old async_size */
- ra->size += req_size;
- ra->size = get_next_ra_size(ra, max);
- ra->async_size = ra->size;
goto readit;
}
-
/*
- * oversize read
+ * history pages start from beginning of file:
+ * it is a strong indication of long-run stream (or whole-file reads)
*/
- if (req_size > max)
- goto initial_readahead;
-
- /*
- * sequential cache miss
- */
- if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
- goto initial_readahead;
-
- /*
- * Query the page cache and look for the traces(cached history pages)
- * that a sequential stream would leave behind.
- */
- if (try_context_readahead(mapping, ra, offset, req_size, max))
- goto readit;
-
+ if (size >= offset)
+ size *= 2;
/*
- * standalone, small random read
- * Read as is, and do not pollute the readahead state.
+ * pages to readahead are already cached
*/
- return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+ if (size <= start - offset)
+ return 0;
-initial_readahead:
- ra->start = offset;
- ra->size = get_init_ra_size(req_size, max);
- ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+ size -= start - offset;
+ ra->start = start;
+ ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+ ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
readit:
/*
--- linux.orig/include/linux/fs.h 2010-02-01 10:21:09.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
/* ra_flags bits */
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
+#define READAHEAD_THRASHED 0x10000000
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 06/11] readahead: thrashing safe context readahead
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-thrashing-safe-mode.patch --]
[-- Type: text/plain, Size: 9372 bytes --]
Introduce a more complete version of context readahead, which is a
full-fledged readahead algorithm by itself. It replaces some of the
existing cases.
- oversize read
no behavior change; except in thrashed mode, async_size will be 0
- random read
no behavior change; implies some different internal handling
The random read will now be recorded in file_ra_state, which means in
an intermixed sequential+random pattern, the sequential part's state
will be flushed by random ones, and hence will be serviced by the
context readahead instead of the stateful one. Also means that the
first readahead for a sequential read in the middle of file will be
started by the stateful one, instead of the sequential cache miss.
- sequential cache miss
better
When walking out of a cached page segment, the readahead size will
be fully restored immediately instead of ramping up from initial size.
- hit readahead marker without valid state
better in rare cases; costs more radix tree lookups, but won't be a
problem with optimized radix_tree_prev_hole(). The added radix tree
scan for history pages is to calculate the thrashing safe readahead
size and adaptive async size.
The algorithm first looks ahead to find the start point of next
read-ahead, then looks backward in the page cache to get an estimation
of the thrashing-threshold.
It is able to automatically adapt to the thrashing threshold in a smooth
workload. The estimation theory can be illustrated with figure:
chunk A chunk B chunk C head
l01 l11 l12 l21 l22
| |-->|-->| |------>|-->| |------>|
| +-------+ +-----------+ +-------------+ |
| | # | | # | | # | |
| +-------+ +-----------+ +-------------+ |
| |<==============|<===========================|<============================|
L0 L1 L2
Let f(l) = L be a map from
l: the number of pages read by the stream
to
L: the number of pages pushed into inactive_list in the mean time
then
f(l01) <= L0
f(l11 + l12) = L1
f(l21 + l22) = L2
...
f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
<= Length(inactive_list) = f(thrashing-threshold)
So the count of continuous history pages left in inactive_list is always a
lower estimation of the true thrashing-threshold. Given a stable workload,
the readahead size will keep ramping up and then stabilize in range
(thrashing_threshold/2, thrashing_threshold)
This is good because, it's in fact bad to always reach thrashing_threshold.
That would not only be more susceptible to fluctuations, but also impose
eviction pressure to the cached pages.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 1
mm/readahead.c | 155 ++++++++++++++++++++++++-------------------
2 files changed, 88 insertions(+), 68 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 10:20:51.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
@@ -20,6 +20,11 @@
#include <linux/pagemap.h>
/*
+ * Set async size to 1/# of the thrashing threshold.
+ */
+#define READAHEAD_ASYNC_RATIO 8
+
+/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
*/
@@ -393,39 +398,16 @@ static pgoff_t count_history_pages(struc
}
/*
- * page cache context based read-ahead
+ * Is @index recently readahead but not yet read by application?
+ * The low boundary is permissively estimated.
*/
-static int try_context_readahead(struct address_space *mapping,
- struct file_ra_state *ra,
- pgoff_t offset,
- unsigned long req_size,
- unsigned long max)
+static bool ra_thrashed(struct file_ra_state *ra, pgoff_t index)
{
- pgoff_t size;
-
- size = count_history_pages(mapping, ra, offset, max);
-
- /*
- * no history pages:
- * it could be a random read
- */
- if (!size)
- return 0;
-
- /*
- * starts from beginning of file:
- * it is a strong indication of long-run stream (or whole-file-read)
- */
- if (size >= offset)
- size *= 2;
-
- ra->start = offset;
- ra->size = get_init_ra_size(size + req_size, max);
- ra->async_size = ra->size;
-
- return 1;
+ return (index >= ra->start - ra->size &&
+ index < ra->start + ra->size);
}
+
/*
* A minimal readahead algorithm for trivial sequential/random reads.
*/
@@ -436,12 +418,26 @@ ondemand_readahead(struct address_space
unsigned long req_size)
{
unsigned long max = max_sane_readahead(ra->ra_pages);
+ unsigned int size;
+ pgoff_t start;
/*
* start of file
*/
- if (!offset)
- goto initial_readahead;
+ if (!offset) {
+ ra->start = offset;
+ ra->size = get_init_ra_size(req_size, max);
+ ra->async_size = ra->size > req_size ?
+ ra->size - req_size : ra->size;
+ goto readit;
+ }
+
+ /*
+ * Context readahead is thrashing safe, and can adapt to near the
+ * thrashing threshold given a stable workload.
+ */
+ if (ra->ra_flags & READAHEAD_THRASHED)
+ goto context_readahead;
/*
* It's the expected callback offset, assume sequential access.
@@ -456,58 +452,81 @@ ondemand_readahead(struct address_space
}
/*
- * Hit a marked page without valid readahead state.
- * E.g. interleaved reads.
- * Query the pagecache for async_size, which normally equals to
- * readahead size. Ramp it up and use it as the new readahead size.
+ * oversize read, no need to query page cache
*/
- if (hit_readahead_marker) {
- pgoff_t start;
+ if (req_size > max && !hit_readahead_marker) {
+ ra->start = offset;
+ ra->size = max;
+ ra->async_size = max;
+ goto readit;
+ }
+ /*
+ * page cache context based read-ahead
+ *
+ * ==========================_____________..............
+ * [ current window ]
+ * ^offset
+ * 1) |---- A ---->[start
+ * 2) |<----------- H -----------|
+ * 3) |----------- H ----------->]end
+ * [ new window ]
+ * [=] cached,visited [_] cached,to-be-visited [.] not cached
+ *
+ * 1) A = pages ahead = previous async_size
+ * 2) H = history pages = thrashing safe size
+ * 3) H - A = new readahead size
+ */
+context_readahead:
+ if (hit_readahead_marker) {
rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+ start = radix_tree_next_hole(&mapping->page_tree,
+ offset + 1, max);
rcu_read_unlock();
-
+ /*
+ * there are enough pages ahead: no readahead
+ */
if (!start || start - offset > max)
return 0;
+ } else
+ start = offset;
+ size = count_history_pages(mapping, ra, offset,
+ READAHEAD_ASYNC_RATIO * max);
+ /*
+ * no history pages cached, could be
+ * - a random read
+ * - a thrashed sequential read
+ */
+ if (!size && !hit_readahead_marker) {
+ if (!ra_thrashed(ra, offset)) {
+ ra->size = min(req_size, max);
+ } else {
+ retain_inactive_pages(mapping, offset, min(2 * max,
+ ra->start + ra->size - offset));
+ ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
+ ra->ra_flags |= READAHEAD_THRASHED;
+ }
+ ra->async_size = 0;
ra->start = start;
- ra->size = start - offset; /* old async_size */
- ra->size += req_size;
- ra->size = get_next_ra_size(ra, max);
- ra->async_size = ra->size;
goto readit;
}
-
/*
- * oversize read
+ * history pages start from beginning of file:
+ * it is a strong indication of long-run stream (or whole-file reads)
*/
- if (req_size > max)
- goto initial_readahead;
-
- /*
- * sequential cache miss
- */
- if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
- goto initial_readahead;
-
- /*
- * Query the page cache and look for the traces(cached history pages)
- * that a sequential stream would leave behind.
- */
- if (try_context_readahead(mapping, ra, offset, req_size, max))
- goto readit;
-
+ if (size >= offset)
+ size *= 2;
/*
- * standalone, small random read
- * Read as is, and do not pollute the readahead state.
+ * pages to readahead are already cached
*/
- return __do_page_cache_readahead(mapping, filp, offset, req_size, 0);
+ if (size <= start - offset)
+ return 0;
-initial_readahead:
- ra->start = offset;
- ra->size = get_init_ra_size(req_size, max);
- ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
+ size -= start - offset;
+ ra->start = start;
+ ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
+ ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
readit:
/*
--- linux.orig/include/linux/fs.h 2010-02-01 10:21:09.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
@@ -895,6 +895,7 @@ struct file_ra_state {
/* ra_flags bits */
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
+#define READAHEAD_THRASHED 0x10000000
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 07/11] readahead: record readahead patterns
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6162 bytes --]
Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.
Currently 7 patterns are defined:
pattern readahead for
-----------------------------------------------------------
RA_PATTERN_INITIAL start-of-file/oversize read
RA_PATTERN_SUBSEQUENT trivial sequential read
RA_PATTERN_CONTEXT interleaved sequential read
RA_PATTERN_THRASH thrashed sequential read
RA_PATTERN_MMAP_AROUND mmap fault
RA_PATTERN_FADVISE posix_fadvise()
RA_PATTERN_RANDOM random read
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +++-
mm/filemap.c | 9 +++++++--
mm/readahead.c | 17 +++++++++++++----
4 files changed, 55 insertions(+), 7 deletions(-)
--- linux.orig/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:51:59.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
};
/* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT 20
+#define READAHEAD_PATTERN 0x00f00000
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
+#define READAHEAD_MMAP 0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+ RA_PATTERN_INITIAL,
+ RA_PATTERN_SUBSEQUENT,
+ RA_PATTERN_CONTEXT,
+ RA_PATTERN_THRASH,
+ RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_FADVISE,
+ RA_PATTERN_RANDOM,
+ RA_PATTERN_ALL, /* for summary stats */
+ RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+ int pattern = (ra_flags & READAHEAD_PATTERN)
+ >> READAHEAD_PATTERN_SHIFT;
+
+ return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+ ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+ (pattern << READAHEAD_PATTERN_SHIFT);
+}
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:01.000000000 +0800
@@ -291,7 +291,10 @@ unsigned long max_sane_readahead(unsigne
* Submit IO for the read-ahead request in file_ra_state.
*/
unsigned long ra_submit(struct file_ra_state *ra,
- struct address_space *mapping, struct file *filp)
+ struct address_space *mapping,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size)
{
int actual;
@@ -425,6 +428,7 @@ ondemand_readahead(struct address_space
* start of file
*/
if (!offset) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
@@ -445,6 +449,7 @@ ondemand_readahead(struct address_space
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
+ ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
@@ -455,6 +460,7 @@ ondemand_readahead(struct address_space
* oversize read, no need to query page cache
*/
if (req_size > max && !hit_readahead_marker) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = max;
ra->async_size = max;
@@ -500,8 +506,10 @@ context_readahead:
*/
if (!size && !hit_readahead_marker) {
if (!ra_thrashed(ra, offset)) {
+ ra_set_pattern(ra, RA_PATTERN_RANDOM);
ra->size = min(req_size, max);
} else {
+ ra_set_pattern(ra, RA_PATTERN_THRASH);
retain_inactive_pages(mapping, offset, min(2 * max,
ra->start + ra->size - offset));
ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -518,12 +526,13 @@ context_readahead:
if (size >= offset)
size *= 2;
/*
- * pages to readahead are already cached
+ * Pages to readahead are already cached?
*/
if (size <= start - offset)
return 0;
-
size -= start - offset;
+
+ ra_set_pattern(ra, RA_PATTERN_CONTEXT);
ra->start = start;
ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -539,7 +548,7 @@ readit:
ra->size += ra->async_size;
}
- return ra_submit(ra, mapping, filp);
+ return ra_submit(ra, mapping, filp, offset, req_size);
}
/**
--- linux.orig/include/linux/mm.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/mm.h 2010-02-02 21:51:59.000000000 +0800
@@ -1209,7 +1209,9 @@ void page_cache_async_readahead(struct a
unsigned long max_sane_readahead(unsigned long nr);
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
- struct file *filp);
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c 2010-02-02 21:50:52.000000000 +0800
+++ linux/mm/filemap.c 2010-02-02 21:51:59.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
if (VM_SequentialReadHint(vma) ||
offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
*/
ra_pages = max_sane_readahead(ra->ra_pages);
if (ra_pages) {
+ ra->ra_flags |= READAHEAD_MMAP;
+ ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
ra->start = max_t(long, 0, offset - ra_pages/2);
ra->size = ra_pages;
ra->async_size = 0;
- ra_submit(ra, mapping, file);
+ ra_submit(ra, mapping, file, offset, 1);
}
}
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
if (VM_RandomReadHint(vma))
return;
ra_mmap_miss_dec(ra);
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
}
/**
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 07/11] readahead: record readahead patterns
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6387 bytes --]
Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.
Currently 7 patterns are defined:
pattern readahead for
-----------------------------------------------------------
RA_PATTERN_INITIAL start-of-file/oversize read
RA_PATTERN_SUBSEQUENT trivial sequential read
RA_PATTERN_CONTEXT interleaved sequential read
RA_PATTERN_THRASH thrashed sequential read
RA_PATTERN_MMAP_AROUND mmap fault
RA_PATTERN_FADVISE posix_fadvise()
RA_PATTERN_RANDOM random read
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +++-
mm/filemap.c | 9 +++++++--
mm/readahead.c | 17 +++++++++++++----
4 files changed, 55 insertions(+), 7 deletions(-)
--- linux.orig/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:51:59.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
};
/* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT 20
+#define READAHEAD_PATTERN 0x00f00000
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
+#define READAHEAD_MMAP 0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+ RA_PATTERN_INITIAL,
+ RA_PATTERN_SUBSEQUENT,
+ RA_PATTERN_CONTEXT,
+ RA_PATTERN_THRASH,
+ RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_FADVISE,
+ RA_PATTERN_RANDOM,
+ RA_PATTERN_ALL, /* for summary stats */
+ RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+ int pattern = (ra_flags & READAHEAD_PATTERN)
+ >> READAHEAD_PATTERN_SHIFT;
+
+ return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+ ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+ (pattern << READAHEAD_PATTERN_SHIFT);
+}
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:01.000000000 +0800
@@ -291,7 +291,10 @@ unsigned long max_sane_readahead(unsigne
* Submit IO for the read-ahead request in file_ra_state.
*/
unsigned long ra_submit(struct file_ra_state *ra,
- struct address_space *mapping, struct file *filp)
+ struct address_space *mapping,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size)
{
int actual;
@@ -425,6 +428,7 @@ ondemand_readahead(struct address_space
* start of file
*/
if (!offset) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
@@ -445,6 +449,7 @@ ondemand_readahead(struct address_space
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
+ ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
@@ -455,6 +460,7 @@ ondemand_readahead(struct address_space
* oversize read, no need to query page cache
*/
if (req_size > max && !hit_readahead_marker) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = max;
ra->async_size = max;
@@ -500,8 +506,10 @@ context_readahead:
*/
if (!size && !hit_readahead_marker) {
if (!ra_thrashed(ra, offset)) {
+ ra_set_pattern(ra, RA_PATTERN_RANDOM);
ra->size = min(req_size, max);
} else {
+ ra_set_pattern(ra, RA_PATTERN_THRASH);
retain_inactive_pages(mapping, offset, min(2 * max,
ra->start + ra->size - offset));
ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -518,12 +526,13 @@ context_readahead:
if (size >= offset)
size *= 2;
/*
- * pages to readahead are already cached
+ * Pages to readahead are already cached?
*/
if (size <= start - offset)
return 0;
-
size -= start - offset;
+
+ ra_set_pattern(ra, RA_PATTERN_CONTEXT);
ra->start = start;
ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -539,7 +548,7 @@ readit:
ra->size += ra->async_size;
}
- return ra_submit(ra, mapping, filp);
+ return ra_submit(ra, mapping, filp, offset, req_size);
}
/**
--- linux.orig/include/linux/mm.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/mm.h 2010-02-02 21:51:59.000000000 +0800
@@ -1209,7 +1209,9 @@ void page_cache_async_readahead(struct a
unsigned long max_sane_readahead(unsigned long nr);
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
- struct file *filp);
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c 2010-02-02 21:50:52.000000000 +0800
+++ linux/mm/filemap.c 2010-02-02 21:51:59.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
if (VM_SequentialReadHint(vma) ||
offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
*/
ra_pages = max_sane_readahead(ra->ra_pages);
if (ra_pages) {
+ ra->ra_flags |= READAHEAD_MMAP;
+ ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
ra->start = max_t(long, 0, offset - ra_pages/2);
ra->size = ra_pages;
ra->async_size = 0;
- ra_submit(ra, mapping, file);
+ ra_submit(ra, mapping, file, offset, 1);
}
}
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
if (VM_RandomReadHint(vma))
return;
ra_mmap_miss_dec(ra);
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
}
/**
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 07/11] readahead: record readahead patterns
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracepoints.patch --]
[-- Type: text/plain, Size: 6387 bytes --]
Record the readahead pattern in ra_flags. This info can be examined by
users via the readahead tracing/stats interfaces.
Currently 7 patterns are defined:
pattern readahead for
-----------------------------------------------------------
RA_PATTERN_INITIAL start-of-file/oversize read
RA_PATTERN_SUBSEQUENT trivial sequential read
RA_PATTERN_CONTEXT interleaved sequential read
RA_PATTERN_THRASH thrashed sequential read
RA_PATTERN_MMAP_AROUND mmap fault
RA_PATTERN_FADVISE posix_fadvise()
RA_PATTERN_RANDOM random read
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/fs.h | 32 ++++++++++++++++++++++++++++++++
include/linux/mm.h | 4 +++-
mm/filemap.c | 9 +++++++--
mm/readahead.c | 17 +++++++++++++----
4 files changed, 55 insertions(+), 7 deletions(-)
--- linux.orig/include/linux/fs.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:51:59.000000000 +0800
@@ -894,8 +894,40 @@ struct file_ra_state {
};
/* ra_flags bits */
+#define READAHEAD_PATTERN_SHIFT 20
+#define READAHEAD_PATTERN 0x00f00000
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
+#define READAHEAD_MMAP 0x20000000
+
+/*
+ * Which policy makes decision to do the current read-ahead IO?
+ */
+enum readahead_pattern {
+ RA_PATTERN_INITIAL,
+ RA_PATTERN_SUBSEQUENT,
+ RA_PATTERN_CONTEXT,
+ RA_PATTERN_THRASH,
+ RA_PATTERN_MMAP_AROUND,
+ RA_PATTERN_FADVISE,
+ RA_PATTERN_RANDOM,
+ RA_PATTERN_ALL, /* for summary stats */
+ RA_PATTERN_MAX
+};
+
+static inline int ra_pattern(int ra_flags)
+{
+ int pattern = (ra_flags & READAHEAD_PATTERN)
+ >> READAHEAD_PATTERN_SHIFT;
+
+ return min(pattern, RA_PATTERN_ALL);
+}
+
+static inline void ra_set_pattern(struct file_ra_state *ra, int pattern)
+{
+ ra->ra_flags = (ra->ra_flags & ~READAHEAD_PATTERN) |
+ (pattern << READAHEAD_PATTERN_SHIFT);
+}
/*
* Don't do ra_flags++ directly to avoid possible overflow:
--- linux.orig/mm/readahead.c 2010-02-02 21:51:53.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:01.000000000 +0800
@@ -291,7 +291,10 @@ unsigned long max_sane_readahead(unsigne
* Submit IO for the read-ahead request in file_ra_state.
*/
unsigned long ra_submit(struct file_ra_state *ra,
- struct address_space *mapping, struct file *filp)
+ struct address_space *mapping,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size)
{
int actual;
@@ -425,6 +428,7 @@ ondemand_readahead(struct address_space
* start of file
*/
if (!offset) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
@@ -445,6 +449,7 @@ ondemand_readahead(struct address_space
*/
if ((offset == (ra->start + ra->size - ra->async_size) ||
offset == (ra->start + ra->size))) {
+ ra_set_pattern(ra, RA_PATTERN_SUBSEQUENT);
ra->start += ra->size;
ra->size = get_next_ra_size(ra, max);
ra->async_size = ra->size;
@@ -455,6 +460,7 @@ ondemand_readahead(struct address_space
* oversize read, no need to query page cache
*/
if (req_size > max && !hit_readahead_marker) {
+ ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
ra->size = max;
ra->async_size = max;
@@ -500,8 +506,10 @@ context_readahead:
*/
if (!size && !hit_readahead_marker) {
if (!ra_thrashed(ra, offset)) {
+ ra_set_pattern(ra, RA_PATTERN_RANDOM);
ra->size = min(req_size, max);
} else {
+ ra_set_pattern(ra, RA_PATTERN_THRASH);
retain_inactive_pages(mapping, offset, min(2 * max,
ra->start + ra->size - offset));
ra->size = max_t(int, ra->size/2, MIN_READAHEAD_PAGES);
@@ -518,12 +526,13 @@ context_readahead:
if (size >= offset)
size *= 2;
/*
- * pages to readahead are already cached
+ * Pages to readahead are already cached?
*/
if (size <= start - offset)
return 0;
-
size -= start - offset;
+
+ ra_set_pattern(ra, RA_PATTERN_CONTEXT);
ra->start = start;
ra->size = clamp_t(unsigned int, size, MIN_READAHEAD_PAGES, max);
ra->async_size = min(ra->size, 1 + size / READAHEAD_ASYNC_RATIO);
@@ -539,7 +548,7 @@ readit:
ra->size += ra->async_size;
}
- return ra_submit(ra, mapping, filp);
+ return ra_submit(ra, mapping, filp, offset, req_size);
}
/**
--- linux.orig/include/linux/mm.h 2010-02-02 21:50:52.000000000 +0800
+++ linux/include/linux/mm.h 2010-02-02 21:51:59.000000000 +0800
@@ -1209,7 +1209,9 @@ void page_cache_async_readahead(struct a
unsigned long max_sane_readahead(unsigned long nr);
unsigned long ra_submit(struct file_ra_state *ra,
struct address_space *mapping,
- struct file *filp);
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long req_size);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux.orig/mm/filemap.c 2010-02-02 21:50:52.000000000 +0800
+++ linux/mm/filemap.c 2010-02-02 21:51:59.000000000 +0800
@@ -1413,6 +1413,7 @@ static void do_sync_mmap_readahead(struc
if (VM_SequentialReadHint(vma) ||
offset - 1 == (ra->prev_pos >> PAGE_CACHE_SHIFT)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
return;
@@ -1431,10 +1432,12 @@ static void do_sync_mmap_readahead(struc
*/
ra_pages = max_sane_readahead(ra->ra_pages);
if (ra_pages) {
+ ra->ra_flags |= READAHEAD_MMAP;
+ ra_set_pattern(ra, RA_PATTERN_MMAP_AROUND);
ra->start = max_t(long, 0, offset - ra_pages/2);
ra->size = ra_pages;
ra->async_size = 0;
- ra_submit(ra, mapping, file);
+ ra_submit(ra, mapping, file, offset, 1);
}
}
@@ -1454,9 +1457,11 @@ static void do_async_mmap_readahead(stru
if (VM_RandomReadHint(vma))
return;
ra_mmap_miss_dec(ra);
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ ra->ra_flags |= READAHEAD_MMAP;
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
}
/**
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 08/11] readahead: add tracing event
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4235 bytes --]
Example output:
# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/readahead.h | 69 +++++++++++++++++++++++++++++
mm/readahead.c | 22 +++++++++
2 files changed, 91 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h 2010-02-01 21:58:48.000000000 +0800
@@ -0,0 +1,69 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+extern const char * const ra_pattern_names[];
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ ra_pattern_names[__entry->pattern],
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
@@ -19,11 +19,25 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
/*
* Set async size to 1/# of the thrashing threshold.
*/
#define READAHEAD_ASYNC_RATIO 8
+const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_THRASH] = "thrash",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -274,6 +288,11 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ trace_readahead(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, nr_to_read, 0, ret);
+
return ret;
}
@@ -301,6 +320,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
return actual;
}
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 08/11] readahead: add tracing event
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4460 bytes --]
Example output:
# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/readahead.h | 69 +++++++++++++++++++++++++++++
mm/readahead.c | 22 +++++++++
2 files changed, 91 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h 2010-02-01 21:58:48.000000000 +0800
@@ -0,0 +1,69 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+extern const char * const ra_pattern_names[];
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ ra_pattern_names[__entry->pattern],
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
@@ -19,11 +19,25 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
/*
* Set async size to 1/# of the thrashing threshold.
*/
#define READAHEAD_ASYNC_RATIO 8
+const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_THRASH] = "thrash",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -274,6 +288,11 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ trace_readahead(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, nr_to_read, 0, ret);
+
return ret;
}
@@ -301,6 +320,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
return actual;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 08/11] readahead: add tracing event
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-tracer.patch --]
[-- Type: text/plain, Size: 4460 bytes --]
Example output:
# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/readahead.h | 69 +++++++++++++++++++++++++++++
mm/readahead.c | 22 +++++++++
2 files changed, 91 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h 2010-02-01 21:58:48.000000000 +0800
@@ -0,0 +1,69 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+extern const char * const ra_pattern_names[];
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ ra_pattern_names[__entry->pattern],
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
@@ -19,11 +19,25 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
/*
* Set async size to 1/# of the thrashing threshold.
*/
#define READAHEAD_ASYNC_RATIO 8
+const char * const ra_pattern_names[] = {
+ [RA_PATTERN_INITIAL] = "initial",
+ [RA_PATTERN_SUBSEQUENT] = "subsequent",
+ [RA_PATTERN_CONTEXT] = "context",
+ [RA_PATTERN_THRASH] = "thrash",
+ [RA_PATTERN_MMAP_AROUND] = "around",
+ [RA_PATTERN_FADVISE] = "fadvise",
+ [RA_PATTERN_RANDOM] = "random",
+ [RA_PATTERN_ALL] = "all",
+};
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -274,6 +288,11 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ trace_readahead(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, nr_to_read, 0, ret);
+
return ret;
}
@@ -301,6 +320,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
return actual;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 09/11] readahead: add /debug/readahead/stats
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 7891 bytes --]
Collect readahead stats when CONFIG_READAHEAD_STATS=y.
This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.
Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 524 216 26 498 498 18 7 4 4
subsequent 181 80 1 130 13 60 25 25 24
context 94 28 3 85 64 8 7 2 5
thrash 0 0 0 0 0 0 0 0 0
around 162 121 33 162 162 162 60 0 21
fadvise 0 0 0 0 0 0 0 0 0
random 137 0 0 137 137 0 1 0 1
all 1098 445 63 1012 874 0 17 6 9
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 13 +++
mm/readahead.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 188 insertions(+), 2 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 21:55:46.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:07.000000000 +0800
@@ -38,6 +38,179 @@ const char * const ra_pattern_names[] =
[RA_PATTERN_ALL] = "all",
};
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request contains/beyond EOF page */
+ RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap accesses */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ int actual)
+{
+ unsigned int pattern = ra_pattern(ra_flags);
+
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+ ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (actual < size) {
+ if (start + size >
+ (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ else
+ ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+ }
+
+ if (!actual)
+ return;
+
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (start <= offset && start + size > offset)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (ra_flags & READAHEAD_MMAP)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+ unsigned long count, iocount;
+
+ seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern",
+ "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ count = ra_stats[i][RA_ACCOUNT_COUNT];
+ iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CHIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+ struct dentry *debugfs_stats;
+
+ ra_debug_root = debugfs_create_dir("readahead", NULL);
+ if (!ra_debug_root)
+ goto out;
+
+ debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+ NULL, &readahead_stats_fops);
+ if (!debugfs_stats)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+ debugfs_create_readahead();
+ return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+ debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+ readahead_stats(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+ readahead_stats(mapping, offset, req_size,
+ RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+ start, size, async_size, actual);
+#endif
+ trace_readahead(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+}
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -289,7 +462,7 @@ int force_page_cache_readahead(struct ad
nr_to_read -= this_chunk;
}
- trace_readahead(mapping, offset, nr_to_read,
+ readahead_event(mapping, offset, nr_to_read,
RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
offset, nr_to_read, 0, ret);
@@ -320,7 +493,7 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
- trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ readahead_event(mapping, offset, req_size, ra->ra_flags,
ra->start, ra->size, ra->async_size, actual);
return actual;
--- linux.orig/mm/Kconfig 2010-02-01 21:55:28.000000000 +0800
+++ linux/mm/Kconfig 2010-02-01 21:55:49.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
of 1 says that all excess pages should be trimmed.
See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+ bool "Collect page-cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ Enable readahead events accounting. Usage:
+
+ # mount -t debugfs none /debug
+
+ # echo > /debug/readahead/stats # reset counters
+ # do benchmarks
+ # cat /debug/readahead/stats # check counters
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 09/11] readahead: add /debug/readahead/stats
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8116 bytes --]
Collect readahead stats when CONFIG_READAHEAD_STATS=y.
This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.
Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 524 216 26 498 498 18 7 4 4
subsequent 181 80 1 130 13 60 25 25 24
context 94 28 3 85 64 8 7 2 5
thrash 0 0 0 0 0 0 0 0 0
around 162 121 33 162 162 162 60 0 21
fadvise 0 0 0 0 0 0 0 0 0
random 137 0 0 137 137 0 1 0 1
all 1098 445 63 1012 874 0 17 6 9
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 13 +++
mm/readahead.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 188 insertions(+), 2 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 21:55:46.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:07.000000000 +0800
@@ -38,6 +38,179 @@ const char * const ra_pattern_names[] =
[RA_PATTERN_ALL] = "all",
};
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request contains/beyond EOF page */
+ RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap accesses */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ int actual)
+{
+ unsigned int pattern = ra_pattern(ra_flags);
+
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+ ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (actual < size) {
+ if (start + size >
+ (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ else
+ ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+ }
+
+ if (!actual)
+ return;
+
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (start <= offset && start + size > offset)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (ra_flags & READAHEAD_MMAP)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+ unsigned long count, iocount;
+
+ seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern",
+ "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ count = ra_stats[i][RA_ACCOUNT_COUNT];
+ iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CHIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+ struct dentry *debugfs_stats;
+
+ ra_debug_root = debugfs_create_dir("readahead", NULL);
+ if (!ra_debug_root)
+ goto out;
+
+ debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+ NULL, &readahead_stats_fops);
+ if (!debugfs_stats)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+ debugfs_create_readahead();
+ return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+ debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+ readahead_stats(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+ readahead_stats(mapping, offset, req_size,
+ RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+ start, size, async_size, actual);
+#endif
+ trace_readahead(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+}
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -289,7 +462,7 @@ int force_page_cache_readahead(struct ad
nr_to_read -= this_chunk;
}
- trace_readahead(mapping, offset, nr_to_read,
+ readahead_event(mapping, offset, nr_to_read,
RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
offset, nr_to_read, 0, ret);
@@ -320,7 +493,7 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
- trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ readahead_event(mapping, offset, req_size, ra->ra_flags,
ra->start, ra->size, ra->async_size, actual);
return actual;
--- linux.orig/mm/Kconfig 2010-02-01 21:55:28.000000000 +0800
+++ linux/mm/Kconfig 2010-02-01 21:55:49.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
of 1 says that all excess pages should be trimmed.
See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+ bool "Collect page-cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ Enable readahead events accounting. Usage:
+
+ # mount -t debugfs none /debug
+
+ # echo > /debug/readahead/stats # reset counters
+ # do benchmarks
+ # cat /debug/readahead/stats # check counters
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 09/11] readahead: add /debug/readahead/stats
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Ingo Molnar, Peter Zijlstra, Wu Fengguang,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-stats.patch --]
[-- Type: text/plain, Size: 8116 bytes --]
Collect readahead stats when CONFIG_READAHEAD_STATS=y.
This is enabled by default because the added overheads are trivial:
two readahead_stats() calls per readahead.
Example output:
(taken from a fresh booted NFS-ROOT box with rsize=16k)
$ cat /debug/readahead/stats
pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
initial 524 216 26 498 498 18 7 4 4
subsequent 181 80 1 130 13 60 25 25 24
context 94 28 3 85 64 8 7 2 5
thrash 0 0 0 0 0 0 0 0 0
around 162 121 33 162 162 162 60 0 21
fadvise 0 0 0 0 0 0 0 0 0
random 137 0 0 137 137 0 1 0 1
all 1098 445 63 1012 874 0 17 6 9
The two most important columns are
- io number of readahead IO
- io_size average readahead IO size
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/Kconfig | 13 +++
mm/readahead.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 188 insertions(+), 2 deletions(-)
--- linux.orig/mm/readahead.c 2010-02-01 21:55:46.000000000 +0800
+++ linux/mm/readahead.c 2010-02-01 21:57:07.000000000 +0800
@@ -38,6 +38,179 @@ const char * const ra_pattern_names[] =
[RA_PATTERN_ALL] = "all",
};
+#ifdef CONFIG_READAHEAD_STATS
+#include <linux/seq_file.h>
+#include <linux/debugfs.h>
+enum ra_account {
+ /* number of readaheads */
+ RA_ACCOUNT_COUNT, /* readahead request */
+ RA_ACCOUNT_EOF, /* readahead request contains/beyond EOF page */
+ RA_ACCOUNT_CHIT, /* readahead request covers some cached pages */
+ RA_ACCOUNT_IOCOUNT, /* readahead IO */
+ RA_ACCOUNT_SYNC, /* readahead IO that is synchronous */
+ RA_ACCOUNT_MMAP, /* readahead IO by mmap accesses */
+ /* number of readahead pages */
+ RA_ACCOUNT_SIZE, /* readahead size */
+ RA_ACCOUNT_ASIZE, /* readahead async size */
+ RA_ACCOUNT_ACTUAL, /* readahead actual IO size */
+ /* end mark */
+ RA_ACCOUNT_MAX,
+};
+
+static unsigned long ra_stats[RA_PATTERN_MAX][RA_ACCOUNT_MAX];
+
+static void readahead_stats(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ int actual)
+{
+ unsigned int pattern = ra_pattern(ra_flags);
+
+ ra_stats[pattern][RA_ACCOUNT_COUNT]++;
+ ra_stats[pattern][RA_ACCOUNT_SIZE] += size;
+ ra_stats[pattern][RA_ACCOUNT_ASIZE] += async_size;
+ ra_stats[pattern][RA_ACCOUNT_ACTUAL] += actual;
+
+ if (actual < size) {
+ if (start + size >
+ (i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT)
+ ra_stats[pattern][RA_ACCOUNT_EOF]++;
+ else
+ ra_stats[pattern][RA_ACCOUNT_CHIT]++;
+ }
+
+ if (!actual)
+ return;
+
+ ra_stats[pattern][RA_ACCOUNT_IOCOUNT]++;
+
+ if (start <= offset && start + size > offset)
+ ra_stats[pattern][RA_ACCOUNT_SYNC]++;
+
+ if (ra_flags & READAHEAD_MMAP)
+ ra_stats[pattern][RA_ACCOUNT_MMAP]++;
+}
+
+static int readahead_stats_show(struct seq_file *s, void *_)
+{
+ unsigned long i;
+ unsigned long count, iocount;
+
+ seq_printf(s, "%-10s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n",
+ "pattern",
+ "readahead", "eof_hit", "cache_hit",
+ "io", "sync_io", "mmap_io",
+ "size", "async_size", "io_size");
+
+ for (i = 0; i < RA_PATTERN_MAX; i++) {
+ count = ra_stats[i][RA_ACCOUNT_COUNT];
+ iocount = ra_stats[i][RA_ACCOUNT_IOCOUNT];
+ /*
+ * avoid division-by-zero
+ */
+ if (count == 0)
+ count = 1;
+ if (iocount == 0)
+ iocount = 1;
+
+ seq_printf(s, "%-10s %10lu %10lu %10lu %10lu %10lu %10lu "
+ "%10lu %10lu %10lu\n",
+ ra_pattern_names[i],
+ ra_stats[i][RA_ACCOUNT_COUNT],
+ ra_stats[i][RA_ACCOUNT_EOF],
+ ra_stats[i][RA_ACCOUNT_CHIT],
+ ra_stats[i][RA_ACCOUNT_IOCOUNT],
+ ra_stats[i][RA_ACCOUNT_SYNC],
+ ra_stats[i][RA_ACCOUNT_MMAP],
+ ra_stats[i][RA_ACCOUNT_SIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ASIZE] / count,
+ ra_stats[i][RA_ACCOUNT_ACTUAL] / iocount);
+ }
+
+ return 0;
+}
+
+static int readahead_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, readahead_stats_show, NULL);
+}
+
+static ssize_t readahead_stats_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ memset(ra_stats, 0, sizeof(ra_stats));
+ return size;
+}
+
+static struct file_operations readahead_stats_fops = {
+ .owner = THIS_MODULE,
+ .open = readahead_stats_open,
+ .write = readahead_stats_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct dentry *ra_debug_root;
+
+static int debugfs_create_readahead(void)
+{
+ struct dentry *debugfs_stats;
+
+ ra_debug_root = debugfs_create_dir("readahead", NULL);
+ if (!ra_debug_root)
+ goto out;
+
+ debugfs_stats = debugfs_create_file("stats", 0644, ra_debug_root,
+ NULL, &readahead_stats_fops);
+ if (!debugfs_stats)
+ goto out;
+
+ return 0;
+out:
+ printk(KERN_ERR "readahead: failed to create debugfs entries\n");
+ return -ENOMEM;
+}
+
+static int __init readahead_init(void)
+{
+ debugfs_create_readahead();
+ return 0;
+}
+
+static void __exit readahead_exit(void)
+{
+ debugfs_remove_recursive(ra_debug_root);
+}
+
+module_init(readahead_init);
+module_exit(readahead_exit);
+#endif
+
+static void readahead_event(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual)
+{
+#ifdef CONFIG_READAHEAD_STATS
+ readahead_stats(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+ readahead_stats(mapping, offset, req_size,
+ RA_PATTERN_ALL << READAHEAD_PATTERN_SHIFT,
+ start, size, async_size, actual);
+#endif
+ trace_readahead(mapping, offset, req_size, ra_flags,
+ start, size, async_size, actual);
+}
+
/*
* Initialise a struct file's readahead state. Assumes that the caller has
* memset *ra to zero.
@@ -289,7 +462,7 @@ int force_page_cache_readahead(struct ad
nr_to_read -= this_chunk;
}
- trace_readahead(mapping, offset, nr_to_read,
+ readahead_event(mapping, offset, nr_to_read,
RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
offset, nr_to_read, 0, ret);
@@ -320,7 +493,7 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
- trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ readahead_event(mapping, offset, req_size, ra->ra_flags,
ra->start, ra->size, ra->async_size, actual);
return actual;
--- linux.orig/mm/Kconfig 2010-02-01 21:55:28.000000000 +0800
+++ linux/mm/Kconfig 2010-02-01 21:55:49.000000000 +0800
@@ -283,3 +283,16 @@ config NOMMU_INITIAL_TRIM_EXCESS
of 1 says that all excess pages should be trimmed.
See Documentation/nommu-mmap.txt for more information.
+
+config READAHEAD_STATS
+ bool "Collect page-cache readahead stats"
+ depends on DEBUG_FS
+ default y
+ help
+ Enable readahead events accounting. Usage:
+
+ # mount -t debugfs none /debug
+
+ # echo > /debug/readahead/stats # reset counters
+ # do benchmarks
+ # cat /debug/readahead/stats # check counters
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2035 bytes --]
Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
seek to 0
read 1024
seek to 1536
read 16384
The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.
So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.
CC: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/read_write.c | 3 +++
include/linux/fs.h | 1 +
mm/readahead.c | 5 +++++
3 files changed, 9 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-02 21:52:19.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:32.000000000 +0800
@@ -625,6 +625,11 @@ ondemand_readahead(struct address_space
if (!offset) {
ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
+ if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+ ra->size = req_size;
+ ra->async_size = 0;
+ goto readit;
+ }
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c 2010-02-02 21:50:51.000000000 +0800
+++ linux/fs/read_write.c 2010-02-02 21:53:04.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
file->f_version = 0;
}
+ if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+ file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
return offset;
}
EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
#define READAHEAD_MMAP 0x20000000
+#define READAHEAD_LSEEK 0x40000000 /* be conservative after lseek() */
/*
* Which policy makes decision to do the current read-ahead IO?
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2260 bytes --]
Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
seek to 0
read 1024
seek to 1536
read 16384
The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.
So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.
CC: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/read_write.c | 3 +++
include/linux/fs.h | 1 +
mm/readahead.c | 5 +++++
3 files changed, 9 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-02 21:52:19.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:32.000000000 +0800
@@ -625,6 +625,11 @@ ondemand_readahead(struct address_space
if (!offset) {
ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
+ if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+ ra->size = req_size;
+ ra->async_size = 0;
+ goto readit;
+ }
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c 2010-02-02 21:50:51.000000000 +0800
+++ linux/fs/read_write.c 2010-02-02 21:53:04.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
file->f_version = 0;
}
+ if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+ file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
return offset;
}
EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
#define READAHEAD_MMAP 0x20000000
+#define READAHEAD_LSEEK 0x40000000 /* be conservative after lseek() */
/*
* Which policy makes decision to do the current read-ahead IO?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Linus Torvalds, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-lseek.patch --]
[-- Type: text/plain, Size: 2260 bytes --]
Some applications (eg. blkid, id3tool etc.) seek around the file
to get information. For example, blkid does
seek to 0
read 1024
seek to 1536
read 16384
The start-of-file readahead heuristic is wrong for them, whose
access pattern can be identified by lseek() calls.
So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
do start-of-file readahead on seeing it. Proposed by Linus.
CC: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
fs/read_write.c | 3 +++
include/linux/fs.h | 1 +
mm/readahead.c | 5 +++++
3 files changed, 9 insertions(+)
--- linux.orig/mm/readahead.c 2010-02-02 21:52:19.000000000 +0800
+++ linux/mm/readahead.c 2010-02-02 21:52:32.000000000 +0800
@@ -625,6 +625,11 @@ ondemand_readahead(struct address_space
if (!offset) {
ra_set_pattern(ra, RA_PATTERN_INITIAL);
ra->start = offset;
+ if ((ra->ra_flags & READAHEAD_LSEEK) && req_size <= max) {
+ ra->size = req_size;
+ ra->async_size = 0;
+ goto readit;
+ }
ra->size = get_init_ra_size(req_size, max);
ra->async_size = ra->size > req_size ?
ra->size - req_size : ra->size;
--- linux.orig/fs/read_write.c 2010-02-02 21:50:51.000000000 +0800
+++ linux/fs/read_write.c 2010-02-02 21:53:04.000000000 +0800
@@ -71,6 +71,9 @@ generic_file_llseek_unlocked(struct file
file->f_version = 0;
}
+ if (!(file->f_ra.ra_flags & READAHEAD_LSEEK))
+ file->f_ra.ra_flags |= READAHEAD_LSEEK;
+
return offset;
}
EXPORT_SYMBOL(generic_file_llseek_unlocked);
--- linux.orig/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
+++ linux/include/linux/fs.h 2010-02-02 21:52:19.000000000 +0800
@@ -899,6 +899,7 @@ struct file_ra_state {
#define READAHEAD_MMAP_MISS 0x0000ffff /* cache misses for mmap access */
#define READAHEAD_THRASHED 0x10000000
#define READAHEAD_MMAP 0x20000000
+#define READAHEAD_LSEEK 0x40000000 /* be conservative after lseek() */
/*
* Which policy makes decision to do the current read-ahead IO?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 11/11] radixtree: speed up next/prev hole search
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 15:28 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: radixtree-scan-hole-fast.patch --]
[-- Type: text/plain, Size: 3404 bytes --]
Replace the hole scan functions with more fast versions:
- radix_tree_next_hole(root, index, max_scan)
- radix_tree_prev_hole(root, index, max_scan)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
lib/radix-tree.c | 85 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 74 insertions(+), 11 deletions(-)
--- linux.orig/lib/radix-tree.c 2010-01-09 21:45:16.000000000 +0800
+++ linux/lib/radix-tree.c 2010-01-21 22:04:22.000000000 +0800
@@ -609,6 +609,24 @@ int radix_tree_tag_get(struct radix_tree
}
EXPORT_SYMBOL(radix_tree_tag_get);
+/*
+ * Find the bottom radix tree node that contains @index.
+ * Return NULL if @index is hole, or is the special root node.
+ */
+static struct radix_tree_node *
+radix_tree_lookup_node(struct radix_tree_root *root, unsigned long index)
+{
+ void *slot;
+
+ slot = radix_tree_lookup_element(root, index, 1);
+ if (!slot || slot == &root->rnode)
+ return NULL;
+
+ slot -= (index & RADIX_TREE_MAP_MASK) * sizeof(void *);
+
+ return container_of(slot, struct radix_tree_node, slots);
+}
+
/**
* radix_tree_next_hole - find the next hole (not-present entry)
* @root: tree root
@@ -630,18 +648,41 @@ EXPORT_SYMBOL(radix_tree_tag_get);
* under rcu_read_lock.
*/
unsigned long radix_tree_next_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan)
+ unsigned long index, unsigned long max_scan)
{
- unsigned long i;
+ struct radix_tree_node *node;
+ unsigned long origin = index;
+ int i;
+
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return index;
+
+ if (!radix_tree_is_indirect_ptr(node))
+ return index ? index : 1;
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
+ while (index - origin < max_scan) {
+ node = radix_tree_lookup_node(root, index);
+ if (!node)
break;
- index++;
- if (index == 0)
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ goto check_overflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK;
+ i < RADIX_TREE_MAP_SIZE;
+ i++, index++)
+ if (rcu_dereference(node->slots[i]) == NULL)
+ goto out;
+
+check_overflow:
+ if (unlikely(index == 0))
break;
}
+out:
return index;
}
EXPORT_SYMBOL(radix_tree_next_hole);
@@ -669,16 +710,38 @@ EXPORT_SYMBOL(radix_tree_next_hole);
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan)
{
- unsigned long i;
+ struct radix_tree_node *node;
+ unsigned long origin = index;
+ int i;
+
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return index;
+
+ if (!radix_tree_is_indirect_ptr(node))
+ return index ? index : ULONG_MAX;
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
+ while (origin - index < max_scan) {
+ node = radix_tree_lookup_node(root, index);
+ if (!node)
break;
- index--;
- if (index == LONG_MAX)
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index - RADIX_TREE_MAP_SIZE) |
+ RADIX_TREE_MAP_MASK;
+ goto check_underflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--)
+ if (rcu_dereference(node->slots[i]) == NULL)
+ goto out;
+
+check_underflow:
+ if (unlikely(index == ULONG_MAX))
break;
}
+out:
return index;
}
EXPORT_SYMBOL(radix_tree_prev_hole);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 11/11] radixtree: speed up next/prev hole search
@ 2010-02-02 15:28 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-02 15:28 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Nick Piggin, Wu Fengguang, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: radixtree-scan-hole-fast.patch --]
[-- Type: text/plain, Size: 3402 bytes --]
Replace the hole scan functions with more fast versions:
- radix_tree_next_hole(root, index, max_scan)
- radix_tree_prev_hole(root, index, max_scan)
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
lib/radix-tree.c | 85 +++++++++++++++++++++++++++++++++++++++------
1 file changed, 74 insertions(+), 11 deletions(-)
--- linux.orig/lib/radix-tree.c 2010-01-09 21:45:16.000000000 +0800
+++ linux/lib/radix-tree.c 2010-01-21 22:04:22.000000000 +0800
@@ -609,6 +609,24 @@ int radix_tree_tag_get(struct radix_tree
}
EXPORT_SYMBOL(radix_tree_tag_get);
+/*
+ * Find the bottom radix tree node that contains @index.
+ * Return NULL if @index is hole, or is the special root node.
+ */
+static struct radix_tree_node *
+radix_tree_lookup_node(struct radix_tree_root *root, unsigned long index)
+{
+ void *slot;
+
+ slot = radix_tree_lookup_element(root, index, 1);
+ if (!slot || slot == &root->rnode)
+ return NULL;
+
+ slot -= (index & RADIX_TREE_MAP_MASK) * sizeof(void *);
+
+ return container_of(slot, struct radix_tree_node, slots);
+}
+
/**
* radix_tree_next_hole - find the next hole (not-present entry)
* @root: tree root
@@ -630,18 +648,41 @@ EXPORT_SYMBOL(radix_tree_tag_get);
* under rcu_read_lock.
*/
unsigned long radix_tree_next_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan)
+ unsigned long index, unsigned long max_scan)
{
- unsigned long i;
+ struct radix_tree_node *node;
+ unsigned long origin = index;
+ int i;
+
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return index;
+
+ if (!radix_tree_is_indirect_ptr(node))
+ return index ? index : 1;
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
+ while (index - origin < max_scan) {
+ node = radix_tree_lookup_node(root, index);
+ if (!node)
break;
- index++;
- if (index == 0)
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ goto check_overflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK;
+ i < RADIX_TREE_MAP_SIZE;
+ i++, index++)
+ if (rcu_dereference(node->slots[i]) == NULL)
+ goto out;
+
+check_overflow:
+ if (unlikely(index == 0))
break;
}
+out:
return index;
}
EXPORT_SYMBOL(radix_tree_next_hole);
@@ -669,16 +710,38 @@ EXPORT_SYMBOL(radix_tree_next_hole);
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan)
{
- unsigned long i;
+ struct radix_tree_node *node;
+ unsigned long origin = index;
+ int i;
+
+ node = rcu_dereference(root->rnode);
+ if (node == NULL)
+ return index;
+
+ if (!radix_tree_is_indirect_ptr(node))
+ return index ? index : ULONG_MAX;
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
+ while (origin - index < max_scan) {
+ node = radix_tree_lookup_node(root, index);
+ if (!node)
break;
- index--;
- if (index == LONG_MAX)
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index - RADIX_TREE_MAP_SIZE) |
+ RADIX_TREE_MAP_MASK;
+ goto check_underflow;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--)
+ if (rcu_dereference(node->slots[i]) == NULL)
+ goto out;
+
+check_underflow:
+ if (unlikely(index == ULONG_MAX))
break;
}
+out:
return index;
}
EXPORT_SYMBOL(radix_tree_prev_hole);
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 17:39 ` Linus Torvalds
-1 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 17:39 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Wu Fengguang wrote:
>
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> seek to 0
> read 1024
> seek to 1536
> read 16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 17:39 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 17:39 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Wu Fengguang wrote:
>
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> seek to 0
> read 1024
> seek to 1536
> read 16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 18:13 ` Olivier Galibert
-1 siblings, 0 replies; 83+ messages in thread
From: Olivier Galibert @ 2010-02-02 18:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Linus Torvalds, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 11:28:45PM +0800, Wu Fengguang wrote:
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> seek to 0
> read 1024
> seek to 1536
> read 16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
Wouldn't that trigger on lseeks to end of file to get the size?
OG.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 18:13 ` Olivier Galibert
0 siblings, 0 replies; 83+ messages in thread
From: Olivier Galibert @ 2010-02-02 18:13 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Linus Torvalds, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 11:28:45PM +0800, Wu Fengguang wrote:
> Some applications (eg. blkid, id3tool etc.) seek around the file
> to get information. For example, blkid does
> seek to 0
> read 1024
> seek to 1536
> read 16384
>
> The start-of-file readahead heuristic is wrong for them, whose
> access pattern can be identified by lseek() calls.
>
> So test-and-set a READAHEAD_LSEEK flag on lseek() and don't
> do start-of-file readahead on seeing it. Proposed by Linus.
Wouldn't that trigger on lseeks to end of file to get the size?
OG.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 18:13 ` Olivier Galibert
@ 2010-02-02 18:40 ` Linus Torvalds
-1 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 18:40 UTC (permalink / raw)
To: Olivier Galibert
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Olivier Galibert wrote:
>
> Wouldn't that trigger on lseeks to end of file to get the size?
Well, you'd only ever do that with a raw block device, no (if even that:
more "raw block device" tools just use the BLKSIZE64 ioctl etc)? Any sane
regular file accessor will do 'fstat()' instead.
And do we care about startup speed of ramping up read-ahead from the
beginning? In fact, the problem case that caused this was literally
'blkid' on a block device - and the fact that the kernel tried to
read-ahead TOO MUCh rather than too little.
If somebody is really doing lots of serial reading, the read-ahead code
will figure it out very quickly. The case this worries about is just the
_first_ read, where the question is one of "do we think it might be
seeking around, or does it look like the user is going to just read the
whole thing"?
IOW, if you start off with a SEEK_END, I think it's reasonable to expect
it to _not_ read the whole thing.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 18:40 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 18:40 UTC (permalink / raw)
To: Olivier Galibert
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Olivier Galibert wrote:
>
> Wouldn't that trigger on lseeks to end of file to get the size?
Well, you'd only ever do that with a raw block device, no (if even that:
more "raw block device" tools just use the BLKSIZE64 ioctl etc)? Any sane
regular file accessor will do 'fstat()' instead.
And do we care about startup speed of ramping up read-ahead from the
beginning? In fact, the problem case that caused this was literally
'blkid' on a block device - and the fact that the kernel tried to
read-ahead TOO MUCh rather than too little.
If somebody is really doing lots of serial reading, the read-ahead code
will figure it out very quickly. The case this worries about is just the
_first_ read, where the question is one of "do we think it might be
seeking around, or does it look like the user is going to just read the
whole thing"?
IOW, if you start off with a SEEK_END, I think it's reasonable to expect
it to _not_ read the whole thing.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 18:40 ` Linus Torvalds
@ 2010-02-02 18:48 ` Olivier Galibert
-1 siblings, 0 replies; 83+ messages in thread
From: Olivier Galibert @ 2010-02-02 18:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 10:40:41AM -0800, Linus Torvalds wrote:
> IOW, if you start off with a SEEK_END, I think it's reasonable to expect
> it to _not_ read the whole thing.
I've seen a lot of:
int fd = open(...);
size = lseek(fd, 0, SEEK_END);
lseek(fd, 0, SEEK_SET);
data = malloc(size);
read(fd, data, size);
close(fd);
Why not fstat? I don't know. Perhaps a case of cargo culting,
perhaps a case of "other unixes suck for portability"[1]. But it's
probably still there a lot in real code.
OG.
[1] In the hpux, dgux, sunos, etc sense. Not to be taken as a comment
on modern BSDs.
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 18:48 ` Olivier Galibert
0 siblings, 0 replies; 83+ messages in thread
From: Olivier Galibert @ 2010-02-02 18:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 10:40:41AM -0800, Linus Torvalds wrote:
> IOW, if you start off with a SEEK_END, I think it's reasonable to expect
> it to _not_ read the whole thing.
I've seen a lot of:
int fd = open(...);
size = lseek(fd, 0, SEEK_END);
lseek(fd, 0, SEEK_SET);
data = malloc(size);
read(fd, data, size);
close(fd);
Why not fstat? I don't know. Perhaps a case of cargo culting,
perhaps a case of "other unixes suck for portability"[1]. But it's
probably still there a lot in real code.
OG.
[1] In the hpux, dgux, sunos, etc sense. Not to be taken as a comment
on modern BSDs.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 18:48 ` Olivier Galibert
@ 2010-02-02 19:14 ` Linus Torvalds
-1 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 19:14 UTC (permalink / raw)
To: Olivier Galibert
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Olivier Galibert wrote:
>
> On Tue, Feb 02, 2010 at 10:40:41AM -0800, Linus Torvalds wrote:
> > IOW, if you start off with a SEEK_END, I think it's reasonable to expect
> > it to _not_ read the whole thing.
>
> I've seen a lot of:
> int fd = open(...);
> size = lseek(fd, 0, SEEK_END);
> lseek(fd, 0, SEEK_SET);
>
> data = malloc(size);
> read(fd, data, size);
> close(fd);
>
> Why not fstat? I don't know.
Well, the above will work perfectly with or without the patch, since it
does the read of the full size. There is no read-ahead hint necessary for
that kind of single read behavior.
Rememebr: read-ahead is about filling the empty IO spaces _between_ reads,
and turning many smaller reads into one bigger one. If you only have a
single big read, read-ahead cannot help.
Also, keep in mind that read-ahead is not always a win. It can be a huge
loss too. Which is why we have _heuristics_. They fundamentally cannot
catch every case, but what they aim for is to do a good job on average.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 19:14 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 19:14 UTC (permalink / raw)
To: Olivier Galibert
Cc: Wu Fengguang, Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2 Feb 2010, Olivier Galibert wrote:
>
> On Tue, Feb 02, 2010 at 10:40:41AM -0800, Linus Torvalds wrote:
> > IOW, if you start off with a SEEK_END, I think it's reasonable to expect
> > it to _not_ read the whole thing.
>
> I've seen a lot of:
> int fd = open(...);
> size = lseek(fd, 0, SEEK_END);
> lseek(fd, 0, SEEK_SET);
>
> data = malloc(size);
> read(fd, data, size);
> close(fd);
>
> Why not fstat? I don't know.
Well, the above will work perfectly with or without the patch, since it
does the read of the full size. There is no read-ahead hint necessary for
that kind of single read behavior.
Rememebr: read-ahead is about filling the empty IO spaces _between_ reads,
and turning many smaller reads into one bigger one. If you only have a
single big read, read-ahead cannot help.
Also, keep in mind that read-ahead is not always a win. It can be a huge
loss too. Which is why we have _heuristics_. They fundamentally cannot
catch every case, but what they aim for is to do a good job on average.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-02 19:38 ` Jens Axboe
-1 siblings, 0 replies; 83+ messages in thread
From: Jens Axboe @ 2010-02-02 19:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Tue, Feb 02 2010, Wu Fengguang wrote:
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. We use a formula that
> generates the following concrete limits:
>
> disk size readahead size
> (scale by 4) (scale by 2)
> 2M 4k
> 8M 8k
> 32M 16k
> 128M 32k
> 512M 64k
> 2G 128k
> 8G 256k
> 32G 512k
> 128G 1024k
I'm not sure the size part makes a ton of sense. You can have really
fast small devices, and large slow devices. One real world example are
the Sun FMod SSD devices, which are only 22GB in size but are faster
than the Intel X25-E SLC disks.
What makes it even worse for these devices is that they are often
attached to fatter controllers than ahci, where command overhead is
larger.
Running your script on such a device yields (I enlarged the read-count
by 2, makes it more reproducible):
MARVELL SD88SA02 MP1F
rasize 1st 2nd
------------------------------------------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
So for that device, 1M-2M looks like the sweet spot, with even needing
4-8M to fully reach full throughput.
I don't think this is atypical of bigger systems. Only very recently
have controller started to slim down the command overhead for real,
because of the SSD devices. What probably is atypical is a device that
is this small yet pretty fast.
--
Jens Axboe
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-02 19:38 ` Jens Axboe
0 siblings, 0 replies; 83+ messages in thread
From: Jens Axboe @ 2010-02-02 19:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Tue, Feb 02 2010, Wu Fengguang wrote:
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. We use a formula that
> generates the following concrete limits:
>
> disk size readahead size
> (scale by 4) (scale by 2)
> 2M 4k
> 8M 8k
> 32M 16k
> 128M 32k
> 512M 64k
> 2G 128k
> 8G 256k
> 32G 512k
> 128G 1024k
I'm not sure the size part makes a ton of sense. You can have really
fast small devices, and large slow devices. One real world example are
the Sun FMod SSD devices, which are only 22GB in size but are faster
than the Intel X25-E SLC disks.
What makes it even worse for these devices is that they are often
attached to fatter controllers than ahci, where command overhead is
larger.
Running your script on such a device yields (I enlarged the read-count
by 2, makes it more reproducible):
MARVELL SD88SA02 MP1F
rasize 1st 2nd
------------------------------------------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
So for that device, 1M-2M looks like the sweet spot, with even needing
4-8M to fully reach full throughput.
I don't think this is atypical of bigger systems. Only very recently
have controller started to slim down the command overhead for real,
because of the SSD devices. What probably is atypical is a device that
is this small yet pretty fast.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 19:14 ` Linus Torvalds
@ 2010-02-02 19:59 ` david
-1 siblings, 0 replies; 83+ messages in thread
From: david @ 2010-02-02 19:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Olivier Galibert, Wu Fengguang, Andrew Morton, Jens Axboe,
Peter Zijlstra, Linux Memory Management List, linux-fsdevel,
LKML
On Tue, 2 Feb 2010, Linus Torvalds wrote:
> Rememebr: read-ahead is about filling the empty IO spaces _between_ reads,
> and turning many smaller reads into one bigger one. If you only have a
> single big read, read-ahead cannot help.
>
> Also, keep in mind that read-ahead is not always a win. It can be a huge
> loss too. Which is why we have _heuristics_. They fundamentally cannot
> catch every case, but what they aim for is to do a good job on average.
as a note from the field, I just had an application that needed to be
changed because it did excessive read-ahead. it turned a 2 min reporting
run into a 20 min reporting run because for this report the access was
really random and the app forced large read-ahead.
David Lang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 19:59 ` david
0 siblings, 0 replies; 83+ messages in thread
From: david @ 2010-02-02 19:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Olivier Galibert, Wu Fengguang, Andrew Morton, Jens Axboe,
Peter Zijlstra, Linux Memory Management List, linux-fsdevel,
LKML
On Tue, 2 Feb 2010, Linus Torvalds wrote:
> Rememebr: read-ahead is about filling the empty IO spaces _between_ reads,
> and turning many smaller reads into one bigger one. If you only have a
> single big read, read-ahead cannot help.
>
> Also, keep in mind that read-ahead is not always a win. It can be a huge
> loss too. Which is why we have _heuristics_. They fundamentally cannot
> catch every case, but what they aim for is to do a good job on average.
as a note from the field, I just had an application that needed to be
changed because it did excessive read-ahead. it turned a 2 min reporting
run into a 20 min reporting run because for this report the access was
really random and the app forced large read-ahead.
David Lang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
2010-02-02 19:59 ` david
@ 2010-02-02 20:22 ` Linus Torvalds
-1 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 20:22 UTC (permalink / raw)
To: david
Cc: Olivier Galibert, Wu Fengguang, Andrew Morton, Jens Axboe,
Peter Zijlstra, Linux Memory Management List, linux-fsdevel,
LKML
On Tue, 2 Feb 2010, david@lang.hm wrote:
> On Tue, 2 Feb 2010, Linus Torvalds wrote:
> >
> > Also, keep in mind that read-ahead is not always a win. It can be a huge
> > loss too. Which is why we have _heuristics_. They fundamentally cannot
> > catch every case, but what they aim for is to do a good job on average.
>
> as a note from the field, I just had an application that needed to be changed
> because it did excessive read-ahead. it turned a 2 min reporting run into a 20
> min reporting run because for this report the access was really random and the
> app forced large read-ahead.
Yeah. And the reason Wu did this patch is similar: something that _should_
have taken just quarter of a second took about 7 seconds, because
read-ahead triggered on this really slow device that only feeds about
15kB/s (yes, _kilo_byte, not megabyte).
You can always use POSIX_FADVISE_RANDOM to disable it, but it's seldom
something that people do. And there are real loads that have random
components to them without being _entirely_ random, so in an optimal world
we should just have heuristics that work well.
The problem is, it's often easier to test/debug the "good" cases, ie the
cases where we _want_ read-ahead to trigger. So that probably means that
we have a tendency to read-ahead too aggressively, because those cases are
the ones where people can most easily look at it and say "yeah, this
improves throughput of a 'dd bs=8192'".
So then when we find loads where read-ahead hurts, I think we need to take
_that_ case very seriously. Because otherwise our selection bias for
testing read-ahead will fail.
Linus
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
@ 2010-02-02 20:22 ` Linus Torvalds
0 siblings, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2010-02-02 20:22 UTC (permalink / raw)
To: david
Cc: Olivier Galibert, Wu Fengguang, Andrew Morton, Jens Axboe,
Peter Zijlstra, Linux Memory Management List, linux-fsdevel,
LKML
On Tue, 2 Feb 2010, david@lang.hm wrote:
> On Tue, 2 Feb 2010, Linus Torvalds wrote:
> >
> > Also, keep in mind that read-ahead is not always a win. It can be a huge
> > loss too. Which is why we have _heuristics_. They fundamentally cannot
> > catch every case, but what they aim for is to do a good job on average.
>
> as a note from the field, I just had an application that needed to be changed
> because it did excessive read-ahead. it turned a 2 min reporting run into a 20
> min reporting run because for this report the access was really random and the
> app forced large read-ahead.
Yeah. And the reason Wu did this patch is similar: something that _should_
have taken just quarter of a second took about 7 seconds, because
read-ahead triggered on this really slow device that only feeds about
15kB/s (yes, _kilo_byte, not megabyte).
You can always use POSIX_FADVISE_RANDOM to disable it, but it's seldom
something that people do. And there are real loads that have random
components to them without being _entirely_ random, so in an optimal world
we should just have heuristics that work well.
The problem is, it's often easier to test/debug the "good" cases, ie the
cases where we _want_ read-ahead to trigger. So that probably means that
we have a tendency to read-ahead too aggressively, because those cases are
the ones where people can most easily look at it and say "yeah, this
improves throughput of a 'dd bs=8192'".
So then when we find loads where read-ahead hurts, I think we need to take
_that_ case very seriously. Because otherwise our selection bias for
testing read-ahead will fail.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-02 22:38 ` Vivek Goyal
-1 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-02 22:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, Wu Fengguang, LKML
On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> Andrew,
>
> This is to lift default readahead size to 512KB, which I believe yields
> more I/O throughput without noticeably increasing I/O latency for today's HDD.
>
Hi Fengguang,
I was doing a quick test with the patches. I was using fio to run some
sequential reader threads. I have got one access to one Lun from an HP
EVA. In my case it looks like with the patches throughput has come down.
Folllowing are the results.
Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 141768 130965 0 0
bsr 3 2 131979 135402 0 0
bsr 3 4 132351 420733 0 0
bsr 3 8 133152 455434 0 0
bsr 3 16 130316 674499 0 0
Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 84749.3 53213 0 0
bsr 3 2 83189.7 157473 0 0
bsr 3 4 77583.3 330030 0 0
bsr 3 8 88545.7 378201 0 0
bsr 3 16 95331.7 482657 0 0
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.
Thanks
Vivek
> For example, for a 100MB/s and 8ms access time HDD:
>
> io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
> 4 8 0.04 8.04 0.49% 497.57 124.39
> 8 8 0.08 8.08 0.97% 990.33 123.79
> 16 8 0.16 8.16 1.92% 1961.69 122.61
> 32 8 0.31 8.31 3.76% 3849.62 120.30
> 64 8 0.62 8.62 7.25% 7420.29 115.94
> 128 8 1.25 9.25 13.51% 13837.84 108.11
> 256 8 2.50 10.50 23.81% 24380.95 95.24
> 512 8 5.00 13.00 38.46% 39384.62 76.92
> 1024 8 10.00 18.00 55.56% 56888.89 55.56
> 2048 8 20.00 28.00 71.43% 73142.86 35.71
> 4096 8 40.00 48.00 83.33% 85333.33 20.83
>
> The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
> merely increases IO latency from 9.25ms to 13.00ms.
>
> As for SSD, I find that Intel X25-M SSD desires large readahead size
> even for sequential reads (the first patch has benchmark details):
>
> rasize first run time/throughput second run time/throughput
> ------------------------------------------------------------------
> 4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
> 8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
> 16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
> 32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
> 64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
> 128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
> 256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
> 512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
> 1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
> 4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
>
> As suggested by Linus, decrease default readahead size for small devices at the same time.
>
> [PATCH 01/11] readahead: limit readahead size for small devices
> [PATCH 02/11] readahead: bump up the default readahead size
> [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
>
> The two other impacts of an enlarged readahead size are
>
> - memory footprint (caused by readahead miss)
> Sequential readahead hit ratio is pretty high regardless of max
> readahead size; the extra memory footprint is mainly caused by
> enlarged mmap read-around.
> I measured my desktop:
> - under Xwindow:
> 128KB readahead cache hit ratio = 143MB/230MB = 62%
> 512KB readahead cache hit ratio = 138MB/248MB = 55%
> - under console: (seems more stable than the Xwindow data)
> 128KB readahead cache hit ratio = 30MB/56MB = 53%
> 1MB readahead cache hit ratio = 30MB/59MB = 51%
> So the impact to memory footprint looks acceptable.
>
> - readahead thrashing
> It will now cost 1MB readahead buffer per stream. Memory tight systems
> typically do not run multiple streams; but if they do so, it should
> help I/O performance as long as we can avoid thrashing, which can be
> achieved with the following patches.
>
> [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
> [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
> [PATCH 06/11] readahead: thrashing safe context readahead
>
> This is a major rewrite of the readahead algorithm, so I did careful tests with
> the following tracing/stats patches:
>
> [PATCH 07/11] readahead: record readahead patterns
> [PATCH 08/11] readahead: add tracing event
> [PATCH 09/11] readahead: add /debug/readahead/stats
>
> I verified the new readahead behavior on various access patterns,
> as well as stress tested the thrashing safety, by running 300 streams
> with mem=128M.
>
> Only 2031/61325=3.3% readahead windows are thrashed (due to workload
> variation):
>
> # cat /debug/readahead/stats
> pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
> initial 20 9 4 20 20 12 73 37 35
> subsequent 3 3 0 1 0 1 8 8 1
> context 61325 1 5479 61325 6788 5 14 2 13
> thrash 2031 0 1222 2031 2031 0 9 0 6
> around 235 90 142 235 235 235 60 0 19
> fadvise 0 0 0 0 0 0 0 0 0
> random 223 133 0 91 91 1 1 0 1
> all 63837 236 6847 63703 9165 0 14 2 13
>
> And the readahead inside a single stream is working as expected:
>
> # grep streams-3162 /debug/tracing/trace
> streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
> streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
> streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
> streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
> streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
> streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
> streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
> streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
> streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
> streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
> streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
> streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
> streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
> streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
> streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
> streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
> streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
> streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
> streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
> streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
> streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
> streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
> streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
> streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
> streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
> [...]
>
> btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
>
> [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
>
> At last, the updated context readahead will do more radix tree scans, so need
> to optimize radix_tree_prev_hole():
>
> [PATCH 11/11] radixtree: speed up next/prev hole search
>
> It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
> plus 8 level-1 node searches.
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 22:38 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-02 22:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, Wu Fengguang, LKML
On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> Andrew,
>
> This is to lift default readahead size to 512KB, which I believe yields
> more I/O throughput without noticeably increasing I/O latency for today's HDD.
>
Hi Fengguang,
I was doing a quick test with the patches. I was using fio to run some
sequential reader threads. I have got one access to one Lun from an HP
EVA. In my case it looks like with the patches throughput has come down.
Folllowing are the results.
Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 141768 130965 0 0
bsr 3 2 131979 135402 0 0
bsr 3 4 132351 420733 0 0
bsr 3 8 133152 455434 0 0
bsr 3 16 130316 674499 0 0
Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 84749.3 53213 0 0
bsr 3 2 83189.7 157473 0 0
bsr 3 4 77583.3 330030 0 0
bsr 3 8 88545.7 378201 0 0
bsr 3 16 95331.7 482657 0 0
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.
Thanks
Vivek
> For example, for a 100MB/s and 8ms access time HDD:
>
> io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
> 4 8 0.04 8.04 0.49% 497.57 124.39
> 8 8 0.08 8.08 0.97% 990.33 123.79
> 16 8 0.16 8.16 1.92% 1961.69 122.61
> 32 8 0.31 8.31 3.76% 3849.62 120.30
> 64 8 0.62 8.62 7.25% 7420.29 115.94
> 128 8 1.25 9.25 13.51% 13837.84 108.11
> 256 8 2.50 10.50 23.81% 24380.95 95.24
> 512 8 5.00 13.00 38.46% 39384.62 76.92
> 1024 8 10.00 18.00 55.56% 56888.89 55.56
> 2048 8 20.00 28.00 71.43% 73142.86 35.71
> 4096 8 40.00 48.00 83.33% 85333.33 20.83
>
> The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
> merely increases IO latency from 9.25ms to 13.00ms.
>
> As for SSD, I find that Intel X25-M SSD desires large readahead size
> even for sequential reads (the first patch has benchmark details):
>
> rasize first run time/throughput second run time/throughput
> ------------------------------------------------------------------
> 4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
> 8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
> 16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
> 32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
> 64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
> 128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
> 256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
> 512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
> 1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
> 4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
>
> As suggested by Linus, decrease default readahead size for small devices at the same time.
>
> [PATCH 01/11] readahead: limit readahead size for small devices
> [PATCH 02/11] readahead: bump up the default readahead size
> [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
>
> The two other impacts of an enlarged readahead size are
>
> - memory footprint (caused by readahead miss)
> Sequential readahead hit ratio is pretty high regardless of max
> readahead size; the extra memory footprint is mainly caused by
> enlarged mmap read-around.
> I measured my desktop:
> - under Xwindow:
> 128KB readahead cache hit ratio = 143MB/230MB = 62%
> 512KB readahead cache hit ratio = 138MB/248MB = 55%
> - under console: (seems more stable than the Xwindow data)
> 128KB readahead cache hit ratio = 30MB/56MB = 53%
> 1MB readahead cache hit ratio = 30MB/59MB = 51%
> So the impact to memory footprint looks acceptable.
>
> - readahead thrashing
> It will now cost 1MB readahead buffer per stream. Memory tight systems
> typically do not run multiple streams; but if they do so, it should
> help I/O performance as long as we can avoid thrashing, which can be
> achieved with the following patches.
>
> [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
> [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
> [PATCH 06/11] readahead: thrashing safe context readahead
>
> This is a major rewrite of the readahead algorithm, so I did careful tests with
> the following tracing/stats patches:
>
> [PATCH 07/11] readahead: record readahead patterns
> [PATCH 08/11] readahead: add tracing event
> [PATCH 09/11] readahead: add /debug/readahead/stats
>
> I verified the new readahead behavior on various access patterns,
> as well as stress tested the thrashing safety, by running 300 streams
> with mem=128M.
>
> Only 2031/61325=3.3% readahead windows are thrashed (due to workload
> variation):
>
> # cat /debug/readahead/stats
> pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
> initial 20 9 4 20 20 12 73 37 35
> subsequent 3 3 0 1 0 1 8 8 1
> context 61325 1 5479 61325 6788 5 14 2 13
> thrash 2031 0 1222 2031 2031 0 9 0 6
> around 235 90 142 235 235 235 60 0 19
> fadvise 0 0 0 0 0 0 0 0 0
> random 223 133 0 91 91 1 1 0 1
> all 63837 236 6847 63703 9165 0 14 2 13
>
> And the readahead inside a single stream is working as expected:
>
> # grep streams-3162 /debug/tracing/trace
> streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
> streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
> streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
> streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
> streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
> streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
> streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
> streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
> streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
> streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
> streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
> streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
> streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
> streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
> streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
> streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
> streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
> streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
> streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
> streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
> streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
> streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
> streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
> streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
> streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
> [...]
>
> btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
>
> [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
>
> At last, the updated context readahead will do more radix tree scans, so need
> to optimize radix_tree_prev_hole():
>
> [PATCH 11/11] radixtree: speed up next/prev hole search
>
> It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
> plus 8 level-1 node searches.
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 22:38 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-02 22:38 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> Andrew,
>
> This is to lift default readahead size to 512KB, which I believe yields
> more I/O throughput without noticeably increasing I/O latency for today's HDD.
>
Hi Fengguang,
I was doing a quick test with the patches. I was using fio to run some
sequential reader threads. I have got one access to one Lun from an HP
EVA. In my case it looks like with the patches throughput has come down.
Folllowing are the results.
Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 141768 130965 0 0
bsr 3 2 131979 135402 0 0
bsr 3 4 132351 420733 0 0
bsr 3 8 133152 455434 0 0
bsr 3 16 130316 674499 0 0
Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 84749.3 53213 0 0
bsr 3 2 83189.7 157473 0 0
bsr 3 4 77583.3 330030 0 0
bsr 3 8 88545.7 378201 0 0
bsr 3 16 95331.7 482657 0 0
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.
Thanks
Vivek
> For example, for a 100MB/s and 8ms access time HDD:
>
> io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
> 4 8 0.04 8.04 0.49% 497.57 124.39
> 8 8 0.08 8.08 0.97% 990.33 123.79
> 16 8 0.16 8.16 1.92% 1961.69 122.61
> 32 8 0.31 8.31 3.76% 3849.62 120.30
> 64 8 0.62 8.62 7.25% 7420.29 115.94
> 128 8 1.25 9.25 13.51% 13837.84 108.11
> 256 8 2.50 10.50 23.81% 24380.95 95.24
> 512 8 5.00 13.00 38.46% 39384.62 76.92
> 1024 8 10.00 18.00 55.56% 56888.89 55.56
> 2048 8 20.00 28.00 71.43% 73142.86 35.71
> 4096 8 40.00 48.00 83.33% 85333.33 20.83
>
> The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
> merely increases IO latency from 9.25ms to 13.00ms.
>
> As for SSD, I find that Intel X25-M SSD desires large readahead size
> even for sequential reads (the first patch has benchmark details):
>
> rasize first run time/throughput second run time/throughput
> ------------------------------------------------------------------
> 4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
> 8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
> 16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
> 32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
> 64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
> 128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
> 256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
> 512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
> 1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
> 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
> 4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
>
> As suggested by Linus, decrease default readahead size for small devices at the same time.
>
> [PATCH 01/11] readahead: limit readahead size for small devices
> [PATCH 02/11] readahead: bump up the default readahead size
> [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
>
> The two other impacts of an enlarged readahead size are
>
> - memory footprint (caused by readahead miss)
> Sequential readahead hit ratio is pretty high regardless of max
> readahead size; the extra memory footprint is mainly caused by
> enlarged mmap read-around.
> I measured my desktop:
> - under Xwindow:
> 128KB readahead cache hit ratio = 143MB/230MB = 62%
> 512KB readahead cache hit ratio = 138MB/248MB = 55%
> - under console: (seems more stable than the Xwindow data)
> 128KB readahead cache hit ratio = 30MB/56MB = 53%
> 1MB readahead cache hit ratio = 30MB/59MB = 51%
> So the impact to memory footprint looks acceptable.
>
> - readahead thrashing
> It will now cost 1MB readahead buffer per stream. Memory tight systems
> typically do not run multiple streams; but if they do so, it should
> help I/O performance as long as we can avoid thrashing, which can be
> achieved with the following patches.
>
> [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
> [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
> [PATCH 06/11] readahead: thrashing safe context readahead
>
> This is a major rewrite of the readahead algorithm, so I did careful tests with
> the following tracing/stats patches:
>
> [PATCH 07/11] readahead: record readahead patterns
> [PATCH 08/11] readahead: add tracing event
> [PATCH 09/11] readahead: add /debug/readahead/stats
>
> I verified the new readahead behavior on various access patterns,
> as well as stress tested the thrashing safety, by running 300 streams
> with mem=128M.
>
> Only 2031/61325=3.3% readahead windows are thrashed (due to workload
> variation):
>
> # cat /debug/readahead/stats
> pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
> initial 20 9 4 20 20 12 73 37 35
> subsequent 3 3 0 1 0 1 8 8 1
> context 61325 1 5479 61325 6788 5 14 2 13
> thrash 2031 0 1222 2031 2031 0 9 0 6
> around 235 90 142 235 235 235 60 0 19
> fadvise 0 0 0 0 0 0 0 0 0
> random 223 133 0 91 91 1 1 0 1
> all 63837 236 6847 63703 9165 0 14 2 13
>
> And the readahead inside a single stream is working as expected:
>
> # grep streams-3162 /debug/tracing/trace
> streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
> streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
> streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
> streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
> streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
> streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
> streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
> streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
> streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
> streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
> streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
> streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
> streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
> streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
> streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
> streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
> streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
> streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
> streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
> streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
> streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
> streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
> streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
> streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
> streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
> [...]
>
> btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
>
> [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
>
> At last, the updated context readahead will do more radix tree scans, so need
> to optimize radix_tree_prev_hole():
>
> [PATCH 11/11] radixtree: speed up next/prev hole search
>
> It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
> plus 8 level-1 node searches.
>
> Thanks,
> Fengguang
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-02 22:38 ` Vivek Goyal
@ 2010-02-02 23:17 ` Vivek Goyal
-1 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-02 23:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 05:38:03PM -0500, Vivek Goyal wrote:
> On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is to lift default readahead size to 512KB, which I believe yields
> > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> >
>
> Hi Fengguang,
>
> I was doing a quick test with the patches. I was using fio to run some
> sequential reader threads. I have got one access to one Lun from an HP
> EVA. In my case it looks like with the patches throughput has come down.
>
> Folllowing are the results.
>
> Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
> AVERAGE
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 141768 130965 0 0
> bsr 3 2 131979 135402 0 0
> bsr 3 4 132351 420733 0 0
> bsr 3 8 133152 455434 0 0
> bsr 3 16 130316 674499 0 0
>
> Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
> AVERAGE
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 84749.3 53213 0 0
> bsr 3 2 83189.7 157473 0 0
> bsr 3 4 77583.3 330030 0 0
> bsr 3 8 88545.7 378201 0 0
> bsr 3 16 95331.7 482657 0 0
>
> I run increasing number of sequential readers. File system is ext3 and
> filesize is 1G.
>
> I have run the tests 3 times (3sets) and taken the average of it.
I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller. Here I
do see improvement in sequenetial reader performance with the patches.
Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
=========================================================================
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 147569 14369.7 0 0
bsr 3 2 124716 243932 0 0
bsr 3 4 123451 327665 0 0
bsr 3 8 122486 455102 0 0
bsr 3 16 117645 1.03957e+06 0 0
Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
=========================================================================
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 160191 22752 0 0
bsr 3 2 149343 184698 0 0
bsr 3 4 147183 430875 0 0
bsr 3 8 144568 484045 0 0
bsr 3 16 137485 1.06257e+06 0 0
Vivek
>
> > For example, for a 100MB/s and 8ms access time HDD:
> >
> > io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
> > 4 8 0.04 8.04 0.49% 497.57 124.39
> > 8 8 0.08 8.08 0.97% 990.33 123.79
> > 16 8 0.16 8.16 1.92% 1961.69 122.61
> > 32 8 0.31 8.31 3.76% 3849.62 120.30
> > 64 8 0.62 8.62 7.25% 7420.29 115.94
> > 128 8 1.25 9.25 13.51% 13837.84 108.11
> > 256 8 2.50 10.50 23.81% 24380.95 95.24
> > 512 8 5.00 13.00 38.46% 39384.62 76.92
> > 1024 8 10.00 18.00 55.56% 56888.89 55.56
> > 2048 8 20.00 28.00 71.43% 73142.86 35.71
> > 4096 8 40.00 48.00 83.33% 85333.33 20.83
> >
> > The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
> > merely increases IO latency from 9.25ms to 13.00ms.
> >
> > As for SSD, I find that Intel X25-M SSD desires large readahead size
> > even for sequential reads (the first patch has benchmark details):
> >
> > rasize first run time/throughput second run time/throughput
> > ------------------------------------------------------------------
> > 4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
> > 8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
> > 16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
> > 32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
> > 64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
> > 128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
> > 256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
> > 512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
> > 1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
> > 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
> > 4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
> >
> > As suggested by Linus, decrease default readahead size for small devices at the same time.
> >
> > [PATCH 01/11] readahead: limit readahead size for small devices
> > [PATCH 02/11] readahead: bump up the default readahead size
> > [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
> >
> > The two other impacts of an enlarged readahead size are
> >
> > - memory footprint (caused by readahead miss)
> > Sequential readahead hit ratio is pretty high regardless of max
> > readahead size; the extra memory footprint is mainly caused by
> > enlarged mmap read-around.
> > I measured my desktop:
> > - under Xwindow:
> > 128KB readahead cache hit ratio = 143MB/230MB = 62%
> > 512KB readahead cache hit ratio = 138MB/248MB = 55%
> > - under console: (seems more stable than the Xwindow data)
> > 128KB readahead cache hit ratio = 30MB/56MB = 53%
> > 1MB readahead cache hit ratio = 30MB/59MB = 51%
> > So the impact to memory footprint looks acceptable.
> >
> > - readahead thrashing
> > It will now cost 1MB readahead buffer per stream. Memory tight systems
> > typically do not run multiple streams; but if they do so, it should
> > help I/O performance as long as we can avoid thrashing, which can be
> > achieved with the following patches.
> >
> > [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
> > [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
> > [PATCH 06/11] readahead: thrashing safe context readahead
> >
> > This is a major rewrite of the readahead algorithm, so I did careful tests with
> > the following tracing/stats patches:
> >
> > [PATCH 07/11] readahead: record readahead patterns
> > [PATCH 08/11] readahead: add tracing event
> > [PATCH 09/11] readahead: add /debug/readahead/stats
> >
> > I verified the new readahead behavior on various access patterns,
> > as well as stress tested the thrashing safety, by running 300 streams
> > with mem=128M.
> >
> > Only 2031/61325=3.3% readahead windows are thrashed (due to workload
> > variation):
> >
> > # cat /debug/readahead/stats
> > pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
> > initial 20 9 4 20 20 12 73 37 35
> > subsequent 3 3 0 1 0 1 8 8 1
> > context 61325 1 5479 61325 6788 5 14 2 13
> > thrash 2031 0 1222 2031 2031 0 9 0 6
> > around 235 90 142 235 235 235 60 0 19
> > fadvise 0 0 0 0 0 0 0 0 0
> > random 223 133 0 91 91 1 1 0 1
> > all 63837 236 6847 63703 9165 0 14 2 13
> >
> > And the readahead inside a single stream is working as expected:
> >
> > # grep streams-3162 /debug/tracing/trace
> > streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
> > streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
> > streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
> > streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
> > streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
> > streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
> > streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
> > streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
> > streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
> > streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
> > streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
> > streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
> > streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
> > streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
> > streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
> > streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
> > streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
> > streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
> > streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
> > streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
> > streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
> > streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
> > streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
> > streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
> > streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
> > [...]
> >
> > btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
> >
> > [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
> >
> > At last, the updated context readahead will do more radix tree scans, so need
> > to optimize radix_tree_prev_hole():
> >
> > [PATCH 11/11] radixtree: speed up next/prev hole search
> >
> > It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
> > plus 8 level-1 node searches.
> >
> > Thanks,
> > Fengguang
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-02 23:17 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-02 23:17 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, Feb 02, 2010 at 05:38:03PM -0500, Vivek Goyal wrote:
> On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is to lift default readahead size to 512KB, which I believe yields
> > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> >
>
> Hi Fengguang,
>
> I was doing a quick test with the patches. I was using fio to run some
> sequential reader threads. I have got one access to one Lun from an HP
> EVA. In my case it looks like with the patches throughput has come down.
>
> Folllowing are the results.
>
> Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
> AVERAGE
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 141768 130965 0 0
> bsr 3 2 131979 135402 0 0
> bsr 3 4 132351 420733 0 0
> bsr 3 8 133152 455434 0 0
> bsr 3 16 130316 674499 0 0
>
> Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
> AVERAGE
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 84749.3 53213 0 0
> bsr 3 2 83189.7 157473 0 0
> bsr 3 4 77583.3 330030 0 0
> bsr 3 8 88545.7 378201 0 0
> bsr 3 16 95331.7 482657 0 0
>
> I run increasing number of sequential readers. File system is ext3 and
> filesize is 1G.
>
> I have run the tests 3 times (3sets) and taken the average of it.
I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller. Here I
do see improvement in sequenetial reader performance with the patches.
Kernel=2.6.33-rc5 Workload=bsr iosched=cfq Filesz=1G bs=32K
=========================================================================
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 147569 14369.7 0 0
bsr 3 2 124716 243932 0 0
bsr 3 4 123451 327665 0 0
bsr 3 8 122486 455102 0 0
bsr 3 16 117645 1.03957e+06 0 0
Kernel=2.6.33-rc5-readahead Workload=bsr iosched=cfq Filesz=1G bs=32K
=========================================================================
AVERAGE
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 160191 22752 0 0
bsr 3 2 149343 184698 0 0
bsr 3 4 147183 430875 0 0
bsr 3 8 144568 484045 0 0
bsr 3 16 137485 1.06257e+06 0 0
Vivek
>
> > For example, for a 100MB/s and 8ms access time HDD:
> >
> > io_size KB access_time transfer_time io_latency util% throughput KB/s IOPS
> > 4 8 0.04 8.04 0.49% 497.57 124.39
> > 8 8 0.08 8.08 0.97% 990.33 123.79
> > 16 8 0.16 8.16 1.92% 1961.69 122.61
> > 32 8 0.31 8.31 3.76% 3849.62 120.30
> > 64 8 0.62 8.62 7.25% 7420.29 115.94
> > 128 8 1.25 9.25 13.51% 13837.84 108.11
> > 256 8 2.50 10.50 23.81% 24380.95 95.24
> > 512 8 5.00 13.00 38.46% 39384.62 76.92
> > 1024 8 10.00 18.00 55.56% 56888.89 55.56
> > 2048 8 20.00 28.00 71.43% 73142.86 35.71
> > 4096 8 40.00 48.00 83.33% 85333.33 20.83
> >
> > The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to ~39MB/s, while
> > merely increases IO latency from 9.25ms to 13.00ms.
> >
> > As for SSD, I find that Intel X25-M SSD desires large readahead size
> > even for sequential reads (the first patch has benchmark details):
> >
> > rasize first run time/throughput second run time/throughput
> > ------------------------------------------------------------------
> > 4k 3.40038 s, 123 MB/s 3.42842 s, 122 MB/s
> > 8k 2.7362 s, 153 MB/s 2.74528 s, 153 MB/s
> > 16k 2.59808 s, 161 MB/s 2.58728 s, 162 MB/s
> > 32k 2.50488 s, 167 MB/s 2.49138 s, 168 MB/s
> > 64k 2.12861 s, 197 MB/s 2.13055 s, 197 MB/s
> > 128k 1.92905 s, 217 MB/s 1.93176 s, 217 MB/s
> > 256k 1.75896 s, 238 MB/s 1.78963 s, 234 MB/s
> > 512k 1.67357 s, 251 MB/s 1.69112 s, 248 MB/s
> > 1M 1.62115 s, 259 MB/s 1.63206 s, 257 MB/s
> > 2M 1.56204 s, 269 MB/s 1.58854 s, 264 MB/s
> > 4M 1.57949 s, 266 MB/s 1.57426 s, 266 MB/s
> >
> > As suggested by Linus, decrease default readahead size for small devices at the same time.
> >
> > [PATCH 01/11] readahead: limit readahead size for small devices
> > [PATCH 02/11] readahead: bump up the default readahead size
> > [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use
> >
> > The two other impacts of an enlarged readahead size are
> >
> > - memory footprint (caused by readahead miss)
> > Sequential readahead hit ratio is pretty high regardless of max
> > readahead size; the extra memory footprint is mainly caused by
> > enlarged mmap read-around.
> > I measured my desktop:
> > - under Xwindow:
> > 128KB readahead cache hit ratio = 143MB/230MB = 62%
> > 512KB readahead cache hit ratio = 138MB/248MB = 55%
> > - under console: (seems more stable than the Xwindow data)
> > 128KB readahead cache hit ratio = 30MB/56MB = 53%
> > 1MB readahead cache hit ratio = 30MB/59MB = 51%
> > So the impact to memory footprint looks acceptable.
> >
> > - readahead thrashing
> > It will now cost 1MB readahead buffer per stream. Memory tight systems
> > typically do not run multiple streams; but if they do so, it should
> > help I/O performance as long as we can avoid thrashing, which can be
> > achieved with the following patches.
> >
> > [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags
> > [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon
> > [PATCH 06/11] readahead: thrashing safe context readahead
> >
> > This is a major rewrite of the readahead algorithm, so I did careful tests with
> > the following tracing/stats patches:
> >
> > [PATCH 07/11] readahead: record readahead patterns
> > [PATCH 08/11] readahead: add tracing event
> > [PATCH 09/11] readahead: add /debug/readahead/stats
> >
> > I verified the new readahead behavior on various access patterns,
> > as well as stress tested the thrashing safety, by running 300 streams
> > with mem=128M.
> >
> > Only 2031/61325=3.3% readahead windows are thrashed (due to workload
> > variation):
> >
> > # cat /debug/readahead/stats
> > pattern readahead eof_hit cache_hit io sync_io mmap_io size async_size io_size
> > initial 20 9 4 20 20 12 73 37 35
> > subsequent 3 3 0 1 0 1 8 8 1
> > context 61325 1 5479 61325 6788 5 14 2 13
> > thrash 2031 0 1222 2031 2031 0 9 0 6
> > around 235 90 142 235 235 235 60 0 19
> > fadvise 0 0 0 0 0 0 0 0 0
> > random 223 133 0 91 91 1 1 0 1
> > all 63837 236 6847 63703 9165 0 14 2 13
> >
> > And the readahead inside a single stream is working as expected:
> >
> > # grep streams-3162 /debug/tracing/trace
> > streams-3162 [000] 8602.455953: readahead: readahead-context(dev=0:2, ino=0, req=287352+1, ra=287354+10-2, async=1) = 10
> > streams-3162 [000] 8602.907873: readahead: readahead-context(dev=0:2, ino=0, req=287362+1, ra=287364+20-3, async=1) = 20
> > streams-3162 [000] 8604.027879: readahead: readahead-context(dev=0:2, ino=0, req=287381+1, ra=287384+14-2, async=1) = 14
> > streams-3162 [000] 8604.754722: readahead: readahead-context(dev=0:2, ino=0, req=287396+1, ra=287398+10-2, async=1) = 10
> > streams-3162 [000] 8605.191228: readahead: readahead-context(dev=0:2, ino=0, req=287406+1, ra=287408+18-3, async=1) = 18
> > streams-3162 [000] 8606.831895: readahead: readahead-context(dev=0:2, ino=0, req=287423+1, ra=287426+12-2, async=1) = 12
> > streams-3162 [000] 8606.919614: readahead: readahead-thrash(dev=0:2, ino=0, req=287425+1, ra=287425+8-0, async=0) = 1
> > streams-3162 [000] 8607.545016: readahead: readahead-context(dev=0:2, ino=0, req=287436+1, ra=287438+9-2, async=1) = 9
> > streams-3162 [000] 8607.960039: readahead: readahead-context(dev=0:2, ino=0, req=287445+1, ra=287447+18-3, async=1) = 18
> > streams-3162 [000] 8608.790973: readahead: readahead-context(dev=0:2, ino=0, req=287462+1, ra=287465+21-3, async=1) = 21
> > streams-3162 [000] 8609.763138: readahead: readahead-context(dev=0:2, ino=0, req=287483+1, ra=287486+15-2, async=1) = 15
> > streams-3162 [000] 8611.467401: readahead: readahead-context(dev=0:2, ino=0, req=287499+1, ra=287501+11-2, async=1) = 11
> > streams-3162 [000] 8642.512413: readahead: readahead-context(dev=0:2, ino=0, req=288053+1, ra=288056+10-2, async=1) = 10
> > streams-3162 [000] 8643.246618: readahead: readahead-context(dev=0:2, ino=0, req=288064+1, ra=288066+22-3, async=1) = 22
> > streams-3162 [000] 8644.278613: readahead: readahead-context(dev=0:2, ino=0, req=288085+1, ra=288088+16-3, async=1) = 16
> > streams-3162 [000] 8644.395782: readahead: readahead-context(dev=0:2, ino=0, req=288087+1, ra=288087+21-3, async=0) = 5
> > streams-3162 [000] 8645.109918: readahead: readahead-context(dev=0:2, ino=0, req=288101+1, ra=288108+8-1, async=1) = 8
> > streams-3162 [000] 8645.285078: readahead: readahead-context(dev=0:2, ino=0, req=288105+1, ra=288116+8-1, async=1) = 8
> > streams-3162 [000] 8645.731794: readahead: readahead-context(dev=0:2, ino=0, req=288115+1, ra=288122+14-2, async=1) = 13
> > streams-3162 [000] 8646.114250: readahead: readahead-context(dev=0:2, ino=0, req=288123+1, ra=288136+8-1, async=1) = 8
> > streams-3162 [000] 8646.626320: readahead: readahead-context(dev=0:2, ino=0, req=288134+1, ra=288144+16-3, async=1) = 16
> > streams-3162 [000] 8647.035721: readahead: readahead-context(dev=0:2, ino=0, req=288143+1, ra=288160+10-2, async=1) = 10
> > streams-3162 [000] 8647.693082: readahead: readahead-context(dev=0:2, ino=0, req=288157+1, ra=288165+12-2, async=1) = 8
> > streams-3162 [000] 8648.221368: readahead: readahead-context(dev=0:2, ino=0, req=288168+1, ra=288177+15-2, async=1) = 15
> > streams-3162 [000] 8649.280800: readahead: readahead-context(dev=0:2, ino=0, req=288190+1, ra=288192+23-3, async=1) = 23
> > [...]
> >
> > btw, Linus suggested to disable start-of-file readahead if lseek() has been called:
> >
> > [PATCH 10/11] readahead: dont do start-of-file readahead after lseek()
> >
> > At last, the updated context readahead will do more radix tree scans, so need
> > to optimize radix_tree_prev_hole():
> >
> > [PATCH 11/11] radixtree: speed up next/prev hole search
> >
> > It will on average reduce 8*64 level-0 slot searches to 32 level-0 slot
> > plus 8 level-1 node searches.
> >
> > Thanks,
> > Fengguang
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-02 19:38 ` Jens Axboe
@ 2010-02-03 6:13 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-03 6:13 UTC (permalink / raw)
To: Jens Axboe
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 03:38:26AM +0800, Jens Axboe wrote:
> On Tue, Feb 02 2010, Wu Fengguang wrote:
> > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> >
> > lseek 0, read 1024 => readahead 4 pages (start of file)
> > lseek 1536, read 16384 => readahead 8 pages (page contiguous)
> >
> > The readahead heuristics involved here are reasonable ones in general.
> > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> >
> > For the kernel part, Linus suggests:
> > So maybe we could be less aggressive about read-ahead when the size of
> > the device is small? Turning a 16kB read into a 64kB one is a big deal,
> > when it's about 15% of the whole device!
> >
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
> >
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. We use a formula that
> > generates the following concrete limits:
> >
> > disk size readahead size
> > (scale by 4) (scale by 2)
> > 2M 4k
> > 8M 8k
> > 32M 16k
> > 128M 32k
> > 512M 64k
> > 2G 128k
> > 8G 256k
> > 32G 512k
> > 128G 1024k
>
> I'm not sure the size part makes a ton of sense. You can have really
> fast small devices, and large slow devices. One real world example are
> the Sun FMod SSD devices, which are only 22GB in size but are faster
> than the Intel X25-E SLC disks.
>
> What makes it even worse for these devices is that they are often
> attached to fatter controllers than ahci, where command overhead is
> larger.
Ah, good to know about this fast 22GB SSD.
> Running your script on such a device yields (I enlarged the read-count
> by 2, makes it more reproducible):
>
> MARVELL SD88SA02 MP1F
>
> rasize 1st 2nd
> ------------------------------------------------------------------
> 4k 41 MB/s 41 MB/s
> 16k 85 MB/s 81 MB/s
> 32k 102 MB/s 109 MB/s
> 64k 125 MB/s 144 MB/s
> 128k 183 MB/s 185 MB/s
> 256k 216 MB/s 216 MB/s
> 512k 216 MB/s 236 MB/s
> 1024k 251 MB/s 252 MB/s
> 2M 258 MB/s 258 MB/s
> 4M 266 MB/s 266 MB/s
> 8M 266 MB/s 266 MB/s
>
> So for that device, 1M-2M looks like the sweet spot, with even needing
> 4-8M to fully reach full throughput.
Thanks for the data! I updated the formula to (16GB device => 1MB
readahead). However the limit in this patch is only true for <4GB
devices, since the default readahead size is merely 512KB.
IOW, this patch only limits the default readahead size (which is now
512KB in general and 4MB for btrfs). The user can always set any
readahead size.
> I don't think this is atypical of bigger systems. Only very recently
> have controller started to slim down the command overhead for real,
> because of the SSD devices. What probably is atypical is a device that
> is this small yet pretty fast.
Right. I didn't expect such small yet fast SSD..
Thanks,
Fengguang
---
readahead: limit readahead size for small devices
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
Anyone has 128MB USB stick? Anyway you get satisfiable performance
with >= 64k readahead size :)
CC: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
--- linux.orig/block/genhd.c 2010-02-02 21:58:09.000000000 +0800
+++ linux/block/genhd.c 2010-02-03 13:57:54.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,28 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-03 6:13 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-03 6:13 UTC (permalink / raw)
To: Jens Axboe
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 03:38:26AM +0800, Jens Axboe wrote:
> On Tue, Feb 02 2010, Wu Fengguang wrote:
> > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> >
> > lseek 0, read 1024 => readahead 4 pages (start of file)
> > lseek 1536, read 16384 => readahead 8 pages (page contiguous)
> >
> > The readahead heuristics involved here are reasonable ones in general.
> > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> >
> > For the kernel part, Linus suggests:
> > So maybe we could be less aggressive about read-ahead when the size of
> > the device is small? Turning a 16kB read into a 64kB one is a big deal,
> > when it's about 15% of the whole device!
> >
> > This looks reasonable: smaller device tend to be slower (USB sticks as
> > well as micro/mobile/old hard disks).
> >
> > Given that the non-rotational attribute is not always reported, we can
> > take disk size as a max readahead size hint. We use a formula that
> > generates the following concrete limits:
> >
> > disk size readahead size
> > (scale by 4) (scale by 2)
> > 2M 4k
> > 8M 8k
> > 32M 16k
> > 128M 32k
> > 512M 64k
> > 2G 128k
> > 8G 256k
> > 32G 512k
> > 128G 1024k
>
> I'm not sure the size part makes a ton of sense. You can have really
> fast small devices, and large slow devices. One real world example are
> the Sun FMod SSD devices, which are only 22GB in size but are faster
> than the Intel X25-E SLC disks.
>
> What makes it even worse for these devices is that they are often
> attached to fatter controllers than ahci, where command overhead is
> larger.
Ah, good to know about this fast 22GB SSD.
> Running your script on such a device yields (I enlarged the read-count
> by 2, makes it more reproducible):
>
> MARVELL SD88SA02 MP1F
>
> rasize 1st 2nd
> ------------------------------------------------------------------
> 4k 41 MB/s 41 MB/s
> 16k 85 MB/s 81 MB/s
> 32k 102 MB/s 109 MB/s
> 64k 125 MB/s 144 MB/s
> 128k 183 MB/s 185 MB/s
> 256k 216 MB/s 216 MB/s
> 512k 216 MB/s 236 MB/s
> 1024k 251 MB/s 252 MB/s
> 2M 258 MB/s 258 MB/s
> 4M 266 MB/s 266 MB/s
> 8M 266 MB/s 266 MB/s
>
> So for that device, 1M-2M looks like the sweet spot, with even needing
> 4-8M to fully reach full throughput.
Thanks for the data! I updated the formula to (16GB device => 1MB
readahead). However the limit in this patch is only true for <4GB
devices, since the default readahead size is merely 512KB.
IOW, this patch only limits the default readahead size (which is now
512KB in general and 4MB for btrfs). The user can always set any
readahead size.
> I don't think this is atypical of bigger systems. Only very recently
> have controller started to slim down the command overhead for real,
> because of the SSD devices. What probably is atypical is a device that
> is this small yet pretty fast.
Right. I didn't expect such small yet fast SSD..
Thanks,
Fengguang
---
readahead: limit readahead size for small devices
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
Anyone has 128MB USB stick? Anyway you get satisfiable performance
with >= 64k readahead size :)
CC: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
--- linux.orig/block/genhd.c 2010-02-02 21:58:09.000000000 +0800
+++ linux/block/genhd.c 2010-02-03 13:57:54.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,28 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-02 22:38 ` Vivek Goyal
@ 2010-02-03 6:27 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-03 6:27 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Vivek,
On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is to lift default readahead size to 512KB, which I believe yields
> > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> >
>
> Hi Fengguang,
>
> I was doing a quick test with the patches. I was using fio to run some
> sequential reader threads. I have got one access to one Lun from an HP
> EVA. In my case it looks like with the patches throughput has come down.
Thank you for the quick testing!
This patchset does 3 things:
1) 512K readahead size
2) new readahead algorithms
3) new readahead tracing/stats interfaces
(1) will impact performance, while (2) _might_ impact performance in
case of bugs.
Would you kindly retest the patchset with readahead size manually set
to 128KB? That would help identify the root cause of the performance
drop:
DEV=sda
echo 128 > /sys/block/$DEV/queue/read_ahead_kb
The readahead stats provided by the patchset are very useful for
analyzing the problem:
mount -t debugfs none /debug
# for each benchmark:
echo > /debug/readahead/stats # reset counters
# do benchmark
cat /debug/readahead/stats # check counters
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-03 6:27 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-03 6:27 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Vivek,
On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > Andrew,
> >
> > This is to lift default readahead size to 512KB, which I believe yields
> > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> >
>
> Hi Fengguang,
>
> I was doing a quick test with the patches. I was using fio to run some
> sequential reader threads. I have got one access to one Lun from an HP
> EVA. In my case it looks like with the patches throughput has come down.
Thank you for the quick testing!
This patchset does 3 things:
1) 512K readahead size
2) new readahead algorithms
3) new readahead tracing/stats interfaces
(1) will impact performance, while (2) _might_ impact performance in
case of bugs.
Would you kindly retest the patchset with readahead size manually set
to 128KB? That would help identify the root cause of the performance
drop:
DEV=sda
echo 128 > /sys/block/$DEV/queue/read_ahead_kb
The readahead stats provided by the patchset are very useful for
analyzing the problem:
mount -t debugfs none /debug
# for each benchmark:
echo > /debug/readahead/stats # reset counters
# do benchmark
cat /debug/readahead/stats # check counters
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-03 6:13 ` Wu Fengguang
@ 2010-02-03 8:23 ` Jens Axboe
-1 siblings, 0 replies; 83+ messages in thread
From: Jens Axboe @ 2010-02-03 8:23 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Feb 03 2010, Wu Fengguang wrote:
> On Wed, Feb 03, 2010 at 03:38:26AM +0800, Jens Axboe wrote:
> > On Tue, Feb 02 2010, Wu Fengguang wrote:
> > > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> > >
> > > lseek 0, read 1024 => readahead 4 pages (start of file)
> > > lseek 1536, read 16384 => readahead 8 pages (page contiguous)
> > >
> > > The readahead heuristics involved here are reasonable ones in general.
> > > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> > >
> > > For the kernel part, Linus suggests:
> > > So maybe we could be less aggressive about read-ahead when the size of
> > > the device is small? Turning a 16kB read into a 64kB one is a big deal,
> > > when it's about 15% of the whole device!
> > >
> > > This looks reasonable: smaller device tend to be slower (USB sticks as
> > > well as micro/mobile/old hard disks).
> > >
> > > Given that the non-rotational attribute is not always reported, we can
> > > take disk size as a max readahead size hint. We use a formula that
> > > generates the following concrete limits:
> > >
> > > disk size readahead size
> > > (scale by 4) (scale by 2)
> > > 2M 4k
> > > 8M 8k
> > > 32M 16k
> > > 128M 32k
> > > 512M 64k
> > > 2G 128k
> > > 8G 256k
> > > 32G 512k
> > > 128G 1024k
> >
> > I'm not sure the size part makes a ton of sense. You can have really
> > fast small devices, and large slow devices. One real world example are
> > the Sun FMod SSD devices, which are only 22GB in size but are faster
> > than the Intel X25-E SLC disks.
> >
> > What makes it even worse for these devices is that they are often
> > attached to fatter controllers than ahci, where command overhead is
> > larger.
>
> Ah, good to know about this fast 22GB SSD.
>
> > Running your script on such a device yields (I enlarged the read-count
> > by 2, makes it more reproducible):
> >
> > MARVELL SD88SA02 MP1F
> >
> > rasize 1st 2nd
> > ------------------------------------------------------------------
> > 4k 41 MB/s 41 MB/s
> > 16k 85 MB/s 81 MB/s
> > 32k 102 MB/s 109 MB/s
> > 64k 125 MB/s 144 MB/s
> > 128k 183 MB/s 185 MB/s
> > 256k 216 MB/s 216 MB/s
> > 512k 216 MB/s 236 MB/s
> > 1024k 251 MB/s 252 MB/s
> > 2M 258 MB/s 258 MB/s
> > 4M 266 MB/s 266 MB/s
> > 8M 266 MB/s 266 MB/s
> >
> > So for that device, 1M-2M looks like the sweet spot, with even needing
> > 4-8M to fully reach full throughput.
>
> Thanks for the data! I updated the formula to (16GB device => 1MB
> readahead). However the limit in this patch is only true for <4GB
> devices, since the default readahead size is merely 512KB.
Thanks Wu, you can add my acked-by.
--
Jens Axboe
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-03 8:23 ` Jens Axboe
0 siblings, 0 replies; 83+ messages in thread
From: Jens Axboe @ 2010-02-03 8:23 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Peter Zijlstra, Linux Memory Management List,
linux-fsdevel, LKML
On Wed, Feb 03 2010, Wu Fengguang wrote:
> On Wed, Feb 03, 2010 at 03:38:26AM +0800, Jens Axboe wrote:
> > On Tue, Feb 02 2010, Wu Fengguang wrote:
> > > Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> > > on which blkid runs unpleasantly slow. He manages to optimize the blkid
> > > reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
> > >
> > > lseek 0, read 1024 => readahead 4 pages (start of file)
> > > lseek 1536, read 16384 => readahead 8 pages (page contiguous)
> > >
> > > The readahead heuristics involved here are reasonable ones in general.
> > > So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
> > >
> > > For the kernel part, Linus suggests:
> > > So maybe we could be less aggressive about read-ahead when the size of
> > > the device is small? Turning a 16kB read into a 64kB one is a big deal,
> > > when it's about 15% of the whole device!
> > >
> > > This looks reasonable: smaller device tend to be slower (USB sticks as
> > > well as micro/mobile/old hard disks).
> > >
> > > Given that the non-rotational attribute is not always reported, we can
> > > take disk size as a max readahead size hint. We use a formula that
> > > generates the following concrete limits:
> > >
> > > disk size readahead size
> > > (scale by 4) (scale by 2)
> > > 2M 4k
> > > 8M 8k
> > > 32M 16k
> > > 128M 32k
> > > 512M 64k
> > > 2G 128k
> > > 8G 256k
> > > 32G 512k
> > > 128G 1024k
> >
> > I'm not sure the size part makes a ton of sense. You can have really
> > fast small devices, and large slow devices. One real world example are
> > the Sun FMod SSD devices, which are only 22GB in size but are faster
> > than the Intel X25-E SLC disks.
> >
> > What makes it even worse for these devices is that they are often
> > attached to fatter controllers than ahci, where command overhead is
> > larger.
>
> Ah, good to know about this fast 22GB SSD.
>
> > Running your script on such a device yields (I enlarged the read-count
> > by 2, makes it more reproducible):
> >
> > MARVELL SD88SA02 MP1F
> >
> > rasize 1st 2nd
> > ------------------------------------------------------------------
> > 4k 41 MB/s 41 MB/s
> > 16k 85 MB/s 81 MB/s
> > 32k 102 MB/s 109 MB/s
> > 64k 125 MB/s 144 MB/s
> > 128k 183 MB/s 185 MB/s
> > 256k 216 MB/s 216 MB/s
> > 512k 216 MB/s 236 MB/s
> > 1024k 251 MB/s 252 MB/s
> > 2M 258 MB/s 258 MB/s
> > 4M 266 MB/s 266 MB/s
> > 8M 266 MB/s 266 MB/s
> >
> > So for that device, 1M-2M looks like the sweet spot, with even needing
> > 4-8M to fully reach full throughput.
>
> Thanks for the data! I updated the formula to (16GB device => 1MB
> readahead). However the limit in this patch is only true for <4GB
> devices, since the default readahead size is merely 512KB.
Thanks Wu, you can add my acked-by.
--
Jens Axboe
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-03 6:27 ` Wu Fengguang
@ 2010-02-03 15:24 ` Vivek Goyal
-1 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-03 15:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> Vivek,
>
> On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > This is to lift default readahead size to 512KB, which I believe yields
> > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > >
> >
> > Hi Fengguang,
> >
> > I was doing a quick test with the patches. I was using fio to run some
> > sequential reader threads. I have got one access to one Lun from an HP
> > EVA. In my case it looks like with the patches throughput has come down.
>
> Thank you for the quick testing!
>
> This patchset does 3 things:
>
> 1) 512K readahead size
> 2) new readahead algorithms
> 3) new readahead tracing/stats interfaces
>
> (1) will impact performance, while (2) _might_ impact performance in
> case of bugs.
>
> Would you kindly retest the patchset with readahead size manually set
> to 128KB? That would help identify the root cause of the performance
> drop:
>
> DEV=sda
> echo 128 > /sys/block/$DEV/queue/read_ahead_kb
>
I have got two paths to the HP EVA and got multipath device setup(dm-3). I
noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
applied it is set to 4. So looks like something went wrong with device
size/capacity detection hence wrong defaults. Manually setting
read_ahead_kb=512, got me better performance as compare to vanilla kernel.
AVERAGE[bsr]
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 190302 97937.3 0 0
bsr 3 2 185636 223286 0 0
bsr 3 4 185986 363658 0 0
bsr 3 8 184352 428478 0 0
bsr 3 16 185646 594311 0 0
Thanks
Vivek
> The readahead stats provided by the patchset are very useful for
> analyzing the problem:
>
> mount -t debugfs none /debug
>
> # for each benchmark:
> echo > /debug/readahead/stats # reset counters
> # do benchmark
> cat /debug/readahead/stats # check counters
>
> Thanks,
> Fengguang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-03 15:24 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-03 15:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> Vivek,
>
> On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > Andrew,
> > >
> > > This is to lift default readahead size to 512KB, which I believe yields
> > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > >
> >
> > Hi Fengguang,
> >
> > I was doing a quick test with the patches. I was using fio to run some
> > sequential reader threads. I have got one access to one Lun from an HP
> > EVA. In my case it looks like with the patches throughput has come down.
>
> Thank you for the quick testing!
>
> This patchset does 3 things:
>
> 1) 512K readahead size
> 2) new readahead algorithms
> 3) new readahead tracing/stats interfaces
>
> (1) will impact performance, while (2) _might_ impact performance in
> case of bugs.
>
> Would you kindly retest the patchset with readahead size manually set
> to 128KB? That would help identify the root cause of the performance
> drop:
>
> DEV=sda
> echo 128 > /sys/block/$DEV/queue/read_ahead_kb
>
I have got two paths to the HP EVA and got multipath device setup(dm-3). I
noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
applied it is set to 4. So looks like something went wrong with device
size/capacity detection hence wrong defaults. Manually setting
read_ahead_kb=512, got me better performance as compare to vanilla kernel.
AVERAGE[bsr]
-------
job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------- -----------
bsr 3 1 190302 97937.3 0 0
bsr 3 2 185636 223286 0 0
bsr 3 4 185986 363658 0 0
bsr 3 8 184352 428478 0 0
bsr 3 16 185646 594311 0 0
Thanks
Vivek
> The readahead stats provided by the patchset are very useful for
> analyzing the problem:
>
> mount -t debugfs none /debug
>
> # for each benchmark:
> echo > /debug/readahead/stats # reset counters
> # do benchmark
> cat /debug/readahead/stats # check counters
>
> Thanks,
> Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-03 15:24 ` Vivek Goyal
@ 2010-02-03 15:58 ` Vivek Goyal
-1 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-03 15:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 10:24:54AM -0500, Vivek Goyal wrote:
> On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> > Vivek,
> >
> > On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > > Andrew,
> > > >
> > > > This is to lift default readahead size to 512KB, which I believe yields
> > > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > > >
> > >
> > > Hi Fengguang,
> > >
> > > I was doing a quick test with the patches. I was using fio to run some
> > > sequential reader threads. I have got one access to one Lun from an HP
> > > EVA. In my case it looks like with the patches throughput has come down.
> >
> > Thank you for the quick testing!
> >
> > This patchset does 3 things:
> >
> > 1) 512K readahead size
> > 2) new readahead algorithms
> > 3) new readahead tracing/stats interfaces
> >
> > (1) will impact performance, while (2) _might_ impact performance in
> > case of bugs.
> >
> > Would you kindly retest the patchset with readahead size manually set
> > to 128KB? That would help identify the root cause of the performance
> > drop:
> >
> > DEV=sda
> > echo 128 > /sys/block/$DEV/queue/read_ahead_kb
> >
>
> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
I put a printk in add_disk and noticed that for multipath device get_capacity() is returning 0 and that's why ra_pages is being set to 1.
Thanks
Vivek
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-03 15:58 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-03 15:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Wed, Feb 03, 2010 at 10:24:54AM -0500, Vivek Goyal wrote:
> On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> > Vivek,
> >
> > On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > > Andrew,
> > > >
> > > > This is to lift default readahead size to 512KB, which I believe yields
> > > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > > >
> > >
> > > Hi Fengguang,
> > >
> > > I was doing a quick test with the patches. I was using fio to run some
> > > sequential reader threads. I have got one access to one Lun from an HP
> > > EVA. In my case it looks like with the patches throughput has come down.
> >
> > Thank you for the quick testing!
> >
> > This patchset does 3 things:
> >
> > 1) 512K readahead size
> > 2) new readahead algorithms
> > 3) new readahead tracing/stats interfaces
> >
> > (1) will impact performance, while (2) _might_ impact performance in
> > case of bugs.
> >
> > Would you kindly retest the patchset with readahead size manually set
> > to 128KB? That would help identify the root cause of the performance
> > drop:
> >
> > DEV=sda
> > echo 128 > /sys/block/$DEV/queue/read_ahead_kb
> >
>
> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
I put a printk in add_disk and noticed that for multipath device get_capacity() is returning 0 and that's why ra_pages is being set to 1.
Thanks
Vivek
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Fwd: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-03 15:58 ` Vivek Goyal
(?)
@ 2010-02-03 16:55 ` Mike Snitzer
-1 siblings, 0 replies; 83+ messages in thread
From: Mike Snitzer @ 2010-02-03 16:55 UTC (permalink / raw)
To: dm-devel; +Cc: fengguang.wu, Vivek Goyal
FYI, wanted to get this on our radar... seems latest DM isn't allowing
RFC readahead code to set a sane readahead default for DM devices?
get_capacity() is returning 0 for DM devices (not just multipath).
Vivek did share that fdisk -l does show the proper capacity for the DM
device.
I haven't had a chance to look at the relevant code yet.
I've asked Vivek to cc dm-devel on any further messages he might send
in response to this thread.
---------- Forwarded message ----------
From: Vivek Goyal <vgoyal@redhat.com>
Date: Wed, Feb 3, 2010 at 10:58 AM
Subject: Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing
safe readahead
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Jens Axboe
<jens.axboe@oracle.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>,
Linux Memory Management List <linux-mm@kvack.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>, LKML
<linux-kernel@vger.kernel.org>
On Wed, Feb 03, 2010 at 10:24:54AM -0500, Vivek Goyal wrote:
> On Wed, Feb 03, 2010 at 02:27:56PM +0800, Wu Fengguang wrote:
> > Vivek,
> >
> > On Wed, Feb 03, 2010 at 06:38:03AM +0800, Vivek Goyal wrote:
> > > On Tue, Feb 02, 2010 at 11:28:35PM +0800, Wu Fengguang wrote:
> > > > Andrew,
> > > >
> > > > This is to lift default readahead size to 512KB, which I believe yields
> > > > more I/O throughput without noticeably increasing I/O latency for today's HDD.
> > > >
> > >
> > > Hi Fengguang,
> > >
> > > I was doing a quick test with the patches. I was using fio to run some
> > > sequential reader threads. I have got one access to one Lun from an HP
> > > EVA. In my case it looks like with the patches throughput has come down.
> >
> > Thank you for the quick testing!
> >
> > This patchset does 3 things:
> >
> > 1) 512K readahead size
> > 2) new readahead algorithms
> > 3) new readahead tracing/stats interfaces
> >
> > (1) will impact performance, while (2) _might_ impact performance in
> > case of bugs.
> >
> > Would you kindly retest the patchset with readahead size manually set
> > to 128KB? That would help identify the root cause of the performance
> > drop:
> >
> > DEV=sda
> > echo 128 > /sys/block/$DEV/queue/read_ahead_kb
> >
>
> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
I put a printk in add_disk and noticed that for multipath device
get_capacity() is returning 0 and that's why ra_pages is being set to
1.
Thanks
Vivek
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-02 15:28 ` Wu Fengguang
@ 2010-02-04 8:24 ` Clemens Ladisch
-1 siblings, 0 replies; 83+ messages in thread
From: Clemens Ladisch @ 2010-02-04 8:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Wu Fengguang wrote:
> Anyone has 512/128MB USB stick?
64 MB, USB full speed:
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-04 8:24 ` Clemens Ladisch
0 siblings, 0 replies; 83+ messages in thread
From: Clemens Ladisch @ 2010-02-04 8:24 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Wu Fengguang wrote:
> Anyone has 512/128MB USB stick?
64 MB, USB full speed:
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-04 8:24 ` Clemens Ladisch
@ 2010-02-04 13:00 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:00 UTC (permalink / raw)
To: Clemens Ladisch
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Clemens,
Thanks for the data!
On Thu, Feb 04, 2010 at 04:24:53PM +0800, Clemens Ladisch wrote:
> Wu Fengguang wrote:
> > Anyone has 512/128MB USB stick?
>
> 64 MB, USB full speed:
> Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
>
> 4KB: 139.339 s, 376 kB/s
> 16KB: 81.0427 s, 647 kB/s
> 32KB: 71.8513 s, 730 kB/s
> 64KB: 67.3872 s, 778 kB/s
> 128KB: 67.5434 s, 776 kB/s
> 256KB: 65.9019 s, 796 kB/s
> 512KB: 66.2282 s, 792 kB/s
> 1024KB: 67.4632 s, 777 kB/s
> 2048KB: 69.9759 s, 749 kB/s
It seems to reach good throughput at 64KB readahead size :)
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-04 13:00 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:00 UTC (permalink / raw)
To: Clemens Ladisch
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Clemens,
Thanks for the data!
On Thu, Feb 04, 2010 at 04:24:53PM +0800, Clemens Ladisch wrote:
> Wu Fengguang wrote:
> > Anyone has 512/128MB USB stick?
>
> 64 MB, USB full speed:
> Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
>
> 4KB: 139.339 s, 376 kB/s
> 16KB: 81.0427 s, 647 kB/s
> 32KB: 71.8513 s, 730 kB/s
> 64KB: 67.3872 s, 778 kB/s
> 128KB: 67.5434 s, 776 kB/s
> 256KB: 65.9019 s, 796 kB/s
> 512KB: 66.2282 s, 792 kB/s
> 1024KB: 67.4632 s, 777 kB/s
> 2048KB: 69.9759 s, 749 kB/s
It seems to reach good throughput at 64KB readahead size :)
Thanks,
Fengguang
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-03 15:58 ` Vivek Goyal
@ 2010-02-04 13:21 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:21 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML,
Clemens Ladisch
Vivek,
> > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > applied it is set to 4. So looks like something went wrong with device
> > size/capacity detection hence wrong defaults. Manually setting
> > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> >
>
> I put a printk in add_disk and noticed that for multipath device
> get_capacity() is returning 0 and that's why ra_pages is being set
> to 1.
Good catch, Thanks!
It makes no sense to limit readahead size for multipath or other
compound devices. So we may just ignore the get_capacity() == 0 case,
as in the following updated patch.
Thanks,
Fengguang
---
readahead: limit readahead size for small devices
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-04 13:21 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:21 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML,
Clemens Ladisch
Vivek,
> > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > applied it is set to 4. So looks like something went wrong with device
> > size/capacity detection hence wrong defaults. Manually setting
> > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> >
>
> I put a printk in add_disk and noticed that for multipath device
> get_capacity() is returning 0 and that's why ra_pages is being set
> to 1.
Good catch, Thanks!
It makes no sense to limit readahead size for multipath or other
compound devices. So we may just ignore the get_capacity() == 0 case,
as in the following updated patch.
Thanks,
Fengguang
---
readahead: limit readahead size for small devices
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-03 15:24 ` Vivek Goyal
@ 2010-02-04 13:44 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:44 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Vivek,
> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
> AVERAGE[bsr]
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 190302 97937.3 0 0
> bsr 3 2 185636 223286 0 0
> bsr 3 4 185986 363658 0 0
> bsr 3 8 184352 428478 0 0
> bsr 3 16 185646 594311 0 0
This looks good, thank you for the data! I added them to the changelog :)
Thanks,
Fengguang
---
readahead: bump up the default readahead size
Use 512kb max readahead size, and 32kb min readahead size.
The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.
-- Rationals on the 512kb size --
I believe it yields more I/O throughput without noticeably increasing
I/O latency for today's HDD.
For example, for a 100MB/s and 8ms access time HDD:
io_size KB access_time transfer_time io_latency util% throughput KB/s
4 8 0.04 8.04 0.49% 497.57
8 8 0.08 8.08 0.97% 990.33
16 8 0.16 8.16 1.92% 1961.69
32 8 0.31 8.31 3.76% 3849.62
64 8 0.62 8.62 7.25% 7420.29
128 8 1.25 9.25 13.51% 13837.84
256 8 2.50 10.50 23.81% 24380.95
512 8 5.00 13.00 38.46% 39384.62
1024 8 10.00 18.00 55.56% 56888.89
2048 8 20.00 28.00 71.43% 73142.86
4096 8 40.00 48.00 83.33% 85333.33
The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.
As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads:
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
The two other impacts of an enlarged readahead size are
- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead hit ratio = 143MB/230MB = 62%
512KB readahead hit ratio = 138MB/248MB = 55%
1MB readahead hit ratio = 130MB/253MB = 51%
- under console: (seems more stable than the Xwindow data)
128KB readahead hit ratio = 30MB/56MB = 53%
1MB readahead hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.
- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight
systems typically do not run multiple streams; but if they do
so, it should help I/O performance as long as we can avoid
thrashing, which can be achieved with the following patches.
-- Benchmarks by Vivek Goyal --
I have got two paths to the HP EVA and got multipath device setup(dm-3).
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.
Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 141768 130965 190302 97937.3
bsr 3 2 131979 135402 185636 223286
bsr 3 4 132351 420733 185986 363658
bsr 3 8 133152 455434 184352 428478
bsr 3 16 130316 674499 185646 594311
I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.
Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 147569 14369.7 160191 22752
bsr 3 2 124716 243932 149343 184698
bsr 3 4 123451 327665 147183 430875
bsr 3 8 122486 455102 144568 484045
bsr 3 16 117645 1.03957e+06 137485 1.06257e+06
Tested-by: Vivek Goyal <vgoyal@redhat.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-04 13:44 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-04 13:44 UTC (permalink / raw)
To: Vivek Goyal
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
Vivek,
> I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> applied it is set to 4. So looks like something went wrong with device
> size/capacity detection hence wrong defaults. Manually setting
> read_ahead_kb=512, got me better performance as compare to vanilla kernel.
>
> AVERAGE[bsr]
> -------
> job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us)
> --- --- -- ------------ ----------- ------------- -----------
> bsr 3 1 190302 97937.3 0 0
> bsr 3 2 185636 223286 0 0
> bsr 3 4 185986 363658 0 0
> bsr 3 8 184352 428478 0 0
> bsr 3 16 185646 594311 0 0
This looks good, thank you for the data! I added them to the changelog :)
Thanks,
Fengguang
---
readahead: bump up the default readahead size
Use 512kb max readahead size, and 32kb min readahead size.
The former helps io performance for common workloads.
The latter will be used in the thrashing safe context readahead.
-- Rationals on the 512kb size --
I believe it yields more I/O throughput without noticeably increasing
I/O latency for today's HDD.
For example, for a 100MB/s and 8ms access time HDD:
io_size KB access_time transfer_time io_latency util% throughput KB/s
4 8 0.04 8.04 0.49% 497.57
8 8 0.08 8.08 0.97% 990.33
16 8 0.16 8.16 1.92% 1961.69
32 8 0.31 8.31 3.76% 3849.62
64 8 0.62 8.62 7.25% 7420.29
128 8 1.25 9.25 13.51% 13837.84
256 8 2.50 10.50 23.81% 24380.95
512 8 5.00 13.00 38.46% 39384.62
1024 8 10.00 18.00 55.56% 56888.89
2048 8 20.00 28.00 71.43% 73142.86
4096 8 40.00 48.00 83.33% 85333.33
The 128KB => 512KB readahead size boosts IO throughput from ~13MB/s to
~39MB/s, while merely increases (minimal) IO latency from 9.25ms to 13ms.
As for SSD, I find that Intel X25-M SSD desires large readahead size
even for sequential reads:
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
The two other impacts of an enlarged readahead size are
- memory footprint (caused by readahead miss)
Sequential readahead hit ratio is pretty high regardless of max
readahead size; the extra memory footprint is mainly caused by
enlarged mmap read-around.
I measured my desktop:
- under Xwindow:
128KB readahead hit ratio = 143MB/230MB = 62%
512KB readahead hit ratio = 138MB/248MB = 55%
1MB readahead hit ratio = 130MB/253MB = 51%
- under console: (seems more stable than the Xwindow data)
128KB readahead hit ratio = 30MB/56MB = 53%
1MB readahead hit ratio = 30MB/59MB = 51%
So the impact to memory footprint looks acceptable.
- readahead thrashing
It will now cost 1MB readahead buffer per stream. Memory tight
systems typically do not run multiple streams; but if they do
so, it should help I/O performance as long as we can avoid
thrashing, which can be achieved with the following patches.
-- Benchmarks by Vivek Goyal --
I have got two paths to the HP EVA and got multipath device setup(dm-3).
I run increasing number of sequential readers. File system is ext3 and
filesize is 1G.
I have run the tests 3 times (3sets) and taken the average of it.
Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 141768 130965 190302 97937.3
bsr 3 2 131979 135402 185636 223286
bsr 3 4 132351 420733 185986 363658
bsr 3 8 133152 455434 184352 428478
bsr 3 16 130316 674499 185646 594311
I ran same test on a different piece of hardware. There are few SATA disks
(5-6) in striped configuration behind a hardware RAID controller.
Workload=bsr iosched=cfq Filesz=1G bs=32K
======================================================================
2.6.33-rc5 2.6.33-rc5-readahead
job Set NR ReadBW(KB/s) MaxClat(us) ReadBW(KB/s) MaxClat(us)
--- --- -- ------------ ----------- ------------ -----------
bsr 3 1 147569 14369.7 160191 22752
bsr 3 2 124716 243932 149343 184698
bsr 3 4 123451 327665 147183 430875
bsr 3 8 122486 455102 144568 484045
bsr 3 16 117645 1.03957e+06 137485 1.06257e+06
Tested-by: Vivek Goyal <vgoyal@redhat.com>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/mm.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
--- linux.orig/include/linux/mm.h 2010-01-30 17:38:49.000000000 +0800
+++ linux/include/linux/mm.h 2010-01-30 18:09:58.000000000 +0800
@@ -1184,8 +1184,8 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
-#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
+#define VM_MAX_READAHEAD 512 /* kbytes */
+#define VM_MIN_READAHEAD 32 /* kbytes (includes current page) */
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
2010-02-04 13:21 ` Wu Fengguang
@ 2010-02-04 15:52 ` Vivek Goyal
-1 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-04 15:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML,
Clemens Ladisch
On Thu, Feb 04, 2010 at 09:21:54PM +0800, Wu Fengguang wrote:
> Vivek,
>
> > > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > > applied it is set to 4. So looks like something went wrong with device
> > > size/capacity detection hence wrong defaults. Manually setting
> > > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> > >
> >
> > I put a printk in add_disk and noticed that for multipath device
> > get_capacity() is returning 0 and that's why ra_pages is being set
> > to 1.
>
> Good catch, Thanks!
>
> It makes no sense to limit readahead size for multipath or other
> compound devices. So we may just ignore the get_capacity() == 0 case,
> as in the following updated patch.
>
Thanks. This patch fixes the issue of read_ahead_kb being set to 4kb on device
mapper targets.
Thanks
Vivek
> Thanks,
> Fengguang
> ---
> readahead: limit readahead size for small devices
>
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:
>
> disk size readahead size
> (scale by 4) (scale by 2)
> 1M 8k
> 4M 16k
> 16M 32k
> 64M 64k
> 256M 128k
> 1G 256k
> --------------------------- (*)
> 4G 512k
> 16G 1024k
> 64G 2048k
> 256G 4096k
>
> (*) Since the default readahead size is 512k, this limit only takes
> effect for devices whose size is less than 4G.
>
> The formula is determined on the following data, collected by script:
>
> #!/bin/sh
>
> # please make sure BDEV is not mounted or opened by others
> BDEV=sdb
>
> for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
> do
> echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
> time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
> done
>
> The principle is, the formula shall not limit readahead size to such a
> degree that will impact some device's sequential read performance.
>
> The Intel SSD is special in that its throughput increases steadily with
> larger readahead size. However it may take years for Linux to increase
> its default readahead size to 2MB, so we don't take it seriously in the
> formula.
>
> SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
>
> rasize 1st run 2nd run
> ----------------------------------
> 4k 123 MB/s 122 MB/s
> 16k 153 MB/s 153 MB/s
> 32k 161 MB/s 162 MB/s
> 64k 167 MB/s 168 MB/s
> 128k 197 MB/s 197 MB/s
> 256k 217 MB/s 217 MB/s
> 512k 238 MB/s 234 MB/s
> 1M 251 MB/s 248 MB/s
> 2M 259 MB/s 257 MB/s
> ==> 4M 269 MB/s 264 MB/s
> 8M 266 MB/s 266 MB/s
>
> Note that ==> points to the readahead size that yields plateau throughput.
>
> SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
>
> rasize 1st 2nd
> --------------------------------
> 4k 41 MB/s 41 MB/s
> 16k 85 MB/s 81 MB/s
> 32k 102 MB/s 109 MB/s
> 64k 125 MB/s 144 MB/s
> 128k 183 MB/s 185 MB/s
> 256k 216 MB/s 216 MB/s
> 512k 216 MB/s 236 MB/s
> 1024k 251 MB/s 252 MB/s
> 2M 258 MB/s 258 MB/s
> ==> 4M 266 MB/s 266 MB/s
> 8M 266 MB/s 266 MB/s
>
> SSD 30G SanDisk SATA 5000
>
> 4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
> 16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
> 32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
> 64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
> 128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
> 256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
> ==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
> 1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
> 2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
>
> USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
>
> 4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
> 16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
> 32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
> 64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
> 128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
> ==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
> 512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
> 1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
> 2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
>
> USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
>
> 4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
> 16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
> 32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
> 64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
> 128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
> ==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
> 512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
> 1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
> 2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
>
> USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
>
> 4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
> 16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
> 32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
> 64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
> 128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
> ==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
> 512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
> 1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
> 2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
>
> 64 MB, USB full speed (collected by Clemens Ladisch)
> Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
>
> 4KB: 139.339 s, 376 kB/s
> 16KB: 81.0427 s, 647 kB/s
> 32KB: 71.8513 s, 730 kB/s
> ==> 64KB: 67.3872 s, 778 kB/s
> 128KB: 67.5434 s, 776 kB/s
> 256KB: 65.9019 s, 796 kB/s
> 512KB: 66.2282 s, 792 kB/s
> 1024KB: 67.4632 s, 777 kB/s
> 2048KB: 69.9759 s, 749 kB/s
>
> CC: Li Shaohua <shaohua.li@intel.com>
> CC: Clemens Ladisch <clemens@ladisch.de>
> Acked-by: Jens Axboe <jens.axboe@oracle.com>
> Tested-by: Vivek Goyal <vgoyal@redhat.com>
> Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> block/genhd.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> --- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
> +++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
> @@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
> struct backing_dev_info *bdi;
> dev_t devt;
> int retval;
> + unsigned long size;
>
> /* minors == 0 indicates to use ext devt from part0 and should
> * be accompanied with EXT_DEVT flag. Make sure all
> @@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
> retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
> "bdi");
> WARN_ON(retval);
> +
> + /*
> + * Limit default readahead size for small devices.
> + * disk size readahead size
> + * 1M 8k
> + * 4M 16k
> + * 16M 32k
> + * 64M 64k
> + * 256M 128k
> + * 1G 256k
> + * ---------------------------
> + * 4G 512k
> + * 16G 1024k
> + * 64G 2048k
> + * 256G 4096k
> + * Since the default readahead size is 512k, this limit
> + * only takes effect for devices whose size is less than 4G.
> + */
> + if (get_capacity(disk)) {
> + size = get_capacity(disk) >> 9;
> + size = 1UL << (ilog2(size) / 2);
> + bdi->ra_pages = min(bdi->ra_pages, size);
> + }
> }
>
> EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead
@ 2010-02-04 15:52 ` Vivek Goyal
0 siblings, 0 replies; 83+ messages in thread
From: Vivek Goyal @ 2010-02-04 15:52 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML,
Clemens Ladisch
On Thu, Feb 04, 2010 at 09:21:54PM +0800, Wu Fengguang wrote:
> Vivek,
>
> > > I have got two paths to the HP EVA and got multipath device setup(dm-3). I
> > > noticed with vanilla kernel read_ahead_kb=128 after boot but with your patches
> > > applied it is set to 4. So looks like something went wrong with device
> > > size/capacity detection hence wrong defaults. Manually setting
> > > read_ahead_kb=512, got me better performance as compare to vanilla kernel.
> > >
> >
> > I put a printk in add_disk and noticed that for multipath device
> > get_capacity() is returning 0 and that's why ra_pages is being set
> > to 1.
>
> Good catch, Thanks!
>
> It makes no sense to limit readahead size for multipath or other
> compound devices. So we may just ignore the get_capacity() == 0 case,
> as in the following updated patch.
>
Thanks. This patch fixes the issue of read_ahead_kb being set to 4kb on device
mapper targets.
Thanks
Vivek
> Thanks,
> Fengguang
> ---
> readahead: limit readahead size for small devices
>
> Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
> on which blkid runs unpleasantly slow. He manages to optimize the blkid
> reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
>
> lseek 0, read 1024 => readahead 4 pages (start of file)
> lseek 1536, read 16384 => readahead 8 pages (page contiguous)
>
> The readahead heuristics involved here are reasonable ones in general.
> So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
>
> For the kernel part, Linus suggests:
> So maybe we could be less aggressive about read-ahead when the size of
> the device is small? Turning a 16kB read into a 64kB one is a big deal,
> when it's about 15% of the whole device!
>
> This looks reasonable: smaller device tend to be slower (USB sticks as
> well as micro/mobile/old hard disks).
>
> Given that the non-rotational attribute is not always reported, we can
> take disk size as a max readahead size hint. This patch uses a formula
> that generates the following concrete limits:
>
> disk size readahead size
> (scale by 4) (scale by 2)
> 1M 8k
> 4M 16k
> 16M 32k
> 64M 64k
> 256M 128k
> 1G 256k
> --------------------------- (*)
> 4G 512k
> 16G 1024k
> 64G 2048k
> 256G 4096k
>
> (*) Since the default readahead size is 512k, this limit only takes
> effect for devices whose size is less than 4G.
>
> The formula is determined on the following data, collected by script:
>
> #!/bin/sh
>
> # please make sure BDEV is not mounted or opened by others
> BDEV=sdb
>
> for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
> do
> echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
> time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
> done
>
> The principle is, the formula shall not limit readahead size to such a
> degree that will impact some device's sequential read performance.
>
> The Intel SSD is special in that its throughput increases steadily with
> larger readahead size. However it may take years for Linux to increase
> its default readahead size to 2MB, so we don't take it seriously in the
> formula.
>
> SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
>
> rasize 1st run 2nd run
> ----------------------------------
> 4k 123 MB/s 122 MB/s
> 16k 153 MB/s 153 MB/s
> 32k 161 MB/s 162 MB/s
> 64k 167 MB/s 168 MB/s
> 128k 197 MB/s 197 MB/s
> 256k 217 MB/s 217 MB/s
> 512k 238 MB/s 234 MB/s
> 1M 251 MB/s 248 MB/s
> 2M 259 MB/s 257 MB/s
> ==> 4M 269 MB/s 264 MB/s
> 8M 266 MB/s 266 MB/s
>
> Note that ==> points to the readahead size that yields plateau throughput.
>
> SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
>
> rasize 1st 2nd
> --------------------------------
> 4k 41 MB/s 41 MB/s
> 16k 85 MB/s 81 MB/s
> 32k 102 MB/s 109 MB/s
> 64k 125 MB/s 144 MB/s
> 128k 183 MB/s 185 MB/s
> 256k 216 MB/s 216 MB/s
> 512k 216 MB/s 236 MB/s
> 1024k 251 MB/s 252 MB/s
> 2M 258 MB/s 258 MB/s
> ==> 4M 266 MB/s 266 MB/s
> 8M 266 MB/s 266 MB/s
>
> SSD 30G SanDisk SATA 5000
>
> 4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
> 16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
> 32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
> 64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
> 128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
> 256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
> ==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
> 1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
> 2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
>
> USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
>
> 4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
> 16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
> 32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
> 64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
> 128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
> ==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
> 512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
> 1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
> 2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
>
> USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
>
> 4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
> 16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
> 32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
> 64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
> 128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
> ==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
> 512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
> 1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
> 2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
>
> USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
>
> 4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
> 16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
> 32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
> 64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
> 128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
> ==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
> 512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
> 1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
> 2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
>
> 64 MB, USB full speed (collected by Clemens Ladisch)
> Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
>
> 4KB: 139.339 s, 376 kB/s
> 16KB: 81.0427 s, 647 kB/s
> 32KB: 71.8513 s, 730 kB/s
> ==> 64KB: 67.3872 s, 778 kB/s
> 128KB: 67.5434 s, 776 kB/s
> 256KB: 65.9019 s, 796 kB/s
> 512KB: 66.2282 s, 792 kB/s
> 1024KB: 67.4632 s, 777 kB/s
> 2048KB: 69.9759 s, 749 kB/s
>
> CC: Li Shaohua <shaohua.li@intel.com>
> CC: Clemens Ladisch <clemens@ladisch.de>
> Acked-by: Jens Axboe <jens.axboe@oracle.com>
> Tested-by: Vivek Goyal <vgoyal@redhat.com>
> Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> block/genhd.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> --- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
> +++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
> @@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
> struct backing_dev_info *bdi;
> dev_t devt;
> int retval;
> + unsigned long size;
>
> /* minors == 0 indicates to use ext devt from part0 and should
> * be accompanied with EXT_DEVT flag. Make sure all
> @@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
> retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
> "bdi");
> WARN_ON(retval);
> +
> + /*
> + * Limit default readahead size for small devices.
> + * disk size readahead size
> + * 1M 8k
> + * 4M 16k
> + * 16M 32k
> + * 64M 64k
> + * 256M 128k
> + * 1G 256k
> + * ---------------------------
> + * 4G 512k
> + * 16G 1024k
> + * 64G 2048k
> + * 256G 4096k
> + * Since the default readahead size is 512k, this limit
> + * only takes effect for devices whose size is less than 4G.
> + */
> + if (get_capacity(disk)) {
> + size = get_capacity(disk) >> 9;
> + size = 1UL << (ilog2(size) / 2);
> + bdi->ra_pages = min(bdi->ra_pages, size);
> + }
> }
>
> EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/11] readahead: add tracing event
2010-02-02 15:28 ` Wu Fengguang
(?)
@ 2010-02-12 16:19 ` Steven Rostedt
-1 siblings, 0 replies; 83+ messages in thread
From: Steven Rostedt @ 2010-02-12 16:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2010-02-02 at 23:28 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-tracer.patch)
> Example output:
> + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> + ra_pattern_names[__entry->pattern],
The above totally breaks any parsing by tools. We have already have a
way to map values to strings with __print_symbolic():
__print_symbolic(__entry->pattern,
{ RA_PATTERN_INITIAL, "initial" },
{ RA_PATTERN_SUBSEQUENT, "subsequent"},
{ RA_PATTERN_CONTEXT, "context"},
{ RA_PATTERN_THRASH, "thrash"},
{ RA_PATTERN_MMAP_AROUND, "around"},
{ RA_PATTERN_FADVISE, "fadvise" },
{ RA_PATTERN_RANDOM, "random"},
{ RA_PATTERN_ALL, "all" }),
see include/trace/irq.h for another example.
-- Steve
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + __entry->ino,
> + __entry->offset,
> + __entry->req_size,
> + __entry->start,
> + __entry->size,
> + __entry->async_size,
> + __entry->start > __entry->offset,
> + __entry->actual)
> +);
> +
> +#endif /* _TRACE_READAHEAD_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> --- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
> +++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
> @@ -19,11 +19,25 @@
> #include <linux/pagevec.h>
> #include <linux/pagemap.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/readahead.h>
> +
> /*
> * Set async size to 1/# of the thrashing threshold.
> */
> #define READAHEAD_ASYNC_RATIO 8
>
> +const char * const ra_pattern_names[] = {
> + [RA_PATTERN_INITIAL] = "initial",
> + [RA_PATTERN_SUBSEQUENT] = "subsequent",
> + [RA_PATTERN_CONTEXT] = "context",
> + [RA_PATTERN_THRASH] = "thrash",
> + [RA_PATTERN_MMAP_AROUND] = "around",
> + [RA_PATTERN_FADVISE] = "fadvise",
> + [RA_PATTERN_RANDOM] = "random",
> + [RA_PATTERN_ALL] = "all",
> +};
> +
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/11] readahead: add tracing event
@ 2010-02-12 16:19 ` Steven Rostedt
0 siblings, 0 replies; 83+ messages in thread
From: Steven Rostedt @ 2010-02-12 16:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2010-02-02 at 23:28 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-tracer.patch)
> Example output:
> + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> + ra_pattern_names[__entry->pattern],
The above totally breaks any parsing by tools. We have already have a
way to map values to strings with __print_symbolic():
__print_symbolic(__entry->pattern,
{ RA_PATTERN_INITIAL, "initial" },
{ RA_PATTERN_SUBSEQUENT, "subsequent"},
{ RA_PATTERN_CONTEXT, "context"},
{ RA_PATTERN_THRASH, "thrash"},
{ RA_PATTERN_MMAP_AROUND, "around"},
{ RA_PATTERN_FADVISE, "fadvise" },
{ RA_PATTERN_RANDOM, "random"},
{ RA_PATTERN_ALL, "all" }),
see include/trace/irq.h for another example.
-- Steve
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + __entry->ino,
> + __entry->offset,
> + __entry->req_size,
> + __entry->start,
> + __entry->size,
> + __entry->async_size,
> + __entry->start > __entry->offset,
> + __entry->actual)
> +);
> +
> +#endif /* _TRACE_READAHEAD_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> --- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
> +++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
> @@ -19,11 +19,25 @@
> #include <linux/pagevec.h>
> #include <linux/pagemap.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/readahead.h>
> +
> /*
> * Set async size to 1/# of the thrashing threshold.
> */
> #define READAHEAD_ASYNC_RATIO 8
>
> +const char * const ra_pattern_names[] = {
> + [RA_PATTERN_INITIAL] = "initial",
> + [RA_PATTERN_SUBSEQUENT] = "subsequent",
> + [RA_PATTERN_CONTEXT] = "context",
> + [RA_PATTERN_THRASH] = "thrash",
> + [RA_PATTERN_MMAP_AROUND] = "around",
> + [RA_PATTERN_FADVISE] = "fadvise",
> + [RA_PATTERN_RANDOM] = "random",
> + [RA_PATTERN_ALL] = "all",
> +};
> +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/11] readahead: add tracing event
@ 2010-02-12 16:19 ` Steven Rostedt
0 siblings, 0 replies; 83+ messages in thread
From: Steven Rostedt @ 2010-02-12 16:19 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Tue, 2010-02-02 at 23:28 +0800, Wu Fengguang wrote:
> plain text document attachment (readahead-tracer.patch)
> Example output:
> + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> + ra_pattern_names[__entry->pattern],
The above totally breaks any parsing by tools. We have already have a
way to map values to strings with __print_symbolic():
__print_symbolic(__entry->pattern,
{ RA_PATTERN_INITIAL, "initial" },
{ RA_PATTERN_SUBSEQUENT, "subsequent"},
{ RA_PATTERN_CONTEXT, "context"},
{ RA_PATTERN_THRASH, "thrash"},
{ RA_PATTERN_MMAP_AROUND, "around"},
{ RA_PATTERN_FADVISE, "fadvise" },
{ RA_PATTERN_RANDOM, "random"},
{ RA_PATTERN_ALL, "all" }),
see include/trace/irq.h for another example.
-- Steve
> + MAJOR(__entry->dev),
> + MINOR(__entry->dev),
> + __entry->ino,
> + __entry->offset,
> + __entry->req_size,
> + __entry->start,
> + __entry->size,
> + __entry->async_size,
> + __entry->start > __entry->offset,
> + __entry->actual)
> +);
> +
> +#endif /* _TRACE_READAHEAD_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> --- linux.orig/mm/readahead.c 2010-02-01 21:55:43.000000000 +0800
> +++ linux/mm/readahead.c 2010-02-01 21:57:25.000000000 +0800
> @@ -19,11 +19,25 @@
> #include <linux/pagevec.h>
> #include <linux/pagemap.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/readahead.h>
> +
> /*
> * Set async size to 1/# of the thrashing threshold.
> */
> #define READAHEAD_ASYNC_RATIO 8
>
> +const char * const ra_pattern_names[] = {
> + [RA_PATTERN_INITIAL] = "initial",
> + [RA_PATTERN_SUBSEQUENT] = "subsequent",
> + [RA_PATTERN_CONTEXT] = "context",
> + [RA_PATTERN_THRASH] = "thrash",
> + [RA_PATTERN_MMAP_AROUND] = "around",
> + [RA_PATTERN_FADVISE] = "fadvise",
> + [RA_PATTERN_RANDOM] = "random",
> + [RA_PATTERN_ALL] = "all",
> +};
> +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/11] readahead: add tracing event
2010-02-12 16:19 ` Steven Rostedt
@ 2010-02-14 3:56 ` Wu Fengguang
-1 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-14 3:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Sat, Feb 13, 2010 at 12:19:05AM +0800, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 23:28 +0800, Wu Fengguang wrote:
> > plain text document attachment (readahead-tracer.patch)
> > Example output:
>
> > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > + ra_pattern_names[__entry->pattern],
>
> The above totally breaks any parsing by tools. We have already have a
> way to map values to strings with __print_symbolic():
>
> __print_symbolic(__entry->pattern,
> { RA_PATTERN_INITIAL, "initial" },
> { RA_PATTERN_SUBSEQUENT, "subsequent"},
> { RA_PATTERN_CONTEXT, "context"},
> { RA_PATTERN_THRASH, "thrash"},
> { RA_PATTERN_MMAP_AROUND, "around"},
> { RA_PATTERN_FADVISE, "fadvise" },
> { RA_PATTERN_RANDOM, "random"},
> { RA_PATTERN_ALL, "all" }),
>
> see include/trace/irq.h for another example.
Thank you! Updated patch as follows.
To avoid unnecessary dependency, EXTRACT_TRACE_SYMBOL() calls are
leaved out for now.
Thanks,
Fengguang
---
readahead: add tracing event
Example output:
# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/readahead.h | 78 +++++++++++++++++++++++++++++
mm/readahead.c | 11 ++++
2 files changed, 89 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h 2010-02-14 11:49:17.000000000 +0800
@@ -0,0 +1,78 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+#define show_pattern_name(val) \
+ __print_symbolic(val, \
+ { RA_PATTERN_INITIAL, "initial" }, \
+ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \
+ { RA_PATTERN_CONTEXT, "context" }, \
+ { RA_PATTERN_THRASH, "thrash" }, \
+ { RA_PATTERN_MMAP_AROUND, "around" }, \
+ { RA_PATTERN_FADVISE, "fadvise" }, \
+ { RA_PATTERN_RANDOM, "random" }, \
+ { RA_PATTERN_ALL, "all" })
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ show_pattern_name(__entry->pattern),
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c 2010-02-14 11:19:25.000000000 +0800
+++ linux/mm/readahead.c 2010-02-14 11:24:13.000000000 +0800
@@ -19,6 +19,9 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
/*
* Set async size to 1/# of the thrashing threshold.
*/
@@ -274,6 +277,11 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ trace_readahead(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, nr_to_read, 0, ret);
+
return ret;
}
@@ -301,6 +309,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
return actual;
}
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH 08/11] readahead: add tracing event
@ 2010-02-14 3:56 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-14 3:56 UTC (permalink / raw)
To: Steven Rostedt
Cc: Andrew Morton, Jens Axboe, Ingo Molnar, Peter Zijlstra,
Linux Memory Management List, linux-fsdevel, LKML
On Sat, Feb 13, 2010 at 12:19:05AM +0800, Steven Rostedt wrote:
> On Tue, 2010-02-02 at 23:28 +0800, Wu Fengguang wrote:
> > plain text document attachment (readahead-tracer.patch)
> > Example output:
>
> > + TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
> > + "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
> > + ra_pattern_names[__entry->pattern],
>
> The above totally breaks any parsing by tools. We have already have a
> way to map values to strings with __print_symbolic():
>
> __print_symbolic(__entry->pattern,
> { RA_PATTERN_INITIAL, "initial" },
> { RA_PATTERN_SUBSEQUENT, "subsequent"},
> { RA_PATTERN_CONTEXT, "context"},
> { RA_PATTERN_THRASH, "thrash"},
> { RA_PATTERN_MMAP_AROUND, "around"},
> { RA_PATTERN_FADVISE, "fadvise" },
> { RA_PATTERN_RANDOM, "random"},
> { RA_PATTERN_ALL, "all" }),
>
> see include/trace/irq.h for another example.
Thank you! Updated patch as follows.
To avoid unnecessary dependency, EXTRACT_TRACE_SYMBOL() calls are
leaved out for now.
Thanks,
Fengguang
---
readahead: add tracing event
Example output:
# echo 1 > /debug/tracing/events/readahead/enable
# cp test-file /dev/null
# cat /debug/tracing/trace # trimmed output
readahead-initial(dev=0:15, ino=100177, req=0+2, ra=0+4-2, async=0) = 4
readahead-subsequent(dev=0:15, ino=100177, req=2+2, ra=4+8-8, async=1) = 8
readahead-subsequent(dev=0:15, ino=100177, req=4+2, ra=12+16-16, async=1) = 16
readahead-subsequent(dev=0:15, ino=100177, req=12+2, ra=28+32-32, async=1) = 32
readahead-subsequent(dev=0:15, ino=100177, req=28+2, ra=60+60-60, async=1) = 24
readahead-subsequent(dev=0:15, ino=100177, req=60+2, ra=120+60-60, async=1) = 0
CC: Ingo Molnar <mingo@elte.hu>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/trace/events/readahead.h | 78 +++++++++++++++++++++++++++++
mm/readahead.c | 11 ++++
2 files changed, 89 insertions(+)
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux/include/trace/events/readahead.h 2010-02-14 11:49:17.000000000 +0800
@@ -0,0 +1,78 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM readahead
+
+#if !defined(_TRACE_READAHEAD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_READAHEAD_H
+
+#include <linux/tracepoint.h>
+
+#define show_pattern_name(val) \
+ __print_symbolic(val, \
+ { RA_PATTERN_INITIAL, "initial" }, \
+ { RA_PATTERN_SUBSEQUENT, "subsequent" }, \
+ { RA_PATTERN_CONTEXT, "context" }, \
+ { RA_PATTERN_THRASH, "thrash" }, \
+ { RA_PATTERN_MMAP_AROUND, "around" }, \
+ { RA_PATTERN_FADVISE, "fadvise" }, \
+ { RA_PATTERN_RANDOM, "random" }, \
+ { RA_PATTERN_ALL, "all" })
+
+/*
+ * Tracepoint for guest mode entry.
+ */
+TRACE_EVENT(readahead,
+ TP_PROTO(struct address_space *mapping,
+ pgoff_t offset,
+ unsigned long req_size,
+ unsigned int ra_flags,
+ pgoff_t start,
+ unsigned int size,
+ unsigned int async_size,
+ unsigned int actual),
+
+ TP_ARGS(mapping, offset, req_size,
+ ra_flags, start, size, async_size, actual),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( pgoff_t, offset )
+ __field( unsigned long, req_size )
+ __field( unsigned int, pattern )
+ __field( pgoff_t, start )
+ __field( unsigned int, size )
+ __field( unsigned int, async_size )
+ __field( unsigned int, actual )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = mapping->host->i_sb->s_dev;
+ __entry->ino = mapping->host->i_ino;
+ __entry->pattern = ra_pattern(ra_flags);
+ __entry->offset = offset;
+ __entry->req_size = req_size;
+ __entry->start = start;
+ __entry->size = size;
+ __entry->async_size = async_size;
+ __entry->actual = actual;
+ ),
+
+ TP_printk("readahead-%s(dev=%d:%d, ino=%lu, "
+ "req=%lu+%lu, ra=%lu+%d-%d, async=%d) = %d",
+ show_pattern_name(__entry->pattern),
+ MAJOR(__entry->dev),
+ MINOR(__entry->dev),
+ __entry->ino,
+ __entry->offset,
+ __entry->req_size,
+ __entry->start,
+ __entry->size,
+ __entry->async_size,
+ __entry->start > __entry->offset,
+ __entry->actual)
+);
+
+#endif /* _TRACE_READAHEAD_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- linux.orig/mm/readahead.c 2010-02-14 11:19:25.000000000 +0800
+++ linux/mm/readahead.c 2010-02-14 11:24:13.000000000 +0800
@@ -19,6 +19,9 @@
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#define CREATE_TRACE_POINTS
+#include <trace/events/readahead.h>
+
/*
* Set async size to 1/# of the thrashing threshold.
*/
@@ -274,6 +277,11 @@ int force_page_cache_readahead(struct ad
offset += this_chunk;
nr_to_read -= this_chunk;
}
+
+ trace_readahead(mapping, offset, nr_to_read,
+ RA_PATTERN_FADVISE << READAHEAD_PATTERN_SHIFT,
+ offset, nr_to_read, 0, ret);
+
return ret;
}
@@ -301,6 +309,9 @@ unsigned long ra_submit(struct file_ra_s
actual = __do_page_cache_readahead(mapping, filp,
ra->start, ra->size, ra->async_size);
+ trace_readahead(mapping, offset, req_size, ra->ra_flags,
+ ra->start, ra->size, ra->async_size, actual);
+
return actual;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
2010-02-07 4:10 [PATCH 00/11] " Wu Fengguang
2010-02-07 4:10 ` Wu Fengguang
@ 2010-02-07 4:10 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-07 4:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
Chris Mason, Peter Zijlstra, Olivier Galibert,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 6956 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-07 4:10 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-07 4:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
Chris Mason, Peter Zijlstra, Olivier Galibert,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7181 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH 01/11] readahead: limit readahead size for small devices
@ 2010-02-07 4:10 ` Wu Fengguang
0 siblings, 0 replies; 83+ messages in thread
From: Wu Fengguang @ 2010-02-07 4:10 UTC (permalink / raw)
To: Andrew Morton
Cc: Jens Axboe, Li Shaohua, Clemens Ladisch, Wu Fengguang,
Chris Mason, Peter Zijlstra, Olivier Galibert,
Linux Memory Management List, linux-fsdevel, LKML
[-- Attachment #1: readahead-size-for-tiny-device.patch --]
[-- Type: text/plain, Size: 7181 bytes --]
Linus reports a _really_ small & slow (505kB, 15kB/s) USB device,
on which blkid runs unpleasantly slow. He manages to optimize the blkid
reads down to 1kB+16kB, but still kernel read-ahead turns it into 48kB.
lseek 0, read 1024 => readahead 4 pages (start of file)
lseek 1536, read 16384 => readahead 8 pages (page contiguous)
The readahead heuristics involved here are reasonable ones in general.
So it's good to fix blkid with fadvise(RANDOM), as Linus already did.
For the kernel part, Linus suggests:
So maybe we could be less aggressive about read-ahead when the size of
the device is small? Turning a 16kB read into a 64kB one is a big deal,
when it's about 15% of the whole device!
This looks reasonable: smaller device tend to be slower (USB sticks as
well as micro/mobile/old hard disks).
Given that the non-rotational attribute is not always reported, we can
take disk size as a max readahead size hint. This patch uses a formula
that generates the following concrete limits:
disk size readahead size
(scale by 4) (scale by 2)
1M 8k
4M 16k
16M 32k
64M 64k
256M 128k
1G 256k
--------------------------- (*)
4G 512k
16G 1024k
64G 2048k
256G 4096k
(*) Since the default readahead size is 512k, this limit only takes
effect for devices whose size is less than 4G.
The formula is determined on the following data, collected by script:
#!/bin/sh
# please make sure BDEV is not mounted or opened by others
BDEV=sdb
for rasize in 4 16 32 64 128 256 512 1024 2048 4096 8192
do
echo $rasize > /sys/block/$BDEV/queue/read_ahead_kb
time dd if=/dev/$BDEV of=/dev/null bs=4k count=102400
done
The principle is, the formula shall not limit readahead size to such a
degree that will impact some device's sequential read performance.
The Intel SSD is special in that its throughput increases steadily with
larger readahead size. However it may take years for Linux to increase
its default readahead size to 2MB, so we don't take it seriously in the
formula.
SSD 80G Intel x25-M SSDSA2M080 (reported by Li Shaohua)
rasize 1st run 2nd run
----------------------------------
4k 123 MB/s 122 MB/s
16k 153 MB/s 153 MB/s
32k 161 MB/s 162 MB/s
64k 167 MB/s 168 MB/s
128k 197 MB/s 197 MB/s
256k 217 MB/s 217 MB/s
512k 238 MB/s 234 MB/s
1M 251 MB/s 248 MB/s
2M 259 MB/s 257 MB/s
==> 4M 269 MB/s 264 MB/s
8M 266 MB/s 266 MB/s
Note that ==> points to the readahead size that yields plateau throughput.
SSD 22G MARVELL SD88SA02 MP1F (reported by Jens Axboe)
rasize 1st 2nd
--------------------------------
4k 41 MB/s 41 MB/s
16k 85 MB/s 81 MB/s
32k 102 MB/s 109 MB/s
64k 125 MB/s 144 MB/s
128k 183 MB/s 185 MB/s
256k 216 MB/s 216 MB/s
512k 216 MB/s 236 MB/s
1024k 251 MB/s 252 MB/s
2M 258 MB/s 258 MB/s
==> 4M 266 MB/s 266 MB/s
8M 266 MB/s 266 MB/s
SSD 30G SanDisk SATA 5000
4k 29.6 MB/s 29.6 MB/s 29.6 MB/s
16k 52.1 MB/s 52.1 MB/s 52.1 MB/s
32k 61.5 MB/s 61.5 MB/s 61.5 MB/s
64k 67.2 MB/s 67.2 MB/s 67.1 MB/s
128k 71.4 MB/s 71.3 MB/s 71.4 MB/s
256k 73.4 MB/s 73.4 MB/s 73.3 MB/s
==> 512k 74.6 MB/s 74.6 MB/s 74.6 MB/s
1M 74.7 MB/s 74.6 MB/s 74.7 MB/s
2M 76.1 MB/s 74.6 MB/s 74.6 MB/s
USB stick 32G Teclast CoolFlash idVendor=1307, idProduct=0165
4k 7.9 MB/s 7.9 MB/s 7.9 MB/s
16k 17.9 MB/s 17.9 MB/s 17.9 MB/s
32k 24.5 MB/s 24.5 MB/s 24.5 MB/s
64k 28.7 MB/s 28.7 MB/s 28.7 MB/s
128k 28.8 MB/s 28.9 MB/s 28.9 MB/s
==> 256k 30.5 MB/s 30.5 MB/s 30.5 MB/s
512k 30.9 MB/s 31.0 MB/s 30.9 MB/s
1M 31.0 MB/s 30.9 MB/s 30.9 MB/s
2M 30.9 MB/s 30.9 MB/s 30.9 MB/s
USB stick 4G SanDisk Cruzer idVendor=0781, idProduct=5151
4k 6.4 MB/s 6.4 MB/s 6.4 MB/s
16k 13.4 MB/s 13.4 MB/s 13.2 MB/s
32k 17.8 MB/s 17.9 MB/s 17.8 MB/s
64k 21.3 MB/s 21.3 MB/s 21.2 MB/s
128k 21.4 MB/s 21.4 MB/s 21.4 MB/s
==> 256k 23.3 MB/s 23.2 MB/s 23.2 MB/s
512k 23.3 MB/s 23.8 MB/s 23.4 MB/s
1M 23.8 MB/s 23.4 MB/s 23.3 MB/s
2M 23.4 MB/s 23.2 MB/s 23.4 MB/s
USB stick 2G idVendor=0204, idProduct=6025 SerialNumber: 08082005000113
4k 6.7 MB/s 6.9 MB/s 6.7 MB/s
16k 11.7 MB/s 11.7 MB/s 11.7 MB/s
32k 12.4 MB/s 12.4 MB/s 12.4 MB/s
64k 13.4 MB/s 13.4 MB/s 13.4 MB/s
128k 13.4 MB/s 13.4 MB/s 13.4 MB/s
==> 256k 13.6 MB/s 13.6 MB/s 13.6 MB/s
512k 13.7 MB/s 13.7 MB/s 13.7 MB/s
1M 13.7 MB/s 13.7 MB/s 13.7 MB/s
2M 13.7 MB/s 13.7 MB/s 13.7 MB/s
64 MB, USB full speed (collected by Clemens Ladisch)
Bus 003 Device 003: ID 08ec:0011 M-Systems Flash Disk Pioneers DiskOnKey
4KB: 139.339 s, 376 kB/s
16KB: 81.0427 s, 647 kB/s
32KB: 71.8513 s, 730 kB/s
==> 64KB: 67.3872 s, 778 kB/s
128KB: 67.5434 s, 776 kB/s
256KB: 65.9019 s, 796 kB/s
512KB: 66.2282 s, 792 kB/s
1024KB: 67.4632 s, 777 kB/s
2048KB: 69.9759 s, 749 kB/s
CC: Li Shaohua <shaohua.li@intel.com>
CC: Clemens Ladisch <clemens@ladisch.de>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Tested-by: Vivek Goyal <vgoyal@redhat.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
block/genhd.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
--- linux.orig/block/genhd.c 2010-02-03 20:40:37.000000000 +0800
+++ linux/block/genhd.c 2010-02-04 21:19:07.000000000 +0800
@@ -518,6 +518,7 @@ void add_disk(struct gendisk *disk)
struct backing_dev_info *bdi;
dev_t devt;
int retval;
+ unsigned long size;
/* minors == 0 indicates to use ext devt from part0 and should
* be accompanied with EXT_DEVT flag. Make sure all
@@ -551,6 +552,29 @@ void add_disk(struct gendisk *disk)
retval = sysfs_create_link(&disk_to_dev(disk)->kobj, &bdi->dev->kobj,
"bdi");
WARN_ON(retval);
+
+ /*
+ * Limit default readahead size for small devices.
+ * disk size readahead size
+ * 1M 8k
+ * 4M 16k
+ * 16M 32k
+ * 64M 64k
+ * 256M 128k
+ * 1G 256k
+ * ---------------------------
+ * 4G 512k
+ * 16G 1024k
+ * 64G 2048k
+ * 256G 4096k
+ * Since the default readahead size is 512k, this limit
+ * only takes effect for devices whose size is less than 4G.
+ */
+ if (get_capacity(disk)) {
+ size = get_capacity(disk) >> 9;
+ size = 1UL << (ilog2(size) / 2);
+ bdi->ra_pages = min(bdi->ra_pages, size);
+ }
}
EXPORT_SYMBOL(add_disk);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2010-02-14 3:57 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-02 15:28 [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 01/11] readahead: limit readahead size for small devices Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 19:38 ` Jens Axboe
2010-02-02 19:38 ` Jens Axboe
2010-02-03 6:13 ` Wu Fengguang
2010-02-03 6:13 ` Wu Fengguang
2010-02-03 8:23 ` Jens Axboe
2010-02-03 8:23 ` Jens Axboe
2010-02-04 8:24 ` Clemens Ladisch
2010-02-04 8:24 ` Clemens Ladisch
2010-02-04 13:00 ` Wu Fengguang
2010-02-04 13:00 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 02/11] readahead: bump up the default readahead size Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 03/11] readahead: introduce {MAX|MIN}_READAHEAD_PAGES macros for ease of use Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 04/11] readahead: replace ra->mmap_miss with ra->ra_flags Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 05/11] readahead: retain inactive lru pages to be accessed soon Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 06/11] readahead: thrashing safe context readahead Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 07/11] readahead: record readahead patterns Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 08/11] readahead: add tracing event Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-12 16:19 ` Steven Rostedt
2010-02-12 16:19 ` Steven Rostedt
2010-02-12 16:19 ` Steven Rostedt
2010-02-14 3:56 ` Wu Fengguang
2010-02-14 3:56 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 09/11] readahead: add /debug/readahead/stats Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` [PATCH 10/11] readahead: dont do start-of-file readahead after lseek() Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 17:39 ` Linus Torvalds
2010-02-02 17:39 ` Linus Torvalds
2010-02-02 18:13 ` Olivier Galibert
2010-02-02 18:13 ` Olivier Galibert
2010-02-02 18:40 ` Linus Torvalds
2010-02-02 18:40 ` Linus Torvalds
2010-02-02 18:48 ` Olivier Galibert
2010-02-02 18:48 ` Olivier Galibert
2010-02-02 19:14 ` Linus Torvalds
2010-02-02 19:14 ` Linus Torvalds
2010-02-02 19:59 ` david
2010-02-02 19:59 ` david
2010-02-02 20:22 ` Linus Torvalds
2010-02-02 20:22 ` Linus Torvalds
2010-02-02 15:28 ` [PATCH 11/11] radixtree: speed up next/prev hole search Wu Fengguang
2010-02-02 15:28 ` Wu Fengguang
2010-02-02 22:38 ` [PATCH 00/11] [RFC] 512K readahead size with thrashing safe readahead Vivek Goyal
2010-02-02 22:38 ` Vivek Goyal
2010-02-02 22:38 ` Vivek Goyal
2010-02-02 23:17 ` Vivek Goyal
2010-02-02 23:17 ` Vivek Goyal
2010-02-03 6:27 ` Wu Fengguang
2010-02-03 6:27 ` Wu Fengguang
2010-02-03 15:24 ` Vivek Goyal
2010-02-03 15:24 ` Vivek Goyal
2010-02-03 15:58 ` Vivek Goyal
2010-02-03 15:58 ` Vivek Goyal
2010-02-03 16:55 ` Fwd: " Mike Snitzer
2010-02-04 13:21 ` Wu Fengguang
2010-02-04 13:21 ` Wu Fengguang
2010-02-04 15:52 ` Vivek Goyal
2010-02-04 15:52 ` Vivek Goyal
2010-02-04 13:44 ` Wu Fengguang
2010-02-04 13:44 ` Wu Fengguang
2010-02-07 4:10 [PATCH 00/11] " Wu Fengguang
2010-02-07 4:10 ` [PATCH 01/11] readahead: limit readahead size for small devices Wu Fengguang
2010-02-07 4:10 ` Wu Fengguang
2010-02-07 4:10 ` Wu Fengguang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.