Re: readahead

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: readahead
@ 2002-04-16 20:21 Andries.Brouwer
  0 siblings, 0 replies; 21+ messages in thread
From: Andries.Brouwer @ 2002-04-16 20:21 UTC (permalink / raw)
  To: Andries.Brouwer, akpm; +Cc: linux-kernel

> if (max == 0)
>	goto out;       /* No readahead */

Very good. Thanks!

Andries

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Readahead
  2005-09-27  4:24 ` Readahead Andrew Morton
@ 2005-09-28 18:40   ` Alan Stern
  0 siblings, 0 replies; 21+ messages in thread
From: Alan Stern @ 2005-09-28 18:40 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe; +Cc: Kernel development list

On Mon, 26 Sep 2005, Andrew Morton wrote:

> Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> >  Can somebody please tell me where the code is that performs optimistic
> >  readahead when a process does sequential reads on a block device?
> 
> mm/readahead.c:__do_page_cache_readahead() is the main one.  Use
> dump_stack() to be sure.
> 
> >  And can someone explain why those readahead calls are allowed to extend 
> >  beyond the end of the device?
> 
> It has a check in there for reads past the blockdev mapping's i_size. 
> Maybe i_size is wrong, or maybe the code is wrong, or maybe it's a
> different caller.

Thanks for the tip.  The problem I was chasing down was the system's
attempts to read beyond the end of a CD disc.  It turns out the
cause is partly in the block layer and partly in the cdrom drivers.

Here's what happened.  CDs have 2048-byte blocks, and I've got a disc 
containing nothing but a single data track (written with cdrecord) of 
326533 blocks.  (The original .iso was 326518 blocks long and I added 15 
blocks of padding.)

Oddly enough, the values recorded in the disc's Table Of Contents indicate
that the track is 326535 blocks.  Maybe this is normal for cdrecord or for
CDROMs in general -- I don't know.  Anyway, the cdrom drivers believe this
value and report a capacity that is 2 blocks too high.

When I try using dd with bs=2048 to read the very last actual block,
number 326532, the block layer of course issues a read request for an
entire 4 KB memory page.  The drive returns the first 2 KB of data
successfully and reports an error reading the second 2 KB, which is beyond 
the actual end of the track.

Now according to a comment in drivers/scsi/sr.c:

	/*
	 * The SCSI specification allows for the value
	 * returned by READ CAPACITY to be up to 75 2K
	 * sectors past the last readable block.
	 * Therefore, if we hit a medium error within the
	 * last 75 2K sectors, we decrease the saved size
	 * value.
	 */

The code to do this has some flaws, but I fixed them.  The result is that
the stored capacity is reduced to 326533 blocks, as it should be, the SCSI
driver calls end_that_request_chunk(req, 1, 2048), and then it requeues
the request in order to retry the remaining 2048 bytes.  This naturally
fails, and the driver calls end_that_request_chunk(req, 0, 2048).  The
upshot is that the dd process receives an error instead of getting the
2 KB of data as it should.

The _next_ time I use dd to read that block, it works perfectly.  The 
block layer only tries to read 2048 bytes and there's no problem.

So evidently the block layer doesn't like it when a transfer only
partially succeeds, even though that part includes everything up to the
(new) end of the device.  Can this be fixed?  I wouldn't know where to
begin.

It's also worth noting that the IDE cdrom driver does not fix up the 
capacity as the SCSI driver does.  It would be a good idea to copy over 
the code -- I can probably handle that.

Alan Stern

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Readahead
  2005-09-27  2:38 Readahead Alan Stern
  2005-09-27  3:06 ` Readahead Randy.Dunlap
@ 2005-09-27  4:24 ` Andrew Morton
  2005-09-28 18:40   ` Readahead Alan Stern
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2005-09-27  4:24 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-kernel

Alan Stern <stern@rowland.harvard.edu> wrote:
>
>  Can somebody please tell me where the code is that performs optimistic
>  readahead when a process does sequential reads on a block device?

mm/readahead.c:__do_page_cache_readahead() is the main one.  Use
dump_stack() to be sure.

>  And can someone explain why those readahead calls are allowed to extend 
>  beyond the end of the device?

It has a check in there for reads past the blockdev mapping's i_size. 
Maybe i_size is wrong, or maybe the code is wrong, or maybe it's a
different caller.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Readahead
  2005-09-27  2:38 Readahead Alan Stern
@ 2005-09-27  3:06 ` Randy.Dunlap
  2005-09-27  4:24 ` Readahead Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Randy.Dunlap @ 2005-09-27  3:06 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-kernel

On Mon, 26 Sep 2005 22:38:21 -0400 (EDT) Alan Stern wrote:

> Can somebody please tell me where the code is that performs optimistic
> readahead when a process does sequential reads on a block device?

There's filesystem readahead in fs/buffer.c (__breadahead()),
although I don't see any actual callers of it.

However, I'm guessing that you are actually thinking about the
anticipatory IO scheduler (as), which is found in
drivers/block/as-iosched.c.
drivers/block/ll_rw_blk.c also does some readahead (READA) work.

> And can someone explain why those readahead calls are allowed to extend 
> beyond the end of the device?

Nope.

---
~Randy

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Readahead
@ 2005-09-27  2:38 Alan Stern
  2005-09-27  3:06 ` Readahead Randy.Dunlap
  2005-09-27  4:24 ` Readahead Andrew Morton
  0 siblings, 2 replies; 21+ messages in thread
From: Alan Stern @ 2005-09-27  2:38 UTC (permalink / raw)
  To: Kernel development list

Can somebody please tell me where the code is that performs optimistic
readahead when a process does sequential reads on a block device?

And can someone explain why those readahead calls are allowed to extend 
beyond the end of the device?

Alan Stern

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-31  9:29       ` READAHEAD Andrew Morton
  2003-11-01  9:15         ` READAHEAD age
@ 2003-11-03  0:15         ` Derek Foreman
  1 sibling, 0 replies; 21+ messages in thread
From: Derek Foreman @ 2003-11-03  0:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ahuisman, linux-kernel, nuno.silva

On Fri, 31 Oct 2003, Andrew Morton wrote:

> Andrew Morton <akpm@osdl.org> wrote:
> >
> > Please, just use time, cat, dd, etc.
> >
> >  	mount /dev/xxx /mnt/yyy
> >  	dd if=/dev/zero of=/mnt/yyy/x bs=1M count=1024
> >  	umount /dev/xxx
> >  	mount /dev/xxx /mnt/yyy
> >  	time cat /mnt/yyy/x > /dev/null
>
> And you can do the same against /dev/hdaN if you have a scratch
> partition; that would be interesting.

I don't have a scratch partition, but the effect is quite apparent when
reading from /dev/hd*

I have 384 megs of ram, no swap, and echo 0 > /proc/sys/vm/swappiness
kernel is 2.6.0-test9

hdparm -qa 0 /dev/hde ; dd if=/dev/hde of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 103.038178 seconds (10420815 bytes/sec)

hdparm -qa 128 /dev/hde ; dd if=/dev/hde of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 34.171719 seconds (31421943 bytes/sec)

hdparm -qa 256 /dev/hde ; dd if=/dev/hde of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 34.994348 seconds (30683293 bytes/sec)

hdparm -qa 4096 /dev/hde ; dd if=/dev/hde of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 22.268371 seconds (48218247 bytes/sec)

(hdparm -t /dev/hde opens /dev/hde and reads it for a few seconds, so
these numbers are much the same as my hdparm -t scores)

I also get similar results from /dev/md0 in another machine (the only
not-ide device I can test on... but it's still a bunch of ide drives).

note...
For people playing with hdparm, hdparm -a 123 -t /dev/drive doesn't set
the readahead before running the test.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
@ 2003-11-01 17:22 Voluspa
  0 siblings, 0 replies; 21+ messages in thread
From: Voluspa @ 2003-11-01 17:22 UTC (permalink / raw)
  To: linux-kernel


On 2003-11-01 9:15:28 Age Huisman wrote:
>Andrew Morton wrote:
>> Andrew Morton <akpm@osdl.org> wrote:
>>
>>>Please, just use time, cat, dd, etc.
>>>
>>> mount /dev/xxx /mnt/yyy
>>> dd if=/dev/zero of=/mnt/yyy/x bs=1M count=1024
>>> umount /dev/xxx
>>> mount /dev/xxx /mnt/yyy
>>> time cat /mnt/yyy/x > /dev/null
[...]
>Here are the new test results.
[...]
>I think you were right  :-)

I see an improvement with 512 instead of the default 256, but no further
speedups with 1024 or 2048 - no point in trying 4096:

readahead = 256 (on)
real    0m39.494s
user    0m0.346s
sys     0m5.436s
Timing buffered disk reads:  64 MB in  2.80 seconds = 22.84 MB/sec

readahead = 512 (on)
real    0m34.418s
user    0m0.302s
sys     0m5.304s
Timing buffered disk reads:  64 MB in  2.16 seconds = 29.63 MB/sec

And for the nostalgic people out there, here's what "hdparm /dev/hdX"
has in its readahead slot under 2.5.X:

2.5.5-pre1
readahead = 8 (on)

2.5.5-pre1-final (AKA 2.5.5) to 2.5.8-pre2
BLKRAGET failed: Input/output error

2.5.8-pre3 to 2.5.9 don't compile.

2.5.10
readahead = 0 (off)

2.5.11 failed to boot and damaged the filesystem.

2.5.12 and onwards
readahead = 256 (on)

Mvh
Mats Johannesson

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-31  9:29       ` READAHEAD Andrew Morton
@ 2003-11-01  9:15         ` age
  2003-11-03  0:15         ` READAHEAD Derek Foreman
  1 sibling, 0 replies; 21+ messages in thread
From: age @ 2003-11-01  9:15 UTC (permalink / raw)
  To: linux-kernel

Andrew Morton wrote:
> Andrew Morton <akpm@osdl.org> wrote:
> 
>>Please, just use time, cat, dd, etc.
>>
>> 	mount /dev/xxx /mnt/yyy
>> 	dd if=/dev/zero of=/mnt/yyy/x bs=1M count=1024
>> 	umount /dev/xxx
>> 	mount /dev/xxx /mnt/yyy
>> 	time cat /mnt/yyy/x > /dev/null
> 
> 
> And you can do the same against /dev/hdaN if you have a scratch
> partition; that would be interesting.


Hi Andrew,

Here are the new test results.


hdparm -a0
Timing buffered disk reads:   52 MB in  3.07 seconds =  16.94 MB/sec

wuuk:~# dd if=/dev/zero of=/home/test/test bs=1M count=34000
34000+0 records in
34000+0 records out
35651584000 bytes transferred in 922.018299 seconds (38666894 bytes/sec)

wuuk:~# time cat /home/test/test > /dev/null

real    33m21.785s
user    0m19.263s
sys     16m31.853s
wuuk:~# time rm /home/test/test

real    1m35.676s
user    0m0.001s
sys     0m6.679s

hdparm -a16
Timing buffered disk reads:  120 MB in  3.05 seconds =  39.39 MB/sec

wuuk:~# dd if=/dev/zero of=/home/test/test bs=1M count=34000
34000+0 records in
34000+0 records out
35651584000 bytes transferred in 920.669609 seconds (38723537 bytes/sec)

wuuk:~# time cat /home/test/test > /dev/null

real    22m4.180s
user    0m18.464s
sys     10m42.722s
wuuk:~# time rm /home/test/test

real    1m35.642s
user    0m0.003s
sys     0m6.635s

hdparm -a256
Timing buffered disk reads:  134 MB in  3.00 seconds =  44.61 MB/sec

wuuk:~# dd if=/dev/zero of=/home/test/test bs=1M count=34000
34000+0 records in
34000+0 records out
35651584000 bytes transferred in 920.412114 seconds (38734371 bytes/sec)

wuuk:~# time cat /home/test/test > /dev/null

real    13m24.228s
user    0m10.306s
sys     3m3.256s
wuuk:~# time rm /home/test/test

real    1m35.900s
user    0m0.002s
sys     0m6.695s

hdparm -a4096
Timing buffered disk reads:  168 MB in  3.01 seconds =  55.82 MB/sec

wuuk:~# dd if=/dev/zero of=/home/test/test bs=1M count=34000
34000+0 records in
34000+0 records out
35651584000 bytes transferred in 920.198902 seconds (38743346 bytes/sec)

wuuk:~# time cat /home/test/test > /dev/null

real    15m25.848s
user    0m10.716s
sys     3m5.103s
wuuk:~# time rm /home/test/test

real    1m36.205s
user    0m0.003s
sys     0m6.743s

I think you were right  :-)

groetjes,

Age Huisman














^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-30 21:44 ` READAHEAD Andrew Morton
  2003-10-31  7:43   ` READAHEAD Nuno Silva
@ 2003-10-31 12:20   ` age
  2003-10-31  9:28     ` READAHEAD Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: age @ 2003-10-31 12:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, nuno.silva


Andrew Morton wrote:
 > age <ahuisman@cistron.nl> wrote:
 >
 >>I have a problem which i don`t understand and i hope that you
 >> will and can  help me. The problem is that i experience strange disk
 >> read performance. I have to set hdparm -m16 -u1 -c1 -d1 -a4096 /dev/hde
 >> to get  timing buffered disk reads of 56 MB/SEC.
 >> When i disable readahead i get 17 MB/SEC
 >> When i enable readahead with -a8 i get  17 MB/SEC
 >> When i enable readahead with -a16 i get 24,5 MB/SEC
 >> When i enable readahead with -a32 i get 30,5 MB/SEC
 >> When i enable readahead with -a64 i get 35 MB/SEC
 >> When i enable readahead with -a128 i get 39 MB/SEC
 >> When i enable readahead with -a256 i get 39 MB/SEC
 >> When i enable readahead with -a512 i get 41 MB/SEC
 >> When i enable readahead with -a1024 i get 50 MB/SEC
 >> When i enable readahead with -a2048 i get 50 MB/SEC
 >> When i enable readahead with -a4096 i get 56 MB/SEC
 >> With -a8192,-a16384 and -a32768 i get also 56MB/SEC
 >>
 >> Before, i never had to set readahead so high
 >> Please could you tell me, what is going on here ?
 >
 >
 > Lots of people have been reporting this.  It's rather weird.
 >
 > Is the same effect observable when reading a large file, or is it only
 > observable via `hdparm -t'?
 >

Hi Andrew,

Here are some tests with bonnie++.
The used command is : bonnie++ -d /home/test -s 1024 -n 10 -u root
TCQ and write cache enabled.


The first test with hdparm -a0 -c1 -m16 -u1 -d1 /dev/hde :

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
wuuk             1G  3610  98 56878  85 10010  35  3129  90 17355  55 181.2  2
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  10 11117  95 +++++ +++ 18414  98 11888  99 +++++ +++ 19189  98


The second test with hdparm -a16 -c1 -m16 -u1 -d1 /dev/hde :

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
wuuk             1G  3611  98 56404  85 12104  36  3260  92 27741  54 193.3   1
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  10 11571  99 +++++ +++ 18477  99 11901  99 +++++ +++ 19250  99

The third test with hdparm -a256 -c1 -m16 -u1 -d1 /dev/hde :

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
wuuk             1G  3614  98 57510  85 15597  28  3621  95 43004  45 188.5   1
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  10 11622  99 +++++ +++ 18439  99 11905  99 +++++ +++ 19378  99


The fourth test with hdparm -a4096 -c1 -m16 -u1 -d1 /dev/hde :

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
wuuk             1G  3610  98 57770  85 18518  32  3636  96 42748  39 186.9   1
                     ------Sequential Create------ --------Random Create--------
                     -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                  10 11653  98 +++++ +++ 18608  99 11908  99 +++++ +++ 19503  99

If you need more information, please tell me.

groetjes,(greetings)

Age Huisman.













^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-31  9:28     ` READAHEAD Andrew Morton
@ 2003-10-31  9:29       ` Andrew Morton
  2003-11-01  9:15         ` READAHEAD age
  2003-11-03  0:15         ` READAHEAD Derek Foreman
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2003-10-31  9:29 UTC (permalink / raw)
  To: ahuisman, linux-kernel, nuno.silva

Andrew Morton <akpm@osdl.org> wrote:
>
> Please, just use time, cat, dd, etc.
> 
>  	mount /dev/xxx /mnt/yyy
>  	dd if=/dev/zero of=/mnt/yyy/x bs=1M count=1024
>  	umount /dev/xxx
>  	mount /dev/xxx /mnt/yyy
>  	time cat /mnt/yyy/x > /dev/null

And you can do the same against /dev/hdaN if you have a scratch
partition; that would be interesting.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-31 12:20   ` READAHEAD age
@ 2003-10-31  9:28     ` Andrew Morton
  2003-10-31  9:29       ` READAHEAD Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2003-10-31  9:28 UTC (permalink / raw)
  To: age; +Cc: linux-kernel, nuno.silva

age <ahuisman@cistron.nl> wrote:
>
> > Lots of people have been reporting this.  It's rather weird.
>   >
>   > Is the same effect observable when reading a large file, or is it only
>   > observable via `hdparm -t'?
>   >
> 
>  Hi Andrew,
> 
>  Here are some tests with bonnie++.

Like so many of these things, bonnie++ is generally far, far too complex
for kernel performance tuning.

Please, just use time, cat, dd, etc.

	mount /dev/xxx /mnt/yyy
	dd if=/dev/zero of=/mnt/yyy/x bs=1M count=1024
	umount /dev/xxx
	mount /dev/xxx /mnt/yyy
	time cat /mnt/yyy/x > /dev/null

nice'n'easy.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-31  7:43   ` READAHEAD Nuno Silva
@ 2003-10-31  8:03     ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2003-10-31  8:03 UTC (permalink / raw)
  To: Nuno Silva; +Cc: ahuisman, linux-kernel

Nuno Silva <nuno.silva@vgertech.com> wrote:
>
> >> Before, i never had to set readahead so high
> >> Please could you tell me, what is going on here ?
> > 
> > 
> > Lots of people have been reporting this.  It's rather weird.
> > 
> 
> I know nothing about this but, FWIW, I think that what changed where the 
> units. With 2.4 you specify sectors, with 2.6 you specify bytes.
> 
> So, having -a8, in 2.4, is the same as having -a$((8*512)) [it's 4096 
> :)], in 2.6.
> 

No, everything seems OK.  Both `hdparm -a' and `blockdev --setra' are
operating in units of 512 bytes.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-30 21:44 ` READAHEAD Andrew Morton
@ 2003-10-31  7:43   ` Nuno Silva
  2003-10-31  8:03     ` READAHEAD Andrew Morton
  2003-10-31 12:20   ` READAHEAD age
  1 sibling, 1 reply; 21+ messages in thread
From: Nuno Silva @ 2003-10-31  7:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: age, linux-kernel

Hi!!

Andrew Morton wrote:
> age <ahuisman@cistron.nl> wrote:
> 
>>I have a problem which i don`t understand and i hope that you
>> will and can  help me. The problem is that i experience strange disk
>> read performance. I have to set hdparm -m16 -u1 -c1 -d1 -a4096 /dev/hde
>> to get  timing buffered disk reads of 56 MB/SEC.
>> When i disable readahead i get 17 MB/SEC
>> When i enable readahead with -a8 i get  17 MB/SEC
>> When i enable readahead with -a16 i get 24,5 MB/SEC
>> When i enable readahead with -a32 i get 30,5 MB/SEC
>> When i enable readahead with -a64 i get 35 MB/SEC
>> When i enable readahead with -a128 i get 39 MB/SEC
>> When i enable readahead with -a256 i get 39 MB/SEC
>> When i enable readahead with -a512 i get 41 MB/SEC
>> When i enable readahead with -a1024 i get 50 MB/SEC
>> When i enable readahead with -a2048 i get 50 MB/SEC
>> When i enable readahead with -a4096 i get 56 MB/SEC
>> With -a8192,-a16384 and -a32768 i get also 56MB/SEC
>>
>> Before, i never had to set readahead so high
>> Please could you tell me, what is going on here ?
> 
> 
> Lots of people have been reporting this.  It's rather weird.
> 

I know nothing about this but, FWIW, I think that what changed where the 
units. With 2.4 you specify sectors, with 2.6 you specify bytes.

So, having -a8, in 2.4, is the same as having -a$((8*512)) [it's 4096 
:)], in 2.6.

Not sure if it's the case, but makes sense :-)

Regards,
Nuno Silva


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: READAHEAD
  2003-10-30 19:23 READAHEAD age
@ 2003-10-30 21:44 ` Andrew Morton
  2003-10-31  7:43   ` READAHEAD Nuno Silva
  2003-10-31 12:20   ` READAHEAD age
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2003-10-30 21:44 UTC (permalink / raw)
  To: age; +Cc: linux-kernel

age <ahuisman@cistron.nl> wrote:
>
> I have a problem which i don`t understand and i hope that you
>  will and can  help me. The problem is that i experience strange disk
>  read performance. I have to set hdparm -m16 -u1 -c1 -d1 -a4096 /dev/hde
>  to get  timing buffered disk reads of 56 MB/SEC.
>  When i disable readahead i get 17 MB/SEC
>  When i enable readahead with -a8 i get  17 MB/SEC
>  When i enable readahead with -a16 i get 24,5 MB/SEC
>  When i enable readahead with -a32 i get 30,5 MB/SEC
>  When i enable readahead with -a64 i get 35 MB/SEC
>  When i enable readahead with -a128 i get 39 MB/SEC
>  When i enable readahead with -a256 i get 39 MB/SEC
>  When i enable readahead with -a512 i get 41 MB/SEC
>  When i enable readahead with -a1024 i get 50 MB/SEC
>  When i enable readahead with -a2048 i get 50 MB/SEC
>  When i enable readahead with -a4096 i get 56 MB/SEC
>  With -a8192,-a16384 and -a32768 i get also 56MB/SEC
> 
>  Before, i never had to set readahead so high
>  Please could you tell me, what is going on here ?

Lots of people have been reporting this.  It's rather weird.

Is the same effect observable when reading a large file, or is it only
observable via `hdparm -t'?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* READAHEAD
@ 2003-10-30 19:23 age
  2003-10-30 21:44 ` READAHEAD Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: age @ 2003-10-30 19:23 UTC (permalink / raw)
  To: linux-kernel

Hi, Andre

Hi,

I have a problem which i don`t understand and i hope that you
will and can  help me. The problem is that i experience strange disk
read performance. I have to set hdparm -m16 -u1 -c1 -d1 -a4096 /dev/hde
to get  timing buffered disk reads of 56 MB/SEC.
When i disable readahead i get 17 MB/SEC
When i enable readahead with -a8 i get  17 MB/SEC
When i enable readahead with -a16 i get 24,5 MB/SEC
When i enable readahead with -a32 i get 30,5 MB/SEC
When i enable readahead with -a64 i get 35 MB/SEC
When i enable readahead with -a128 i get 39 MB/SEC
When i enable readahead with -a256 i get 39 MB/SEC
When i enable readahead with -a512 i get 41 MB/SEC
When i enable readahead with -a1024 i get 50 MB/SEC
When i enable readahead with -a2048 i get 50 MB/SEC
When i enable readahead with -a4096 i get 56 MB/SEC
With -a8192,-a16384 and -a32768 i get also 56MB/SEC

Before, i never had to set readahead so high
Please could you tell me, what is going on here ?

I use 2.6.0-test9 with TCQ enabled.
The harddisk is the new Hitachi 7K250 40 GB PATA.

PS: It doesn`t matter  when i disable tcq and/or multcount, i get
      the same results.

THX

Age huisman

/dev/hde:

Model=HDS722540VLAT20, FwRev=V31OA60A, SerialNo=VN321EC2R9X1ML
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=52
BuffType=DualPortCache, BuffSize=1794kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=80418240
IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes:  pio0 pio1 pio2 pio3 pio4
DMA modes:  mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=yes: disabled (255) WriteCache=enabled
Drive conforms to: ATA/ATAPI-6 T13 1410D revision 3a:

* signifies the current active mode


ATA device, with non-removable media
powers-up in standby; SET FEATURES subcmd spins-up.
        Model Number:       HDS722540VLAT20
        Serial Number:      VN321EC2R9X1ML
        Firmware Revision:  V31OA60A
Standards:
        Used: ATA/ATAPI-6 T13 1410D revision 3a
        Supported: 6 5 4 3
Configuration:
        Logical         max     current
        cylinders       16383   65535
        heads           16      1
        sectors/track   63      63
        --
        CHS current addressable sectors:    4128705
        LBA    user addressable sectors:   80418240
        LBA48  user addressable sectors:   80418240
        device size with M = 1024*1024:       39266 MBytes
        device size with M = 1000*1000:       41174 MBytes (41 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        bytes avail on r/w long: 52     Queue depth: 32
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Advanced power management level: unknown setting (0x0000)
        Recommended acoustic management value: 128, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=240ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    NOP cmd
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
                Release interrupt
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
                Security Mode feature set
                SMART feature set
           *    FLUSH CACHE EXT command
           *    Mandatory FLUSH CACHE command
           *    Device Configuration Overlay feature set
           *    48-bit Address feature set
                Automatic Acoustic Management feature set
                SET MAX security extension
                Address Offset Reserved Area Boot
                SET FEATURES subcommand required to spinup after power up
                Power-Up In Standby feature set
                Advanced Power Management feature set
           *    READ/WRITE DMA QUEUED
           *    General Purpose Logging feature set
           *    SMART self-test
           *    SMART error logging
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
        22min for SECURITY ERASE UNIT.
HW reset results:
        CBLID- above Vih
        Device num = 0 determined by the jumper
Checksum: correct




00:0e.0 Unknown mass storage controller: Promise Technology, Inc. 20269 (rev
02) (prog-if 85)
        Subsystem: Promise Technology, Inc. Ultra133TX2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=slow >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (1000ns min, 4500ns max), Cache Line Size: 0x08 (32 
bytes)
        Interrupt: pin A routed to IRQ 10
        Region 0: I/O ports at eff0 [size=8]
        Region 1: I/O ports at efe4 [size=4]
        Region 2: I/O ports at efa8 [size=8]
        Region 3: I/O ports at efe0 [size=4]
        Region 4: I/O ports at ef90 [size=16]
        Region 5: Memory at febfc000 (32-bit, non-prefetchable) [size=16K]
        Expansion ROM at febf8000 [disabled] [size=16K]
        Capabilities: [60] Power Management version 1
                Flags: PMEClk- DSI+ D1+ D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=0 PME-



wuuk:/proc/ide/hde# cat settings
name                    value           min             max             mode
----                    -----           ---             ---             ----
acoustic                0               0               254             rw
address                 1               0               2               rw
bios_cyl                16383           0               65535           rw
bios_head               255             0               255             rw
bios_sect               63              0               63              rw
bswap                   0               0               1               r
current_speed           69              0               70              rw
failures                0               0               65535           rw
init_speed              69              0               70              rw
io_32bit                1               0               3               rw
keepsettings            0               0               1               rw
lun                     0               0               7               rw
max_failures            1               0               65535           rw
multcount               16              0               16              rw
nice1                   1               0               1               rw
nowerr                  0               0               1               rw
number                  0               0               3               rw
pio_mode                write-only      0               255             w
slow                    0               0               1               rw
unmaskirq               1               0               1               rw
using_dma               1               0               1               rw
using_tcq               1               0               32              rw
wcache                  0               0               1               rw



wuuk:~# dmesg
Linux version 2.6.0-test9 (root@wuuk) (gcc version 3.3.2 (Debian)) #1 
Wed Oct
29 02:23:54 CET 2003
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000001ffe0000 (usable)
BIOS-e820: 000000001ffe0000 - 000000001fff8000 (ACPI data)
BIOS-e820: 000000001fff8000 - 0000000020000000 (ACPI NVS)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000fff80000 - 0000000100000000 (reserved)
511MB LOWMEM available.
On node 0 totalpages: 131040
  DMA zone: 4096 pages, LIFO batch:1
  Normal zone: 126944 pages, LIFO batch:16
  HighMem zone: 0 pages, LIFO batch:1
DMI 2.1 present.
Building zonelist for node : 0
Kernel command line: root=/dev/hde2 ro vga=3845
Local APIC disabled by BIOS -- reenabling.
Found and enabled local APIC!
Initializing CPU#0
PID hash table entries: 2048 (order 11: 16384 bytes)
Detected 448.128 MHz processor.
Console: colour VGA+ 80x30
Memory: 515404k/524160k available (1814k kernel code, 8004k reserved, 675k
data, 124k init, 0k highmem)
Calibrating delay loop... 884.73 BogoMIPS
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
CPU:     After generic identify, caps: 0383fbff 00000000 00000000 00000000
CPU:     After vendor identify, caps: 0383fbff 00000000 00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 512K
CPU:     After all inits, caps: 0383fbff 00000000 00000000 00000040
CPU: Intel Pentium III (Katmai) stepping 03
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
enabled ExtINT on CPU#0
ESR value before enabling vector: 00000000
ESR value after enabling vector: 00000000
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 447.0981 MHz.
..... host bus clock speed is 99.0551 MHz.
NET: Registered protocol family 16
PCI: PCI BIOS revision 2.10 entry at 0xfdb91, last bus=1
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
drivers/usb/core/usb.c: registered new driver hub
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Using IRQ router PIIX [8086/7110] at 0000:00:07.0
IA-32 Microcode Update Driver: v1.13 <tigran@veritas.com>
Limiting direct PCI/PCI transfers.
pty: 256 Unix98 ptys configured
Real Time Clock Driver v1.12
Linux agpgart interface v0.100 (c) Dave Jones
agpgart: Detected an Intel 440BX Chipset.
agpgart: Maximum main memory to use for agp memory: 439M
agpgart: AGP aperture is 64M @ 0xf8000000
Using anticipatory io scheduler
Floppy drive(s): fd0 is 1.44M
floppy0: no floppy controllers found
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
loop: loaded (max 8 devices)
8139too Fast Ethernet driver 0.9.26
PCI: Found IRQ 5 for device 0000:00:0f.0
eth0: RealTek RTL8139 at 0xe0823c00, 00:02:44:47:f8:69, IRQ 5
eth0:  Identified 8139 chip type 'RTL-8100B/8139D'
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PIIX4: IDE controller at PCI slot 0000:00:07.1
PIIX4: chipset revision 1
PIIX4: not 100% native mode: will probe irqs later
    ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:pio
hdc: JLMS XJ-HD166S, ATAPI CD/DVD-ROM drive
ide1 at 0x170-0x177,0x376 on irq 15
PDC20269: IDE controller at PCI slot 0000:00:0e.0
PCI: Found IRQ 10 for device 0000:00:0e.0
PDC20269: chipset revision 2
PDC20269: ROM enabled at 0xfebf8000
PDC20269: 100% native mode on irq 10
    ide2: BM-DMA at 0xef90-0xef97, BIOS settings: hde:pio, hdf:pio
    ide3: BM-DMA at 0xef98-0xef9f, BIOS settings: hdg:pio, hdh:pio
hde: HDS722540VLAT20, ATA DISK drive
ide2 at 0xeff0-0xeff7,0xefe6 on irq 10
hde: max request size: 1024KiB
hde: 80418240 sectors (41174 MB) w/1794KiB Cache, CHS=16383/255/63, 
UDMA(100)
hde: tagged command queueing enabled, command queue depth 32
hde: hde1 hde2 hde3
end_request: I/O error, dev hdc, sector 0
hdc: ATAPI 48X DVD-ROM drive, 512kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.12
ide-floppy driver 0.99.newide
drivers/usb/host/uhci-hcd.c: USB Universal Host Controller Interface driver
v2.1
PCI: Found IRQ 9 for device 0000:00:07.2
uhci_hcd 0000:00:07.2: UHCI Host Controller
uhci_hcd 0000:00:07.2: irq 9, io base 0000ef40
uhci_hcd 0000:00:07.2: new USB bus registered, assigned bus number 1
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 2 ports detected
drivers/usb/core/usb.c: registered new driver hiddev
drivers/usb/core/usb.c: registered new driver hid
drivers/usb/input/hid-core.c: v2.0:USB HID core driver
mice: PS/2 mouse device common for all mice
serio: i8042 AUX port at 0x60,0x64 irq 12
input: AT Translated Set 2 keyboard on isa0060/serio0
serio: i8042 KBD port at 0x60,0x64 irq 1
NET: Registered protocol family 2
IP: routing cache hash table of 4096 buckets, 32Kbytes
TCP: Hash tables configured (established 32768 bind 65536)
NET: Registered protocol family 1
NET: Registered protocol family 17
NET: Registered protocol family 15
BIOS EDD facility v0.10 2003-Oct-11, 1 devices found
Please report your BIOS at http://domsch.com/linux/edd30/results.html
kjournald starting.  Commit interval 5 seconds
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
Freeing unused kernel memory: 124k freed
hub 1-0:1.0: new USB device on port 1, assigned address 2
input: USB HID v1.10 Mouse [Logitech USB Mouse] on usb-0000:00:07.2-1
Adding 124952k swap on /dev/hde1.  Priority:-1 extents:1
EXT3 FS on hde2, internal journal
eth0: link up, 100Mbps, full-duplex, lpa 0x45E1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
microcode: CPU0 updated from revision 0xa to 0xe, date = 09101999



/dev/hde:
multcount    = 16 (on)
IO_support   =  1 (32-bit)
unmaskirq    =  1 (on)
using_dma    =  1 (on)
keepsettings =  0 (off)
readonly     =  0 (off)
readahead    = 4096 (on)
geometry     = 16383/255/63, sectors = 80418240, start = 0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: readahead
  2002-04-16 19:23 ` readahead Andrew Morton
@ 2002-04-16 19:33   ` Jens Axboe
  0 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2002-04-16 19:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andries.Brouwer, linux-kernel

On Tue, Apr 16 2002, Andrew Morton wrote:
> Andries.Brouwer@cwi.nl wrote:
> > 
> > ...
> >     Do these cards not have a request queue?
> > 
> > The kernel views them as SCSI disks.
> > So yes, I can do
> > 
> >    blockdev --setra 0 /dev/sdc
> > 
> > Unfortunately that does not help in the least.
> > Indeed, the only user of the readahead info is
> > readahead.c: get_max_readahead() and it does
> > 
> >         blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
> >         if (blk_ra_kbytes < VM_MIN_READAHEAD)
> >                 blk_ra_kbytes = VM_MAX_READAHEAD;
> > 
> > We need to distinguish between undefined, and explicily zero.
> 
> Yup.  The below (untested) patch should fix it up.  Assuming
> that all drivers use blk_init_queue(), and heaven knows if
> that's the case.  If not, the readahead window will have to be

set it in blk_queue_make_request(), please.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: readahead
  2002-04-16 19:10 readahead Andries.Brouwer
@ 2002-04-16 19:23 ` Andrew Morton
  2002-04-16 19:33   ` readahead Jens Axboe
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2002-04-16 19:23 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: linux-kernel

Andries.Brouwer@cwi.nl wrote:
> 
> ...
>     Do these cards not have a request queue?
> 
> The kernel views them as SCSI disks.
> So yes, I can do
> 
>    blockdev --setra 0 /dev/sdc
> 
> Unfortunately that does not help in the least.
> Indeed, the only user of the readahead info is
> readahead.c: get_max_readahead() and it does
> 
>         blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
>         if (blk_ra_kbytes < VM_MIN_READAHEAD)
>                 blk_ra_kbytes = VM_MAX_READAHEAD;
> 
> We need to distinguish between undefined, and explicily zero.

Yup.  The below (untested) patch should fix it up.  Assuming
that all drivers use blk_init_queue(), and heaven knows if
that's the case.  If not, the readahead window will have to be
set from userspace for those devices.

> ...
> 
>     Yes, but things should be OK as-is.  If the readahead attempt
>     gets an I/O error, do_generic_file_read will notice the non-uptodate
>     page and will issue a single-page read.  So everything up to
>     a page's distance from the bad block should be recoverable.
>     That's the theory; can't say that I've tested it.
> 
> It is really important to be able to tell the kernel to read and
> write only the blocks it has been asked to read and write and
> not to touch anything else.

That is really hard when there is a filesystem mounted.
Even with readahead set to zero, the kernel will go and
read PAGE_CACHE_SIZE chunks.  That's not worth changing 
to address this problem.  In the last resort, the
user would need to perform a sector-by-sector read of
the dud device into a regular file, recover from that.


--- 2.5.8/mm/readahead.c~readahead-fixes	Tue Apr 16 12:07:42 2002
+++ 2.5.8-akpm/mm/readahead.c	Tue Apr 16 12:13:52 2002
@@ -25,9 +25,6 @@
  * has a zero value of ra_sectors.
  */
 
-#define VM_MAX_READAHEAD	128	/* kbytes */
-#define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
-
 /*
  * Return max readahead size for this inode in number-of-pages.
  */
@@ -36,9 +33,6 @@ static int get_max_readahead(struct inod
 	unsigned blk_ra_kbytes = 0;
 
 	blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
-	if (blk_ra_kbytes < VM_MIN_READAHEAD)
-		blk_ra_kbytes = VM_MAX_READAHEAD;
-
 	return blk_ra_kbytes >> (PAGE_CACHE_SHIFT - 10);
 }
 
@@ -96,10 +90,10 @@ static int get_min_readahead(struct inod
  * second page) then the window will build more slowly.
  *
  * On a readahead miss (the application seeked away) the readahead window is shrunk
- * by 25%.  We don't want to drop it too aggressively, because it's a good assumption
- * that an application which has built a good readahead window will continue to
- * perform linear reads.  Either at the new file position, or at the old one after
- * another seek.
+ * by 25%.  We don't want to drop it too aggressively, because it is a good
+ * assumption that an application which has built a good readahead window will
+ * continue to perform linear reads.  Either at the new file position, or at the
+ * old one after another seek.
  *
  * There is a special-case: if the first page which the application tries to read
  * happens to be the first page of the file, it is assumed that a linear read is
@@ -109,9 +103,9 @@ static int get_min_readahead(struct inod
  * sequential file reading.
  *
  * This function is to be called for every page which is read, rather than when
- * it is time to perform readahead.  This is so the readahead algorithm can centrally
- * work out the access patterns.  This could be costly with many tiny read()s, so
- * we specifically optimise for that case with prev_page.
+ * it is time to perform readahead.  This is so the readahead algorithm can
+ * centrally work out the access patterns.  This could be costly with many tiny
+ * read()s, so we specifically optimise for that case with prev_page.
  */
 
 /*
@@ -208,8 +202,10 @@ void page_cache_readahead(struct file *f
 			goto out;
 	}
 
-	min = get_min_readahead(inode);
 	max = get_max_readahead(inode);
+	if (max == 0)
+		goto out;	/* No readahead */
+	min = get_min_readahead(inode);
 
 	if (ra->next_size == 0 && offset == 0) {
 		/*
@@ -310,7 +306,8 @@ void page_cache_readaround(struct file *
 {
 	unsigned long target;
 	unsigned long backward;
-	const int min = get_min_readahead(file->f_dentry->d_inode->i_mapping->host) * 2;
+	const int min =
+		get_min_readahead(file->f_dentry->d_inode->i_mapping->host) * 2;
 
 	if (file->f_ra.next_size < min)
 		file->f_ra.next_size = min;
--- 2.5.8/include/linux/mm.h~readahead-fixes	Tue Apr 16 12:11:39 2002
+++ 2.5.8-akpm/include/linux/mm.h	Tue Apr 16 12:12:14 2002
@@ -539,6 +539,8 @@ extern int filemap_sync(struct vm_area_s
 extern struct page *filemap_nopage(struct vm_area_struct *, unsigned long, int);
 
 /* readahead.c */
+#define VM_MAX_READAHEAD	128	/* kbytes */
+#define VM_MIN_READAHEAD	16	/* kbytes (includes current page) */
 void do_page_cache_readahead(struct file *file,
 			unsigned long offset, unsigned long nr_to_read);
 void page_cache_readahead(struct file *file, unsigned long offset);
--- 2.5.8/drivers/block/ll_rw_blk.c~readahead-fixes	Tue Apr 16 12:12:19 2002
+++ 2.5.8-akpm/drivers/block/ll_rw_blk.c	Tue Apr 16 12:13:43 2002
@@ -851,7 +851,7 @@ int blk_init_queue(request_queue_t *q, r
 	q->plug_tq.data		= q;
 	q->queue_flags		= (1 << QUEUE_FLAG_CLUSTER);
 	q->queue_lock		= lock;
-	q->ra_sectors		= 0;		/* Use VM default */
+	q->ra_sectors		= VM_MAX_READAHEAD << (PAGE_CACHE_SHIFT - 9);
 
 	blk_queue_segment_boundary(q, 0xffffffff);
 

-

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: readahead
@ 2002-04-16 19:10 Andries.Brouwer
  2002-04-16 19:23 ` readahead Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Andries.Brouwer @ 2002-04-16 19:10 UTC (permalink / raw)
  To: Andries.Brouwer, akpm; +Cc: linux-kernel

    From: Andrew Morton <akpm@zip.com.au>

    > In the good old days we had tunable readahead.
    > Very good, especially for special purposes.

    readahead is tunable, but the window size is stored
    at the request queue layer.  If it has never been
    set, or if the device doesn't have a request queue,
    you get the defaults.

    Do these cards not have a request queue?

The kernel views them as SCSI disks.
So yes, I can do

   blockdev --setra 0 /dev/sdc

Unfortunately that does not help in the least.
Indeed, the only user of the readahead info is
readahead.c: get_max_readahead() and it does

        blk_ra_kbytes = blk_get_readahead(inode->i_dev) / 2;
        if (blk_ra_kbytes < VM_MIN_READAHEAD)
                blk_ra_kbytes = VM_MAX_READAHEAD;

We need to distinguish between undefined, and explicily zero.
Also, overriding the value explicitly given by the user
is a bad idea.

    > I recall the days where I tried to get something off
    > a bad SCSI disk, and the kernel would die in the retries
    > trying to read a bad block, while the data I needed was
    > not in the block but just before. Set readahead to zero
    > and all was fine.

    Yes, but things should be OK as-is.  If the readahead attempt
    gets an I/O error, do_generic_file_read will notice the non-uptodate
    page and will issue a single-page read.  So everything up to
    a page's distance from the bad block should be recoverable.
    That's the theory; can't say that I've tested it.

It is really important to be able to tell the kernel to read and
write only the blocks it has been asked to read and write and
not to touch anything else.

In my SCSI example you go easily past "an I/O error", but what
this driver would do is retry a few times, reset the device,
retry again, reset the scsi bus, and then the kernel would crash
or hang forever. Maybe things are better today, but one does
not want to depend on complicated subsystems recovering
from their errors. There must just not be any errors.

In my situation yesterday night entirely different things play a role.
This card has a mapping from logical to physical blocks, but a
logical block only has a corresponding physical block when it has
been written at least once. So readahead will ask for blocks that
do not exist yet. (The driver that I put on ftp now recognizes this
situation and returns an all zero block, instead of an error.)

There are other situations where reading something has side effects.
A very common side effect is time delay.

So, for some devices I want to be able to kill read-ahead, even
before the kernel looks at the partition table.
Fortunately, I think that 2.5 will include the code that moves
partition table reading code out of the kernel, so this is
really possible.

    If the driver is actually dying over the bad block, well, foo.

    Yup.  Permitting a window size of zero is on my todo list,
    but it would require that the device have a request queue.

It has.

Andries

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: readahead
  2002-04-16 13:54 readahead Andries.Brouwer
  2002-04-16 16:08 ` readahead Steven Cole
@ 2002-04-16 18:25 ` Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2002-04-16 18:25 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: linux-kernel

Andries.Brouwer@cwi.nl wrote:
> 
> [readahead.c has badly readable comments, on a standard
> 80-column display: many lines have a size just slightly
> over 80 chars]

Sigh.  At least it has comments.  Agree with the 80-column
thing, but I find for the kernel coding style, 80 is just
5-10 columns too short, often.

> In the good old days we had tunable readahead.
> Very good, especially for special purposes.

readahead is tunable, but the window size is stored
at the request queue layer.  If it has never been
set, or if the device doesn't have a request queue,
you get the defaults.

Do these cards not have a request queue?  Suggestions
are sought.

> I recall the days where I tried to get something off
> a bad SCSI disk, and the kernel would die in the retries
> trying to read a bad block, while the data I needed was
> not in the block but just before. Set readahead to zero
> and all was fine.

Yes, but things should be OK as-is.  If the readahead attempt
gets an I/O error, do_generic_file_read will notice the non-uptodate
page and will issue a single-page read.  So everything up to
a page's distance from the bad block should be recoverable.
That's the theory; can't say that I've tested it.

If the driver is actually dying over the bad block, well, foo.

> Yesterday evening I was playing with my sddr09 driver,
> reading SmartMedia cards, and found to my dismay that
> the kernel wants to do a 128 block readahead.
> Not only is that bad on a slow medium, one is waiting
> a noticeable time for unwanted data, but it is worse
> that setting the readahead no longer works.
> 
> [Indeed, it is very desirable to be able to set readahead
> to zero. It is also desirable to be able to set it to a
> small value. Today on 2.5.8 both are impossible, readahead.c
> insists on a minimum readahead of 16 sectors.]

Yup.  Permitting a window size of zero is on my todo list,
but it would require that the device have a request queue.
Maybe the readahead size should be placed in struct blk_dev_struct,
and not in the request queue?

-

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: readahead
  2002-04-16 13:54 readahead Andries.Brouwer
@ 2002-04-16 16:08 ` Steven Cole
  2002-04-16 18:25 ` readahead Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Steven Cole @ 2002-04-16 16:08 UTC (permalink / raw)
  To: Andries.Brouwer; +Cc: akpm, linux-kernel, Dave Jones

On Tue, 2002-04-16 at 07:54, Andries.Brouwer@cwi.nl wrote:
> [readahead.c has badly readable comments, on a standard
> 80-column display: many lines have a size just slightly
> over 80 chars]

[other stuff snipped]

How is this?  Now the longest comment line should end at column 80.
Patch is made against 2.5.8-dj1.

Steven

--- linux-2.5.8-dj1/mm/readahead.c.orig	Tue Apr 16 09:25:36 2002
+++ linux-2.5.8-dj1/mm/readahead.c	Tue Apr 16 09:54:31 2002
@@ -63,9 +63,10 @@
  *              Together, start and size represent the `readahead window'.
  * next_size:   The number of pages to read when we get the next readahead miss.
  * prev_page:   The page which the readahead algorithm most-recently inspected.
- *              prev_page is mainly an optimisation: if page_cache_readahead sees
- *              that it is again being called for a page which it just looked at,
- *              it can return immediately without making any state changes.
+ *              prev_page is mainly an optimisation: if page_cache_readahead
+ *              sees that it is again being called for a page which it just
+ *              looked at, it can return immediately without making any state
+ *              changes.
  * ahead_start,
  * ahead_size:  Together, these form the "ahead window".
  *
@@ -95,30 +96,31 @@
  * If readahead hits are more sparse (say, the application is only reading every
  * second page) then the window will build more slowly.
  *
- * On a readahead miss (the application seeked away) the readahead window is shrunk
- * by 25%.  We don't want to drop it too aggressively, because it's a good assumption
- * that an application which has built a good readahead window will continue to
- * perform linear reads.  Either at the new file position, or at the old one after
- * another seek.
- *
- * There is a special-case: if the first page which the application tries to read
- * happens to be the first page of the file, it is assumed that a linear read is
- * about to happen and the window is immediately set to half of the device maximum.
+ * On a readahead miss (the application seeked away) the readahead window is
+ * shrunk by 25%.  We don't want to drop it too aggressively, because it's a
+ * good assumption that an application which has built a good readahead window
+ * will continue to perform linear reads.  Either at the new file position,
+ * or at the old one after another seek.
+ *
+ * There is a special-case: if the first page which the application tries to
+ * read happens to be the first page of the file, it is assumed that a linear
+ * read is about to happen and the window is immediately set to half of the
+ * device maximum.
  * 
  * A page request at (start + size) is not a miss at all - it's just a part of
  * sequential file reading.
  *
  * This function is to be called for every page which is read, rather than when
- * it is time to perform readahead.  This is so the readahead algorithm can centrally
- * work out the access patterns.  This could be costly with many tiny read()s, so
- * we specifically optimise for that case with prev_page.
+ * it is time to perform readahead.  This is so the readahead algorithm can
+ * centrally work out the access patterns.  This could be costly with many tiny
+ * read()s, so we specifically optimise for that case with prev_page.
  */
 
 /*
  * do_page_cache_readahead actually reads a chunk of disk.  It allocates all the
- * pages first, then submits them all for I/O. This avoids the very bad behaviour
- * which would occur if page allocations are causing VM writeback.  We really don't
- * want to intermingle reads and writes like that.
+ * pages first, then submits them all for I/O. This avoids the very bad
+ * behaviour which would occur if page allocations are causing VM writeback.
+ * We really don't want to intermingle reads and writes like that.
  */
 void do_page_cache_readahead(struct file *file,
 			unsigned long offset, unsigned long nr_to_read)
@@ -231,7 +233,8 @@
 		ra->next_size += 2;
 	} else {
 		/*
-		 * A miss - lseek, pread, etc.  Shrink the readahead window by 25%.
+		 * A miss - lseek, pread, etc.
+		 * Shrink the readahead window by 25%.
 		 */
 		ra->next_size -= ra->next_size / 4;
 		if (ra->next_size < min)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* readahead
@ 2002-04-16 13:54 Andries.Brouwer
  2002-04-16 16:08 ` readahead Steven Cole
  2002-04-16 18:25 ` readahead Andrew Morton
  0 siblings, 2 replies; 21+ messages in thread
From: Andries.Brouwer @ 2002-04-16 13:54 UTC (permalink / raw)
  To: akpm, linux-kernel

[readahead.c has badly readable comments, on a standard
80-column display: many lines have a size just slightly
over 80 chars]

In the good old days we had tunable readahead.
Very good, especially for special purposes.

I recall the days where I tried to get something off
a bad SCSI disk, and the kernel would die in the retries
trying to read a bad block, while the data I needed was
not in the block but just before. Set readahead to zero
and all was fine.

Yesterday evening I was playing with my sddr09 driver,
reading SmartMedia cards, and found to my dismay that
the kernel wants to do a 128 block readahead.
Not only is that bad on a slow medium, one is waiting
a noticeable time for unwanted data, but it is worse
that setting the readahead no longer works.

[Indeed, it is very desirable to be able to set readahead
to zero. It is also desirable to be able to set it to a
small value. Today on 2.5.8 both are impossible, readahead.c
insists on a minimum readahead of 16 sectors.]

Andries

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-09-28 18:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-16 20:21 readahead Andries.Brouwer
  -- strict thread matches above, loose matches on Subject: below --
2005-09-27  2:38 Readahead Alan Stern
2005-09-27  3:06 ` Readahead Randy.Dunlap
2005-09-27  4:24 ` Readahead Andrew Morton
2005-09-28 18:40   ` Readahead Alan Stern
2003-11-01 17:22 READAHEAD Voluspa
2003-10-30 19:23 READAHEAD age
2003-10-30 21:44 ` READAHEAD Andrew Morton
2003-10-31  7:43   ` READAHEAD Nuno Silva
2003-10-31  8:03     ` READAHEAD Andrew Morton
2003-10-31 12:20   ` READAHEAD age
2003-10-31  9:28     ` READAHEAD Andrew Morton
2003-10-31  9:29       ` READAHEAD Andrew Morton
2003-11-01  9:15         ` READAHEAD age
2003-11-03  0:15         ` READAHEAD Derek Foreman
2002-04-16 19:10 readahead Andries.Brouwer
2002-04-16 19:23 ` readahead Andrew Morton
2002-04-16 19:33   ` readahead Jens Axboe
2002-04-16 13:54 readahead Andries.Brouwer
2002-04-16 16:08 ` readahead Steven Cole
2002-04-16 18:25 ` readahead Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).