linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28  1:08 [resent PATCH] Re: very slow parallel read performance Dieter Nützel
@ 2001-08-28  0:05 ` Marcelo Tosatti
  2001-08-28  1:54   ` Daniel Phillips
  2001-08-28  5:01 ` Mike Galbraith
  2001-08-28 18:18 ` Andrew Morton
  2 siblings, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-08-28  0:05 UTC (permalink / raw)
  To: Dieter Nützel; +Cc: Linux Kernel List, Daniel Phillips, ReiserFS List



On Tue, 28 Aug 2001, Dieter Nützel wrote:

> [-]
> > In the real-world case we observed the readahead was actually being
> > throttled by the ftp clients.  IO request throttling on the file read
> > side would not have prevented cache from overfilling.  Once the cache
> > filled up, readahead pages started being dropped and reread, cutting
> > the server throughput by a factor of 2 or so.  On the other hand,
> > performance with no readahead was even worse.
> [-]
> 
> Are you like some numbers?

Note that increasing readahead size on -ac and stock tree will affect the
system in a different way since the VM has different logics on drop
behind.

Could you please try the same tests with the stock tree? (2.4.10-pre and
2.4.9)

> 
> I've generated some max-readahead numbers (dbench-1.1 32 clients) with 
> 2.4.8-ac11,  2.4.8-ac12 (+ memory.c fix) and 2.4.8-ac12 (+ memory.c fix + low 
> latency)
> 
> system:
> Athlon I 550
> MSI MS-6167 Rev 1.0B, AMD Irongate C4 (without bypass)
> 640 MB PC100-2-2-2 SDRAM
> AHA-2940UW
> IBM U160 DDYS 18 GB, 10.000 rpm (in UW mode)
> all filesystems ReiserFS 3.6.25
> 
> * readahead do not show dramatic differences
> * killall -STOP kupdated DO
> 
> Yes, I know it is dangerous to stop kupdated but my disk show heavy thrashing 
> (seeks like mad) since 2.4.7ac4. killall -STOP kupdated make it smooth and 
> fast, again.
> 
> Could it be that kupdated and kreiserfsd do concurrent (double) work?
> The numbers for context switches are more than double with kupdated than 
> without it. Without kupdated the system feels very smooth and snappy.
> Low latency patch push this even further.
> 
> Regards,
> 	Dieter
> 
> 
> 2.4.8-ac11
> 
> max-readahead 511
> Throughput 22.4059 MB/sec (NB=28.0073 MB/sec  224.059 MBit/sec)
> 24.780u 78.010s 3:09.54 54.2%   0+0k 0+0io 911pf+0w
> 
> max-readahead 31 (default)
> Throughput 19.7815 MB/sec (NB=24.7269 MB/sec  197.815 MBit/sec)
> 24.690u 73.630s 3:34.55 45.8%   0+0k 0+0io 911pf+0w
> 
> max-readahead 15
> Throughput 20.5266 MB/sec (NB=25.6583 MB/sec  205.266 MBit/sec)
> 25.430u 77.090s 3:26.79 49.5%   0+0k 0+0io 911pf+0w
> 
> max-readahead 7
> Throughput 19.7186 MB/sec (NB=24.6483 MB/sec  197.186 MBit/sec)
> 24.950u 77.820s 3:35.23 47.7%   0+0k 0+0io 911pf+0w
> 
> max-readahead 3
> Throughput 21.1795 MB/sec (NB=26.4743 MB/sec  211.795 MBit/sec)
> 26.020u 79.290s 3:20.45 52.5%   0+0k 0+0io 911pf+0w
> 
> max-readahead 0
> Throughput 19.3769 MB/sec (NB=24.2211 MB/sec  193.769 MBit/sec)
> 25.070u 77.550s 3:39.00 46.8%   0+0k 0+0io 911pf+0w
> 
> killall -STOP kupdated
> retry with the 2 best cases
> 
> max-readahead 3
> Throughput 34.6985 MB/sec (NB=43.3731 MB/sec  346.985 MBit/sec)
> 24.230u 81.930s 2:02.75 86.4%   0+0k 0+0io 911pf+0w
> 
> max-readahead 511 (it is repeatable, see below)
> Throughput 32.3584 MB/sec (NB=40.448 MB/sec  323.584 MBit/sec)
> 24.190u 86.130s 2:11.55 83.8%   0+0k 0+0io 911pf+0w
> 
> Throughput 33.28 MB/sec (NB=41.6 MB/sec  332.8 MBit/sec)
> 25.220u 84.260s 2:07.93 85.5%   0+0k 0+0io 911pf+0w
> 
> After (heavy) work:
> Throughput 25.3106 MB/sec (NB=31.6383 MB/sec  253.106 MBit/sec)
> 25.370u 84.420s 2:47.91 65.3%   0+0k 0+0io 911pf+0w
> 
> After reboot:
> Throughput 31.2373 MB/sec (NB=39.0466 MB/sec  312.373 MBit/sec)
> 25.500u 83.810s 2:16.26 80.2%   0+0k 0+0io 911pf+0w
> 
> After reboot:
> Throughput 30.0666 MB/sec (NB=37.5833 MB/sec  300.666 MBit/sec)
> 25.020u 83.770s 2:21.50 76.8%   0+0k 0+0io 911pf+0w
> 
> 
> 
> 2.4.8-ac12
> 
> max-readahead 31 (default)
> Throughput 19.4526 MB/sec (NB=24.3157 MB/sec  194.526 MBit/sec)
> 24.840u 79.490s 3:38.16 47.8%   0+0k 0+0io 911pf+0w
> 
> max-readahead 511
> Throughput 21.5307 MB/sec (NB=26.9134 MB/sec  215.307 MBit/sec)
> 25.000u 77.520s 3:17.20 51.9%   0+0k 0+0io 911pf+0w
> 
> killall -STOP kupdated
> 
> max-readahead 31 (default)
> Throughput 28.5728 MB/sec (NB=35.7159 MB/sec  285.728 MBit/sec)
> 24.750u 88.250s 2:28.85 75.9%   0+0k 0+0io 911pf+0w
> 
> max-readahead 511
> Throughput 29.5127 MB/sec (NB=36.8908 MB/sec  295.127 MBit/sec)
> 25.610u 86.730s 2:24.14 77.9%   0+0k 0+0io 911pf+0w
> 
> 
> 
> 2.4.8-ac12 + The Right memory.c fix
> 
> max-readahead 31 (default)
> Throughput 22.0905 MB/sec (NB=27.6131 MB/sec  220.905 MBit/sec)
> 25.340u 77.700s 3:12.24 53.5%   0+0k 0+0io 911pf+0w
> 
> killall -STOP kupdated
> 
> max-readahead 31 (default)
> Throughput 29.2189 MB/sec (NB=36.5236 MB/sec  292.189 MBit/sec)
> 25.750u 82.090s 2:25.57 74.0%   0+0k 0+0io 911pf+0w
> 
> 
> 
> 2.4.8-ac12 + The Right memory.c fix + low latency patch
> 
> max-readahead 31 (default)
> Throughput 20.3505 MB/sec (NB=25.4381 MB/sec  203.505 MBit/sec)
> 25.430u 75.250s 3:28.58 48.2%   0+0k 0+0io 911pf+0w
> 
> killall -STOP kupdated
> 
> max-readahead 31 (default)
> Throughput 29.25 MB/sec (NB=36.5625 MB/sec  292.5 MBit/sec)
> 24.600u 86.370s 2:25.42 76.3%   0+0k 0+0io 911pf+0w
> 
> max-readahead 511
> Throughput 30.0372 MB/sec (NB=37.5465 MB/sec  300.372 MBit/sec)
> 25.590u 75.910s 2:21.64 71.6%   0+0k 0+0io 911pf+0w
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
@ 2001-08-28  1:08 Dieter Nützel
  2001-08-28  0:05 ` Marcelo Tosatti
                   ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Dieter Nützel @ 2001-08-28  1:08 UTC (permalink / raw)
  To: Linux Kernel List; +Cc: Daniel Phillips, ReiserFS List

[-]
> In the real-world case we observed the readahead was actually being
> throttled by the ftp clients.  IO request throttling on the file read
> side would not have prevented cache from overfilling.  Once the cache
> filled up, readahead pages started being dropped and reread, cutting
> the server throughput by a factor of 2 or so.  On the other hand,
> performance with no readahead was even worse.
[-]

Are you like some numbers?

I've generated some max-readahead numbers (dbench-1.1 32 clients) with 
2.4.8-ac11,  2.4.8-ac12 (+ memory.c fix) and 2.4.8-ac12 (+ memory.c fix + low 
latency)

system:
Athlon I 550
MSI MS-6167 Rev 1.0B, AMD Irongate C4 (without bypass)
640 MB PC100-2-2-2 SDRAM
AHA-2940UW
IBM U160 DDYS 18 GB, 10.000 rpm (in UW mode)
all filesystems ReiserFS 3.6.25

* readahead do not show dramatic differences
* killall -STOP kupdated DO

Yes, I know it is dangerous to stop kupdated but my disk show heavy thrashing 
(seeks like mad) since 2.4.7ac4. killall -STOP kupdated make it smooth and 
fast, again.

Could it be that kupdated and kreiserfsd do concurrent (double) work?
The numbers for context switches are more than double with kupdated than 
without it. Without kupdated the system feels very smooth and snappy.
Low latency patch push this even further.

Regards,
	Dieter


2.4.8-ac11

max-readahead 511
Throughput 22.4059 MB/sec (NB=28.0073 MB/sec  224.059 MBit/sec)
24.780u 78.010s 3:09.54 54.2%   0+0k 0+0io 911pf+0w

max-readahead 31 (default)
Throughput 19.7815 MB/sec (NB=24.7269 MB/sec  197.815 MBit/sec)
24.690u 73.630s 3:34.55 45.8%   0+0k 0+0io 911pf+0w

max-readahead 15
Throughput 20.5266 MB/sec (NB=25.6583 MB/sec  205.266 MBit/sec)
25.430u 77.090s 3:26.79 49.5%   0+0k 0+0io 911pf+0w

max-readahead 7
Throughput 19.7186 MB/sec (NB=24.6483 MB/sec  197.186 MBit/sec)
24.950u 77.820s 3:35.23 47.7%   0+0k 0+0io 911pf+0w

max-readahead 3
Throughput 21.1795 MB/sec (NB=26.4743 MB/sec  211.795 MBit/sec)
26.020u 79.290s 3:20.45 52.5%   0+0k 0+0io 911pf+0w

max-readahead 0
Throughput 19.3769 MB/sec (NB=24.2211 MB/sec  193.769 MBit/sec)
25.070u 77.550s 3:39.00 46.8%   0+0k 0+0io 911pf+0w

killall -STOP kupdated
retry with the 2 best cases

max-readahead 3
Throughput 34.6985 MB/sec (NB=43.3731 MB/sec  346.985 MBit/sec)
24.230u 81.930s 2:02.75 86.4%   0+0k 0+0io 911pf+0w

max-readahead 511 (it is repeatable, see below)
Throughput 32.3584 MB/sec (NB=40.448 MB/sec  323.584 MBit/sec)
24.190u 86.130s 2:11.55 83.8%   0+0k 0+0io 911pf+0w

Throughput 33.28 MB/sec (NB=41.6 MB/sec  332.8 MBit/sec)
25.220u 84.260s 2:07.93 85.5%   0+0k 0+0io 911pf+0w

After (heavy) work:
Throughput 25.3106 MB/sec (NB=31.6383 MB/sec  253.106 MBit/sec)
25.370u 84.420s 2:47.91 65.3%   0+0k 0+0io 911pf+0w

After reboot:
Throughput 31.2373 MB/sec (NB=39.0466 MB/sec  312.373 MBit/sec)
25.500u 83.810s 2:16.26 80.2%   0+0k 0+0io 911pf+0w

After reboot:
Throughput 30.0666 MB/sec (NB=37.5833 MB/sec  300.666 MBit/sec)
25.020u 83.770s 2:21.50 76.8%   0+0k 0+0io 911pf+0w



2.4.8-ac12

max-readahead 31 (default)
Throughput 19.4526 MB/sec (NB=24.3157 MB/sec  194.526 MBit/sec)
24.840u 79.490s 3:38.16 47.8%   0+0k 0+0io 911pf+0w

max-readahead 511
Throughput 21.5307 MB/sec (NB=26.9134 MB/sec  215.307 MBit/sec)
25.000u 77.520s 3:17.20 51.9%   0+0k 0+0io 911pf+0w

killall -STOP kupdated

max-readahead 31 (default)
Throughput 28.5728 MB/sec (NB=35.7159 MB/sec  285.728 MBit/sec)
24.750u 88.250s 2:28.85 75.9%   0+0k 0+0io 911pf+0w

max-readahead 511
Throughput 29.5127 MB/sec (NB=36.8908 MB/sec  295.127 MBit/sec)
25.610u 86.730s 2:24.14 77.9%   0+0k 0+0io 911pf+0w



2.4.8-ac12 + The Right memory.c fix

max-readahead 31 (default)
Throughput 22.0905 MB/sec (NB=27.6131 MB/sec  220.905 MBit/sec)
25.340u 77.700s 3:12.24 53.5%   0+0k 0+0io 911pf+0w

killall -STOP kupdated

max-readahead 31 (default)
Throughput 29.2189 MB/sec (NB=36.5236 MB/sec  292.189 MBit/sec)
25.750u 82.090s 2:25.57 74.0%   0+0k 0+0io 911pf+0w



2.4.8-ac12 + The Right memory.c fix + low latency patch

max-readahead 31 (default)
Throughput 20.3505 MB/sec (NB=25.4381 MB/sec  203.505 MBit/sec)
25.430u 75.250s 3:28.58 48.2%   0+0k 0+0io 911pf+0w

killall -STOP kupdated

max-readahead 31 (default)
Throughput 29.25 MB/sec (NB=36.5625 MB/sec  292.5 MBit/sec)
24.600u 86.370s 2:25.42 76.3%   0+0k 0+0io 911pf+0w

max-readahead 511
Throughput 30.0372 MB/sec (NB=37.5465 MB/sec  300.372 MBit/sec)
25.590u 75.910s 2:21.64 71.6%   0+0k 0+0io 911pf+0w

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28  0:05 ` Marcelo Tosatti
@ 2001-08-28  1:54   ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-28  1:54 UTC (permalink / raw)
  To: Marcelo Tosatti, Dieter Nützel ; +Cc: Linux Kernel List, ReiserFS List

On August 28, 2001 02:05 am, Marcelo Tosatti wrote:
> On Tue, 28 Aug 2001, Dieter Nützel wrote:
> 
> > [-]
> > > In the real-world case we observed the readahead was actually being
> > > throttled by the ftp clients.  IO request throttling on the file read
> > > side would not have prevented cache from overfilling.  Once the cache
> > > filled up, readahead pages started being dropped and reread, cutting
> > > the server throughput by a factor of 2 or so.  On the other hand,
> > > performance with no readahead was even worse.
> > [-]
> > 
> > Are you like some numbers?
> 
> Note that increasing readahead size on -ac and stock tree will affect the
> system in a different way since the VM has different logics on drop
> behind.
> 
> Could you please try the same tests with the stock tree? (2.4.10-pre and
> 2.4.9)

He'll need the proc max-readahead patch posted by Craig I. Hagan on Sunday 
under the subject "Re: [resent PATCH] Re: very slow parallel read 
performance".

There are two other big variables here: Reiserfs and dbench.  Personally, I 
question the value of doing this testing on dbench, it's too erratic.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28  1:08 [resent PATCH] Re: very slow parallel read performance Dieter Nützel
  2001-08-28  0:05 ` Marcelo Tosatti
@ 2001-08-28  5:01 ` Mike Galbraith
  2001-08-28  8:46   ` [reiserfs-list] " Hans Reiser
  2001-08-28 18:18 ` Andrew Morton
  2 siblings, 1 reply; 124+ messages in thread
From: Mike Galbraith @ 2001-08-28  5:01 UTC (permalink / raw)
  To: Dieter N|tzel; +Cc: Linux Kernel List, Daniel Phillips, ReiserFS List

On Tue, 28 Aug 2001, Dieter N|tzel wrote:

> * readahead do not show dramatic differences
> * killall -STOP kupdated DO
>
> Yes, I know it is dangerous to stop kupdated but my disk show heavy thrashing
> (seeks like mad) since 2.4.7ac4. killall -STOP kupdated make it smooth and
> fast, again.

Interesting.

A while back, I twiddled the flush logic in buffer.c a little and made
kupdated only handle light flushing.. stay out of the way when bdflush
is running.  This and some dynamic adjustment of bdflush flushsize and
not stopping flushing right _at_ (biggie) the trigger level produced
very interesting improvements.  (very marked reduction in system time
for heavy IO jobs, and large improvement in file rewrite throughput)

	-Mike


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [reiserfs-list] Re: [resent PATCH] Re: very slow parallel read  performance
  2001-08-28  5:01 ` Mike Galbraith
@ 2001-08-28  8:46   ` Hans Reiser
  2001-08-28 19:17     ` Mike Galbraith
  0 siblings, 1 reply; 124+ messages in thread
From: Hans Reiser @ 2001-08-28  8:46 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Dieter N|tzel, Linux Kernel List, Daniel Phillips, ReiserFS List,
	Gryaznova E.

Mike Galbraith wrote:
> 
> On Tue, 28 Aug 2001, Dieter N|tzel wrote:
> 
> > * readahead do not show dramatic differences
> > * killall -STOP kupdated DO
> >
> > Yes, I know it is dangerous to stop kupdated but my disk show heavy thrashing
> > (seeks like mad) since 2.4.7ac4. killall -STOP kupdated make it smooth and
> > fast, again.
> 
> Interesting.
> 
> A while back, I twiddled the flush logic in buffer.c a little and made
> kupdated only handle light flushing.. stay out of the way when bdflush
> is running.  This and some dynamic adjustment of bdflush flushsize and
> not stopping flushing right _at_ (biggie) the trigger level produced
> very interesting improvements.  (very marked reduction in system time
> for heavy IO jobs, and large improvement in file rewrite throughput)
> 
>         -Mike


Can you send us the patch, and Elena will run some tests on it?

Hans

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28  1:08 [resent PATCH] Re: very slow parallel read performance Dieter Nützel
  2001-08-28  0:05 ` Marcelo Tosatti
  2001-08-28  5:01 ` Mike Galbraith
@ 2001-08-28 18:18 ` Andrew Morton
  2001-08-28 18:45   ` Hans Reiser
  2 siblings, 1 reply; 124+ messages in thread
From: Andrew Morton @ 2001-08-28 18:18 UTC (permalink / raw)
  To: Dieter Nützel; +Cc: Linux Kernel List, Daniel Phillips, ReiserFS List

Dieter Nützel wrote:
> 
> ...
> (dbench-1.1 32 clients)
> ...
> 640 MB PC100-2-2-2 SDRAM
> ...
> * readahead do not show dramatic differences
> * killall -STOP kupdated DO

dbench is a poor tool for evaluating VM or filesystem performance.
It deletes its own files.

If you want really good dbench numbers, you can simply install lots
of RAM and tweak the bdflush parameters thusly:

1: set nfract and nfract_sync really high, so you can use all your
   RAM for buffering dirty data.

2: Set the kupdate interval to 1000000000 to prevent kupdate from
   kicking in.

And guess what?  You can perform an entire dbench run without
touching the disk at all!  dbench deletes its own files, and
they never hit disk.


It gets more complex - if you leave the bdflush parameters at
default, and increase the number of dbench clients you'll reach
a point where bdflush starts kicking in to reduce the amount
of buffercache memory.  This slows the dbench clients down,
so they have less opportunity to delete data before kupdate and
bdflush write them out.  So the net effect is that the slower
you go, the more I/O you end up doing - a *lot* more.  This slows
you down further, which causes more I/O, etc.

dbench is not a benchmark.  It is really complex, it is really
misleading.  It is a fine stress-tester though.

The original netbench test which dbench emulates has three phases:
startup, run and cleanup.  Throughput numbers are only quoted for
the "run" phase.  dbench is incomplete in that it reports on throughput
for all three phases.  Apparently Tridge and friends are working on
changing this, but it will still be the case that the entire test
can be optimised away, and that it is subject to the regenerative
feedback phenomenon described above.

For tuning and measuring fs and VM efficiency we need to user
simpler, more specific tools.

-

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28 18:18 ` Andrew Morton
@ 2001-08-28 18:45   ` Hans Reiser
  0 siblings, 0 replies; 124+ messages in thread
From: Hans Reiser @ 2001-08-28 18:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dieter NЭtzel, Linux Kernel List, Daniel Phillips, ReiserFS List

Andrew Morton wrote:
> 
> Dieter NЭtzel wrote:
> >
> > ...
> > (dbench-1.1 32 clients)
> > ...
> > 640 MB PC100-2-2-2 SDRAM
> > ...
> > * readahead do not show dramatic differences
> > * killall -STOP kupdated DO
> 
> dbench is a poor tool for evaluating VM or filesystem performance.
> It deletes its own files.
> 
> If you want really good dbench numbers, you can simply install lots
> of RAM and tweak the bdflush parameters thusly:
> 
> 1: set nfract and nfract_sync really high, so you can use all your
>    RAM for buffering dirty data.
> 
> 2: Set the kupdate interval to 1000000000 to prevent kupdate from
>    kicking in.
> 
> And guess what?  You can perform an entire dbench run without
> touching the disk at all!  dbench deletes its own files, and
> they never hit disk.
> 
> It gets more complex - if you leave the bdflush parameters at
> default, and increase the number of dbench clients you'll reach
> a point where bdflush starts kicking in to reduce the amount
> of buffercache memory.  This slows the dbench clients down,
> so they have less opportunity to delete data before kupdate and
> bdflush write them out.  So the net effect is that the slower
> you go, the more I/O you end up doing - a *lot* more.  This slows
> you down further, which causes more I/O, etc.
> 
> dbench is not a benchmark.  It is really complex, it is really
> misleading.  It is a fine stress-tester though.
> 
> The original netbench test which dbench emulates has three phases:
> startup, run and cleanup.  Throughput numbers are only quoted for
> the "run" phase.  dbench is incomplete in that it reports on throughput
> for all three phases.  Apparently Tridge and friends are working on
> changing this, but it will still be the case that the entire test
> can be optimised away, and that it is subject to the regenerative
> feedback phenomenon described above.
> 
> For tuning and measuring fs and VM efficiency we need to user
> simpler, more specific tools.
> 
> -
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

I would encourage you to check out the reiser_fract_tree program, which is at
the heart of mongo.pl.  It generates lots of files using a fractal algorithm for
size and number per directory, and I think it reflects real world statistics
decently.  You can specify mean file size, max file size, mean nr of files per
directory, max nr files per directory, check it out..... 
www.namesys.com/benchmarks.html

It just generates file sets, which is fine for write performance testing like
you are doing.  Mongo.pl can do reads and copies and stuff using those file
sets.

Hans

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [reiserfs-list] Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28  8:46   ` [reiserfs-list] " Hans Reiser
@ 2001-08-28 19:17     ` Mike Galbraith
  0 siblings, 0 replies; 124+ messages in thread
From: Mike Galbraith @ 2001-08-28 19:17 UTC (permalink / raw)
  To: Hans Reiser
  Cc: Dieter N|tzel, Linux Kernel List, Daniel Phillips, ReiserFS List,
	Gryaznova E.

On Tue, 28 Aug 2001, Hans Reiser wrote:

> Mike Galbraith wrote:
> >
> > On Tue, 28 Aug 2001, Dieter N|tzel wrote:
> >
> > > * readahead do not show dramatic differences
> > > * killall -STOP kupdated DO
> > >
> > > Yes, I know it is dangerous to stop kupdated but my disk show heavy thrashing
> > > (seeks like mad) since 2.4.7ac4. killall -STOP kupdated make it smooth and
> > > fast, again.
> >
> > Interesting.
> >
> > A while back, I twiddled the flush logic in buffer.c a little and made
> > kupdated only handle light flushing.. stay out of the way when bdflush
> > is running.  This and some dynamic adjustment of bdflush flushsize and
> > not stopping flushing right _at_ (biggie) the trigger level produced
> > very interesting improvements.  (very marked reduction in system time
> > for heavy IO jobs, and large improvement in file rewrite throughput)
> >
> >         -Mike
>
>
> Can you send us the patch, and Elena will run some tests on it?

I think I posted the patch once (including dumb typo), and I know I sent
is to a couple of folks to try if they wanted, but I don't save such.

The specific patch is no longer germain.. large (more sensible) change
to flush logic recently.  Interesting is the kupdated/vm interaction..
I saw it getting in the way here (so I whittled it down to size.. made
it small), and some posts I've seen seem to indicate the same.

("biggie" thing is what leads to rewrite throughput increase.  Whacking
kupdated only removes a noise source)

	-Mike


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-28 17:52 ` Kai Henningsen
@ 2001-08-28 21:54   ` Matthew M
  0 siblings, 0 replies; 124+ messages in thread
From: Matthew M @ 2001-08-28 21:54 UTC (permalink / raw)
  To: linux-kernel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

#ifdef CONFIG_PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCONIOSIS

	cough();
	panic("*wheeze* I don't feel so good....");
	
#endif


- -- 
*matt* 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE7jBMtzhSxTQTEoE0RAjLZAJ9bBUIAKo71Zv7OnJi274+jWzsB4QCfT7eW
hj0jcQCbeP5etZ4kWtjDvio=
=USl2
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27  2:03 Rick Hohensee
  2001-08-27  2:52 ` Keith Owens
@ 2001-08-28 17:52 ` Kai Henningsen
  2001-08-28 21:54   ` Matthew M
  1 sibling, 1 reply; 124+ messages in thread
From: Kai Henningsen @ 2001-08-28 17:52 UTC (permalink / raw)
  To: linux-kernel

Hi Keith,
kaos@ocs.com.au (Keith Owens)  wrote on 27.08.01 in <20824.998880763@kao2.melbourne.sgi.com>:

> That reminds me, I have to add this config entry to kbuild.
>
> CONFIG_LLANFAIRPWLLGWYNGYLLGOGERYCHWYRNDROBWLLLLANTYSILIOGOGOGOCH
>   Use Welsh

Don't forget

CONFIG_KUMARREKSITUTESKENTELEENTUVAISEHKOLLAISMAISEKKUUDELLISENNESKENTELUTTELEMATTOMAMMUUKSISSANSAKKAANKOPAHAN
  Use Finnish

(Assuming I spelled that at least approximately right)

And while we're at it,

CONFIG_DONAUDAMPFSCHIFFARTSKAPITAENSWITWENRENTENANSPRUCHSFORMULAR
  Use German

CONFIG_PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCONIOSIS
  Use English

MfG Kai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
@ 2001-08-28 15:28 Dieter Nützel
  0 siblings, 0 replies; 124+ messages in thread
From: Dieter Nützel @ 2001-08-28 15:28 UTC (permalink / raw)
  To: Linux Kernel List
  Cc: Marcelo Tosatti, Daniel Phillips, Mike Galbraith, Robert Love,
	ReiserFS List

Am Dienstag, 28. August 2001 03:08 schrieb Dieter Nützel:
> Are you like some numbers?
>
> I've generated some max-readahead numbers (dbench-1.1 32 clients) with
> 2.4.8-ac11,  2.4.8-ac12 (+ memory.c fix) and 2.4.8-ac12 (+ memory.c fix +
> low latency)
>
> system:
> Athlon I 550
> MSI MS-6167 Rev 1.0B, AMD Irongate C4 (without bypass)
> 640 MB PC100-2-2-2 SDRAM
> AHA-2940UW
> IBM U160 DDYS 18 GB, 10.000 rpm (in UW mode)
> all filesystems ReiserFS 3.6.25
>
> * readahead do not show dramatic differences
> * killall -STOP kupdated DO

> 2.4.8-ac12 + The Right memory.c fix + low latency patch
>
> max-readahead 31 (default)
> Throughput 20.3505 MB/sec (NB=25.4381 MB/sec  203.505 MBit/sec)
> 25.430u 75.250s 3:28.58 48.2%   0+0k 0+0io 911pf+0w
>
> killall -STOP kupdated
>
> max-readahead 31 (default)
> Throughput 29.25 MB/sec (NB=36.5625 MB/sec  292.5 MBit/sec)
> 24.600u 86.370s 2:25.42 76.3%   0+0k 0+0io 911pf+0w
>
> max-readahead 511
> Throughput 30.0372 MB/sec (NB=37.5465 MB/sec  300.372 MBit/sec)
> 25.590u 75.910s 2:21.64 71.6%   0+0k 0+0io 911pf+0w

Argh, every one have from time to time some minutes...
I've patched with the low latency patch (patch-rml-2.4.8-ac12-preempt- 
kernel-1) , compiled and run it (see above) but didn't enabled it....

So here are the right numbers, now.
They are not much different but it is good to see that low latency do not 
harm disk troughput for this test.

* I never saw such low numbers for context switches (GREAT).
* load hardly only reach half the numbers (~16, for 32 processes)
   compared to the normal kernel
* system is very smooth and snappy

Drawbacks:
There are several (most/all) modules missing preempt_schedule symbol.
See the end of this mail.

Regards,
	Dieter

PS Should I redo my test with 2.4.9/2.410-pre1 + max-readahead patch or is 
it useless like Daniel mentioned?

2.4.8-ac12 + The Right memory.c fix + low latency patch
The Real Test (CONFIG_PREEMPT=y)
 
max-readahead 31 (default)
Throughput 19.9584 MB/sec (NB=24.948 MB/sec  199.584 MBit/sec)
26.480u 79.570s 3:32.66 49.8%   0+0k 0+0io 911pf+0w
 
killall -STOP kupdated
 
max-readahead 31 (default)
Throughput 29.6149 MB/sec (NB=37.0186 MB/sec  296.149 MBit/sec)
26.590u 78.830s 2:23.65 73.3%   0+0k 0+0io 911pf+0w
 
max-readahead 511
Throughput 30.3902 MB/sec (NB=37.9878 MB/sec  303.902 MBit/sec)
26.510u 78.430s 2:20.00 74.9%   0+0k 0+0io 911pf+0w

depmod -aev
[-]
xftw starting at /lib/modules/2.4 lstat on /lib/modules/2.4 failed
xftw starting at /lib/modules/kernel lstat on /lib/modules/kernel failed
xftw starting at /lib/modules/fs lstat on /lib/modules/fs failed
xftw starting at /lib/modules/net lstat on /lib/modules/net failed
xftw starting at /lib/modules/scsi lstat on /lib/modules/scsi failed
xftw starting at /lib/modules/block lstat on /lib/modules/block failed
xftw starting at /lib/modules/cdrom lstat on /lib/modules/cdrom failed
xftw starting at /lib/modules/ipv4 lstat on /lib/modules/ipv4 failed
xftw starting at /lib/modules/ipv6 lstat on /lib/modules/ipv6 failed
xftw starting at /lib/modules/sound lstat on /lib/modules/sound failed
xftw starting at /lib/modules/fc4 lstat on /lib/modules/fc4 failed
xftw starting at /lib/modules/video lstat on /lib/modules/video failed
xftw starting at /lib/modules/misc lstat on /lib/modules/misc failed
xftw starting at /lib/modules/pcmcia lstat on /lib/modules/pcmcia failed
xftw starting at /lib/modules/atm lstat on /lib/modules/atm failed
xftw starting at /lib/modules/usb lstat on /lib/modules/usb failed
xftw starting at /lib/modules/ide lstat on /lib/modules/ide failed
xftw starting at /lib/modules/ieee1394 lstat on /lib/modules/ieee1394 failed
xftw starting at /lib/modules/mtd lstat on /lib/modules/mtd failed
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/block/floppy.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/block/floppy.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/block/loop.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/block/loop.o
/lib/modules/2.4.8-ac12/kernel/drivers/cdrom/cdrom.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/char/drm/tdfx.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/char/drm/tdfx.o
/lib/modules/2.4.8-ac12/kernel/drivers/char/joystick/analog.o
/lib/modules/2.4.8-ac12/kernel/drivers/char/joystick/emu10k1-gp.o
/lib/modules/2.4.8-ac12/kernel/drivers/char/joystick/gameport.o
/lib/modules/2.4.8-ac12/kernel/drivers/char/joystick/sidewinder.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/char/lp.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/char/lp.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/char/ppdev.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/char/ppdev.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/char/serial.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/char/serial.o
/lib/modules/2.4.8-ac12/kernel/drivers/i2c/i2c-core.o
/lib/modules/2.4.8-ac12/kernel/drivers/i2c/i2c-dev.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-disk.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-disk.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-mod.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-mod.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-probe-mod.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/ide/ide-probe-mod.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/input/evdev.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/input/evdev.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/input/input.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/input/input.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/input/joydev.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/input/joydev.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/input/mousedev.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/input/mousedev.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/3c509.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/3c509.o
/lib/modules/2.4.8-ac12/kernel/drivers/net/bsd_comp.o
/lib/modules/2.4.8-ac12/kernel/drivers/net/dummy.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/eepro100.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/eepro100.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/ppp_async.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/ppp_async.o
/lib/modules/2.4.8-ac12/kernel/drivers/net/ppp_deflate.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/ppp_generic.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/ppp_generic.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/pppoe.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/pppoe.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/net/pppox.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/net/pppox.o
/lib/modules/2.4.8-ac12/kernel/drivers/net/slhc.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/parport/parport.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/parport/parport.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/parport/parport_pc.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/parport/parport_pc.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/pnp/isa-pnp.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/pnp/isa-pnp.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/scsi/sg.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/scsi/sg.o
/lib/modules/2.4.8-ac12/kernel/drivers/scsi/sr_mod.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/scsi/st.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/scsi/st.o
/lib/modules/2.4.8-ac12/kernel/drivers/sound/ac97_codec.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/sound/emu10k1/emu10k1.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/sound/emu10k1/emu10k1.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/sound/soundcore.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/sound/soundcore.o
/lib/modules/2.4.8-ac12/kernel/drivers/usb/dc2xx.o
/lib/modules/2.4.8-ac12/kernel/drivers/usb/hid.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/usb/scanner.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/usb/scanner.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/usb/usb-ohci.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/usb/usb-ohci.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/drivers/usb/usbcore.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/drivers/usb/usbcore.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/autofs/autofs.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/autofs/autofs.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/autofs4/autofs4.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/autofs4/autofs4.o
/lib/modules/2.4.8-ac12/kernel/fs/binfmt_aout.o
depmod: *** Unresolved symbols in /lib/modules/2.4.8-ac12/kernel/fs/fat/fat.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/fat/fat.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/isofs/isofs.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/isofs/isofs.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/lockd/lockd.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/lockd/lockd.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/minix/minix.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/minix/minix.o
/lib/modules/2.4.8-ac12/kernel/fs/msdos/msdos.o
depmod: *** Unresolved symbols in /lib/modules/2.4.8-ac12/kernel/fs/nfs/nfs.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/nfs/nfs.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/nfsd/nfsd.o depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/nfsd/nfsd.o
/lib/modules/2.4.8-ac12/kernel/fs/nls/nls_cp437.o
/lib/modules/2.4.8-ac12/kernel/fs/nls/nls_cp850.o
/lib/modules/2.4.8-ac12/kernel/fs/nls/nls_iso8859-1.o
/lib/modules/2.4.8-ac12/kernel/fs/nls/nls_iso8859-15.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/romfs/romfs.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/romfs/romfs.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/smbfs/smbfs.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/smbfs/smbfs.o
depmod: *** Unresolved symbols in /lib/modules/2.4.8-ac12/kernel/fs/udf/udf.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/udf/udf.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/fs/vfat/vfat.o depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/fs/vfat/vfat.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_conntrack.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_conntrack.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_conntrack_ftp.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_conntrack_ftp.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_nat_ftp.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_nat_ftp.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_tables.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ip_tables.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipchains.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipchains.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_LOG.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_LOG.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_MASQUERADE.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_MASQUERADE.o
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_REDIRECT.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_limit.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_limit.o
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_state.o
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/ipt_tos.o
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/iptable_filter.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/iptable_nat.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv4/netfilter/iptable_nat.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv6/ipv6.odepmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv6/ipv6.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv6/netfilter/ip6_tables.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv6/netfilter/ip6_tables.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/ipv6/netfilter/ip6t_limit.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/ipv6/netfilter/ip6t_limit.o
/lib/modules/2.4.8-ac12/kernel/net/ipv6/netfilter/ip6table_filter.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/khttpd/khttpd.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/khttpd/khttpd.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/packet/af_packet.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/packet/af_packet.o
depmod: *** Unresolved symbols in 
/lib/modules/2.4.8-ac12/kernel/net/sunrpc/sunrpc.o
depmod:         preempt_schedule
/lib/modules/2.4.8-ac12/kernel/net/sunrpc/sunrpc.o
/lib/modules/2.4.8-ac12/misc/mssclampfw.o

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 23:00                       ` Marcelo Tosatti
@ 2001-08-28  3:10                         ` Linus Torvalds
  0 siblings, 0 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-08-28  3:10 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: phillips, linux-kernel


On Mon, 27 Aug 2001, Marcelo Tosatti wrote:
>
> It would be needed to pass a "READ or READA" flag to get_block_t calls
> too, which in turn would end up in a lot of code changes on the low level
> filesystems.

No, that's over-doing it. It's not worth it - you are better off just
taking the (rather smallish) risk that the meta-data isn't already cached.

In the long run, if you _really_ want to be clever, then yes. But in the
short run I doubt it's all that noticeable.

> I'm looking forward to "re-implement" the READA/WRITEA logic for 2.5.
>
> Do you have any idea/comments on how to do that with the smaller amount of
> pain ?

Just worry about data, not meta-data. That simplifies the whole issue a
_lot_, and means that you really only need to change readpage().

		Linus


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 21:44                     ` Linus Torvalds
  2001-08-27 22:30                       ` Daniel Phillips
@ 2001-08-27 23:00                       ` Marcelo Tosatti
  2001-08-28  3:10                         ` Linus Torvalds
  1 sibling, 1 reply; 124+ messages in thread
From: Marcelo Tosatti @ 2001-08-27 23:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: phillips, linux-kernel



On Mon, 27 Aug 2001, Linus Torvalds wrote:

> In article <20010827203125Z16070-32383+1731@humbolt.nl.linux.org> you write:
> >On August 27, 2001 09:43 pm, Oliver Neukum wrote:
> >> 
> >> If we are optimising for streaming (which readahead is made for) dropping 
> >> only one page will buy you almost nothing in seek time. You might just as 
> >> well drop them all and correct your error in one larger read if necessary.
> >> Dropping the oldest page is possibly the worst you can do, as you will need 
> >> it soonest.
> >
> >Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind, 
> >dropping readahead pages is not supposed to happen frequently under 
> >steady-state operation, so it's not that critical what we do here, it's going 
> >to be hard to create a load that shows the impact.  The really big benefit 
> >comes from not overdoing the readahead in the first place, and not underdoing 
> >it either.
> 
> Note that the big reason why I did _not_ end up just increasing the
> read-ahead value from 31 to 511 (it was there for a short while) is that
> large read-ahead does not necessarily improve performance AT ALL,
> regardless of memory pressure. 
> 
> Why? Because if the IO request queue fills up, the read-ahead actually
> ends up waiting for requests, and ends up being synchronous. Which
> totally destroys the whole point of doing read-ahead in the first place.
> And a large read-ahead only makes this more likely.
> 
> Also note that doing tons of parallel reads _also_ makes this more
> likely, and actually ends up also mixing the read-ahead streams which is
> exactly what you do not want to do.
> 
> The solution to both problems is to make the read-ahead not wait
> synchronously on requests - that way the request allocation itself ends
> up being a partial throttle on memory usage too, so that you actually
> probably end up fixing the problem of memory pressure _too_.
> 
> This requires that the read-ahead code would start submitting the blocks
> using READA, which in turn requires that the readpage() function get a
> "READ vs READA" argument.  And the ll_rw_block code would obviously have
> to honour the rw_ahead hint and submit_bh() would have to return an
> error code - which it currently doesn't do, but which should be trivial
> to implement. 

Its not easy to do.

It would be needed to pass a "READ or READA" flag to get_block_t calls
too, which in turn would end up in a lot of code changes on the low level
filesystems.

Reading metadata to map data which you're not going to read is pretty
stupid.

I'm looking forward to "re-implement" the READA/WRITEA logic for 2.5. 

Do you have any idea/comments on how to do that with the smaller amount of
pain ?




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 21:44                     ` Linus Torvalds
@ 2001-08-27 22:30                       ` Daniel Phillips
  2001-08-27 23:00                       ` Marcelo Tosatti
  1 sibling, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 22:30 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On August 27, 2001 11:44 pm, Linus Torvalds wrote:
> In article <20010827203125Z16070-32383+1731@humbolt.nl.linux.org> you write:
> >On August 27, 2001 09:43 pm, Oliver Neukum wrote:
> >> 
> >> If we are optimising for streaming (which readahead is made for) dropping 
> >> only one page will buy you almost nothing in seek time. You might just as 
> >> well drop them all and correct your error in one larger read if necessary.
> >> Dropping the oldest page is possibly the worst you can do, as you will need 
> >> it soonest.
> >
> >Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind, 
> >dropping readahead pages is not supposed to happen frequently under 
> >steady-state operation, so it's not that critical what we do here, it's going 
> >to be hard to create a load that shows the impact.  The really big benefit 
> >comes from not overdoing the readahead in the first place, and not underdoing 
> >it either.
> 
> Note that the big reason why I did _not_ end up just increasing the
> read-ahead value from 31 to 511 (it was there for a short while) is that
> large read-ahead does not necessarily improve performance AT ALL,
> regardless of memory pressure. 
> 
> Why? Because if the IO request queue fills up, the read-ahead actually
> ends up waiting for requests, and ends up being synchronous. Which
> totally destroys the whole point of doing read-ahead in the first place.
> And a large read-ahead only makes this more likely.
> 
> Also note that doing tons of parallel reads _also_ makes this more
> likely, and actually ends up also mixing the read-ahead streams which is
> exactly what you do not want to do.
> 
> The solution to both problems is to make the read-ahead not wait
> synchronously on requests - that way the request allocation itself ends
> up being a partial throttle on memory usage too, so that you actually
> probably end up fixing the problem of memory pressure _too_.
> 
> This requires that the read-ahead code would start submitting the blocks
> using READA, which in turn requires that the readpage() function get a
> "READ vs READA" argument.  And the ll_rw_block code would obviously have
> to honour the rw_ahead hint and submit_bh() would have to return an
> error code - which it currently doesn't do, but which should be trivial
> to implement. 
> 
> I really think that doing anything else is (a) stupid and (b) wrong.
> Trying to come up with a complex algorithm on how to change read-ahead
> based on memory pressure is just bound to be extremely fragile and have
> strange performance effects. While letting the IO layer throttle the
> read-ahead on its own is the natural and high-performance approach.

In the real-world case we observed the readahead was actually being
throttled by the ftp clients.  IO request throttling on the file read
side would not have prevented cache from overfilling.  Once the cache
filled up, readahead pages started being dropped and reread, cutting
the server throughput by a factor of 2 or so.  On the other hand,
performance with no readahead was even worse.

The solution was to set readahead down to a low enough number so that
the maximum number of clients allowed would not overfill the cache.
This is clearly fragile since an extra load on the machine from any
source could send the machine back into readahead-thrash mode.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 21:38                           ` Oliver.Neukum
@ 2001-08-27 22:26                             ` Alex Bligh - linux-kernel
  0 siblings, 0 replies; 124+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-08-27 22:26 UTC (permalink / raw)
  To: Oliver.Neukum, Alex Bligh - linux-kernel
  Cc: Rik van Riel, Daniel Phillips, Helge Hafting, linux-kernel,
	Alex Bligh - linux-kernel



--On Monday, 27 August, 2001 11:38 PM +0200 
Oliver.Neukum@lrz.uni-muenchen.de wrote:

> How do you measure cost of
> replacement ?

See previous mail with cost of E. However, this
is broken w.r.t. differential speed of consumption
by tasks, and as Daniel points out, if the system
works as he's designed/designing, the key will be
sizing readahead sufficiently intelligently that
drops happen infrequently anyway. If it works like
TCP windows, it need not be particularly intelligent
(very punitive, but infrequent), and we can do
the tweaks of what packets to drop (cf RED in TCP)
later.

My point w.r.t. penalizing fast consumers is that
negative feedback in a control system isn't necessarily
bad.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 20:37                     ` Daniel Phillips
@ 2001-08-27 22:10                       ` Oliver.Neukum
  0 siblings, 0 replies; 124+ messages in thread
From: Oliver.Neukum @ 2001-08-27 22:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Neukum, linux-kernel

Hi,

> > If we are optimising for streaming (which readahead is made for) dropping
> > only one page will buy you almost nothing in seek time. You might just as
> > well drop them all and correct your error in one larger read if necessary.
> > Dropping the oldest page is possibly the worst you can do, as you will need
> > it soonest.
>
> Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind,
> dropping readahead pages is not supposed to happen frequently under
> steady-state operation, so it's not that critical what we do here, it's going
> to be hard to create a load that shows the impact.  The really big benefit
> comes from not overdoing the readahead in the first place, and not underdoing
> it either.

OK, so you have four reasons for dropping a readahead page

1) File was closed - trivial I'd say
2) Memory needed for other readahead
3) Memory needed for anything but readahead
4) The page won't be needed - some kind of timeout ?

Cases 3 and 2 are hard. Should replacement policy vary ?

> > > > If you are streaming dropping all should be no great loss.
> > >
> > > The quesion is, how do you know you're streaming?  Some files are
> > > read/written many times and some files are accessed randomly.  I'm trying
> > > to avoid penalizing these admittedly rarer, but still important cases.
> >
> > For those other case we have pages on the active list which are properly
> > aged, haven't we ?
>
> Not if they never get a change to be moved to the active list.  We have to be
> careful to provide that opportunity.

I see no reason a change to readahead would affect that. Maybe I am dense.
Relevant seems to be rather whether you use read-once or put them on the
active list right away.

Could information on the use pattern be stored in the dentry ?
E.g. If we opened it twice in the last 30 seconds we put all referenced
pages into the active list unaged to ensure that they'll stay in core.

> > And readahead won't help you for random access unless you can cache it all.
> > You might throw in a few extra blocks if you have free memory, but then you
> > have free memory anyway.
>
> Readahead definitely can help in some types of random access loads, e.g.,
> when access size is larger than a page, or when the majority of pages of a
> file are eventually accessed, but in random order.  On the other hand, it

Only if they fit into ram. There's nothing wrong with reading the whole
accessed file if it's small enough and you have a low memory pressure.
It might cut down on netscape launch times drastically on large machines.
But if there's memory pressure you can hardly evict pages on the active
list for readahead, can you ?

> will only hurt for short, sparse, random reads.  Such a pattern needs to be
> detected in generic_file_readahead.

Interresting, how do you do that, keep a list of recent reads ?

	Regards
		Oliver



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:43                   ` Oliver Neukum
  2001-08-27 20:37                     ` Daniel Phillips
@ 2001-08-27 21:44                     ` Linus Torvalds
  2001-08-27 22:30                       ` Daniel Phillips
  2001-08-27 23:00                       ` Marcelo Tosatti
  1 sibling, 2 replies; 124+ messages in thread
From: Linus Torvalds @ 2001-08-27 21:44 UTC (permalink / raw)
  To: phillips, linux-kernel

In article <20010827203125Z16070-32383+1731@humbolt.nl.linux.org> you write:
>On August 27, 2001 09:43 pm, Oliver Neukum wrote:
>> 
>> If we are optimising for streaming (which readahead is made for) dropping 
>> only one page will buy you almost nothing in seek time. You might just as 
>> well drop them all and correct your error in one larger read if necessary.
>> Dropping the oldest page is possibly the worst you can do, as you will need 
>> it soonest.
>
>Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind, 
>dropping readahead pages is not supposed to happen frequently under 
>steady-state operation, so it's not that critical what we do here, it's going 
>to be hard to create a load that shows the impact.  The really big benefit 
>comes from not overdoing the readahead in the first place, and not underdoing 
>it either.

Note that the big reason why I did _not_ end up just increasing the
read-ahead value from 31 to 511 (it was there for a short while) is that
large read-ahead does not necessarily improve performance AT ALL,
regardless of memory pressure. 

Why? Because if the IO request queue fills up, the read-ahead actually
ends up waiting for requests, and ends up being synchronous. Which
totally destroys the whole point of doing read-ahead in the first place.
And a large read-ahead only makes this more likely.

Also note that doing tons of parallel reads _also_ makes this more
likely, and actually ends up also mixing the read-ahead streams which is
exactly what you do not want to do.

The solution to both problems is to make the read-ahead not wait
synchronously on requests - that way the request allocation itself ends
up being a partial throttle on memory usage too, so that you actually
probably end up fixing the problem of memory pressure _too_.

This requires that the read-ahead code would start submitting the blocks
using READA, which in turn requires that the readpage() function get a
"READ vs READA" argument.  And the ll_rw_block code would obviously have
to honour the rw_ahead hint and submit_bh() would have to return an
error code - which it currently doesn't do, but which should be trivial
to implement. 

I really think that doing anything else is (a) stupid and (b) wrong.
Trying to come up with a complex algorithm on how to change read-ahead
based on memory pressure is just bound to be extremely fragile and have
strange performance effects. While letting the IO layer throttle the
read-ahead on its own is the natural and high-performance approach.

		Linus

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 20:19                         ` Alex Bligh - linux-kernel
@ 2001-08-27 21:38                           ` Oliver.Neukum
  2001-08-27 22:26                             ` Alex Bligh - linux-kernel
  0 siblings, 1 reply; 124+ messages in thread
From: Oliver.Neukum @ 2001-08-27 21:38 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel
  Cc: Oliver Neukum, Rik van Riel, Daniel Phillips, Helge Hafting,
	linux-kernel

On Mon, 27 Aug 2001, Alex Bligh - linux-kernel wrote:

> Oliver,
>
> --On Monday, 27 August, 2001 10:03 PM +0200 Oliver Neukum
> <Oliver.Neukum@lrz.uni-muenchen.de> wrote:
>
> > what leads you to this conclusion ?
> > A task that needs little time to process data it reads in is hurt much
> > more  by added latency due to a disk read.
>
> I meant that dropping readahed pages from dd from a floppy (or
> slow network connection) is going to cost more to replace
> than dropping the same number of readahead pages from dd from
> a fast HD. By fast, I meant fast to read in from the file.

There you are perfectly right. I misunderstood. How do you measure cost of
replacement ?

> If the task is slow, because it's CPU bound (or bound by
> other I/O), and /that/ causes the stream to be slow to
> empty, then as you say, we have the opposite problem.
> On the other hand, it might only be a fast reading task
> compared to others as other tasks are blocking on stuff
> requiring memory, and all the memory is allocated to that
> stream's readahead buffer. So penalizing slow tasks and
> prioritizing fast ones may cause an avalanche effect.
>
> Complicated.

Do we need a maximum readahead based on reasonable latency of the device
in question ?
If on the other hand a task is very fast in processing its buffers the
readahead queue will _not_ be long. The task will however use a lot of IO
bandwidth. Strictly speaking this is a question of IO scheduling.

	Regards
		Oliver



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:34                     ` Alex Bligh - linux-kernel
  2001-08-27 20:03                       ` Oliver Neukum
@ 2001-08-27 21:29                       ` Daniel Phillips
  1 sibling, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 21:29 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Rik van Riel
  Cc: Helge Hafting, linux-kernel, Alex Bligh - linux-kernel

On August 27, 2001 09:34 pm, Alex Bligh - linux-kernel wrote:
> As another optimization, we may need to think of pages used
> by multiple streams. Think, for instance, of 'make -j' and
> header files, or many users ftp'ing down the same file.
> Just because one gcc process has read past
> a block in a header file, I submit that we are less keen to
> drop it if it is in the readahead chain for another.

This is supposed to be handled by putting the page on the active list and 
aging it up, i.e., the current behaviour.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:55                 ` Richard Gooch
  2001-08-27 20:09                   ` Oliver Neukum
@ 2001-08-27 21:06                   ` Daniel Phillips
  1 sibling, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 21:06 UTC (permalink / raw)
  To: Richard Gooch; +Cc: Oliver Neukum, linux-kernel

On August 27, 2001 09:55 pm, Richard Gooch wrote:
> Daniel Phillips writes:
> > The quesion is, how do you know you're streaming?  Some files are
> > read/written many times and some files are accessed randomly.  I'm
> > trying to avoid penalizing these admittedly rarer, but still
> > important cases.
> 
> I wonder if we're trying to do the impossible: an algorithm that works
> great for very different workloads, without hints from the process.

By nature, it's impossible to do optimal page replacement without being 
prescient.  Nonetheless, it is possible to spot some patterns and take 
advantage of them.

Look at bzip if you need inspiration.  I've seen it do a few bytes worse 
occasionally, but on average it does the job about 20% better than gzip, 
amazing.

> Shouldn't we encourage use of madvise(2) more? And if needed, add
> O_DROPBEHIND and similar flags for open(2).
> 
> The application knows how it's going to use data/memory. It should
> tell the kernel so the kernel can choose the best algorithm.

The hooks are there but it's unlikely very many people will ever use them, 
even if encouraged.  Also, the kernel has information available to it that 
the application programmer does not.  For example, the kernel knows about the 
current, system-wide load.

The ideal arrangement is for madvise to complement the kernel's automagic 
heuristics.  Bearing that in mind, I'll take care not to break it.

--
Daniel

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:43                   ` Oliver Neukum
@ 2001-08-27 20:37                     ` Daniel Phillips
  2001-08-27 22:10                       ` Oliver.Neukum
  2001-08-27 21:44                     ` Linus Torvalds
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 20:37 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: linux-kernel

On August 27, 2001 09:43 pm, Oliver Neukum wrote:
> Am Montag, 27. August 2001 21:04 schrieb Daniel Phillips:
> > On August 27, 2001 08:37 pm, Oliver Neukum wrote:
> > > Hi,
> > >
> > > >   - Readahead cache is naturally a fifo - new chunks of readahead
> > > >     are added at the head and unused readahead is (eventually)
> > > >     culled from the tail.
> > >
> > > do you really want to do this based on pages ? Should you not drop all
> > > pages associated with the inode that wasn't touched for the longest
> > > time ?
> >
> > Isn't that very much the same as dropping pages from the end of the
> > readahead queue?
> 
> No, the end of the readahead queue will usually have pages from many 
> inodes(or perhaps it should be attached to open files as two tasks may read 
> different parts of one file).
> 
> If we are optimising for streaming (which readahead is made for) dropping 
> only one page will buy you almost nothing in seek time. You might just as 
> well drop them all and correct your error in one larger read if necessary.
> Dropping the oldest page is possibly the worst you can do, as you will need 
> it soonest.

Yes, good point.  OK, I'll re-examine the dropping logic.  Bear in mind, 
dropping readahead pages is not supposed to happen frequently under 
steady-state operation, so it's not that critical what we do here, it's going 
to be hard to create a load that shows the impact.  The really big benefit 
comes from not overdoing the readahead in the first place, and not underdoing 
it either.

> > > If you are streaming dropping all should be no great loss.
> >
> > The quesion is, how do you know you're streaming?  Some files are
> > read/written many times and some files are accessed randomly.  I'm trying
> > to avoid penalizing these admittedly rarer, but still important cases.
> 
> For those other case we have pages on the active list which are properly 
> aged, haven't we ?

Not if they never get a change to be moved to the active list.  We have to be 
careful to provide that opportunity.

> And readahead won't help you for random access unless you can cache it all.
> You might throw in a few extra blocks if you have free memory, but then you 
> have free memory anyway.

Readahead definitely can help in some types of random access loads, e.g., 
when access size is larger than a page, or when the majority of pages of a 
file are eventually accessed, but in random order.  On the other hand, it 
will only hurt for short, sparse, random reads.  Such a pattern needs to be 
detected in generic_file_readahead.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:36                   ` Alex Bligh - linux-kernel
@ 2001-08-27 20:24                     ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 20:24 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Helge Hafting, linux-kernel
  Cc: Alex Bligh - linux-kernel

On August 27, 2001 09:36 pm, Alex Bligh - linux-kernel wrote:
> --On Monday, 27 August, 2001 6:02 PM +0200 Daniel Phillips 
> <phillips@bonn-fries.net> wrote:
> 
> > On the other hand, we
> > will penalize faster streams that way
> 
> Penalizing faster streams for the same number
> of pages is probably a good thing
> as they cost less time to replace.

Let me clarify, the stream is fast because its client is fast.  The disk will 
service the reads at the same speed for all the streams.  (Let's not go into 
the multi-disk case just now, ok?)

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 20:03                       ` Oliver Neukum
@ 2001-08-27 20:19                         ` Alex Bligh - linux-kernel
  2001-08-27 21:38                           ` Oliver.Neukum
  0 siblings, 1 reply; 124+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-08-27 20:19 UTC (permalink / raw)
  To: Oliver Neukum, Alex Bligh - linux-kernel, Rik van Riel
  Cc: Daniel Phillips, Helge Hafting, linux-kernel, Alex Bligh - linux-kernel

Oliver,

--On Monday, 27 August, 2001 10:03 PM +0200 Oliver Neukum 
<Oliver.Neukum@lrz.uni-muenchen.de> wrote:

> what leads you to this conclusion ?
> A task that needs little time to process data it reads in is hurt much
> more  by added latency due to a disk read.

I meant that dropping readahed pages from dd from a floppy (or
slow network connection) is going to cost more to replace
than dropping the same number of readahead pages from dd from
a fast HD. By fast, I meant fast to read in from the file.

If the task is slow, because it's CPU bound (or bound by
other I/O), and /that/ causes the stream to be slow to
empty, then as you say, we have the opposite problem.
On the other hand, it might only be a fast reading task
compared to others as other tasks are blocking on stuff
requiring memory, and all the memory is allocated to that
stream's readahead buffer. So penalizing slow tasks and
prioritizing fast ones may cause an avalanche effect.

Complicated.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:55                 ` Richard Gooch
@ 2001-08-27 20:09                   ` Oliver Neukum
  2001-08-27 21:06                   ` Daniel Phillips
  1 sibling, 0 replies; 124+ messages in thread
From: Oliver Neukum @ 2001-08-27 20:09 UTC (permalink / raw)
  To: Richard Gooch, Daniel Phillips; +Cc: linux-kernel

Am Montag, 27. August 2001 21:55 schrieb Richard Gooch:
> Daniel Phillips writes:
> > The quesion is, how do you know you're streaming?  Some files are
> > read/written many times and some files are accessed randomly.  I'm
> > trying to avoid penalizing these admittedly rarer, but still
> > important cases.
>
> I wonder if we're trying to do the impossible: an algorithm that works
> great for very different workloads, without hints from the process.

For streaming we should be able to detect consecutive reads.
If it's not that easy could we not measure hit/miss ratios ?

> Shouldn't we encourage use of madvise(2) more? And if needed, add
> O_DROPBEHIND and similar flags for open(2).

For symmetry rather fadvise. Besides usage patterns may change.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:34                     ` Alex Bligh - linux-kernel
@ 2001-08-27 20:03                       ` Oliver Neukum
  2001-08-27 20:19                         ` Alex Bligh - linux-kernel
  2001-08-27 21:29                       ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Oliver Neukum @ 2001-08-27 20:03 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Rik van Riel
  Cc: Daniel Phillips, Helge Hafting, linux-kernel

Hi,

> Thinking about it a bit more, we also want to drop pages
> from fast streams faster, to an extent, than we drop
> them from slow streams (as well as dropping quite
> a few pages at once), as these 'cost' more to replace.

what leads you to this conclusion ?
A task that needs little time to process data it reads in is hurt much more 
by added latency due to a disk read.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 18:37               ` Oliver Neukum
  2001-08-27 19:04                 ` Daniel Phillips
@ 2001-08-27 19:55                 ` Richard Gooch
  2001-08-27 20:09                   ` Oliver Neukum
  2001-08-27 21:06                   ` Daniel Phillips
  1 sibling, 2 replies; 124+ messages in thread
From: Richard Gooch @ 2001-08-27 19:55 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Oliver Neukum, linux-kernel

Daniel Phillips writes:
> The quesion is, how do you know you're streaming?  Some files are
> read/written many times and some files are accessed randomly.  I'm
> trying to avoid penalizing these admittedly rarer, but still
> important cases.

I wonder if we're trying to do the impossible: an algorithm that works
great for very different workloads, without hints from the process.

Shouldn't we encourage use of madvise(2) more? And if needed, add
O_DROPBEHIND and similar flags for open(2).

The application knows how it's going to use data/memory. It should
tell the kernel so the kernel can choose the best algorithm.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 19:04                 ` Daniel Phillips
@ 2001-08-27 19:43                   ` Oliver Neukum
  2001-08-27 20:37                     ` Daniel Phillips
  2001-08-27 21:44                     ` Linus Torvalds
  0 siblings, 2 replies; 124+ messages in thread
From: Oliver Neukum @ 2001-08-27 19:43 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Am Montag, 27. August 2001 21:04 schrieb Daniel Phillips:
> On August 27, 2001 08:37 pm, Oliver Neukum wrote:
> > Hi,
> >
> > >   - Readahead cache is naturally a fifo - new chunks of readahead
> > >     are added at the head and unused readahead is (eventually)
> > >     culled from the tail.
> >
> > do you really want to do this based on pages ? Should you not drop all
> > pages associated with the inode that wasn't touched for the longest
> > time ?
>
> Isn't that very much the same as dropping pages from the end of the
> readahead queue?

No, the end of the readahead queue will usually have pages from many 
inodes(or perhaps it should be attached to open files as two tasks may read 
different parts of one file).

If we are optimising for streaming (which readahead is made for) dropping 
only one page will buy you almost nothing in seek time. You might just as 
well drop them all and correct your error in one larger read if necessary.
Dropping the oldest page is possibly the worst you can do, as you will need 
it soonest.

> > If you are streaming dropping all should be no great loss.
>
> The quesion is, how do you know you're streaming?  Some files are
> read/written many times and some files are accessed randomly.  I'm trying
> to avoid penalizing these admittedly rarer, but still important cases.

For those other case we have pages on the active list which are properly 
aged, haven't we ?
And readahead won't help you for random access unless you can cache it all.
You might throw in a few extra blocks if you have free memory, but then you 
have free memory anyway.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 16:02                 ` Daniel Phillips
@ 2001-08-27 19:36                   ` Alex Bligh - linux-kernel
  2001-08-27 20:24                     ` Daniel Phillips
  0 siblings, 1 reply; 124+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-08-27 19:36 UTC (permalink / raw)
  To: Daniel Phillips, Alex Bligh - linux-kernel, Helge Hafting, linux-kernel
  Cc: Alex Bligh - linux-kernel

--On Monday, 27 August, 2001 6:02 PM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

> On the other hand, we
> will penalize faster streams that way

Penalizing faster streams for the same number
of pages is probably a good thing
as they cost less time to replace.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
       [not found]                   ` <Pine.LNX.4.33L.0108271213370.5646-100000@imladris.rielhome.cone ctiva>
@ 2001-08-27 19:34                     ` Alex Bligh - linux-kernel
  2001-08-27 20:03                       ` Oliver Neukum
  2001-08-27 21:29                       ` Daniel Phillips
  0 siblings, 2 replies; 124+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-08-27 19:34 UTC (permalink / raw)
  To: Rik van Riel, Alex Bligh - linux-kernel
  Cc: Daniel Phillips, Helge Hafting, linux-kernel, Alex Bligh - linux-kernel

Rik,

Terminological confusion - see below.

--On Monday, 27 August, 2001 12:14 PM -0300 Rik van Riel 
<riel@conectiva.com.br> wrote:

>> If you are reading
>> ahead (let's have caps for a page that has been used for reading,
>> as well as read from the disk, and lowercase for read-ahead that
>> has not been used):
>> 	ABCDefghijklmnopq
>>              |            |
>>             read         disk
>> 	   ptr          head
>> and you want to reclaim memory, you want to drop (say) 'pq'
>> to get
>> 	ABCDefghijklmno
>> for two reasons: firstly because 'efg' etc. are most likely
>> to be used NEXT, and secondly because the diskhead is nearer
>> 'pq' when you (inevitably) have to read it again.
>
> This is NOT MRU, since p and q have not been used yet.
> In this example you really want to drop D and C instead.

Daniel was (I think) suggesting readahed blocks that hadn't
be read were on a different list entirely from other blocks,
under his scheme managed by used once, he wrote:

>   - A new readahead page starts on the readahead queue.  When used
>     (by generic_file_read) the readahead page moves to the inactive
>     queue and becomes a used-once page (i.e., low priority).  If a
>     readahead page reaches the tail of the readahead queue it may
>     be culled by moving it to the inactive queue.

and:

>   - Readahead pages have higher priority than inactive pages, lower
>     than active.

So, in the context of this list, A-D are not even candidates
for dropping /in terms of this list/. Because they'd already
have become inactive pages, and have been dropped first.
My point was (after ABCD have been dropped), the pages
should be dropped in the order qponm... etc.

All the pages on this list have been read from disk, but
not read by generic_file_read. I guess I meant 'drop the
most recently /read from disk/'.

Thinking about it a bit more, we also want to drop pages
from fast streams faster, to an extent, than we drop
them from slow streams (as well as dropping quite
a few pages at once), as these 'cost' more to replace.

Let's assume we keep one queue per stream (this may have some
other advantages, like making it dead easy to make all
the pages inactive if the stream suddenly closes)

If we timestamp pages as they are read, we can approximate
the ease of dropping one page by QueueLength/(T(tail)-T(head))
(that's 1/TimeToReadAPage), and the ease of dropping the entire
queue by (roughly)

                  2
       QueueLength
  E = ------------------
   q   T       - T
        q,tail    q,head


So a possible heuristic is to drop HALF the pages in the
queue (q) with the highest E(q) value. This will half the queue
length, and approximately half the difference in T, halving
the value of E, and will ensure a long fast queue frees
up a fair number of pages. Repeat until you've reaped all
the pages you need.

As another optimization, we may need to think of pages used
by multiple streams. Think, for instance, of 'make -j' and
header files, or many users ftp'ing down the same file.
Just because one gcc process has read past
a block in a header file, I submit that we are less keen to
drop it if it is in the readahead chain for another. This
would imply that what we actually want to do is keep one
readahead queue per stream, AND keep blocks in (any) readahead
queue on the normal used-once list. However, keep a count
in this list of how many readahead queues the page is in.
Increment this count when the page is read from disk, and
decrement it when the page is read by generic_file_read.
When considering page aging, drop, in order

  inactive pages first that are no readahead queues
  inactive pages on exactly 1 readahead queue, using above
  inactive pages on >1 readahead queue (not sure what the
    best heuristic is here - perhaps just in order of
    number of queues)
  active pages

[Actually, if you don't care /which/ readahead pages get
dropped, you could do all of this a hell of a lot more
simply, JUST by using the counter to say how many readahead
queues it's in. Increment and Decrement the same way as before,
and add use this value to age out inactive pages on no
queues first, then inactive pages on queues, then active
pages - no extra LRU/MRU lists; if you reverse the age
for things on more than one readahead queue you stand a
good chance of dropping recent pages too.]

--
Alex Bligh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 18:37               ` Oliver Neukum
@ 2001-08-27 19:04                 ` Daniel Phillips
  2001-08-27 19:43                   ` Oliver Neukum
  2001-08-27 19:55                 ` Richard Gooch
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 19:04 UTC (permalink / raw)
  To: Oliver Neukum; +Cc: linux-kernel

On August 27, 2001 08:37 pm, Oliver Neukum wrote:
> Hi,
> 
> >   - Readahead cache is naturally a fifo - new chunks of readahead
> >     are added at the head and unused readahead is (eventually)
> >     culled from the tail.
> 
> do you really want to do this based on pages ? Should you not drop all 
> pages associated with the inode that wasn't touched for the longest
> time ?

Isn't that very much the same as dropping pages from the end of the readahead 
queue?

> If you are streaming dropping all should be no great loss.

The quesion is, how do you know you're streaming?  Some files are 
read/written many times and some files are accessed randomly.  I'm trying to 
avoid penalizing these admittedly rarer, but still important cases.

--
Daniel



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 16:55               ` David Lang
@ 2001-08-27 18:54                 ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 18:54 UTC (permalink / raw)
  To: David Lang; +Cc: Helge Hafting, linux-kernel

On August 27, 2001 06:55 pm, David Lang wrote:
> with you moving things to the inactive queue both when they are used and
> when they spillover from the readahead queue I think you end up putting to
> much preasure on the inactive queue.

Note: the whole point is to avoid having the readahead queue spill over, by 
throttling readahead.  So the readahead queue should only need to be culled 
due to a rise in page activations, enlarging the active list, and requiring 
the system to rebalance itself.  I don't think you can really talk about 
pressure being exerted on the inactive queue by the readahead queue.  The 
size of the inactive queue doesn't really matter as long as it isn't too 
short to provide a good test of short-term page activity.  (If it gets very 
long then it will automatically shorten itself because the probability of a 
given page on the queue being referenced and rescued goes up.)

> given that the readahead queue will fill almost all memory when things
> start spilling off of it you are needing to free memory, so if you just
> put it on the inactive queue you then have to free an equivalent amount of
> space from the inactive queue to actually make any progress on freeing
> space.

Sure, there is an argument for stripping buffers immediately from culled 
readahead pages and moving them straight to the inactive_clean list instead 
of the inactive_dirty list.  In effect, culled readahead pages would then 
rank below aged-to-zero and used-once pages.  It's too subtle a difference 
for me to see any clear advantage one way or the other.  Doing it your way 
would save one queue-move in the lifetime of every culled readahead page.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 14:31             ` Daniel Phillips
  2001-08-27 14:42               ` Alex Bligh - linux-kernel
  2001-08-27 16:55               ` David Lang
@ 2001-08-27 18:37               ` Oliver Neukum
  2001-08-27 19:04                 ` Daniel Phillips
  2001-08-27 19:55                 ` Richard Gooch
  2 siblings, 2 replies; 124+ messages in thread
From: Oliver Neukum @ 2001-08-27 18:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

Hi,

>   - Readahead cache is naturally a fifo - new chunks of readahead
>     are added at the head and unused readahead is (eventually)
>     culled from the tail.

do you really want to do this based on pages ? Should you not drop all pages 
associated with the inode that wasn't touched for the longest time ?
If you are streaming dropping all should be no great loss.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 14:31             ` Daniel Phillips
  2001-08-27 14:42               ` Alex Bligh - linux-kernel
@ 2001-08-27 16:55               ` David Lang
  2001-08-27 18:54                 ` Daniel Phillips
  2001-08-27 18:37               ` Oliver Neukum
  2 siblings, 1 reply; 124+ messages in thread
From: David Lang @ 2001-08-27 16:55 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Helge Hafting, linux-kernel

with you moving things to the inactive queue both when they are used and
when they spillover from the readahead queue I think you end up putting to
much preasure on the inactive queue.

given that the readahead queue will fill almost all memory when things
start spilling off of it you are needing to free memory, so if you just
put it on the inactive queue you then have to free an equivalent amount of
space from the inactive queue to actually make any progress on freeing
space.

David Lang



 On Mon, 27 Aug 2001, Daniel Phillips wrote:

> Date: Mon, 27 Aug 2001 16:31:21 +0200
> From: Daniel Phillips <phillips@bonn-fries.net>
> To: Helge Hafting <helgehaf@idb.hist.no>, linux-kernel@vger.kernel.org
> Subject: Re: [resent PATCH] Re: very slow parallel read performance
>
> On August 27, 2001 09:08 am, Helge Hafting wrote:
> > Daniel Phillips wrote:
> > >
> > > On August 24, 2001 09:02 pm, Rik van Riel wrote:
> > > > I guess in the long run we should have automatic collapse
> > > > of the readahead window when we find that readahead window
> > > > thrashing is going on, in the short term I think it is
> > > > enough to have the maximum readahead size tunable in /proc,
> > > > like what is happening in the -ac kernels.
> > >
> > > Yes, and the most effective way to detect that the readahead window is too
> > > high is by keeping a history of recently evicted pages.  When we find
> > > ourselves re-reading pages that were evicted before ever being used we
> > > know exactly what the problem is.
> >
> > Counting how much we are reading ahead and comparing with total RAM
> > (or total cache) might also be an idea.  We may then read ahead
> > a lot for those who runs a handful of processes, and
> > do smaller readahead for those that runs thousands of processes.
>
> Yes.  In fact I was just sitting down to write up a design for a new
> readahead-handling strategy that incorporates this idea.  Here are my design
> notes so far:
>
>   - Readahead cache should be able to expand to fill (almost) all
>     memory in the absence of other activity.
>
>   - Readahead pages have higher priority than inactive pages, lower
>     than active.
>
>   - Readahead cache is naturally a fifo - new chunks of readahead
>     are added at the head and unused readahead is (eventually)
>     culled from the tail.
>
>   - Readahead cache is important enough to get its own lru list.
>     We know it's a fifo so don't have to waste cycles scanning/aging.
>     Having a distinct list makes the accounting trivial, vs keeping
>     readahead on the active list for example.
>
>   - A new readahead page starts on the readahead queue.  When used
>     (by generic_file_read) the readahead page moves to the inactive
>     queue and becomes a used-once page (i.e., low priority).  If a
>     readahead page reaches the tail of the readahead queue it may
>     be culled by moving it to the inactive queue.
>
>   - When the readahead cache fills up past its falloff limit we
>     will reduce amount of readahead submitted proportionally by the
>     amount the readahead cache exceeds the falloff limit.  At the
>     cutoff limit, no new readahead is submitted.
>
>   - At each try_to_free_pages step the readahead queue is culled
>     proportionally by the amount it exceeds its falloff limit.  A
>     tuning parameter controls the rate at which readahead is
>     culled vs new readahead submissions (is there a better way?).
>
>   - The cutoff limit is adjusted periodically according to the size
>     of the active list, implementing the idea that active pages
>     take priority over readahead pages.
>
>   - The falloff limit is set proportionally to the cutoff limit.
>
>   - The mechanism operates without user intervention, though there
>     are several points at which proportional factors could be
>     exposed as tuning parameters.
>
> The overarching idea here is that we can pack more readahead into memory by
> managing it carefully, in such a way that we do not often discard unused
> readahead pages.  In other words, we do as much readahead as possible but
> avoid thrashing.
>
> The advantages seem clear enough that I'll proceed to an implementation
> without further ado.
>
> --
> Daniel
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 15:14                 ` Rik van Riel
@ 2001-08-27 16:04                   ` Daniel Phillips
       [not found]                   ` <Pine.LNX.4.33L.0108271213370.5646-100000@imladris.rielhome.cone ctiva>
  1 sibling, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 16:04 UTC (permalink / raw)
  To: Rik van Riel, Alex Bligh - linux-kernel; +Cc: Helge Hafting, linux-kernel

On August 27, 2001 05:14 pm, Rik van Riel wrote:
> On Mon, 27 Aug 2001, Alex Bligh - linux-kernel wrote:
> 
> > A nit: I think it's a MRU list you want.
> 
> Absolutely, however ...
> 
> > If you are reading
> > ahead (let's have caps for a page that has been used for reading,
> > as well as read from the disk, and lowercase for read-ahead that
> > has not been used):
> > 	ABCDefghijklmnopq
> >              |            |
> >             read         disk
> > 	   ptr          head
> > and you want to reclaim memory, you want to drop (say) 'pq'
> > to get
> > 	ABCDefghijklmno
> > for two reasons: firstly because 'efg' etc. are most likely
> > to be used NEXT, and secondly because the diskhead is nearer
> > 'pq' when you (inevitably) have to read it again.
> 
> This is NOT MRU, since p and q have not been used yet.
> In this example you really want to drop D and C instead.

What we mean by "drop" is "deactivate".

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 14:42               ` Alex Bligh - linux-kernel
  2001-08-27 15:14                 ` Rik van Riel
@ 2001-08-27 16:02                 ` Daniel Phillips
  2001-08-27 19:36                   ` Alex Bligh - linux-kernel
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 16:02 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel, Helge Hafting, linux-kernel
  Cc: Alex Bligh - linux-kernel

On August 27, 2001 04:42 pm, Alex Bligh - linux-kernel wrote:
> --On Monday, 27 August, 2001 4:31 PM +0200 Daniel Phillips 
> <phillips@bonn-fries.net> wrote:
> 
> >   - Readahead cache is important enough to get its own lru list.
> >     We know it's a fifo so don't have to waste cycles scanning/aging.
> >     Having a distinct list makes the accounting trivial, vs keeping
> >     readahead on the active list for example.
> 
> A nit: I think it's a MRU list you want. If you are reading
> ahead (let's have caps for a page that has been used for reading,
> as well as read from the disk, and lowercase for read-ahead that
> has not been used):
> 	ABCDefghijklmnopq
>              |            |
>             read         disk
> 	   ptr          head
> and you want to reclaim memory, you want to drop (say) 'pq'
> to get
> 	ABCDefghijklmno
> for two reasons: firstly because 'efg' etc. are most likely
> to be used NEXT, and secondly because the diskhead is nearer
> 'pq' when you (inevitably) have to read it again.

Good point.  Even with a fifo queue we can deal with this nicely by modifying 
the insertion step to scan forward past other pages of the same file.  So the 
readahead pages end up being inserted in reverse order locally, while 
chunkwise we still have a fifo.

> This seems even more imporant when considering multiple streams,
> as if you drop the least recently 'used' (i.e. read in from disk),
> you will instantly create a thrashing storm.

The object is to avoid getting into the position of having to drop readahead 
pages in the first place, by properly throttling the readahead.  When we do 
have to drop readahead it's because the active list expanded.  Hopefully we 
will stabilize soon with a shorter readahead list.  Yes, it may well be 
better to drop from the head of the queue instead of the tail because the 
dropped pages will come from a smaller set of files.  On the other hand, we 
will penalize faster streams that way.  Furthermore, sometimes readahead 
pages may never be used in which case we would keep them forever.

> And an idea: when dropping read-ahead pages, you might be better
> dropping many readahed pages for a single stream, rather than
> hitting them all equally, else they will tend to run out of
> readahead in sync.

Yes, this requirement is satisfied by the arrangement described.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 14:42               ` Alex Bligh - linux-kernel
@ 2001-08-27 15:14                 ` Rik van Riel
  2001-08-27 16:04                   ` Daniel Phillips
       [not found]                   ` <Pine.LNX.4.33L.0108271213370.5646-100000@imladris.rielhome.cone ctiva>
  2001-08-27 16:02                 ` Daniel Phillips
  1 sibling, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-27 15:14 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: Daniel Phillips, Helge Hafting, linux-kernel

On Mon, 27 Aug 2001, Alex Bligh - linux-kernel wrote:

> A nit: I think it's a MRU list you want.

Absolutely, however ...

> If you are reading
> ahead (let's have caps for a page that has been used for reading,
> as well as read from the disk, and lowercase for read-ahead that
> has not been used):
> 	ABCDefghijklmnopq
>              |            |
>             read         disk
> 	   ptr          head
> and you want to reclaim memory, you want to drop (say) 'pq'
> to get
> 	ABCDefghijklmno
> for two reasons: firstly because 'efg' etc. are most likely
> to be used NEXT, and secondly because the diskhead is nearer
> 'pq' when you (inevitably) have to read it again.

This is NOT MRU, since p and q have not been used yet.
In this example you really want to drop D and C instead.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27 14:31             ` Daniel Phillips
@ 2001-08-27 14:42               ` Alex Bligh - linux-kernel
  2001-08-27 15:14                 ` Rik van Riel
  2001-08-27 16:02                 ` Daniel Phillips
  2001-08-27 16:55               ` David Lang
  2001-08-27 18:37               ` Oliver Neukum
  2 siblings, 2 replies; 124+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-08-27 14:42 UTC (permalink / raw)
  To: Daniel Phillips, Helge Hafting, linux-kernel; +Cc: Alex Bligh - linux-kernel

--On Monday, 27 August, 2001 4:31 PM +0200 Daniel Phillips 
<phillips@bonn-fries.net> wrote:

>   - Readahead cache is important enough to get its own lru list.
>     We know it's a fifo so don't have to waste cycles scanning/aging.
>     Having a distinct list makes the accounting trivial, vs keeping
>     readahead on the active list for example.

A nit: I think it's a MRU list you want. If you are reading
ahead (let's have caps for a page that has been used for reading,
as well as read from the disk, and lowercase for read-ahead that
has not been used):
	ABCDefghijklmnopq
             |            |
            read         disk
	   ptr          head
and you want to reclaim memory, you want to drop (say) 'pq'
to get
	ABCDefghijklmno
for two reasons: firstly because 'efg' etc. are most likely
to be used NEXT, and secondly because the diskhead is nearer
'pq' when you (inevitably) have to read it again.

This seems even more imporant when considering multiple streams,
as if you drop the least recently 'used' (i.e. read in from disk),
you will instantly create a thrashing storm.

And an idea: when dropping read-ahead pages, you might be better
dropping many readahed pages for a single stream, rather than
hitting them all equally, else they will tend to run out of
readahead in sync.

--
Alex Bligh

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27  7:08           ` Helge Hafting
@ 2001-08-27 14:31             ` Daniel Phillips
  2001-08-27 14:42               ` Alex Bligh - linux-kernel
                                 ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27 14:31 UTC (permalink / raw)
  To: Helge Hafting, linux-kernel

On August 27, 2001 09:08 am, Helge Hafting wrote:
> Daniel Phillips wrote:
> > 
> > On August 24, 2001 09:02 pm, Rik van Riel wrote:
> > > I guess in the long run we should have automatic collapse
> > > of the readahead window when we find that readahead window
> > > thrashing is going on, in the short term I think it is
> > > enough to have the maximum readahead size tunable in /proc,
> > > like what is happening in the -ac kernels.
> > 
> > Yes, and the most effective way to detect that the readahead window is too
> > high is by keeping a history of recently evicted pages.  When we find
> > ourselves re-reading pages that were evicted before ever being used we 
> > know exactly what the problem is.
> 
> Counting how much we are reading ahead and comparing with total RAM
> (or total cache) might also be an idea.  We may then read ahead
> a lot for those who runs a handful of processes, and
> do smaller readahead for those that runs thousands of processes.

Yes.  In fact I was just sitting down to write up a design for a new 
readahead-handling strategy that incorporates this idea.  Here are my design 
notes so far:

  - Readahead cache should be able to expand to fill (almost) all
    memory in the absence of other activity.

  - Readahead pages have higher priority than inactive pages, lower
    than active.

  - Readahead cache is naturally a fifo - new chunks of readahead
    are added at the head and unused readahead is (eventually)
    culled from the tail.

  - Readahead cache is important enough to get its own lru list.
    We know it's a fifo so don't have to waste cycles scanning/aging.
    Having a distinct list makes the accounting trivial, vs keeping
    readahead on the active list for example.

  - A new readahead page starts on the readahead queue.  When used
    (by generic_file_read) the readahead page moves to the inactive
    queue and becomes a used-once page (i.e., low priority).  If a
    readahead page reaches the tail of the readahead queue it may
    be culled by moving it to the inactive queue.

  - When the readahead cache fills up past its falloff limit we
    will reduce amount of readahead submitted proportionally by the
    amount the readahead cache exceeds the falloff limit.  At the
    cutoff limit, no new readahead is submitted.

  - At each try_to_free_pages step the readahead queue is culled
    proportionally by the amount it exceeds its falloff limit.  A
    tuning parameter controls the rate at which readahead is 
    culled vs new readahead submissions (is there a better way?).

  - The cutoff limit is adjusted periodically according to the size
    of the active list, implementing the idea that active pages
    take priority over readahead pages.

  - The falloff limit is set proportionally to the cutoff limit.

  - The mechanism operates without user intervention, though there
    are several points at which proportional factors could be
    exposed as tuning parameters.

The overarching idea here is that we can pack more readahead into memory by 
managing it carefully, in such a way that we do not often discard unused 
readahead pages.  In other words, we do as much readahead as possible but 
avoid thrashing.

The advantages seem clear enough that I'll proceed to an implementation
without further ado.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 22:29         ` Daniel Phillips
  2001-08-24 23:10           ` Rik van Riel
@ 2001-08-27  7:08           ` Helge Hafting
  2001-08-27 14:31             ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Helge Hafting @ 2001-08-27  7:08 UTC (permalink / raw)
  To: Daniel Phillips, linux-kernel

Daniel Phillips wrote:
> 
> On August 24, 2001 09:02 pm, Rik van Riel wrote:
> > I guess in the long run we should have automatic collapse
> > of the readahead window when we find that readahead window
> > thrashing is going on, in the short term I think it is
> > enough to have the maximum readahead size tunable in /proc,
> > like what is happening in the -ac kernels.
> 
> Yes, and the most effective way to detect that the readahead window is too
> high is by keeping a history of recently evicted pages.  When we find
> ourselves re-reading pages that were evicted before ever being used we know
> exactly what the problem is.

Counting how much we are reading ahead and comparing with total RAM
(or total cache) might also be an idea.  We may then read ahead
a lot for those who runs a handful of processes, and
do smaller readahead for those that runs thousands of processes.

Helge Hafting

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27  2:03 Rick Hohensee
@ 2001-08-27  2:52 ` Keith Owens
  2001-08-28 17:52 ` Kai Henningsen
  1 sibling, 0 replies; 124+ messages in thread
From: Keith Owens @ 2001-08-27  2:52 UTC (permalink / raw)
  To: Rick Hohensee; +Cc: linux-kernel

On Sun, 26 Aug 2001 22:03:54 -0400 (EDT), 
Rick Hohensee <humbubba@smarty.smart.net> wrote:
>I believe the Committee for the Preservation of Welsh Poetry are pretty
>settled on the -ac tree.

That reminds me, I have to add this config entry to kbuild.

CONFIG_LLANFAIRPWLLGWYNGYLLGOGERYCHWYRNDROBWLLLLANTYSILIOGOGOGOCH
  Use Welsh


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
@ 2001-08-27  2:03 Rick Hohensee
  2001-08-27  2:52 ` Keith Owens
  2001-08-28 17:52 ` Kai Henningsen
  0 siblings, 2 replies; 124+ messages in thread
From: Rick Hohensee @ 2001-08-27  2:03 UTC (permalink / raw)
  To: yodaiken; +Cc: linux-kernel

Yodaiken
I'll have to wait to display more ignorance (on this subject)
until next week. Off to
LinuxWorld SF - rushing in where Alan Cox is afraid to go!

And OT: the Embedded Linux Consortium is considering standards
for Embedded Linux, as is the Emblix Consortium (Japan), the Open Group,
and, for all I know, the UN, the NRA, and the Committee for the
Preservation Welsh Poetry. I'd be
interested in any suggestions, comments, proposals, or witty remarks
I could convey to  the first three of  these august organizations.

moi
I believe the Committee for the Preservation of Welsh Poetry are pretty
settled on the -ac tree. Aren't they doing an audio CD of Alan reciting
the TCP/IP stack sources?

You might mention that embedded Linux is in many ways a reenactment of the
history of Forth, which has many of the same advantages as Linux, foremost
in embedded being  low (nil) unit-cost. Then there's robustness,
mutability, completeness...

Rick Hohensee
                www.
                           cLIeNUX
                                          .com
                                                        humbubba@smart.net

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-27  0:02                                 ` Rik van Riel
@ 2001-08-27  0:42                                   ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-27  0:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: pcg, Roger Larsson, linux-kernel

On August 27, 2001 02:02 am, Rik van Riel wrote:
> On Sun, 26 Aug 2001, Daniel Phillips wrote:
> 
> > > His kernel is running completely out of memory, with no
> > > swap space configured.
> >
> > No, he's streaming mp3's:
> 
> 1) these two are not exclusive
> 2) he clearly wrote that he was running out of memory,
>    though this was in a different email thread

So you're confident there's no problem here, even though all he's doing is a 
kernel build and playing mp3's with 24 meg available for the job.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 23:24                                   ` Daniel Phillips
  2001-08-26 23:24                                     ` Russell King
@ 2001-08-27  0:07                                     ` Rik van Riel
  1 sibling, 0 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-27  0:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Russell King, pcg, Roger Larsson, linux-kernel

On Mon, 27 Aug 2001, Daniel Phillips wrote:

> This is a nice dump format.  One thing that would be very helpful is
> the page executable flag, another would be the writable flag.  The
> 4687 anonymous pages are the elephant under the rug, but we don't know
> how they break down between executable (evictable) and otherwise.

Anonymous pages and executable pages are mutually exclusive.

Executable pages are ALWAYS mapped from a file, and thus
never anonymous.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:08                               ` Daniel Phillips
  2001-08-26 22:33                                 ` Russell King
@ 2001-08-27  0:02                                 ` Rik van Riel
  2001-08-27  0:42                                   ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-27  0:02 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:

> > His kernel is running completely out of memory, with no
> > swap space configured.
>
> No, he's streaming mp3's:

1) these two are not exclusive
2) he clearly wrote that he was running out of memory,
   though this was in a different email thread

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 21:07                                     ` Daniel Phillips
  2001-08-26 22:12                                       ` Rik van Riel
@ 2001-08-26 23:24                                       ` Lehmann 
  1 sibling, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-26 23:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel

On Sun, Aug 26, 2001 at 11:07:07PM +0200, Daniel Phillips <phillips@bonn-fries.net> wrote:
> To recap, you made these changes:
> 
>   - Changed to -ac
>   - Set max-readahead through proc
> 
> Anything else?  Did you change MAX_SECTORS?

no, I tinkered a lot with my server (freeing up memory), which obviously
helped somewhat and increased socket buffers to 256k max., makeing i/o
more chunky and thus more efficient. i also use more than one reader
thread: under linus' kernels more than one thread made the situation
worse, on the ac kernels they seem to slightly improve throughout, up to a
point.

> > no longer thrashing. And linux does the job nicely ;)
> Good, but should we rest on our laurels now?

well, linux-2.4 has a lot of dark corners that need improving, but the
machine does no longer totally misbehave (4 or 5 mb/s is debatable, but
2mb/s with more memory and higher load is not ;). it's now acceptable to
me ;)


-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 23:24                                   ` Daniel Phillips
@ 2001-08-26 23:24                                     ` Russell King
  2001-08-27  0:07                                     ` Rik van Riel
  1 sibling, 0 replies; 124+ messages in thread
From: Russell King @ 2001-08-26 23:24 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, pcg, Roger Larsson, linux-kernel

On Mon, Aug 27, 2001 at 01:24:23AM +0200, Daniel Phillips wrote:
> This is a nice dump format.  One thing that would be very helpful is the page 
> executable flag, another would be the writable flag.  The 4687 anonymous 
> pages are the elephant under the rug, but we don't know how they break down 
> between executable (evictable) and otherwise.

You can't get at that information without doing a complete scan for each
page table in all tasks, unless you want to dump out the page tables.
This could done separately of course.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 22:33                                 ` Russell King
@ 2001-08-26 23:24                                   ` Daniel Phillips
  2001-08-26 23:24                                     ` Russell King
  2001-08-27  0:07                                     ` Rik van Riel
  0 siblings, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 23:24 UTC (permalink / raw)
  To: Russell King; +Cc: Rik van Riel, pcg, Roger Larsson, linux-kernel

On August 27, 2001 12:33 am, Russell King wrote:
> Of the 8192 pages in his system, there are 4687 anonymous pages like the
> above, 466 slab pages, 636 reserved pages, 376 unused pages (for reserved
> allocation) and 2000 ramdisk pages.  That leaves 27 pages, which are the
> 26 inactive clean pages (from the above), plus one page cache page.
> 
> Feel free to use this information to formulate new strategies on improving
> the VM.
> 
> [note that it takes around 1 minute to get 32MB-worth of information out
> of a serial console at 38400 baud.  You don't want to do this on GB boxes].

This is a nice dump format.  One thing that would be very helpful is the page 
executable flag, another would be the writable flag.  The 4687 anonymous 
pages are the elephant under the rug, but we don't know how they break down 
between executable (evictable) and otherwise.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:08                               ` Daniel Phillips
@ 2001-08-26 22:33                                 ` Russell King
  2001-08-26 23:24                                   ` Daniel Phillips
  2001-08-27  0:02                                 ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Russell King @ 2001-08-26 22:33 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, pcg, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 10:08:07PM +0200, Daniel Phillips wrote:
> No, he's streaming mp3's:

I'd like to take the time to share some extra information on the VM
state of Nico's system.  It consists of a log of the state of every
single page in his system, which was taken after he hit the point of
no progress, and consists of the following output:

<sysrq-m output>
Zone: free 255, inactive clean 26, inactive dirty 0
      min 255, low 510, high 765
<lots of pages>
c04fd000:  0 00000002 00000000 [----] -- [---]
c04fe000:  1 00000002 00000000 [--s-] -- [---]
c04ff000:  1 00000000 c180f0f8 [----] -- [-c-]
c0500000:  1 00000002 00000000 [--s-] -- [---]
c0501000:  1 00000005 00000000 [----] -- [---]
c0502000:  1 00000002 00000000 [--s-] -- [---]
<lots of pages>
c05af000:  1 00000005 00000000 [----] -- [---]
c05b0000:  1 00000005 00000000 [----] -- [---]
c05b1000:  1 00000005 00000000 [----] -- [---]
c05b2000:  1 00000011 00000000 [----] -- [---]
c05b3000:  1 00000005 00000000 [----] -- [---]
c05b4000:  1 00000008 00000000 [----] -- [---]
<lots of pages>

In order, that is:
  virtual page address
  page->count
  paeg->age
  page->mapping

Then 4 flags:
  R - reserved
  S - swap cache
  s - slab page
  r - (not really a flag) ramdisk page

Then 2 flags:
  r - referenced
  D - dirty

Then 3 list flags:
  a - active list
  d - inactive dirty list
  c - inactive clean list

It's a little big (380K), so I won't post it here, but I'll provide a
summary of the state:

Of the 8192 pages in his system, there are 4687 anonymous pages like the
above, 466 slab pages, 636 reserved pages, 376 unused pages (for reserved
allocation) and 2000 ramdisk pages.  That leaves 27 pages, which are the
26 inactive clean pages (from the above), plus one page cache page.

Feel free to use this information to formulate new strategies on improving
the VM.

[note that it takes around 1 minute to get 32MB-worth of information out
of a serial console at 38400 baud.  You don't want to do this on GB boxes].

Thanks.

--
Russell King (rmk@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 21:07                                     ` Daniel Phillips
@ 2001-08-26 22:12                                       ` Rik van Riel
  2001-08-26 23:24                                       ` Lehmann 
  1 sibling, 0 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 22:12 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Alan Cox, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:

> Good, but should we rest on our laurels now?

If you have an idea on how to improve things or want to
implement one of the proposed ideas, feel free. I'll
happily help think about the stuff and/or test patches.

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:26                                   ` Gérard Roudier
@ 2001-08-26 21:20                                     ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 21:20 UTC (permalink / raw)
  To: Gérard Roudier 
  Cc: pcg, Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

On August 26, 2001 10:26 pm, Gérard Roudier wrote:
> On Sun, 26 Aug 2001, Daniel Phillips wrote:
> 
> [...]
> 
> > It should not be being ignored.  This needs to be looked into.  In any event,
> > the max-readahead proc setting is clearly good and needs to be in Linus's
> > tree, otherwise changing the default MAX_READAHEAD requires a recompile.
> > Worse, there is no way at all to specify the kernel's max-readahead for scsi
> > disks - regardless of the fact that scsi disks do their own readahead, the
> > kernel will do its own as well, with no way for the user to turn it off.
> 
> For SCSI disks prefetch tuning you may look into the CACHING page.
> 
> For example, you can tell the drive to stop prefetching as soon a command
> is ready by setting MINIMUM PRE-FETCH to zero.
> 
> Unfortunately, not all SCSI disks allow to tune all the configuration
> parameters of the caching page.

In this case he's using ide.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 19:18                                   ` Lehmann 
@ 2001-08-26 21:07                                     ` Daniel Phillips
  2001-08-26 22:12                                       ` Rik van Riel
  2001-08-26 23:24                                       ` Lehmann 
  0 siblings, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 21:07 UTC (permalink / raw)
  To: pcg; +Cc: Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

On August 26, 2001 09:18 pm, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:
> On Sun, Aug 26, 2001 at 07:29:43PM +0200, Daniel Phillips wrote:

> > > Now, a question: how does the per-block-device read-ahead fit into
> > > this picture?  Is it being ignored? I fiddled with it (under
> > > 2.4.8pre4) but couldn't see any difference.
> > 
> > It should not be being ignored.  This needs to be looked into.
> 
> so the efefctive read-ahead is 2min(per-block-device, kernel-global)"? if
> yes, then this would be ideal for my case, as usage patterns are strictly
> seperated by disk in the machine, so I could get the best of all worlds.

Yes.

> > tree, otherwise changing the default MAX_READAHEAD requires a recompile.  
> 
> I like the proposed ideas of making it autotune ;) I think very large
> readaheads make a lot of sense under normal loads with modern harddisks,
> but not always.

Well, autotuning will require some r&d, plus a settling in period.  In the 
meantime it's quite safe and useful to use a user-set global as you showed, 
and also will be helpful during the development of the automagic version.

> [...]
> Also, this whole thread was very fruitfull for that server:
> 
>    929 (929) connections
>    4084385150 bytes written in the last 1044.23861896992 seconds
>    (3911352.3 bytes/s)
> 
> which is almost twice as fast as with my old setup/config. Thanks to all
> who analyzed it, sent patches and gave hints (thttpd gave about 700kb/s
> on the same setup, with only 250 connections, but admittedly only 4k
> read-requests).

To recap, you made these changes:

  - Changed to -ac
  - Set max-readahead through proc

Anything else?  Did you change MAX_SECTORS?

> I believe that, with some tweaking (more memory dedicated to buffers), I
> could go to 4.5 and maybe 5mb/s, but certainly not much higher. And it's
> no longer thrashing. And linux does the job nicely ;)

Good, but should we rest on our laurels now?

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:45                     ` Victor Yodaiken
@ 2001-08-26 21:00                       ` Alan Cox
  0 siblings, 0 replies; 124+ messages in thread
From: Alan Cox @ 2001-08-26 21:00 UTC (permalink / raw)
  To: Victor Yodaiken; +Cc: Rik van Riel, Victor Yodaiken, linux-kernel

> And OT: the Embedded Linux Consortium is considering standards
> for Embedded Linux, as is the Emblix Consortium (Japan), the Open Group,
> and, for all I know, the UN, the NRA, and the Committee for the 
> Preservation Welsh Poetry. I'd be
> interested in any suggestions, comments, proposals, or witty remarks
> I could convey to  the first three of  these august organizations.

With regards to the standards proposed
It is worth stating as everyone knows
That no standard for Linux will succeed
Unless Linus Torvalds has agreed
The community will need to be told
Its designed by the groups as a whole
And everyone will need to be heard
If they seek to build more than a turd

		[submission of the welsh embedded linux poetry project
			Gnome #819]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:34                   ` Rik van Riel
@ 2001-08-26 20:45                     ` Victor Yodaiken
  2001-08-26 21:00                       ` Alan Cox
  0 siblings, 1 reply; 124+ messages in thread
From: Victor Yodaiken @ 2001-08-26 20:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Victor Yodaiken, linux-kernel

I'll have to wait to display more ignorance (on this subject)
until next week. Off to
LinuxWorld SF - rushing in where Alan Cox is afraid to go!

And OT: the Embedded Linux Consortium is considering standards
for Embedded Linux, as is the Emblix Consortium (Japan), the Open Group,
and, for all I know, the UN, the NRA, and the Committee for the 
Preservation Welsh Poetry. I'd be
interested in any suggestions, comments, proposals, or witty remarks
I could convey to  the first three of  these august organizations.


On Sun, Aug 26, 2001 at 05:34:24PM -0300, Rik van Riel wrote:
> On Sun, 26 Aug 2001, Victor Yodaiken wrote:
> > On Sun, Aug 26, 2001 at 04:38:55PM -0300, Rik van Riel wrote:
> > > On Sun, 26 Aug 2001, Victor Yodaiken wrote:
> > >
> > Daniel was suggesting a readahead thread, if I'm not mistaken.
> 
> Ouch, that's about as insane as it gets ;)
> 
> 
> > > > BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
> > > > trading memory space for time, why doesn't it just turn off when there's
> > > > a shortage of free memory?
> > > > 		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? readahead: 0)
> > >
> > > When the VM load is high, the last thing you want to do is
> > > shrink the size of your IO operations, this would only lead
> > > to more disk seeks and possibly thrashing.
> >
> > Doesn't this very much depend on why VM load is high and on the
> > kind of I/O load? For example, if your I/O load is already in
> > big chunks or if VM stress is being caused by a bunch of big
> > threads hammering shared data that is in page cache already.
> 
> Processes accessing stuff already in RAM aren't causing
> any VM stress, since all the stuff they need is already
> in RAM.
> 
> As for I/O already being done in big chunks, I'm not sure
> if readahead would have any influence on this situation.
> 
> > At least to me, "thrashing" where the OS is shuffling pages in and
> > out without work getting done is different from "thrashing" where
> > user processes run with suboptimal I/O.
> 
> Actually, "suboptimal I/O" and "shuffling pages without getting
> work done" are pretty similar.
> 
> > > It would be nice to do something similar to TCP window
> > > collapse for readahead, though...
> 
> > That is, failure to use readahead may be caused by memory pressure,
> > scheduling delays, etc - how do you tell the difference between a
> > process that would profit from readahead if the scheduler would let it
> > and one that would not?
> 
> I don't think we'd need to know the difference at all times.
> After all, TCP manages fine without knowing the reason for
> packet loss ;)
> 
> 
> > > IA64: a worthy successor to i860.
> >
> > Not the 432?
> 
> ;)
> 
> Rik
> -- 
> IA64: a worthy successor to i860.
> 
> http://www.surriel.com/		http://distro.conectiva.com/
> 
> Send all your spam to aardvark@nl.linux.org (spam digging piggy)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 18:59             ` Victor Yodaiken
  2001-08-26 19:38               ` Rik van Riel
@ 2001-08-26 20:42               ` Daniel Phillips
  1 sibling, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 20:42 UTC (permalink / raw)
  To: Victor Yodaiken, linux-kernel

Hi, Victor

On August 26, 2001 08:59 pm, Victor Yodaiken wrote:
> On Sun, Aug 26, 2001 at 06:54:55PM +0200, Daniel Phillips wrote:
> > But it's a very interesting idea: instead of performing readahead in 
> > generic_file_read the user thread would calculate the readahead window
> > information and pass it off to a kernel thread dedicated to readahead.
> > This thread can make an informed, global decision on how much IO to
> > submit.  The user thread benefits by avoiding some stalls due to
> > readahead->readpage, as well as avoiding thrashing due to excessive
> > readahead.
> 
> And scheduling gets even more complex as we try to account for work done
> in this thread on behalf of other processes.

We already have kernel threads doing IO work on behalf of other processes, 
bdflush is an example.  Granted, it's output, not input, but is there a 
difference as far as accounting goes?

> And, of course, we have all sorts of wacky merge problems
> 	Process		Kthread
> 	----------------------------
> 	read block 1
> 			schedules to read block 2 readahead
> 	read block 2 
> 	not in cache so
> 	send to ll_rw
> 	get it.
> 	exit
> 			getting through the backlog, don't see block 2 anywhere
> 			so do the readahead not knowing that it's already been
> 			read, used, and discarded

Very unlikely, having been used the block (page) is simply left on the 
inactive queue, not freed.  In any event, the object of the exercise is for 
readahead to run ahead of demand - your example shows what happens when we 
get a traffic jam.

> Sound like it could keep you busy for a while.

The existing code handles this situation just fine.

> BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
> trading memory space for time, why doesn't it just turn off when there's
> a shortage of free memory?
> 		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? 
readahead: 0)

When the system is running under load there's *always* a shortage of free 
memory.  Yes, for sure we need automatic throttling on readahead.  First we 
need a good way of estimating the amount of memory we can reasonably devote 
to readahead.  It's not completely obvious how to do that.  (Look at all the 
difficulty coming up with an accurate way of determining memory_full, a 
similar problem.)

On the way towards coming up with a reliable automatic readahead throttling 
mechanism we can do two really easy, useful things:

  1) Let the user set the per-file limit manually
  2) Automagically cap readahead-in-flight as some user-supplied fraction
     of memory

A port to the linus tree of an -ac patch for (1) was obligingly supplied by 
Craig Hagan in the "very slow parallel read performance" thread.  For (2) 
there's some slight difficulty in accounting accurately for 
readahead-in-flight.  What I'm considering doing at the moment is creating a 
separate lru_list dedicated to readahead pages, then the accounting becomes 
trivial - it's just the length of the list.  At the same time, this provides 
a simple mechanism for elevating the priority of as-yet-unused readahead 
pages over used-once pages, which as Rik helpfully pointed out, allows us to 
pack as much as twice as many readahead pages into cache before we hit the 
thrash point.

Welcome to the vm tag-team mudwrestling championships ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 20:05                 ` Victor Yodaiken
@ 2001-08-26 20:34                   ` Rik van Riel
  2001-08-26 20:45                     ` Victor Yodaiken
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 20:34 UTC (permalink / raw)
  To: Victor Yodaiken; +Cc: linux-kernel

On Sun, 26 Aug 2001, Victor Yodaiken wrote:
> On Sun, Aug 26, 2001 at 04:38:55PM -0300, Rik van Riel wrote:
> > On Sun, 26 Aug 2001, Victor Yodaiken wrote:
> >
> Daniel was suggesting a readahead thread, if I'm not mistaken.

Ouch, that's about as insane as it gets ;)


> > > BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
> > > trading memory space for time, why doesn't it just turn off when there's
> > > a shortage of free memory?
> > > 		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? readahead: 0)
> >
> > When the VM load is high, the last thing you want to do is
> > shrink the size of your IO operations, this would only lead
> > to more disk seeks and possibly thrashing.
>
> Doesn't this very much depend on why VM load is high and on the
> kind of I/O load? For example, if your I/O load is already in
> big chunks or if VM stress is being caused by a bunch of big
> threads hammering shared data that is in page cache already.

Processes accessing stuff already in RAM aren't causing
any VM stress, since all the stuff they need is already
in RAM.

As for I/O already being done in big chunks, I'm not sure
if readahead would have any influence on this situation.

> At least to me, "thrashing" where the OS is shuffling pages in and
> out without work getting done is different from "thrashing" where
> user processes run with suboptimal I/O.

Actually, "suboptimal I/O" and "shuffling pages without getting
work done" are pretty similar.

> > It would be nice to do something similar to TCP window
> > collapse for readahead, though...

> That is, failure to use readahead may be caused by memory pressure,
> scheduling delays, etc - how do you tell the difference between a
> process that would profit from readahead if the scheduler would let it
> and one that would not?

I don't think we'd need to know the difference at all times.
After all, TCP manages fine without knowing the reason for
packet loss ;)


> > IA64: a worthy successor to i860.
>
> Not the 432?

;)

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 17:29                                 ` Daniel Phillips
                                                     ` (2 preceding siblings ...)
  2001-08-26 19:18                                   ` Lehmann 
@ 2001-08-26 20:26                                   ` Gérard Roudier
  2001-08-26 21:20                                     ` Daniel Phillips
  3 siblings, 1 reply; 124+ messages in thread
From: Gérard Roudier @ 2001-08-26 20:26 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Rik van Riel, Alan Cox, Roger Larsson, linux-kernel



On Sun, 26 Aug 2001, Daniel Phillips wrote:

[...]

> It should not be being ignored.  This needs to be looked into.  In any event,
> the max-readahead proc setting is clearly good and needs to be in Linus's
> tree, otherwise changing the default MAX_READAHEAD requires a recompile.
> Worse, there is no way at all to specify the kernel's max-readahead for scsi
> disks - regardless of the fact that scsi disks do their own readahead, the
> kernel will do its own as well, with no way for the user to turn it off.

For SCSI disks prefetch tuning you may look into the CACHING page.

For example, you can tell the drive to stop prefetching as soon a command
is ready by setting MINIMUM PRE-FETCH to zero.

Unfortunately, not all SCSI disks allow to tune all the configuration
parameters of the caching page.

[...]

  Gérard.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 19:52                             ` Rik van Riel
@ 2001-08-26 20:08                               ` Daniel Phillips
  2001-08-26 22:33                                 ` Russell King
  2001-08-27  0:02                                 ` Rik van Riel
  0 siblings, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 20:08 UTC (permalink / raw)
  To: Rik van Riel; +Cc: pcg, Roger Larsson, linux-kernel

On August 26, 2001 09:52 pm, Rik van Riel wrote:
> On Sun, 26 Aug 2001, Daniel Phillips wrote:
> > On August 26, 2001 08:39 pm, Rik van Riel wrote:
> > > On Sun, 26 Aug 2001, Daniel Phillips wrote:
> > >
> > > > There's an obvious explanation for the high loadavg people are seeing
> > > > when their systems go into thrash mode: when free is exhausted, every
> > > > task that fails to get a block in __alloc_pages will become
> > > > PF_MEMALLOC and start scanning.
> > >
> > > If you ever tested this, you'd know this is not true.
> >
> > Look at this, supplied by Nicolas Pitre in the thread "What version of
> > the kernel fixes these VM issues?":
> 
> His kernel is running completely out of memory, with no
> swap space configured.

No, he's streaming mp3's:

> > > > My test consist in compiling gcc 3.0 while some MP3s are continously playing
> > > > in the background.  The gcc build goes pretty far along until both the mp3
> > > > player and the gcc build completely jam.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 19:38               ` Rik van Riel
@ 2001-08-26 20:05                 ` Victor Yodaiken
  2001-08-26 20:34                   ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Victor Yodaiken @ 2001-08-26 20:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Victor Yodaiken, linux-kernel

On Sun, Aug 26, 2001 at 04:38:55PM -0300, Rik van Riel wrote:
> On Sun, 26 Aug 2001, Victor Yodaiken wrote:
> 
> > And scheduling gets even more complex as we try to account for work
> > done in this thread on behalf of other processes. And, of course, we
> > have all sorts of wacky merge problems
> 
> Actually, readahead is always done by the thread reading
> the data, so this is not an issue.

Daniel was suggesting a readahead thread, if I'm not mistaken.

> 
> > BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
> > trading memory space for time, why doesn't it just turn off when there's
> > a shortage of free memory?
> > 		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? readahead: 0)
> 
> When the VM load is high, the last thing you want to do is
> shrink the size of your IO operations, this would only lead
> to more disk seeks and possibly thrashing.

(obviously, I don't know anything about Linux VM, so 
my questions are ignorant)


Doesn't this very much depend on why VM load is high and on the 
kind of I/O load? For example, if your I/O load is already in 
big chunks or if VM stress is being caused by a bunch of big
threads hammering shared data that is in page cache already.
This is what I would guess happens in http servers where some more
seeks might not even show up as a cost - or might be compensated for
by read-ahead on the drive itself. And the OS should still be trying
hard to aggregate swap I/O and I/O for paging. 
At least to me, "thrashing" where the OS is shuffling pages in and
out without work getting done is different from "thrashing" where
user processes run with suboptimal I/O. 
I think the applications people are better if they know
	If memory fills up,  I/O may slow down because the OS won't
	do readaheads.
than
	The OS will attempt to guess aggregate optimial I/O patterns
	and may get it completely wrong, so that when memory is full
	performance becomes totally unpredictable.


But since I have no numbers this is just a stab in the dark.

> It would be nice to do something similar to TCP window
> collapse for readahead, though...
> 
> This would work by increasing the readahead size every
> time we reach the end of the last readahead window without
> having to re-read data twice and collapsing the readahead
> window if any of the pages we read in have to be read
> twice before we got around to using them.

So  suppose Dave Miller's  computer does
      Process A: request data; sleep;                       
      Process B: fill memory with graphics (high res pictures of sparc 4s. of course)
      Process A wake up, readahead failed (pages seized by graphics); read more; sleep
      Process B fill memory 
      Process A; get data; expand readahead; request readahead
      etc.
That is, failure to use readahead may be caused by memory pressure,
scheduling delays, etc - how do you tell the difference between a
process that would profit from readahead if the scheduler would let
it and one that would not?

> IA64: a worthy successor to i860.

Not the 432?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 19:46                           ` Daniel Phillips
@ 2001-08-26 19:52                             ` Rik van Riel
  2001-08-26 20:08                               ` Daniel Phillips
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 19:52 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:
> On August 26, 2001 08:39 pm, Rik van Riel wrote:
> > On Sun, 26 Aug 2001, Daniel Phillips wrote:
> >
> > > There's an obvious explanation for the high loadavg people are seeing
> > > when their systems go into thrash mode: when free is exhausted, every
> > > task that fails to get a block in __alloc_pages will become
> > > PF_MEMALLOC and start scanning.
> >
> > If you ever tested this, you'd know this is not true.
>
> Look at this, supplied by Nicolas Pitre in the thread "What version of
> the kernel fixes these VM issues?":

His kernel is running completely out of memory, with no
swap space configured.

This is very different from a system which is thrashing
because of the IO load.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 18:39                         ` Rik van Riel
@ 2001-08-26 19:46                           ` Daniel Phillips
  2001-08-26 19:52                             ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 19:46 UTC (permalink / raw)
  To: Rik van Riel; +Cc: pcg, Roger Larsson, linux-kernel

On August 26, 2001 08:39 pm, Rik van Riel wrote:
> On Sun, 26 Aug 2001, Daniel Phillips wrote:
> 
> > There's an obvious explanation for the high loadavg people are seeing
> > when their systems go into thrash mode: when free is exhausted, every
> > task that fails to get a block in __alloc_pages will become
> > PF_MEMALLOC and start scanning.
> 
> If you ever tested this, you'd know this is not true.

Look at this, supplied by Nicolas Pitre in the thread "What version of the 
kernel fixes these VM issues?":

> A couple sysrq-P at random intervals shows the CPU looping in the following
> functions:
> 
> PC value	System.map
> --------	----------
> c0040d84	zone_inactive_plenty
> c0041024	try_to_swap_out
> c00216e0	cpu_sa1100_cache_clean_invalidate_range
> c00216d0	cpu_sa1100_cache_clean_invalidate_range
> c0041304	swap_out_mm
> c0041168	swap_out_pmd
> c0044324	__get_swap_page
> c0040d60	zone_inactive_plenty
> c0041128	swap_out_pmd
> c0040fec	try_to_swap_out

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 18:59             ` Victor Yodaiken
@ 2001-08-26 19:38               ` Rik van Riel
  2001-08-26 20:05                 ` Victor Yodaiken
  2001-08-26 20:42               ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 19:38 UTC (permalink / raw)
  To: Victor Yodaiken; +Cc: linux-kernel

On Sun, 26 Aug 2001, Victor Yodaiken wrote:

> And scheduling gets even more complex as we try to account for work
> done in this thread on behalf of other processes. And, of course, we
> have all sorts of wacky merge problems

Actually, readahead is always done by the thread reading
the data, so this is not an issue.

> BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
> trading memory space for time, why doesn't it just turn off when there's
> a shortage of free memory?
> 		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? readahead: 0)

When the VM load is high, the last thing you want to do is
shrink the size of your IO operations, this would only lead
to more disk seeks and possibly thrashing.

It would be nice to do something similar to TCP window
collapse for readahead, though...

This would work by increasing the readahead size every
time we reach the end of the last readahead window without
having to re-read data twice and collapsing the readahead
window if any of the pages we read in have to be read
twice before we got around to using them.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 16:54           ` Daniel Phillips
  2001-08-26 18:59             ` Victor Yodaiken
@ 2001-08-26 19:31             ` Lehmann 
  1 sibling, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-26 19:31 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 06:54:55PM +0200, Daniel Phillips <phillips@bonn-fries.net> wrote:
> But it's a very interesting idea: instead of performing readahead in 
> generic_file_read the user thread would calculate the readahead window

More basic, less interesting but far more useful would be real aio, sicne
this would get rid of hundreds of threads that need manual killing when the
server crashes during development and keep the accept socket open ;)

(But it seems we will have sth. like that in 2.6, so I am generally happy
;)

> submit.  The user thread benefits by avoiding some stalls due to
> readahead->readpage, as well as avoiding thrashing due to excessive
> readahead.

In an ideal world, I would not need user buffers as well and could just
tell the kernel "copy from file to network", but that requires too much
kernel tinkery I believe (although, when there is generic aio in the
kernel there might some day be a aio_sendfile ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 17:29                                 ` Daniel Phillips
  2001-08-26 17:37                                   ` Craig I. Hagan
  2001-08-26 18:56                                   ` Rik van Riel
@ 2001-08-26 19:18                                   ` Lehmann 
  2001-08-26 21:07                                     ` Daniel Phillips
  2001-08-26 20:26                                   ` Gérard Roudier
  3 siblings, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-26 19:18 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 07:29:43PM +0200, Daniel Phillips <phillips@bonn-fries.net> wrote:
> Yes, this probably points at a bug in linus's tree.  This needs more digging.
> You're streaming the mp3's over the net or from your disk?

at home I just do the equivalent of "mpg123 <file>". I often copy around
100-300mb large video files, often many of them, and the linus' kernels
seem to have better throughput but that's not feasible if you have to wait
until the transfer ends (X often becomes unusable interactively).

> > if not I will test it at a later time. Now, a question: how does the
> > per-block-device read-ahead fit into this picture?  Is it being ignored? I 
> > fiddled with it (under 2.4.8pre4) but couldn't see any difference.
> 
> It should not be being ignored.  This needs to be looked into.

so the efefctive read-ahead is 2min(per-block-device, kernel-global)"? if
yes, then this would be ideal for my case, as usage patterns are strictly
seperated by disk in the machine, so I could get the best of all worlds.

> tree, otherwise changing the default MAX_READAHEAD requires a recompile.  

I like the proposed ideas of making it autotune ;) I think very large
readaheads make a lot of sense under normal loads with modern harddisks,
but not always.

> The reason for that is still unclear.  I realize you're testing this under 
> live load (you're a brave man)

Well, I don't get paid for the service and no company or individual
depends on it, so it's purely my personal desire to provide as good
service as possible to my users. It's also a great challange to make it
work at all ;)

Also, this whole thread was very fruitfull for that server:

   929 (929) connections
   4084385150 bytes written in the last 1044.23861896992 seconds
   (3911352.3 bytes/s)

which is almost twice as fast as with my old setup/config. Thanks to all
who analyzed it, sent patches and gave hints (thttpd gave about 700kb/s
on the same setup, with only 250 connections, but admittedly only 4k
read-requests).

I believe that, with some tweaking (more memory dedicated to buffers), I
could go to 4.5 and maybe 5mb/s, but certainly not much higher. And it's
no longer thrashing. And linux does the job nicely ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 16:54           ` Daniel Phillips
@ 2001-08-26 18:59             ` Victor Yodaiken
  2001-08-26 19:38               ` Rik van Riel
  2001-08-26 20:42               ` Daniel Phillips
  2001-08-26 19:31             ` Lehmann 
  1 sibling, 2 replies; 124+ messages in thread
From: Victor Yodaiken @ 2001-08-26 18:59 UTC (permalink / raw)
  To: linux-kernel

On Sun, Aug 26, 2001 at 06:54:55PM +0200, Daniel Phillips wrote:
> But it's a very interesting idea: instead of performing readahead in 
> generic_file_read the user thread would calculate the readahead window
> information and pass it off to a kernel thread dedicated to readahead.
> This thread can make an informed, global decision on how much IO to
> submit.  The user thread benefits by avoiding some stalls due to
> readahead->readpage, as well as avoiding thrashing due to excessive
> readahead.

And scheduling gets even more complex as we try to account for work done
in this thread on behalf of other processes. And, of course, we have
all sorts of wacky merge problems
	Process		Kthread
	----------------------------
	read block 1
			schedules to read block 2 readahead
	read block 2 
	not in cache so
	send to ll_rw
	get it.
	exit
			getting through the backlog, don't see block 2 anywhere
			so do the readahead not knowing that it's already been
			read, used, and discarded
	
Sound like it could keep you busy for a while.


BTW: maybe I'm oversimplifying, but since read-ahead is an optimization
trading memory space for time, why doesn't it just turn off when there's
a shortage of free memory?
		num_pages = (num_requestd_pages +  (there_is_a_boatload_of_free_space? readahead: 0)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 17:29                                 ` Daniel Phillips
  2001-08-26 17:37                                   ` Craig I. Hagan
@ 2001-08-26 18:56                                   ` Rik van Riel
  2001-08-26 19:18                                   ` Lehmann 
  2001-08-26 20:26                                   ` Gérard Roudier
  3 siblings, 0 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 18:56 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Alan Cox, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:
> On August 26, 2001 04:49 am, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:

> > Anyway, I compiled and bootet into linux-2.4.8-ac9. I jused ac8 on my
> > desktop machines and was not pleased with absolute performance but, unlike
> > the linus' series, I can listen to mp3's while working which was the
> > killer feature for me ;)
>
> Yes, this probably points at a bug in linus's tree.

"It works, it must be a bug!"

> > So the ac9 kernel seems to work much better (than the linus' series),
> > although the number of connections was below the critical limit. I'll
> > check this when I get higher loads again.
>
> The reason for that is still unclear.

I've tried to explain it to you about 10 times now.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 16:55                       ` Daniel Phillips
@ 2001-08-26 18:39                         ` Rik van Riel
  2001-08-26 19:46                           ` Daniel Phillips
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 18:39 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:

> There's an obvious explanation for the high loadavg people are seeing
> when their systems go into thrash mode: when free is exhausted, every
> task that fails to get a block in __alloc_pages will become
> PF_MEMALLOC and start scanning.

If you ever tested this, you'd know this is not true.

In almost all cases where the system is thrashing
tasks are waiting for the data they need to be read
in from disk.

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 17:29                                 ` Daniel Phillips
@ 2001-08-26 17:37                                   ` Craig I. Hagan
  2001-08-26 18:56                                   ` Rik van Riel
                                                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 124+ messages in thread
From: Craig I. Hagan @ 2001-08-26 17:37 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: pcg, Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

> live load (you're a brave man) but let's try a bringing the -ac max-readahead
> patch across and try it in 2.4.9.

here it is, pardon the fuzz in some spots (where i just left it alone)

diff -ur linux-clean/drivers/md/md.c linux/drivers/md/md.c
--- linux-clean/drivers/md/md.c	Fri May 25 09:48:49 2001
+++ linux/drivers/md/md.c	Fri Jun 15 02:30:48 2001
@@ -3291,7 +3291,7 @@
 	/*
 	 * Tune reconstruction:
 	 */
-	window = MAX_READAHEAD*(PAGE_SIZE/512);
+	window = vm_max_readahead*(PAGE_SIZE/512);
 	printk(KERN_INFO "md: using %dk window, over a total of %d blocks.\n",window/2,max_sectors/2);

 	atomic_set(&mddev->recovery_active, 0);
diff -ur linux-clean/include/linux/blkdev.h linux/include/linux/blkdev.h
--- linux-clean/include/linux/blkdev.h	Fri May 25 18:01:40 2001
+++ linux/include/linux/blkdev.h	Fri Jun 15 02:23:22 2001
@@ -183,10 +183,6 @@

 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

-/* read-ahead in pages.. */
-#define MAX_READAHEAD	31
-#define MIN_READAHEAD	3
-
 #define blkdev_entry_to_request(entry) list_entry((entry), struct request, queue)
 #define blkdev_entry_next_request(entry) blkdev_entry_to_request((entry)->next)
 #define blkdev_entry_prev_request(entry) blkdev_entry_to_request((entry)->prev)
diff -ur linux-clean/include/linux/mm.h linux/include/linux/mm.h
--- linux-clean/include/linux/mm.h	Fri Jun 15 02:20:24 2001
+++ linux/include/linux/mm.h	Fri Jun 15 02:26:12 2001
@@ -105,6 +105,10 @@
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)

+/* read ahead limits */
+extern int vm_min_readahead;
+extern int vm_max_readahead;
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff -ur linux-clean/include/linux/raid/md_k.h linux/include/linux/raid/md_k.h
--- linux-clean/include/linux/raid/md_k.h	Sun May 20 12:11:39 2001
+++ linux/include/linux/raid/md_k.h	Fri Jun 15 02:31:24 2001
@@ -89,7 +89,7 @@
 /*
  * default readahead
  */
-#define MD_READAHEAD	MAX_READAHEAD
+#define MD_READAHEAD	vm_max_readahead

 static inline int disk_faulty(mdp_disk_t * d)
 {
diff -ur linux-clean/include/linux/sysctl.h linux/include/linux/sysctl.h
--- linux-clean/include/linux/sysctl.h	Fri May 25 18:01:27 2001
+++ linux/include/linux/sysctl.h	Fri Jun 15 02:24:33 2001
@@ -134,7 +134,9 @@
 	VM_PAGECACHE=7,		/* struct: Set cache memory thresholds */
 	VM_PAGERDAEMON=8,	/* struct: Control kswapd behaviour */
 	VM_PGT_CACHE=9,		/* struct: Set page table cache parameters */
-	VM_PAGE_CLUSTER=10	/* int: set number of pages to swap together */
+	VM_PAGE_CLUSTER=10,	/* int: set number of pages to swap together */
+        VM_MIN_READAHEAD=12,    /* Min file readahead */
+        VM_MAX_READAHEAD=13     /* Max file readahead */
 };


diff -ur linux-clean/kernel/sysctl.c linux/kernel/sysctl.c
--- linux-clean/kernel/sysctl.c	Thu Apr 12 12:20:31 2001
+++ linux/kernel/sysctl.c	Fri Jun 15 02:28:02 2001
@@ -270,6 +270,10 @@
 	 &pgt_cache_water, 2*sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_PAGE_CLUSTER, "page-cluster",
 	 &page_cluster, sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_MIN_READAHEAD, "min-readahead",
+	&vm_min_readahead,sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_MAX_READAHEAD, "max-readahead",
+	&vm_max_readahead,sizeof(int), 0644, NULL, &proc_dointvec},
 	{0}
 };

Only in linux/kernel: sysctl.c~
diff -ur linux-clean/mm/filemap.c linux/mm/filemap.c
--- linux-clean/mm/filemap.c	Thu Aug 16 13:12:07 2001
+++ linux/mm/filemap.c	Fri Aug 24 14:08:20 2001
@@ -45,6 +45,12 @@
 unsigned int page_hash_bits;
 struct page **page_hash_table;

+int vm_max_readahead = 31;
+int vm_min_readahead = 3;
+EXPORT_SYMBOL(vm_max_readahead);
+EXPORT_SYMBOL(vm_min_readahead);
+
+
 spinlock_t __cacheline_aligned pagecache_lock = SPIN_LOCK_UNLOCKED;
 /*
  * NOTE: to avoid deadlocking you must never acquire the pagecache_lock with
@@ -870,7 +876,7 @@
 static inline int get_max_readahead(struct inode * inode)
 {
 	if (!inode->i_dev || !max_readahead[MAJOR(inode->i_dev)])
-		return MAX_READAHEAD;
+		return vm_max_readahead;
 	return max_readahead[MAJOR(inode->i_dev)][MINOR(inode->i_dev)];
 }

@@ -1044,8 +1050,8 @@
 		if (filp->f_ramax < needed)
 			filp->f_ramax = needed;

-		if (reada_ok && filp->f_ramax < MIN_READAHEAD)
-				filp->f_ramax = MIN_READAHEAD;
+		if (reada_ok && filp->f_ramax < vm_min_readahead)
+				filp->f_ramax = vm_min_readahead;
 		if (filp->f_ramax > max_readahead)
 			filp->f_ramax = max_readahead;
 	}
--- linux-clean/drivers/ide/ide-probe.c	Sun Mar 18 09:25:02 2001
+++ linux/drivers/ide/ide-probe.c	Fri Jun 15 03:09:49 2001
@@ -779,7 +779,7 @@
 		/* IDE can do up to 128K per request. */
 		*max_sect++ = 255;
 #endif
-		*max_ra++ = MAX_READAHEAD;
+		*max_ra++ = vm_max_readahead;
 	}

 	for (unit = 0; unit < units; ++unit)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  2:49                               ` Lehmann 
@ 2001-08-26 17:29                                 ` Daniel Phillips
  2001-08-26 17:37                                   ` Craig I. Hagan
                                                     ` (3 more replies)
  0 siblings, 4 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 17:29 UTC (permalink / raw)
  To: pcg; +Cc: Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

On August 26, 2001 04:49 am, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:
> On Sun, Aug 26, 2001 at 03:38:34AM +0200, Daniel Phillips 
<phillips@bonn-fries.net> wrote:
> > Let's test the idea that readahead is the problem.  If it is, then 
disabling 
> > readahead should make the lowlevel disk throughput match the highlevel 
> > throughput.  Marc, could you please try it with this patch:
> 
> No, I rebooted the machine before your mail and sinc wehtis is a production
> server.. ;)
> 
> Anyway, I compiled and bootet into linux-2.4.8-ac9. I jused ac8 on my
> desktop machines and was not pleased with absolute performance but, unlike
> the linus' series, I can listen to mp3's while working which was the
> killer feature for me ;)

Yes, this probably points at a bug in linus's tree.  This needs more digging.
You're streaming the mp3's over the net or from your disk?

> anyway, AFAIU, one can tune raedahead dynamically under the ac9 series by
> changing:
> 
> isldoom:/proc/sys/vm# cat max-readahead 
> 31
> 
> If this is equivalent to your patch, then fine.

My patch would be equivalent to:

    echo 0 >/proc/sys/vm/max-readahead 

This was just to see if that makes the highlevel throughput match the 
lowlevel throughput, eliminating one variable from the equation.  In -ac you 
have a much more convenient way of doing that.

> if not I will test it at a later time. Now, a question: how does the
> per-block-device read-ahead fit into this picture?  Is it being ignored? I 
> fiddled with it (under 2.4.8pre4) but couldn't see any difference.

It should not be being ignored.  This needs to be looked into.  In any event, 
the max-readahead proc setting is clearly good and needs to be in Linus's 
tree, otherwise changing the default MAX_READAHEAD requires a recompile.  
Worse, there is no way at all to specify the kernel's max-readahead for scsi 
disks - regardless of the fact that scsi disks do their own readahead, the 
kernel will do its own as well, with no way for the user to turn it off.

> [...]
> Now the interesting part. setting read-ahead to 31 again, I increased the
> number of reader threads from one to 64 and got 3.8MB (@450 connections, I
> had to restart the server).
> 
> So the ac9 kernel seems to work much better (than the linus' series),
> although the number of connections was below the critical limit. I'll
> check this when I get higher loads again.

The reason for that is still unclear.  I realize you're testing this under 
live load (you're a brave man) but let's try a bringing the -ac max-readahead 
patch across and try it in 2.4.9.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 16:28                     ` Lehmann 
  2001-08-25 16:34                       ` Rik van Riel
@ 2001-08-26 16:55                       ` Daniel Phillips
  2001-08-26 18:39                         ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 16:55 UTC (permalink / raw)
  To: pcg, Rik van Riel; +Cc: Roger Larsson, linux-kernel

On August 25, 2001 06:28 pm, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:
> On Sat, Aug 25, 2001 at 12:50:51PM -0300, Rik van Riel 
<riel@conectiva.com.br> wrote:
> > Remember NL.linux.org a few weeks back, where a difference of
> > 10 FTP users more or less was the difference between a system
> > load of 3 and a system load of 250 ?  ;)
> 
> OTOH, servers the use a single process or thread per connection are
> destined to fail under load ;)

There's an obvious explanation for the high loadavg people are seeing when 
their systems go into thrash mode: when free is exhausted, every task that 
fails to get a block in __alloc_pages will become PF_MEMALLOC and start 
scanning.  The remaining tasks that are still running will soon also fail in 
__alloc_pages and jump onto the dogpile.  This is a pretty clear 
demonstration of why the current "self-help" strategy is flawed.

Those tasks that can't get memory should block, leaving one or two threads to 
sort things out under predictable conditions, then restart the blocked 
threads as appropriate.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 23:23         ` Lehmann 
       [not found]           ` <200108242344.f7ONi0h21270@mailg.telia.com>
  2001-08-25  3:09           ` Rik van Riel
@ 2001-08-26 16:54           ` Daniel Phillips
  2001-08-26 18:59             ` Victor Yodaiken
  2001-08-26 19:31             ` Lehmann 
  2 siblings, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26 16:54 UTC (permalink / raw)
  To: pcg, Rik van Riel; +Cc: Roger Larsson, linux-kernel

On August 25, 2001 01:23 am, pcg@goof.com ( Marc) (A.) (Lehmann ) wrote:
> I could imagine that read() executes, returns to
> userspace and at the same time the kernel thinks "nothing to do, let's
> readahead". While, in the concurrent case, there is hardly a time when no
> read() is running. But read-ahead does not seem to work that way.

But it's a very interesting idea: instead of performing readahead in 
generic_file_read the user thread would calculate the readahead window
information and pass it off to a kernel thread dedicated to readahead.
This thread can make an informed, global decision on how much IO to
submit.  The user thread benefits by avoiding some stalls due to
readahead->readpage, as well as avoiding thrashing due to excessive
readahead.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 15:06                                       ` Rik van Riel
@ 2001-08-26 15:25                                         ` Lehmann 
  0 siblings, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-26 15:25 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Alan Cox, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 12:06:42PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> > It must be enough, unfortunately ;)
> 
> Then you'll need to change the physical structure of
> your disks to eliminate seek time. ;)

well, you could send me 200GB flashdisks, that would certainly help, but
what else could I do that is hardware-cost-neutral?

(there are currently three disks on two ide channels, so there is one
obvious optimization left, but this is completely independent of the
problem ;)

I really think linux should be able to achieve what the hardwrae can,
rather than fixing linux vm shortcomings with faster disks.

(playing around with readahead did indeed give me a very noticable
performance improvement, and ~40 mbits is ok).

> Automatic scaling of readahead window and possibly more agressive
> drop-behind could help your system load.

well, the system "load" is very low (50% idle time ;) here is the top ouput
of the current (typical) load:

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 8390 root      20   0 95312  93M  1076 S    43.9 18.5  16:03 myhttpd
 8661 root      17  -4 32464  31M   944 R <  28.4  6.2   1:07 get
 6279 root      18  14  4856 4856   524 R N  27.3  0.9 122:19 dec
 8396 root      10   0 95312  93M  1076 D     6.5 18.5   1:43 myhttpd
 8395 root      11   0 95312  93M  1076 D     5.3 18.5   1:46 myhttpd
 8394 root       9   0 95312  93M  1076 D     4.4 18.5   1:42 myhttpd
 8682 root      19   0  1012 1012   800 R     4.2  0.1   0:01 top

myhttpd is the http serverr, doing about 4MB/s now @ 743 connections.
"get" is a process that reads usenet news from many different servers and
dec is a decoder that decoded news. The news spool is on a 20Gb, 5 disk
SCSI array, together with the system itself. The machine is a dual P-II
300.

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 2  3  1      0   3004  11436 171676   0   0   225   170  303    75  30  45  25
# the above line was the total system uptime average, now vmstat 5 output:
 1  4  2      0   3056  11060 165760   0   0  7094   103 6905  1046  31  51  18
 1  4  2      0   3056  11264 165532   0   0  6173   183 6146  1051  30  41  29
 2  4  2      0   3056  10988 167656   0   0  7402   150 6706  1204  31  48  21
 0  6  0      0   3056  11196 167344   0   0  7249   265 6760  1318  30  47  23
 0  4  0      0   3056  11336 166876   0   0  1718   190 4995   582  25  19  55
 2  0  0      0   3056  11536 166988   0   0  1057   264 3264   313  22  12  65
 2  5  1      0   2880  11332 152916   0   0  1776   121 2789   280  32  22  46
 1  5  0      0  16108  11472 153984   0   0  1040   215 3255   248  29  15  56
 1  4  3      0   3056  11624 166800   0   0  4406   179 3329   653  32  23  45
 1  4  0      0   3056  10852 167636   0   0  6970   138 5521  1247  34  39  26
 2  4  0      0   3056  11016 167440   0   0  7238   162 5997  1118  36  39  25
 2  4  1      0   3056  11284 177332   0   0  6247    84 5206  1293  34  36  30
 1  4  2      0   3052  11296 181564   0   0  7800    85 5493  1399  35  41  24

There are 4 reader threads ATM, and this coincides nicely with the 4
blocked tasks.

> so we have an idea of what kind of system load we're facing, and
> the active/inactive memory lines from /proc/meminfo ?

I then did: while sleep 5; do grep "\(^In\|^Act\)" </proc/meminfo;done

Active:         144368 kB
Inact_dirty:     29048 kB
Inact_clean:       192 kB
Inact_target:    19348 kB
Active:         154012 kB
Inact_dirty:     14092 kB
Inact_clean:      5556 kB
Inact_target:    19360 kB
Active:         164908 kB
Inact_dirty:     21212 kB
Inact_clean:      5428 kB
Inact_target:    19104 kB
Active:         169788 kB
Inact_dirty:     20652 kB
Inact_clean:      1224 kB
Inact_target:    18912 kB
Active:         147280 kB
Inact_dirty:     37444 kB
Inact_clean:      5080 kB
Inact_target:    19132 kB
Active:         151400 kB
Inact_dirty:     26604 kB
Inact_clean:     10280 kB
Inact_target:    19328 kB
Active:         157288 kB
Inact_dirty:      9312 kB
Inact_clean:     20988 kB
Inact_target:    19500 kB
Active:         160456 kB
Inact_dirty:     11908 kB
Inact_clean:     12112 kB
Inact_target:    19672 kB

> Indeed, something is going wrong ;)
> 
> Lets find out exactly what so we can iron out this bug
> properly.

When that happened I can test wether massively increasing the number of
reader threads changes performance ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 14:55                                     ` Lehmann 
@ 2001-08-26 15:06                                       ` Rik van Riel
  2001-08-26 15:25                                         ` Lehmann 
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 15:06 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Daniel Phillips, Alan Cox, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Marc A. Lehmann wrote:
> On Sun, Aug 26, 2001 at 10:48:05AM -0300, Rik van Riel <riel@conectiva.com.br> wrote:

> > This paper convinced me that doing just elevator sorting
> > is never enough ;)
>
> It must be enough, unfortunately ;)

Then you'll need to change the physical structure of
your disks to eliminate seek time. ;)

> > One thing you could do, in recent -ac kernels, is make the
> > maximum readahead size smaller by lowering the value in
> > /proc/sys/vm/max-readahead
>
> Yes, this indeed helps slightly (with the ac9 kernel I don't see the
> massive thrashing going on with the linus' ones anyway). Now the only
> thing left would be to be able to to that per access

Automatic scaling of readahead window and possibly more agressive
drop-behind could help your system load.

Could you give me a few (heavy load) lines of 'vmstat 5' output
so we have an idea of what kind of system load we're facing, and
the active/inactive memory lines from /proc/meminfo ?

> It seems that I can in excess of 4MB/s (sometimes 5MB/s) when using
> large bounce-buffers and disabling read-ahead completely under ac9. So
> something with the kernel read-ahead *is* going wrong, as the default
> read-ahead of 31 (pages? 124k) is very similar to my current
> read-buffer size of 128k. And enabling read-ahead and decreasing the
> user-space buffers gives abysmal performance again.

Indeed, something is going wrong ;)

Lets find out exactly what so we can iron out this bug
properly.

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 13:48                                   ` Rik van Riel
@ 2001-08-26 14:55                                     ` Lehmann 
  2001-08-26 15:06                                       ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-26 14:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Alan Cox, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 10:48:05AM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> Margo Selzer wrote a nice paper on letting an elevator
> algorithm take care of request sorting. Only at the point
> where several thousand requests were queued and latency
> to get something from disk grew to about 30 seconds did
> a disk system relying on just an elevator get anything
> close to decent throughput.

But this is totally irrelevant. The elevator is the only thing left to do.
(and, as I said, rhe latency until data transfer starts is already about
16 seconds ;).

I believe I can get 5-6MB/s in the best case out of the system, and if the
elevator optimization coming from, say, 64 threads gives me 5.5 instead
of 5MB/s I would be very happy. Conversely, I could reduce the initial
latency by using smaller buffers if this optimization worked.

> This paper convinced me that doing just elevator sorting
> is never enough ;)

It must be enough, unfortunately ;) Ok, I could add user-limits or
somesuch, but I never believed in such a thing myself.

> One thing you could do, in recent -ac kernels, is make the
> maximum readahead size smaller by lowering the value in
> /proc/sys/vm/max-readahead

Yes, this indeed helps slightly (with the ac9 kernel I don't see the
massive thrashing going on with the linus' ones anyway). Now the only
thing left would be to be able to to that per access (even per disk would
help), as I, of course, rely on read-ahead for all other things going on
on that server ;)

It seems that I can in excess of 4MB/s (sometimes 5MB/s) when using large
bounce-buffers and disabling read-ahead completely under ac9. So something
with the kernel read-ahead *is* going wrong, as the default read-ahead
of 31 (pages? 124k) is very similar to my current read-buffer size of
128k. And enabling read-ahead and decreasing the user-space buffers gives
abysmal performance again.

> 
> regards,
> 
> Rik
> -- 
> IA64: a worthy successor to i860.
> 
> http://www.surriel.com/		http://distro.conectiva.com/
> 
> Send all your spam to aardvark@nl.linux.org (spam digging piggy)
> 

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26 13:22                                 ` Lehmann 
@ 2001-08-26 13:48                                   ` Rik van Riel
  2001-08-26 14:55                                     ` Lehmann 
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26 13:48 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Daniel Phillips, Alan Cox, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Marc A. Lehmann wrote:
> On Sun, Aug 26, 2001 at 12:32:09AM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> > Reality check time indeed.  If you propose that disabling
> > readahead should improve read performance something fishy
> > is going on ;)
>
> Actually, I also believe that. If you have no memory to store
> read-ahead data then your only chance is what I try to do: massively
> parallelize reads so the elevator can optimize what's possible and do
> no read-ahead whatsoever.

Margo Selzer wrote a nice paper on letting an elevator
algorithm take care of request sorting. Only at the point
where several thousand requests were queued and latency
to get something from disk grew to about 30 seconds did
a disk system relying on just an elevator get anything
close to decent throughput.

This paper convinced me that doing just elevator sorting
is never enough ;)

> The problem is that read-ahead (seems to) go completely havoc when
> read()'s are issued in many threads at the same time.

In that case, probably the readahead windows are too large
or the cache memory used for these readahead pages is too
small.

One thing you could do, in recent -ac kernels, is make the
maximum readahead size smaller by lowering the value in
/proc/sys/vm/max-readahead

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  3:32                               ` Rik van Riel
@ 2001-08-26 13:22                                 ` Lehmann 
  2001-08-26 13:48                                   ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-26 13:22 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Alan Cox, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 12:32:09AM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> Reality check time indeed.  If you propose that disabling
> readahead should improve read performance something fishy
> is going on ;)

Actually, I also believe that. If you have no memory to store read-ahead
data then your only chance is what I try to do: massively parallelize
reads so the elevator can optimize what's possible and do no read-ahead
whatsoever.

Of course, in general, read-ahead is a must. Using 256k socket buffer
+ 128k userspace bounce buffer is, effectively, a 384k read-ahead per
connenction (the average connection speed is about 8k/s btw, so thats
about one read every 16 seconds. 16 seconds is also about the time between
sending out the response headers and the beginning of the data transfer,
which is ok in my case). The problem is that read-ahead (seems to) go
completely havoc when read()'s are issued in many threads at the same time.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  3:40                             ` Rik van Riel
@ 2001-08-26  5:28                               ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26  5:28 UTC (permalink / raw)
  To: Rik van Riel, John Stoffel
  Cc: Marc A. Lehmann, Alan Cox, Roger Larsson, linux-kernel

On August 26, 2001 05:40 am, Rik van Riel wrote:
> On Sat, 25 Aug 2001, John Stoffel wrote:
> 
> > Ummm... is this really more of an agreement that Daniel's used-once
> > patch is a good idea on a system.  Keep a page around if it's used
> > once, but drop it quickly if only used once?
> 
> There's a very big difference, though.  With use-once we'll
> also quickly drop the pages we have not yet used, that is,
> the pages we _are about to use_.

You're really complaining about the treatment of readahead pages, not the 
used-once pages.  We can arrange things so that readahead pages get higher 
priority than used-once pages, then become used-once pages when... they get 
used once.  Simple idea, but not a one-line implementation.

> Drop-behind specifically drops the pages we have already
> used, giving better protection to the pages we are about
> to use.
> 
> http://linux-mm.org/wiki/moin.cgi/StreamingIo

How will you know not to drop the pages of that header file that is used 
constantly by the compiler?

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  0:46                           ` John Stoffel
  2001-08-26  1:07                             ` Alan Cox
@ 2001-08-26  3:40                             ` Rik van Riel
  2001-08-26  5:28                               ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26  3:40 UTC (permalink / raw)
  To: John Stoffel
  Cc: Marc A. Lehmann, Alan Cox, Daniel Phillips, Roger Larsson, linux-kernel

On Sat, 25 Aug 2001, John Stoffel wrote:

> Ummm... is this really more of an agreement that Daniel's used-once
> patch is a good idea on a system.  Keep a page around if it's used
> once, but drop it quickly if only used once?

There's a very big difference, though.  With use-once we'll
also quickly drop the pages we have not yet used, that is,
the pages we _are about to use_.

Drop-behind specifically drops the pages we have already
used, giving better protection to the pages we are about
to use.

http://linux-mm.org/wiki/moin.cgi/StreamingIo

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  1:38                             ` Daniel Phillips
  2001-08-26  2:49                               ` Lehmann 
@ 2001-08-26  3:32                               ` Rik van Riel
  2001-08-26 13:22                                 ` Lehmann 
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26  3:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Marc A. Lehmann, Alan Cox, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Daniel Phillips wrote:

> Reality check time.

> Let's test the idea that readahead is the problem.  If it is, then
> disabling readahead should make the lowlevel disk throughput match the
> highlevel throughput.

Reality check time indeed.  If you propose that disabling
readahead should improve read performance something fishy
is going on ;)

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  1:07                             ` Alan Cox
@ 2001-08-26  3:30                               ` Rik van Riel
  0 siblings, 0 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-26  3:30 UTC (permalink / raw)
  To: Alan Cox
  Cc: John Stoffel, Marc A. Lehmann, Daniel Phillips, Roger Larsson,
	linux-kernel

On Sun, 26 Aug 2001, Alan Cox wrote:

> > Ummm... is this really more of an agreement that Daniel's used-once
> > patch is a good idea on a system.  Keep a page around if it's used
> > once, but drop it quickly if only used once?  But you seem to be
>
> Is there a reason aging alone cannot do most of the work instead.

Yes. We *REALLY* want MRU replacement for streaming IO, that is,
we want to replace the pages we just used with far more preference
than the pages we newly read in and have not used at all yet, but
are about to use.

Doing this by aging up the pages we're about to read would give
readahead pages far too much power to push out normal working set
pages, just aging down the pages behind us (instead of forcefully
deactivating them) looks to me like it wouldn't give them enough
bias over the not-yet-used newly read pages we are about to use.

OTOH, this is something worth experimenting with ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  2:02                               ` Rik van Riel
@ 2001-08-26  2:57                                 ` Lehmann 
  0 siblings, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-26  2:57 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Alan Cox, Daniel Phillips, Roger Larsson, linux-kernel

On Sat, Aug 25, 2001 at 11:02:25PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> This is because the readahead windows are too large so the
> kernel ends up evicting data before its needed and has to
> re-read the data.
> 
> Also see http://linux-mm.org/wiki/moin.cgi/StreamingIo
> in the Linux-MM wiki.

O_STREAMING is an interesting idea (I talked with stefan traby about
the problem and we came to a similar conclusion - fadvise instead of
madvise, and per-file characteristics as opposed to blockdevices or global
settings).

Anyway, so far it really looks as if this is the case - there is some
limit (around 700 conns) where the available memory doesn't suffice for
read-ahead.

> This problem should be relatively easy to fix for 2.4.

Nice ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  1:38                             ` Daniel Phillips
@ 2001-08-26  2:49                               ` Lehmann 
  2001-08-26 17:29                                 ` Daniel Phillips
  2001-08-26  3:32                               ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-26  2:49 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Alan Cox, Roger Larsson, linux-kernel

On Sun, Aug 26, 2001 at 03:38:34AM +0200, Daniel Phillips <phillips@bonn-fries.net> wrote:
> Let's test the idea that readahead is the problem.  If it is, then disabling 
> readahead should make the lowlevel disk throughput match the highlevel 
> throughput.  Marc, could you please try it with this patch:

No, I rebooted the machine before your mail and sinc wehtis is a production
server.. ;)

Anyway, I compiled and bootet into linux-2.4.8-ac9. I jused ac8 on my
desktop machines and was not pleased with absolute performance but, unlike
the linus' series, I can listen to mp3's while working which was the
killer feature for me ;)

anyway, AFAIU, one can tune raedahead dynamically under the ac9 series by
changing:

isldoom:/proc/sys/vm# cat max-readahead 
31

If this is equivalent to your patch, then fine. if not I will test it at
a later time. Now, a question: how does the per-block-device read-ahead
fit into this picture?  Is it being ignored? I fiddled with it (under
2.4.8pre4) but couldn't see any difference.

Anyway, after booting and waiting a minute, I had 574 active connections
and 3.6MB/s (btw, filesize is usually hundreds of megabytes, so most of
the work is actually pure read/write. in these tests, I had a userspace
"bounce buffer" of 128k per socket and, unlike earlier tests, 256kb
tcp-wmem).

Now:

isldoom:/proc/sys/vm# echo 511 >max-readahead  

this gave me - after some warming up - 3MB/s at 636 connections (which
doesn't mean very much - could be pure chance).

isldoom:/proc/sys/vm# echo 0 >max-readahead 

now 1.6MB/s at 645 connections. read-ahead seems to be very useful (at
least on ac kernels and under medium load).

free showed me 160mb free, which, at these connection counts, might be
enough for read-ahead to work effectively (remember my other tests were at
higher load, but I cannot choose this freely ;)

Now the interesting part. setting read-ahead to 31 again, I increased the
number of reader threads from one to 64 and got 3.8MB (@450 connections, I
had to restart the server).

So the ac9 kernel seems to work much better (than the linus' series),
although the number of connections was below the critical limit. I'll
check this when I get higher loads again.

At least I now have reasonable behaviour - more parallel reads => slightly
better performance.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 23:34                             ` Lehmann 
@ 2001-08-26  2:02                               ` Rik van Riel
  2001-08-26  2:57                                 ` Lehmann 
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-26  2:02 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Alan Cox, Daniel Phillips, Roger Larsson, linux-kernel

On Sun, 26 Aug 2001, Marc A. Lehmann wrote:
> On Sat, Aug 25, 2001 at 10:33:36PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > > exactly this is a point: my disk can do 5mb/s with almost random seeks,
> > > and linux indeed reads 5mb/s from it. but the userpsace process doing
> > > read() only ever sees 2mb/s because the kernel throes away all the nice
> > > pages.
> >
> > Which means the VM in the relevant kernel is probably crap or your working
> > set exceeds ram.
>
> The relevant kernel is linux (all 2.4 versions I tested), and no,
> working set exceeding ram should never result in such excessive
> thrashing. So yes, the VM in the kernel is crap (in this particular
> case, namely high-volume fileserving) ;)

This is because the readahead windows are too large so the
kernel ends up evicting data before its needed and has to
re-read the data.

Also see http://linux-mm.org/wiki/moin.cgi/StreamingIo
in the Linux-MM wiki.

This problem should be relatively easy to fix for 2.4.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 20:52                           ` Rik van Riel
@ 2001-08-26  1:38                             ` Daniel Phillips
  2001-08-26  2:49                               ` Lehmann 
  2001-08-26  3:32                               ` Rik van Riel
  0 siblings, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-26  1:38 UTC (permalink / raw)
  To: Rik van Riel, Marc A. Lehmann; +Cc: Alan Cox, Roger Larsson, linux-kernel

On August 25, 2001 10:52 pm, Rik van Riel wrote:
> On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
> > On Sat, Aug 25, 2001 at 08:15:44PM +0100, Alan Cox 
<alan@lxorguk.ukuu.org.uk> wrote:
> > > How much disk and bandwidth can you afford. With vsftpd its certainly 
over
> > > 1000 parallal downloads on a decent PII box
> >
> > exactly this is a point: my disk can do 5mb/s with almost random
> > seeks, and linux indeed reads 5mb/s from it. but the userpsace process
> > doing read() only ever sees 2mb/s because the kernel throes away all
> > the nice pages.
> 
> The trick here is for the kernel to throw away the pages
> the processes have already used and keep in memory the
> data we have not yet used.

Reality check time.  Quoting Marc from the beginning of the thread:

> I tested the following under linux-2.4.8-ac8, linux-2.4.8pre4 and
> 2.4.5pre4, all had similar behaviour.

2.4.5pre4 has drop-behind and so does -ac8.  Still, if the readahead window 
is too large then drop-behind isn't going to help a lot.

Let's test the idea that readahead is the problem.  If it is, then disabling 
readahead should make the lowlevel disk throughput match the highlevel 
throughput.  Marc, could you please try it with this patch:

--- ../2.4.9.clean/mm/filemap.c	Thu Aug 16 14:12:07 2001
+++ ./mm/filemap.c	Sun Aug 26 02:24:50 2001
@@ -886,6 +886,7 @@
 
 	raend = filp->f_raend;
 	max_ahead = 0;
+return;
 
 /*
  * The current page is locked.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-26  0:46                           ` John Stoffel
@ 2001-08-26  1:07                             ` Alan Cox
  2001-08-26  3:30                               ` Rik van Riel
  2001-08-26  3:40                             ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Alan Cox @ 2001-08-26  1:07 UTC (permalink / raw)
  To: John Stoffel
  Cc: Rik van Riel, Marc A. Lehmann, Alan Cox, Daniel Phillips,
	Roger Larsson, linux-kernel

> Ummm... is this really more of an agreement that Daniel's used-once
> patch is a good idea on a system.  Keep a page around if it's used
> once, but drop it quickly if only used once?  But you seem to be

Is there a reason aging alone cannot do most of the work instead. When you
readahead a page you look to age a page that is a a bit over the readahead
window further back in the file if it is in memory, and has no mapped users
- ie its just cache.

That seems to cost us extra lookups in the page cache hashes but push things
in the right direction. 

Alan


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 19:35                         ` Lehmann 
  2001-08-25 20:52                           ` Rik van Riel
  2001-08-25 21:33                           ` Alan Cox
@ 2001-08-26  0:46                           ` John Stoffel
  2001-08-26  1:07                             ` Alan Cox
  2001-08-26  3:40                             ` Rik van Riel
  2 siblings, 2 replies; 124+ messages in thread
From: John Stoffel @ 2001-08-26  0:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Marc A. Lehmann, Alan Cox, Daniel Phillips, Roger Larsson, linux-kernel


Rik> On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
>> On Sat, Aug 25, 2001 at 08:15:44PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>> > How much disk and bandwidth can you afford. With vsftpd its certainly over
>> > 1000 parallal downloads on a decent PII box

>> exactly this is a point: my disk can do 5mb/s with almost random
>> seeks, and linux indeed reads 5mb/s from it. but the userpsace
>> process doing read() only ever sees 2mb/s because the kernel throes
>> away all the nice pages.

Rik> The trick here is for the kernel to throw away the pages the
Rik> processes have already used and keep in memory the data we have
Rik> not yet used.

Ummm... is this really more of an agreement that Daniel's used-once
patch is a good idea on a system.  Keep a page around if it's used
once, but drop it quickly if only used once?  But you seem to be
saying that in this one case (serving lots of files to multiple
clients) that you should be reading in as fast as possible, but then
dropping those read in pages right away.

If so, shouldn't this be more of an application level type of
optimization here, and not so much the VM's problem?  The VM needs to
be general and fair over a large spread of loads, I can't imagine that
we'll get it right for every single possible load out there.  

Maybe what would help here is having some type of VM simulator written
where we could plug in the current Linux VM and instrument it and
watch what it does under a variety of loads and make it react in a
smooth fashion.  Right now it looks like people are just tweaking
stuff left and right (though since I don't know the core code at all,
it's just my impression) without having a good theoretical
understanding of what's really happening.

I've been following this VM discussion for a while now, and I'm not
sure we're really closer to fixing the corner case where the load gets
out of hand.  But I do think Daniel is on the right case in terms of
trying to wake up and free as many pages as were allocated in a
quanta.  Trying to wakeup once a wall clock second doesn't seem as
realistic.  Under very light loads, you're just spending time doing
work that isn't needed.  Under heavy loads, you may be reacting way
too slowly.  

As we get closer to being out of free pages, we should work harder,
but also do less work as the percentage of free pages goes up and the
system VM load goes down.  

Enough of my rambling, and please realize I appreciate all the work
and words that have been flowing around this topic, it's getting
better all the time and we've got enough veiwpoints going to keep
everyone honest and working to the same goal.

John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 21:33                           ` Alan Cox
@ 2001-08-25 23:34                             ` Lehmann 
  2001-08-26  2:02                               ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-25 23:34 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rik van Riel, Roger Larsson, linux-kernel

On Sat, Aug 25, 2001 at 10:33:36PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > exactly this is a point: my disk can do 5mb/s with almost random seeks,
> > and linux indeed reads 5mb/s from it. but the userpsace process doing
> > read() only ever sees 2mb/s because the kernel throes away all the nice
> > pages.
> 
> Which means the VM in the relevant kernel is probably crap or your working
> set exceeds ram.

The relevant kernel is linux (all 2.4 versions I tested), and no, working
set exceeding ram should never result in such excessive thrashing. So yes,
the VM in the kernel is crap (in this particular case, namely high-volume
fileserving) ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 19:35                         ` Lehmann 
  2001-08-25 20:52                           ` Rik van Riel
@ 2001-08-25 21:33                           ` Alan Cox
  2001-08-25 23:34                             ` Lehmann 
  2001-08-26  0:46                           ` John Stoffel
  2 siblings, 1 reply; 124+ messages in thread
From: Alan Cox @ 2001-08-25 21:33 UTC (permalink / raw)
  To: pcg( Marc)@goof(A.).(Lehmann )com
  Cc: Alan Cox, Daniel Phillips, Rik van Riel, Roger Larsson, linux-kernel

> exactly this is a point: my disk can do 5mb/s with almost random seeks,
> and linux indeed reads 5mb/s from it. but the userpsace process doing
> read() only ever sees 2mb/s because the kernel throes away all the nice
> pages.

Which means the VM in the relevant kernel is probably crap or your working
set exceeds ram.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 19:35                         ` Lehmann 
@ 2001-08-25 20:52                           ` Rik van Riel
  2001-08-26  1:38                             ` Daniel Phillips
  2001-08-25 21:33                           ` Alan Cox
  2001-08-26  0:46                           ` John Stoffel
  2 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-25 20:52 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Alan Cox, Daniel Phillips, Roger Larsson, linux-kernel

On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
> On Sat, Aug 25, 2001 at 08:15:44PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> > How much disk and bandwidth can you afford. With vsftpd its certainly over
> > 1000 parallal downloads on a decent PII box
>
> exactly this is a point: my disk can do 5mb/s with almost random
> seeks, and linux indeed reads 5mb/s from it. but the userpsace process
> doing read() only ever sees 2mb/s because the kernel throes away all
> the nice pages.

The trick here is for the kernel to throw away the pages
the processes have already used and keep in memory the
data we have not yet used.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 19:15                       ` Alan Cox
@ 2001-08-25 19:35                         ` Lehmann 
  2001-08-25 20:52                           ` Rik van Riel
                                             ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-25 19:35 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rik van Riel, Roger Larsson, linux-kernel

On Sat, Aug 25, 2001 at 08:15:44PM +0100, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
> How much disk and bandwidth can you afford. With vsftpd its certainly over
> 1000 parallal downloads on a decent PII box

exactly this is a point: my disk can do 5mb/s with almost random seeks,
and linux indeed reads 5mb/s from it. but the userpsace process doing
read() only ever sees 2mb/s because the kernel throes away all the nice
pages.

I doubt vsftpd would help at all.

(one can easily get 2000 or more parallel downloads, if you are willing to
go with 1kb/s).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 16:43                     ` Daniel Phillips
@ 2001-08-25 19:15                       ` Alan Cox
  2001-08-25 19:35                         ` Lehmann 
  0 siblings, 1 reply; 124+ messages in thread
From: Alan Cox @ 2001-08-25 19:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

> Well, lets look into that then.  Surely you hit the wall at some point, no 
> matter which replacement policy you use.  How many simultaneous downloads can 
> you handle with 2.4.7 vs 2.4.8?

How much disk and bandwidth can you afford. With vsftpd its certainly over
1000 parallal downloads on a decent PII box
> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 17:56                   ` Roger Larsson
@ 2001-08-25 19:13                     ` Gérard Roudier
  0 siblings, 0 replies; 124+ messages in thread
From: Gérard Roudier @ 2001-08-25 19:13 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel



On Sat, 25 Aug 2001, Roger Larsson wrote:

> On Saturdayen den 25 August 2001 13:49, Gérard Roudier wrote:
> > On Sat, 25 Aug 2001, Roger Larsson wrote:
>
> > > Where did the seek time go? And rotational latency?
> >
> > They just go away, for the following reasons:
> >
> > - For random IOs, IOs are rather seek time bounded.
> >   Tagged commands may help here.
>
> > - For sequential IOs, the drive is supposed to prefetch data.
> >
>
> Ok, the disk does buffering itself (but it has less memory than . But will it
> do read ahead when
> there is a request to do something else waiting?
> Will it continue to read when it should have moved?
> (Note: lets assume that the disk are never out of requests)
> When it moves. The arm has to move and the platter will rotate to the right
> spot - lets see. You can start reading directly when you are at the right
> track. It can be stuff needed later... (ok, latency can be ignored)
>
> > > This is mine (a IBM Deskstar 75GXP)
> > > Sustained data rate (MB/sec)  37
> > > Seek time  (read typical)
> > >       Average (ms)             8.5
> > >       Track-to-track (ms)      1.2
> > >       Full-track (ms)         15.0
> > > Data buffer                   2 MB
> > > Latency (calculated 7200 RPM) 4.2 ms
> > >
> > > So sustained data rate is 37 MB/s
> > >
> > > > hdparm -t gives around 35 MB/s
> > >
> > > best I got during testing or real files is 32 MB/s
>
> Note: the 32 MB/s is for reading two big file sequentially.
> (first one then the other)
> If I instead do a diff the two files the throughput drops
> 2.4.0 gave around 15 MB/s
> 2.4.7 gave only 11 MB/s
> 2.4.8-pre3 give around 25 MB/s (READA higher than normal)
> 2.4.9-pre2 gives 11 MB/s
> 2.4.9-pre4 gives 12 MB/s
> My patch against 2.4.8-pre1 gives 28 MB/s

I didn't write that your calculation was incorrect. :)

But diffing large files on the same hard disks is not that usual a
pattern. The application (diff) knows of it. So, it should be up to it to
use large buffering per file, instead of relying on guessing from the
kernel.

Just entering 'man diff' shows that no buffer size option seem to be
available with GNU diff. :)

Early UNIXes did 1xIO per FS block. This should have been true for BSD <
4, I think. At this time, FS allocation was based on a free block list and
FS fragmented very quickly. Disk throughput should probably not excess
100KB/second. Despite all that, programmers were certainly happy to diff
source files on those systems. But time has changed... :-)

  Gérard.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 11:49                 ` Gérard Roudier
@ 2001-08-25 17:56                   ` Roger Larsson
  2001-08-25 19:13                     ` Gérard Roudier
  0 siblings, 1 reply; 124+ messages in thread
From: Roger Larsson @ 2001-08-25 17:56 UTC (permalink / raw)
  To: Gérard Roudier; +Cc: linux-kernel

On Saturdayen den 25 August 2001 13:49, Gérard Roudier wrote:
> On Sat, 25 Aug 2001, Roger Larsson wrote:

> > Where did the seek time go? And rotational latency?
>
> They just go away, for the following reasons:
>
> - For random IOs, IOs are rather seek time bounded.
>   Tagged commands may help here.

> - For sequential IOs, the drive is supposed to prefetch data.
>

Ok, the disk does buffering itself (but it has less memory than . But will it 
do read ahead when
there is a request to do something else waiting?
Will it continue to read when it should have moved?
(Note: lets assume that the disk are never out of requests)
When it moves. The arm has to move and the platter will rotate to the right
spot - lets see. You can start reading directly when you are at the right
track. It can be stuff needed later... (ok, latency can be ignored)

> > This is mine (a IBM Deskstar 75GXP)
> > Sustained data rate (MB/sec)  37
> > Seek time  (read typical)
> >       Average (ms)             8.5
> >       Track-to-track (ms)      1.2
> >       Full-track (ms)         15.0
> > Data buffer                   2 MB
> > Latency (calculated 7200 RPM) 4.2 ms
> >
> > So sustained data rate is 37 MB/s
> >
> > > hdparm -t gives around 35 MB/s
> >
> > best I got during testing or real files is 32 MB/s

Note: the 32 MB/s is for reading two big file sequentially.
(first one then the other)
If I instead do a diff the two files the throughput drops
2.4.0 gave around 15 MB/s
2.4.7 gave only 11 MB/s
2.4.8-pre3 give around 25 MB/s (READA higher than normal)
2.4.9-pre2 gives 11 MB/s
2.4.9-pre4 gives 12 MB/s
My patch against 2.4.8-pre1 gives 28 MB/s

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 15:50                   ` Rik van Riel
  2001-08-25 16:28                     ` Lehmann 
@ 2001-08-25 16:43                     ` Daniel Phillips
  2001-08-25 19:15                       ` Alan Cox
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-25 16:43 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On August 25, 2001 05:50 pm, Rik van Riel wrote:
> On Sat, 25 Aug 2001, Daniel Phillips wrote:
> 
> > > True, it's just an issue of performance and heavily used
> > > servers falling over under load, nothing as serious as
> > > data corruption or system instability.
> >
> > If your server is falling over under load, this is not the reason.
> 
> I bet your opinion will be changed the moment you see a system
> get close to falling over exactly because of this.
> 
> Remember NL.linux.org a few weeks back, where a difference of
> 10 FTP users more or less was the difference between a system
> load of 3 and a system load of 250 ?  ;)

Well, lets look into that then.  Surely you hit the wall at some point, no 
matter which replacement policy you use.  How many simultaneous downloads can 
you handle with 2.4.7 vs 2.4.8?

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 16:34                       ` Rik van Riel
@ 2001-08-25 16:41                         ` Lehmann 
  0 siblings, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-25 16:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Roger Larsson, linux-kernel

On Sat, Aug 25, 2001 at 01:34:36PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> That wouldn't have made a big difference in this case, except that
> one process doing readahead window thrashing with its own readahead
> data would have fallen over even worse ...

I don't know that case, but in my case, one process thrashing it's
read-ahead window is twice as fats as 16 processes doing so ;)

I wother wether these problems can be fixed at all, with all these
different situations where the vm behaves totally different in (seemingly)
similar cases.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 16:28                     ` Lehmann 
@ 2001-08-25 16:34                       ` Rik van Riel
  2001-08-25 16:41                         ` Lehmann 
  2001-08-26 16:55                       ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-25 16:34 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Daniel Phillips, Roger Larsson, linux-kernel

On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
> On Sat, Aug 25, 2001 at 12:50:51PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> > Remember NL.linux.org a few weeks back, where a difference of
> > 10 FTP users more or less was the difference between a system
> > load of 3 and a system load of 250 ?  ;)
>
> OTOH, servers the use a single process or thread per connection are
> destined to fail under load ;)

That wouldn't have made a big difference in this case, except that
one process doing readahead window thrashing with its own readahead
data would have fallen over even worse ...

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 15:50                   ` Rik van Riel
@ 2001-08-25 16:28                     ` Lehmann 
  2001-08-25 16:34                       ` Rik van Riel
  2001-08-26 16:55                       ` Daniel Phillips
  2001-08-25 16:43                     ` Daniel Phillips
  1 sibling, 2 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-25 16:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Roger Larsson, linux-kernel

On Sat, Aug 25, 2001 at 12:50:51PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> Remember NL.linux.org a few weeks back, where a difference of
> 10 FTP users more or less was the difference between a system
> load of 3 and a system load of 250 ?  ;)

OTOH, servers the use a single process or thread per connection are
destined to fail under load ;)

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25 15:49                 ` Daniel Phillips
@ 2001-08-25 15:50                   ` Rik van Riel
  2001-08-25 16:28                     ` Lehmann 
  2001-08-25 16:43                     ` Daniel Phillips
  0 siblings, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-25 15:50 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Sat, 25 Aug 2001, Daniel Phillips wrote:

> > True, it's just an issue of performance and heavily used
> > servers falling over under load, nothing as serious as
> > data corruption or system instability.
>
> If your server is falling over under load, this is not the reason.

I bet your opinion will be changed the moment you see a system
get close to falling over exactly because of this.

Remember NL.linux.org a few weeks back, where a difference of
10 FTP users more or less was the difference between a system
load of 3 and a system load of 250 ?  ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  1:34               ` Rik van Riel
@ 2001-08-25 15:49                 ` Daniel Phillips
  2001-08-25 15:50                   ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-25 15:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On August 25, 2001 03:34 am, Rik van Riel wrote:
> On Sat, 25 Aug 2001, Daniel Phillips wrote:
> > My point is, even with the case you supplied the expected behaviour of
> > the existing algorithm is acceptable.  There is no burning fire to put
> > out, not here anyway.
> 
> True, it's just an issue of performance and heavily used
> servers falling over under load, nothing as serious as
> data corruption or system instability.

If your server is falling over under load, this is not the reason.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  8:02             ` Gérard Roudier
  2001-08-25  9:26               ` Roger Larsson
@ 2001-08-25 13:17               ` Rik van Riel
  1 sibling, 0 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-25 13:17 UTC (permalink / raw)
  To: Gérard Roudier; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Sat, 25 Aug 2001, Gérard Roudier wrote:

> With such values, given a U160 SCSI BUS, using 64K IO chunks will
> result in about less than 25% of bandwidth used for the SCSI protocol

Unless you donate one of those sets to me, I'm not going
to see decent IO with smaller IO sizes. ;)

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  9:26               ` Roger Larsson
@ 2001-08-25 11:49                 ` Gérard Roudier
  2001-08-25 17:56                   ` Roger Larsson
  0 siblings, 1 reply; 124+ messages in thread
From: Gérard Roudier @ 2001-08-25 11:49 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Rik van Riel, Marc A. Lehmann, linux-kernel, oesi



On Sat, 25 Aug 2001, Roger Larsson wrote:

> On Saturday den 25 August 2001 10:02, Gérard Roudier wrote:
> > On Fri, 24 Aug 2001, Rik van Riel wrote:
> > > On Fri, 24 Aug 2001, Gérard Roudier wrote:
> > > > The larger the read-ahead chunks, the more likely trashing will
> > > > occur. In my opinion, using more than 128 K IO chunks will not
> > > > improve performances with modern hard disks connected to a
> > > > reasonnably fast controller, but will increase memory pressure
> > > > and probably thrashing.
> > >
> > > Your opinion seems to differ from actual measurements
> > > made by Roger Larsson and other people.
> >
> > The part of my posting that talked about modern hard disks sustaining more
> > than 8000 IOs per second and controllers sustaining 15000 IOs per second
> > is a _measurement_.
> >
> > With such values, given a U160 SCSI BUS, using 64K IO chunks will result
> > in about less than 25% of bandwidth used for the SCSI protocol and 75% for
> > useful data at full load (about 2000 IO/s - 120 MB/s). This is a
> > _calculation_. With 128K IO chunks, less than 15% of the SCSI BUS will be
> > used for the SCSI protocol and more than 85% for usefull data. Still a
> > _calculation_.
> >
> > This let me claim - opinion based on fairly simple calculations - that if
> > using more 128 K IO chunks gives significantly better throughput, then
> > some serious IO scheduling problem should exist in kernel IO subsystems.
>
> Where did the seek time go? And rotational latency?

They just go away, for the following reasons:

- For random IOs, IOs are rather seek time bounded.
  Tagged commands may help here.
- For sequential IOs, the drive is supposed to prefetch data.

Everything is compromise. If for N K bufferring, you get 90% of the
throughtput, then the compromise looks good to me. 2*N K bufferring is
likely to give no more than 5% gain for *twice* the cost in bufferring.

> [I hope my calculations are correct]
>
>
> This is mine (a IBM Deskstar 75GXP)
> Sustained data rate (MB/sec)	37
> Seek time  (read typical)
> 	Average (ms)		 8.5
> 	Track-to-track (ms)	 1.2
> 	Full-track (ms)		15.0
> Data buffer			2 MB
> Latency (calculated 7200 RPM)	4.2 ms
>
> So sustained data rate is 37 MB/s
> > hdparm -t gives around 35 MB/s
> best I got during testing or real files is 32 MB/s

Looks a less than 10% slow-down.
Did you take into account the bmap() thing?
Are you sure all the read data are on the outer tracks which are
the fastest ones?

> A small calculation:
> Track-to-track time is 1.2 ms + time to rotate 4.2 ms = 5.4 ms
> In this time I can read 37 MB/s * 5.4 ms = 200 kB or
> more than 48 pages (instead of moving the head to the closest
> track I could read 48 pages...)
>
> Average is 8.5 ms + 4.2 ms => 114 pages
>
> Reading a maximum of 114 pages at a time gives on the average
> half the maximum sustained throughput for this disk.

And reading 1140 pages at a time will give 90% of the maximum disk
throughput, for random IOs. :-)
Btw, I thought that IDE was very limited in actual IO size.

> But the buffer holds 512 pages... => I tried with that. [overkill]

The drive buffer is supposed to help overlap actual IOs with the medium.
-> Read prefetching, write behind caching.

> * Data from a Seagate Cheetah X15 ST336732LC (Better than most
> disks out there - way better than mine)
> Formatted Int Transfer Rate (min)	51 MBytes/sec
> Formatted Int Transfer Rate (max)	69 MBytes/sec
> Average Seek Time, Read		3.7 msec typical
> Track-to-Track Seek, Read	0.5 msec typical
> Average Latency			2 msec
> Default Buffer (cache) Size	8,192 Kbytes
> Spindle Speed			15K RPM
>
> Same calculation:
> Track-to-track: (2 + 0.5) ms * 51 MB = 127 kB or 31 pages (42 pages max)
> Average: (2 + 3.7 ms) * 51 MB/s = 290 kB or almost 71 pages (96 pages max)
>
> Reading a maximum of 71 pages gives on the average half the
> maximum sustained throughput. [smart buffering in the drive might help
> when reading from several streams]

It helps even for single-stream by prefetching data. If you are using
files mostly contiguous, you will not see the various disk latencies.

On the other hand, the kernel asynchronous read-ahead code will queue 1 IO
in advance that will be overlapped with the processing of previous data by
the application.

Btw, if the application just discards the data, this is for sure not
useful. :)

  Gérard.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  8:02             ` Gérard Roudier
@ 2001-08-25  9:26               ` Roger Larsson
  2001-08-25 11:49                 ` Gérard Roudier
  2001-08-25 13:17               ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Roger Larsson @ 2001-08-25  9:26 UTC (permalink / raw)
  To: Gérard Roudier, Rik van Riel; +Cc: Marc A. Lehmann, linux-kernel, oesi

On Saturday den 25 August 2001 10:02, Gérard Roudier wrote:
> On Fri, 24 Aug 2001, Rik van Riel wrote:
> > On Fri, 24 Aug 2001, Gérard Roudier wrote:
> > > The larger the read-ahead chunks, the more likely trashing will
> > > occur. In my opinion, using more than 128 K IO chunks will not
> > > improve performances with modern hard disks connected to a
> > > reasonnably fast controller, but will increase memory pressure
> > > and probably thrashing.
> >
> > Your opinion seems to differ from actual measurements
> > made by Roger Larsson and other people.
>
> The part of my posting that talked about modern hard disks sustaining more
> than 8000 IOs per second and controllers sustaining 15000 IOs per second
> is a _measurement_.
>
> With such values, given a U160 SCSI BUS, using 64K IO chunks will result
> in about less than 25% of bandwidth used for the SCSI protocol and 75% for
> useful data at full load (about 2000 IO/s - 120 MB/s). This is a
> _calculation_. With 128K IO chunks, less than 15% of the SCSI BUS will be
> used for the SCSI protocol and more than 85% for usefull data. Still a
> _calculation_.
>
> This let me claim - opinion based on fairly simple calculations - that if
> using more 128 K IO chunks gives significantly better throughput, then
> some serious IO scheduling problem should exist in kernel IO subsystems.

Where did the seek time go? And rotational latency?

[I hope my calculations are correct]


This is mine (a IBM Deskstar 75GXP)
Sustained data rate (MB/sec)	37
Seek time  (read typical)
	Average (ms)		 8.5
	Track-to-track (ms)	 1.2
	Full-track (ms)		15.0
Data buffer			2 MB
Latency (calculated 7200 RPM)	4.2 ms

So sustained data rate is 37 MB/s
> hdparm -t gives around 35 MB/s
best I got during testing or real files is 32 MB/s

A small calculation:
Track-to-track time is 1.2 ms + time to rotate 4.2 ms = 5.4 ms
In this time I can read 37 MB/s * 5.4 ms = 200 kB or 
more than 48 pages (instead of moving the head to the closest
track I could read 48 pages...)

Average is 8.5 ms + 4.2 ms => 114 pages 

Reading a maximum of 114 pages at a time gives on the average
half the maximum sustained throughput for this disk.
But the buffer holds 512 pages... => I tried with that. [overkill]

* Data from a Seagate Cheetah X15 ST336732LC (Better than most
disks out there - way better than mine)
Formatted Int Transfer Rate (min)	51 MBytes/sec
Formatted Int Transfer Rate (max)	69 MBytes/sec
Average Seek Time, Read		3.7 msec typical
Track-to-Track Seek, Read	0.5 msec typical
Average Latency			2 msec
Default Buffer (cache) Size	8,192 Kbytes
Spindle Speed			15K RPM

Same calculation:
Track-to-track: (2 + 0.5) ms * 51 MB = 127 kB or 31 pages (42 pages max)
Average: (2 + 3.7 ms) * 51 MB/s = 290 kB or almost 71 pages (96 pages max)

Reading a maximum of 71 pages gives on the average half the
maximum sustained throughput. [smart buffering in the drive might help
when reading from several streams]


/RogerL

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  3:09           ` Rik van Riel
@ 2001-08-25  9:13             ` Gérard Roudier
  0 siblings, 0 replies; 124+ messages in thread
From: Gérard Roudier @ 2001-08-25  9:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Marc A. Lehmann, Daniel Phillips, Roger Larsson, linux-kernel



On Sat, 25 Aug 2001, Rik van Riel wrote:

> On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
> > On Fri, Aug 24, 2001 at 05:19:07PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> > > Actually, no.  FIFO would be ok if you had ONE readahead
> > > stream going on, but when you have multiple readahead
> >
> > Do we all agree that read-ahead is actually the problem? ATM, I serve
> > ~800 files, read()ing them in turn. When I increase the number of
> > threads I have more reads at the same time in the kernel, but the
> > absolute number of read() requests decreases.
>
> 	[snip evidence beyond all doubt]

I am not so sure. :)

Btw, the new VM and elevator have been very long to stabilize. Such
erratic development process does not make softwares appear trustable to
me. I would also suspect some flaws in these parts, too.

> Earlier today some talking between VM developers resulted
> in us agreeing on trying to fix this problem by implementing
> dynamic window scaling for readahead, using heuristics not
> all that much different from TCP window scaling.

It is probably time to rewrite this old code. Good luck for that.

> This should make the system able to withstand a higher load
> than currently, while also allowing fast data streams to
> work with more efficiently than currently.

That's the wish. Just crossing finger for things to work a lot better than
the development of your new VM and elevator improvements. :-)

Regards,
  Gérard.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 23:12           ` Rik van Riel
@ 2001-08-25  8:02             ` Gérard Roudier
  2001-08-25  9:26               ` Roger Larsson
  2001-08-25 13:17               ` Rik van Riel
  0 siblings, 2 replies; 124+ messages in thread
From: Gérard Roudier @ 2001-08-25  8:02 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi



On Fri, 24 Aug 2001, Rik van Riel wrote:

> On Fri, 24 Aug 2001, Gérard Roudier wrote:
>
> > The larger the read-ahead chunks, the more likely trashing will
> > occur. In my opinion, using more than 128 K IO chunks will not
> > improve performances with modern hard disks connected to a
> > reasonnably fast controller, but will increase memory pressure
> > and probably thrashing.
>
> Your opinion seems to differ from actual measurements
> made by Roger Larsson and other people.

The part of my posting that talked about modern hard disks sustaining more
than 8000 IOs per second and controllers sustaining 15000 IOs per second
is a _measurement_.

With such values, given a U160 SCSI BUS, using 64K IO chunks will result
in about less than 25% of bandwidth used for the SCSI protocol and 75% for
useful data at full load (about 2000 IO/s - 120 MB/s). This is a
_calculation_. With 128K IO chunks, less than 15% of the SCSI BUS will be
used for the SCSI protocol and more than 85% for usefull data. Still a
_calculation_.

This let me claim - opinion based on fairly simple calculations - that if
using more 128 K IO chunks gives significantly better throughput, then
some serious IO scheduling problem should exist in kernel IO subsystems.

> But yes, increasing the readahead window also increases
> the chance of readahead window thrashing. Luckily we can
> detect fairly easily if this is happening and use that
> to automatically shrink the readahead window...

Using too large buffering when it is not needed may lead to great penalty
everywhere, not only in kernel memory management. All caching entities in
the system hardware and devices will uselessly get pressure and will slow
down as a result.

Band-aiding scheduling flaws by using hudge bufferring is a fairly stupid
approach, in my opinion.

Regards,
  Gérard.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 23:23         ` Lehmann 
       [not found]           ` <200108242344.f7ONi0h21270@mailg.telia.com>
@ 2001-08-25  3:09           ` Rik van Riel
  2001-08-25  9:13             ` Gérard Roudier
  2001-08-26 16:54           ` Daniel Phillips
  2 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-25  3:09 UTC (permalink / raw)
  To: Marc A. Lehmann; +Cc: Daniel Phillips, Roger Larsson, linux-kernel

On Sat, 25 Aug 2001, Marc A. Lehmann wrote:
> On Fri, Aug 24, 2001 at 05:19:07PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> > Actually, no.  FIFO would be ok if you had ONE readahead
> > stream going on, but when you have multiple readahead
>
> Do we all agree that read-ahead is actually the problem? ATM, I serve
> ~800 files, read()ing them in turn. When I increase the number of
> threads I have more reads at the same time in the kernel, but the
> absolute number of read() requests decreases.

	[snip evidence beyond all doubt]

Earlier today some talking between VM developers resulted
in us agreeing on trying to fix this problem by implementing
dynamic window scaling for readahead, using heuristics not
all that much different from TCP window scaling.

This should make the system able to withstand a higher load
than currently, while also allowing fast data streams to
work with more efficiently than currently.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-25  0:41             ` Daniel Phillips
@ 2001-08-25  1:34               ` Rik van Riel
  2001-08-25 15:49                 ` Daniel Phillips
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-25  1:34 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Sat, 25 Aug 2001, Daniel Phillips wrote:

> > The queue looks like this, with new pages being added to the
> > front and old pages being dropped off the right side:
> > 	AAaaBBbbCCccDDdd
> >
> > With the current use-once thing, we will end up dropping ALL
> > pages from file D, even the ones we are about to use (DDdd).
>
> I call that gracefull.  Look, you only lost 2 pages out of 16, and
> when you have to re-read them it will be a clustered read.  It's just
> not that big a deal.

The difference is that with the use-once idea file D would
lose all 4 pages, while with the drop-behind idea every file
would use one page it had already used.

> > With drop-behind we'll drop four pages we have already used,
> > without affecting the pages we are about to use (dcba).
>
> Well, yes, but you will also drop that header file your compiler wants
> to read over and over.  How do you tell the difference?  There are
> lots of nice things you can do if your algorithm can assume
> omniscience.

Agreed, this thing should be fixed. I'm sure we'll come
up with a way to get around this problem.

> My point is, even with the case you supplied the expected behaviour of
> the existing algorithm is acceptable.  There is no burning fire to put
> out, not here anyway.

True, it's just an issue of performance and heavily used
servers falling over under load, nothing as serious as
data corruption or system instability.

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 23:10           ` Rik van Riel
@ 2001-08-25  0:42             ` Daniel Phillips
  0 siblings, 0 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-25  0:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On August 25, 2001 01:10 am, Rik van Riel wrote:
> On Sat, 25 Aug 2001, Daniel Phillips wrote:
> > On August 24, 2001 09:02 pm, Rik van Riel wrote:
> 
> > > I guess in the long run we should have automatic collapse
> > > of the readahead window when we find that readahead window
> > > thrashing is going on,
> >
> > Yes, and the most effective way to detect that the readahead
> > window is too high is by keeping a history of recently evicted
> > pages.
> 
> I think it could be even easier. We simply count for each
> file how many pages we read-ahead and how many pages we
> read.
> 
> If the number of pages being read-ahead is really a lot
> higher than the pages being read, we know pages get evicted
> before we read them ==> we shrink the readahead window.
> 
> This simpler scheme should also be able to correctly set
> the readahead window for slower data streams to smaller
> than the readahead window size for faster reading data
> streams (which _do_ get to use more of their data before
> it is evicted again).

Yes, this is a good method.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 23:03           ` Rik van Riel
@ 2001-08-25  0:41             ` Daniel Phillips
  2001-08-25  1:34               ` Rik van Riel
  0 siblings, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-25  0:41 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On August 25, 2001 01:03 am, Rik van Riel wrote:
> On Fri, 24 Aug 2001, Daniel Phillips wrote:
> 
> > > Actually, no.  FIFO would be ok if you had ONE readahead
> > > stream going on, but when you have multiple readahead
> > > streams going on you want to evict the data each of the
> > > streams has already used, and not all the readahead data
> > > which happened to be read in first.
> >
> > We will be fine up until the point that the set of all readahead
> > fills the entire cache, then we will start dropping *some* of
> > the readahead.  This will degrade gracefully: if the set of
> > readahead is twice as large as cache then half the readahead
> > will be dropped.  We will drop the readahead in coherent chunks
> > so that it can be re-read in one disk seek.  This is not such
> > bad behaviour.
> 
> The problem is that it WON'T degrade gracefully. Suppose we
> have 5 readahead streams, A B C D and E, and we can store
> 4 readahead windows in RAM without problems. A page which has
> not yet been read is marked with a capital letter, a page
> which has already been read is marked with a small letter.
> 
> The queue looks like this, with new pages being added to the
> front and old pages being dropped off the right side:
> 	AAaaBBbbCCccDDdd
> 
> With the current use-once thing, we will end up dropping ALL
> pages from file D, even the ones we are about to use (DDdd).

I call that gracefull.  Look, you only lost 2 pages out of 16, and when you 
have to re-read them it will be a clustered read.  It's just not that big a 
deal.

> With drop-behind we'll drop four pages we have already used,
> without affecting the pages we are about to use (dcba).

Well, yes, but you will also drop that header file your compiler wants to 
read over and over.  How do you tell the difference?  There are lots of nice 
things you can do if your algorithm can assume omniscience.

> > That said, I think I might be able to come up with something
> > that uses specific knowledge about readahead to squeeze a little
> > better performance out of your example case without breaking
> > loads that are already working pretty well.  It will require
> > another lru list - this is not something we want to do right
> > now, don't you agree?
> 
> Ummm, if you're still busy trying to come up with the idea,
> how do you already know your future idea will require an extra
> LRU list? ;)

Because it's still in the conceptual stage.  Point taken about the readahead, 
it wants to have a higher priority than used-once pages.  If marked in some 
way, the readahead pages could start on the active ring then be moved 
immediately to the inactive queue when first used, or after being fully aged 
if unused.  Write pages on the other hand want to start on the inactive list. 
With our current page cache factoring this is a bit of a pain to implement.

My point is, even with the case you supplied the expected behaviour of the 
existing algorithm is acceptable.  There is no burning fire to put out, not 
here anyway.

--
Daniel

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
       [not found]           ` <200108242344.f7ONi0h21270@mailg.telia.com>
@ 2001-08-25  0:28             ` Lehmann 
  0 siblings, 0 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-25  0:28 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel

On Sat, Aug 25, 2001 at 01:39:35AM +0200, Roger Larsson <roger.larsson@skelleftea.mail.telia.com> wrote:
> Try to add only one - but one that surely reads from another file...

unless two clients request the same file (possible but very rare), no
thread will read from the same file (even if, at 800 clients this is a
rare occasion).

> In the example it looked like only one was waiting for data.
> Then you added 15 more => 16 waiting for data...

exactly. and when i add one more => 2 waiting for data. I admit I have no
idea what you are after ;) My problem is that I want much more speed out
of the elevator. read-ahead cannot help in the general case, as the numebr
of clients will not decrease, but increase, so at one point I simply will
have to live without any read-ahead to get any performance at all.

> But you should see the read ahead effect when going from 1 to 2 concurrently
> reading. You can actually see the effect when doing a diff with two files :-)

 1  1  1      0   3068  16780 210080   0   0  5492     0 4045  1809  43  37  20
 2  0  3      0   3060  16776 210096   0   0  6944     0 4151  1650  44  42  14
 2  1  1      0   3056  16796 210076   0   0  4072  2572 3875  1542  25  35  40
 1  1  1      0   3056  16796 210072   0   0  6460     0 3943  1801  36  37  27
 1  1  1      0   3064  16796 210072   0   0  6724     0 4134  1807  48  36  16
 2  1  1      0   3056  16796 210060   0   0  5868     0 4226  1894  55  34  12
 2  1  0      0   3056  16796 210056   0   0  7020     0 4229  1743  46  38  15
 1  1  0      0   3056  16872 209980   0   0  5220  2124 4260  1462  41  43  16
 2  1  2      0   3064  16868 209984   0   0  4608     0 3935  1647  44  48   8
 2  2  0      0   3060  16868 210000   0   0  5340     0 4178  1874  42  43  15
 2  1  2      0   3068  16840 210000   0   0  5724     0 4239  1803  59  40   1
# added one more thread
 3  1  4      0   3560  16808 210140   0   0  6616     0 4179  1550  50  36  14
 4  1  4      0   3056  16848 210576   0   0  5868  1384 4018  1653  41  37  22
 1  2  1      0   3056  16848 210708   0   0  4348   952 4048  1429  45  43  13
 3  2  0      0   3056  16848 210712   0   0  5968     0 4130  1740  44  42  14
 1  2  0      0   3056  16848 210700   0   0  5376     0 4251  1988  40  46  14
 1  2  2      0   3064  16836 210712   0   0  6704     0 4476  1701  41  40  18
 2  4  1      0   3056  16828 210724   0   0  5024   316 3919  1647  44  30  26
 1  2  1      0   3056  16828 210716   0   0  4124  1832 4048  1559  39  42  19

(as you can see, another process was running here). one more thread does
not change much (server throughput goes down to about 19mbit, but that
could be attributed to pure chance).

     0.000571 read(6, "\300\317\347\n", 4) = 4 <0.000272>
     0.000649 lseek(1368, 23993104, SEEK_SET) = 23993104 <0.000039>
     0.000354 read(1368, "\f\vv\247\25(\27\27\211B@\374\274{\360\n\22\201\361WvF"..., 65536) = 65536 <0.023478>
     0.024785 write(9, "\300\317\347\n", 4) = 4 <0.000282>
     0.000702 read(6, "\200\\\2\n", 4)  = 4 <0.000279>
     0.000697 lseek(1485, 33253196, SEEK_SET) = 33253196 <0.000040>
     0.007086 read(1485, "\276IY\315\213qM\16#\202 D\345\24\210>\205\231I(H\304 "..., 65536) = 65536 <0.051500>
     0.052311 write(9, "\200\\\2\n", 4) = 4 <0.000093>

anyway, my server would be happy without any read-ahead (socket +
userspace buffers already are readahead), but I certainly need the
head-movement-optimization from the elevator. And I don't see how I can
get this without issues many reads in parallel.

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24  7:35 ` [resent PATCH] " Roger Larsson
  2001-08-24 17:43   ` Rik van Riel
  2001-08-24 19:42   ` Lehmann 
@ 2001-08-25  0:05   ` Craig I. Hagan
  2 siblings, 0 replies; 124+ messages in thread
From: Craig I. Hagan @ 2001-08-25  0:05 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Marc A. Lehmann, linux-kernel, oesi

> This is a patch I did it does also enable the profiling, the only needed
> line is the
> -#define MAX_READAHEAD  31
> +#define MAX_READAHEAD  511
> I have not tried to push it further up since this resulted in virtually
> equal total throughput for read two files than for read one.

I've used this on machines that i'm working with, as i'm doing mostly large
streaming reads which seems to really benefit from a larger value, especially
with fibre attached storage which is poor at small ios (<128k) :(.

that said, if you look at the -ac series, you'll note that both of those values
exist in /proc/sys/vm/min-readahead and /proc/sys/vm/max-readahead

here is a specific patch for 2.4.9.

Personally, i think that making this a dynamic tunable is the correct solution
as it allows people to adjust things to match their system and workload.

-- craig


diff -ur linux-clean/drivers/md/md.c linux/drivers/md/md.c
--- linux-clean/drivers/md/md.c	Fri May 25 09:48:49 2001
+++ linux/drivers/md/md.c	Fri Jun 15 02:30:48 2001
@@ -3291,7 +3291,7 @@
 	/*
 	 * Tune reconstruction:
 	 */
-	window = MAX_READAHEAD*(PAGE_SIZE/512);
+	window = vm_max_readahead*(PAGE_SIZE/512);
 	printk(KERN_INFO "md: using %dk window, over a total of %d blocks.\n",window/2,max_sectors/2);

 	atomic_set(&mddev->recovery_active, 0);
diff -ur linux-clean/include/linux/blkdev.h linux/include/linux/blkdev.h
--- linux-clean/include/linux/blkdev.h	Fri May 25 18:01:40 2001
+++ linux/include/linux/blkdev.h	Fri Jun 15 02:23:22 2001
@@ -183,10 +183,6 @@

 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)

-/* read-ahead in pages.. */
-#define MAX_READAHEAD	31
-#define MIN_READAHEAD	3
-
 #define blkdev_entry_to_request(entry) list_entry((entry), struct request, queue)
 #define blkdev_entry_next_request(entry) blkdev_entry_to_request((entry)->next)
 #define blkdev_entry_prev_request(entry) blkdev_entry_to_request((entry)->prev)
diff -ur linux-clean/include/linux/mm.h linux/include/linux/mm.h
--- linux-clean/include/linux/mm.h	Fri Jun 15 02:20:24 2001
+++ linux/include/linux/mm.h	Fri Jun 15 02:26:12 2001
@@ -105,6 +105,10 @@
 #define VM_SequentialReadHint(v)	((v)->vm_flags & VM_SEQ_READ)
 #define VM_RandomReadHint(v)		((v)->vm_flags & VM_RAND_READ)

+/* read ahead limits */
+extern int vm_min_readahead;
+extern int vm_max_readahead;
+
 /*
  * mapping from the currently active vm_flags protection bits (the
  * low four bits) to a page protection mask..
diff -ur linux-clean/include/linux/raid/md_k.h linux/include/linux/raid/md_k.h
--- linux-clean/include/linux/raid/md_k.h	Sun May 20 12:11:39 2001
+++ linux/include/linux/raid/md_k.h	Fri Jun 15 02:31:24 2001
@@ -89,7 +89,7 @@
 /*
  * default readahead
  */
-#define MD_READAHEAD	MAX_READAHEAD
+#define MD_READAHEAD	vm_max_readahead

 static inline int disk_faulty(mdp_disk_t * d)
 {
diff -ur linux-clean/include/linux/sysctl.h linux/include/linux/sysctl.h
--- linux-clean/include/linux/sysctl.h	Fri May 25 18:01:27 2001
+++ linux/include/linux/sysctl.h	Fri Jun 15 02:24:33 2001
@@ -134,7 +134,9 @@
 	VM_PAGECACHE=7,		/* struct: Set cache memory thresholds */
 	VM_PAGERDAEMON=8,	/* struct: Control kswapd behaviour */
 	VM_PGT_CACHE=9,		/* struct: Set page table cache parameters */
-	VM_PAGE_CLUSTER=10	/* int: set number of pages to swap together */
+	VM_PAGE_CLUSTER=10,	/* int: set number of pages to swap together */
+        VM_MIN_READAHEAD=12,    /* Min file readahead */
+        VM_MAX_READAHEAD=13     /* Max file readahead */
 };


diff -ur linux-clean/kernel/sysctl.c linux/kernel/sysctl.c
--- linux-clean/kernel/sysctl.c	Thu Apr 12 12:20:31 2001
+++ linux/kernel/sysctl.c	Fri Jun 15 02:28:02 2001
@@ -270,6 +270,10 @@
 	 &pgt_cache_water, 2*sizeof(int), 0644, NULL, &proc_dointvec},
 	{VM_PAGE_CLUSTER, "page-cluster",
 	 &page_cluster, sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_MIN_READAHEAD, "min-readahead",
+	&vm_min_readahead,sizeof(int), 0644, NULL, &proc_dointvec},
+	{VM_MAX_READAHEAD, "max-readahead",
+	&vm_max_readahead,sizeof(int), 0644, NULL, &proc_dointvec},
 	{0}
 };

Only in linux/kernel: sysctl.c~
diff -ur linux-clean/mm/filemap.c linux/mm/filemap.c
--- linux-clean/mm/filemap.c	Thu Aug 16 13:12:07 2001
+++ linux/mm/filemap.c	Fri Aug 24 14:08:20 2001
@@ -45,6 +45,12 @@
 unsigned int page_hash_bits;
 struct page **page_hash_table;

+int vm_max_readahead = 31;
+int vm_min_readahead = 3;
+EXPORT_SYMBOL(vm_max_readahead);
+EXPORT_SYMBOL(vm_min_readahead);
+
+
 spinlock_t __cacheline_aligned pagecache_lock = SPIN_LOCK_UNLOCKED;
 /*
  * NOTE: to avoid deadlocking you must never acquire the pagecache_lock with
@@ -870,7 +876,7 @@
 static inline int get_max_readahead(struct inode * inode)
 {
 	if (!inode->i_dev || !max_readahead[MAJOR(inode->i_dev)])
-		return MAX_READAHEAD;
+		return vm_max_readahead;
 	return max_readahead[MAJOR(inode->i_dev)][MINOR(inode->i_dev)];
 }

@@ -1044,8 +1050,8 @@
 		if (filp->f_ramax < needed)
 			filp->f_ramax = needed;

-		if (reada_ok && filp->f_ramax < MIN_READAHEAD)
-				filp->f_ramax = MIN_READAHEAD;
+		if (reada_ok && filp->f_ramax < vm_min_readahead)
+				filp->f_ramax = vm_min_readahead;
 		if (filp->f_ramax > max_readahead)
 			filp->f_ramax = max_readahead;
 	}
--- linux-clean/drivers/ide/ide-probe.c	Sun Mar 18 09:25:02 2001
+++ linux/drivers/ide/ide-probe.c	Fri Jun 15 03:09:49 2001
@@ -779,7 +779,7 @@
 		/* IDE can do up to 128K per request. */
 		*max_sect++ = 255;
 #endif
-		*max_ra++ = MAX_READAHEAD;
+		*max_ra++ = vm_max_readahead;
 	}

 	for (unit = 0; unit < units; ++unit)



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 20:19       ` Rik van Riel
  2001-08-24 21:11         ` Daniel Phillips
@ 2001-08-24 23:23         ` Lehmann 
       [not found]           ` <200108242344.f7ONi0h21270@mailg.telia.com>
                             ` (2 more replies)
  1 sibling, 3 replies; 124+ messages in thread
From: Lehmann  @ 2001-08-24 23:23 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, Roger Larsson, linux-kernel

On Fri, Aug 24, 2001 at 05:19:07PM -0300, Rik van Riel <riel@conectiva.com.br> wrote:
> Actually, no.  FIFO would be ok if you had ONE readahead
> stream going on, but when you have multiple readahead

Do we all agree that read-ahead is actually the problem? ATM, I serve ~800
files, read()ing them in turn. When I increase the number of threads I
have more reads at the same time in the kernel, but the absolute number of
read() requests decreases.

So if read-ahead is the problem, then there must be some interdependency
between read-aheads for serial requests and read-aheads for concurrent
requests.

For example, I could imagine that read() executes, returns to
userspace and at the same time the kernel thinks "nothing to do, let's
readahead". While, in the concurrent case, there is hardly a time when no
read() is running. But read-ahead does not seem to work that way.

I usually have around 200MB of free memory, this leaves ~200k/handle
(enough for read-ahead). let's see what happens with vmstat. this is after i
kill -STOP httpd:

 0  2  0      0   4140  15448 185972   0   0     0   264 1328    88   1   3  96
 0  2  0      0   4136  15448 185976   0   0     0     0  793    62   0   3  97
 0  2  0      0   4136  15448 185976   0   0     0     0  538    62   0   1  98
 0  2  0      0   4128  15448 185984   0   0     0     0  382    62   1   1  98
 0  2  0      0   4128  15448 185984   0   0     0     0  312    66   0   1  99
 1  1  0      0   3056  15448 185088   0   0    48   132 2465   321  24  28  48
 2  1  1      0   3056  15448 186392   0   0  4704     0 3017   925  11  21  68
 1  1  2      0   3804  15440 185516   0   0  4708     0 3909  1196  28  38  34
 1  1  0      0   3056  15440 186212   0   0  5392     0 4004  1579  21  32  47
 1  1  1      0   3064  15436 186220   0   0  6668     0 3676  1273  19  42  39
 0  1  1      0   3056  15468 186168   0   0  4488  1424 3889  1342  16  34  50
 0  1  2      0   3056  15468 186116   0   0  3372     0 3854  1525  20  34  46
 1  1  1      0   3060  15468 186084   0   0  4096     0 4088  1641  21  37  41
 0  1  2      0   3056  15468 186044   0   0  6744     0 3679  1415  22  33  45
 1  1  1      0   3072  15468 186040   0   0  4700     0 3713  1429  19  32  50

at some point I killall -CONT myhttpd (it seems more than the 2mb/s is
being read, though, although I only server about 2.3mb/s). now let's see
when i dynamically add 15 reader threads:

  procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  1  1      0   3056  14636 193104   0   0  5248     0 3531  1428  11  35  53
 0  1  2      0   3056  14604 193072   0   0  5936     0 3730  1416  10  33  57
 0  1  0      0   3056  14604 193068   0   0  5628     0 4025  1504  16  34  50
# added 15 threads
 0 16  0      0   3060  14580 192952   0   0  5412     0 3394  1010  12  31  57
 1 16  1      0   3060  14580 192944   0   0  2580   268 2253   636   8  23  69
 0 16  0      0   3056  14580 192944   0   0   380   852 1570   488   5  13  82
 0 16  0      0   3064  14580 192880   0   0  2652     0 1470   583   6  14  80
 0 16  0      0   3056  14560 192888   0   0  2944     0 2068   693   5  23  72
 0 16  0      0   3056  14560 192888   0   0   544     0 1732   550   5  15  80
 1 16  0      0   3056  14560 192884   0   0  2892     0 1513   741   4  14  82
 0 16  1      0   3056  14560 192884   0   0  2236   552 1894   742   5  17  77
 1 16  0      0   3056  14560 192880   0   0  1100   128 1699   604   4  15  81

woaw. blocks in decreases a slot. now let's remove the extra threads again:

 0 16  1      0   3056  14552 193720   0   0  4944   660 1463   721   3  20  77
 0 16  0      0   3056  14552 193648   0   0  1756   136 1451   726   4  18  79
 0 16  0      0   3056  14552 193588   0   0   440     0 1221   565   3  13  84
 0 16  0      0   3056  14552 193584   0   0  3308     0 1278   632   4   9  88
 1 16  0      0   3056  14536 193608   0   0  3040     0 2469  1168   7  21  72
 0 16  0      0   3056  14536 193608   0   0  2320     0 1844   730   5  15  80
 1 16  1      0   3056  14536 193612   0   0  1660   596 1557   559  12  24  64
# here the server starts rto reap threads. this is a slow and inefficient process
# that take svery long and block the server.
 1 16  0      0   3056  14536 193612   0   0  2188   164  831   440  25  30  45
 1 16  0      0   3056  14536 193612   0   0  2324     0  506   329  23  30  47
 1 16  0      0   3056  14536 193612   0   0  1452     0  460   401  24  30  46
# many similar lines snipped
 1 16  0      0   3056  14516 193692   0   0  3932     0  510   621  20  38  42
 1 16  0      0   3056  14516 193692   0   0  1744     0  338   369  23  31  46
 1 16  0      0   3056  14568 193872   0   0  1292   476  383   392  20  32  48
 2  0  2      0   3056  14616 175104   0   0  5748     0 3670  1342  24  37  39
 2  1  1      0   3560  14616 174708   0   0  5028     0 3539   989  22  43  35
 0  1  0      0  93948  14660 175764   0   0  1604     0 3341   667  10  22  68
 1  0  0      0  92244  14844 176212   0   0   524     0 3240   424  12  12  76
 0  1  0      0  90532  15212 176404   0   0   200  1524 3308   426  16  14  71
 0  1  1      0  84600  15212 179096   0   0  2712     0 2710   669  26  11  63
 0  1  2      0  77768  15212 183132   0   0  4012     0 3041   889  19  18  63
 0  1  0      0  68724  15212 189440   0   0  6284     0 3110   998  21  21  58
 1  1  0      0  58892  15212 195984   0   0  6528     0 3149   975  28  25  47
 2  1  0      0  50636  15248 201316   0   0  5316  1368 3321   968  20  28  52
 1  1  0      0  38520  15248 210004   0   0  8664     0 3250   910  28  26  46
 0  1  1      0  28100  15248 218520   0   0  8508     0 3186   777  20  28  52
 1  1  1      0  18848  15248 227028   0   0  8500     0 3090   704  15  26  59
 0  1  0      0  10752  15248 233752   0   0  6732     0 3223   774  20  27  53

back to 2.1mb/s, but reading much more.

this certainly looks like overzealus read-ahead, but I should have the
memory available for read-ahead. So is it "just" the use-once-optimization
that throws away read-ahead pages? If yes, then why do I see exactly
the same performance under 2.4.5pre4, which didn't have (AFAIK) the
use-once-opt.?

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 20:37         ` Gérard Roudier
@ 2001-08-24 23:12           ` Rik van Riel
  2001-08-25  8:02             ` Gérard Roudier
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 23:12 UTC (permalink / raw)
  To: Gérard Roudier; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Fri, 24 Aug 2001, Gérard Roudier wrote:

> The larger the read-ahead chunks, the more likely trashing will
> occur. In my opinion, using more than 128 K IO chunks will not
> improve performances with modern hard disks connected to a
> reasonnably fast controller, but will increase memory pressure
> and probably thrashing.

Your opinion seems to differ from actual measurements
made by Roger Larsson and other people.

But yes, increasing the readahead window also increases
the chance of readahead window thrashing. Luckily we can
detect fairly easily if this is happening and use that
to automatically shrink the readahead window...

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 22:29         ` Daniel Phillips
@ 2001-08-24 23:10           ` Rik van Riel
  2001-08-25  0:42             ` Daniel Phillips
  2001-08-27  7:08           ` Helge Hafting
  1 sibling, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 23:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Sat, 25 Aug 2001, Daniel Phillips wrote:
> On August 24, 2001 09:02 pm, Rik van Riel wrote:

> > I guess in the long run we should have automatic collapse
> > of the readahead window when we find that readahead window
> > thrashing is going on,
>
> Yes, and the most effective way to detect that the readahead
> window is too high is by keeping a history of recently evicted
> pages.

I think it could be even easier. We simply count for each
file how many pages we read-ahead and how many pages we
read.

If the number of pages being read-ahead is really a lot
higher than the pages being read, we know pages get evicted
before we read them ==> we shrink the readahead window.

This simpler scheme should also be able to correctly set
the readahead window for slower data streams to smaller
than the readahead window size for faster reading data
streams (which _do_ get to use more of their data before
it is evicted again).

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 21:11         ` Daniel Phillips
@ 2001-08-24 23:03           ` Rik van Riel
  2001-08-25  0:41             ` Daniel Phillips
  0 siblings, 1 reply; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 23:03 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Fri, 24 Aug 2001, Daniel Phillips wrote:

> > Actually, no.  FIFO would be ok if you had ONE readahead
> > stream going on, but when you have multiple readahead
> > streams going on you want to evict the data each of the
> > streams has already used, and not all the readahead data
> > which happened to be read in first.
>
> We will be fine up until the point that the set of all readahead
> fills the entire cache, then we will start dropping *some* of
> the readahead.  This will degrade gracefully: if the set of
> readahead is twice as large as cache then half the readahead
> will be dropped.  We will drop the readahead in coherent chunks
> so that it can be re-read in one disk seek.  This is not such
> bad behaviour.

The problem is that it WON'T degrade gracefully. Suppose we
have 5 readahead streams, A B C D and E, and we can store
4 readahead windows in RAM without problems. A page which has
not yet been read is marked with a capital letter, a page
which has already been read is marked with a small letter.

The queue looks like this, with new pages being added to the
front and old pages being dropped off the right side:
	AAaaBBbbCCccDDdd

With the current use-once thing, we will end up dropping ALL
pages from file D, even the ones we are about to use (DDdd).

With drop-behind we'll drop four pages we have already used,
without affecting the pages we are about to use (dcba).


> That said, I think I might be able to come up with something
> that uses specific knowledge about readahead to squeeze a little
> better performance out of your example case without breaking
> loads that are already working pretty well.  It will require
> another lru list - this is not something we want to do right
> now, don't you agree?

Ummm, if you're still busy trying to come up with the idea,
how do you already know your future idea will require an extra
LRU list? ;)

cheers,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 19:02       ` Rik van Riel
  2001-08-24 20:37         ` Gérard Roudier
@ 2001-08-24 22:29         ` Daniel Phillips
  2001-08-24 23:10           ` Rik van Riel
  2001-08-27  7:08           ` Helge Hafting
  1 sibling, 2 replies; 124+ messages in thread
From: Daniel Phillips @ 2001-08-24 22:29 UTC (permalink / raw)
  To: Rik van Riel, Roger Larsson; +Cc: Marc A. Lehmann, linux-kernel, oesi

On August 24, 2001 09:02 pm, Rik van Riel wrote:
> On Fri, 24 Aug 2001, Roger Larsson wrote:
> 
> > Not having the patch gives you another effect - disk arm is
> > moving from track to track in a furiously tempo...
> 
> Fully agreed, but remember that when you reach the point
> where the readahead windows are pushing each other out
> you'll be off even worse.
> 
> I guess in the long run we should have automatic collapse
> of the readahead window when we find that readahead window
> thrashing is going on, in the short term I think it is
> enough to have the maximum readahead size tunable in /proc,
> like what is happening in the -ac kernels.

Yes, and the most effective way to detect that the readahead window is too 
high is by keeping a history of recently evicted pages.  When we find 
ourselves re-reading pages that were evicted before ever being used we know 
exactly what the problem is.

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 19:42   ` Lehmann 
@ 2001-08-24 21:42     ` Gérard Roudier
  0 siblings, 0 replies; 124+ messages in thread
From: Gérard Roudier @ 2001-08-24 21:42 UTC (permalink / raw)
  To: Marc; +Cc: Roger Larsson, linux-kernel



On Fri, 24 Aug 2001, Marc wrote:

> On Fri, Aug 24, 2001 at 09:35:08AM +0200, Roger Larsson <roger.larsson@skelleftea.mail.telia.com> wrote:
> > And I found out that read ahead was too short for modern disks.
>
> That could well be, the problem in my case is that, with up to 1000
> clients, I fear that there might not be enough memory for effective
> read-ahead (and I think read-ahead would be counter-productive even).

Depend on the amount of memory.
If asynchronous read-ahead is working for 1000 sequential IO streams on
1000 different files with MAX_READAHEAD=31, your system needs about:

(32+32) * 1 PAGE * 1000 = 256 MB

just for read-ahead data.

> > line is the
> > -#define MAX_READAHEAD  31
> > +#define MAX_READAHEAD  511
>
> I plan to try this, however, read-ahead should IMHO be zero anyway, since
> there simply is ot enough memory, and the kernel should not do much
> read-ahead when many other requests are outstanding.

I donnot recommend you to try this value, even if the read-ahead code may
be smart enough to detect trashing and use an average value more
reasonnable.

> The real problem., however, is that read performance sinks so much when many
> readers run in parallel.
>
> What I need is many parallel reads because this helps the elevator scan the
> disk once and not jump around widely)
>
> (I have 512MB memory around 64k socket send buffer and use an additional
> 96k buffer currently for reading from disk, so effectively i do my own
> read-ahead anyway. IU just need to optimize the head movements).

The asynchronous read-ahead code tries to eliminate IO latency by starting
the next IO in advance. This is probably not useful for the situation you
describe. Optimizing the head movements in indeed the smartest thing to
try given the IO pattern you describe.

In my opinion, your system is probably trashing a lot:

Buffers : (256K + 64K + 96K) * 1000 = 416 MB. Code and various datas
(notably kernel socket datas): probably far more than 100 MB.
Total greater than 512 MB.

May-be you should either:

- Use a smaller number of clients.

or

- Increase memory size up to 1 GB, for example.

or

- Use smaller buffers, for example:
    MAX_READAHEAD=15
    32K file buffer
    32K socket buffer

--

Regards,
  Gérard.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 20:19       ` Rik van Riel
@ 2001-08-24 21:11         ` Daniel Phillips
  2001-08-24 23:03           ` Rik van Riel
  2001-08-24 23:23         ` Lehmann 
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-24 21:11 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On August 24, 2001 10:19 pm, Rik van Riel wrote:
> On Fri, 24 Aug 2001, Daniel Phillips wrote:
> > On August 24, 2001 07:43 pm, Rik van Riel wrote:
> 
> > > 1) under memory pressure, the inactive_dirty list is
> > >    only as large as 1 second of pageout IO, meaning
> > 		      ^^^^^^^^
> > This is the problem.  In the absense of competition and truly
> > active pages, the inactive queue should just grow until it is
> > much larger than the active ring.  Then the replacement policy
> > will naturally become fifo, which is exactly what you want in
> > your example.
> 
> Actually, no.  FIFO would be ok if you had ONE readahead
> stream going on, but when you have multiple readahead
> streams going on you want to evict the data each of the
> streams has already used, and not all the readahead data
> which happened to be read in first.

We will be fine up until the point that the set of all readahead fills the 
entire cache, then we will start dropping *some* of the readahead.  This will 
degrade gracefully: if the set of readahead is twice as large as cache then 
half the readahead will be dropped.  We will drop the readahead in coherent 
chunks so that it can be re-read in one disk seek.  This is not such bad 
behaviour.

All this assuming you don't enforce the 1 second size limit on the inactive 
queue, of course.

We probably could squeeze a little better performance out of this case by 
magically knowing that no input page will ever be reused, as you suggest.  
We risk getting such an improvement at the expensive of other, more typical 
loads.

That said, I think I might be able to come up with something that uses 
specific knowledge about readahead to squeeze a little better performance out 
of your example case without breaking loads that are already working pretty 
well.  It will require another lru list - this is not something we want to do 
right now, don't you agree?

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 19:02       ` Rik van Riel
@ 2001-08-24 20:37         ` Gérard Roudier
  2001-08-24 23:12           ` Rik van Riel
  2001-08-24 22:29         ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Gérard Roudier @ 2001-08-24 20:37 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi



On Fri, 24 Aug 2001, Rik van Riel wrote:

> On Fri, 24 Aug 2001, Roger Larsson wrote:
>
> > Not having the patch gives you another effect - disk arm is
> > moving from track to track in a furiously tempo...
>
> Fully agreed, but remember that when you reach the point
> where the readahead windows are pushing each other out
> you'll be off even worse.
>
> I guess in the long run we should have automatic collapse
> of the readahead window when we find that readahead window
> thrashing is going on, in the short term I think it is
> enough to have the maximum readahead size tunable in /proc,
> like what is happening in the -ac kernels.

There is some minimal heuristics in the readahead code that tries to
reduce the windows if things donnot work as expected. I donnot remember
exactly as they work. May-be I should re-read the code. What the code
wants is to perform *asynchronous* read-ahead in order to elimate IO
latency for sequential IO streaming patterns.

For hard disks connected to a fast HBA, large read-ahead is not necessary.
For example, un Cheetah drive can handle more than 8000 short IOs per
second and a SYMBIOS chip using the SYM53C8XX driver can handle more than
15000 short IOs per seconds.

The larger the read-ahead chunks, the more likely trashing will occur. In
my opinion, using more than 128 K IO chunks will not improve performances
with modern hard disks connected to a reasonnably fast controller, but
will increase memory pressure and probably thrashing.

It is a different story for some external RAID boxes that may perform
better using hudge IO chunks. Such boxes have poor firmware in my opinion.

Some tunability of the read-ahead algorithm will be useful for sure, but
not that much, in my opinion. I am under the impression that using some
MAX READ AHEAD value in the range 64K-128K and given the current
heuristics in the code, a subsystem using hard disks connected to a decent
controller will perform close to the best for most IO patterns. This let
me think, that there is no miracle to expect from a tunability of the
read-ahead. (Excepted if stupid external RAID boxes are involved,
obviously).

Regards,
  Gérard.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 20:18     ` Daniel Phillips
@ 2001-08-24 20:19       ` Rik van Riel
  2001-08-24 21:11         ` Daniel Phillips
  2001-08-24 23:23         ` Lehmann 
  0 siblings, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 20:19 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Roger Larsson, Marc A. Lehmann, linux-kernel, oesi

On Fri, 24 Aug 2001, Daniel Phillips wrote:
> On August 24, 2001 07:43 pm, Rik van Riel wrote:

> > 1) under memory pressure, the inactive_dirty list is
> >    only as large as 1 second of pageout IO, meaning
> 		      ^^^^^^^^
> This is the problem.  In the absense of competition and truly
> active pages, the inactive queue should just grow until it is
> much larger than the active ring.  Then the replacement policy
> will naturally become fifo, which is exactly what you want in
> your example.

Actually, no.  FIFO would be ok if you had ONE readahead
stream going on, but when you have multiple readahead
streams going on you want to evict the data each of the
streams has already used, and not all the readahead data
which happened to be read in first.

> Anyway, this is a theoretical problem, we haven't seen it in the
> wild yet, or a test load that demonstrates it.

I've seen it in the wild, have given you a test load and
have shown you the arithmetic explaining what's going on.

How long will you continue ignoring things which aren't
convenient to your idea of the world ?

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 17:43   ` Rik van Riel
  2001-08-24 18:28     ` Roger Larsson
@ 2001-08-24 20:18     ` Daniel Phillips
  2001-08-24 20:19       ` Rik van Riel
  1 sibling, 1 reply; 124+ messages in thread
From: Daniel Phillips @ 2001-08-24 20:18 UTC (permalink / raw)
  To: Rik van Riel, Roger Larsson; +Cc: Marc A. Lehmann, linux-kernel, oesi

On August 24, 2001 07:43 pm, Rik van Riel wrote:
> On Fri, 24 Aug 2001, Roger Larsson wrote:
> 
> > I earlier questioned this too...
> > And I found out that read ahead was too short for modern disks.
> > This is a patch I did it does also enable the profiling, the only needed
> > line is the
> > -#define MAX_READAHEAD  31
> > +#define MAX_READAHEAD  511
> > I have not tried to push it further up since this resulted in virtually
> > equal total throughput for read two files than for read one.
> 
> Note that this can have HORRIBLE effects if the total
> size of all the readahead windows combined doesn't fit
> in your memory.
> 
> If you have 100 IO streams going on and you have space
> for 50 of them, you'll find yourself with 100 threads
> continuously pushing each other's read-ahead data out
> of RAM.
> 
> 100 threads may sound much, but 100 clients really isn't
> that special for an ftp server...
> 
> This effect is made a lot worse with the use-once
> strategy used in recent Linus kernels because:
> 
> 1) under memory pressure, the inactive_dirty list is
>    only as large as 1 second of pageout IO, meaning
		      ^^^^^^^^
This is the problem.  In the absense of competition and truly active pages, 
the inactive queue should just grow until it is much larger than the active 
ring.  Then the replacement policy will naturally become fifo, which is 
exactly what you want in your example.

Anyway, this is a theoretical problem, we haven't seen it in the wild yet, or 
a test load that demonstrates it.

>    the sum of the readahead windows is smaller than
>    with a kernel which doesn't do the use-once thing
>    (eg. Alan's kernel)
> 
> 2) the drop-behind strategy makes it much more likely
>    that we'll replace the data we already used, instead
>    of the read-ahead data we haven't used yet ... this
>    means the data we are about to use has a better chance
>    to be in memory

--
Daniel

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24  7:35 ` [resent PATCH] " Roger Larsson
  2001-08-24 17:43   ` Rik van Riel
@ 2001-08-24 19:42   ` Lehmann 
  2001-08-24 21:42     ` Gérard Roudier
  2001-08-25  0:05   ` Craig I. Hagan
  2 siblings, 1 reply; 124+ messages in thread
From: Lehmann  @ 2001-08-24 19:42 UTC (permalink / raw)
  To: Roger Larsson; +Cc: linux-kernel

On Fri, Aug 24, 2001 at 09:35:08AM +0200, Roger Larsson <roger.larsson@skelleftea.mail.telia.com> wrote:
> And I found out that read ahead was too short for modern disks.

That could well be, the problem in my case is that, with up to 1000
clients, I fear that there might not be enough memory for effective
read-ahead (and I think read-ahead would be counter-productive even).

> line is the  
> -#define MAX_READAHEAD  31
> +#define MAX_READAHEAD  511

I plan to try this, however, read-ahead should IMHO be zero anyway, since
there simply is ot enough memory, and the kernel should not do much
read-ahead when many other requests are outstanding.

The real problem., however, is that read performance sinks so much when many
readers run in parallel.

What I need is many parallel reads because this helps the elevator scan the
disk once and not jump around widely)

(I have 512MB memory around 64k socket send buffer and use an additional
96k buffer currently for reading from disk, so effectively i do my own
read-ahead anyway. IU just need to optimize the head movements).

-- 
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg@goof.com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 18:28     ` Roger Larsson
@ 2001-08-24 19:02       ` Rik van Riel
  2001-08-24 20:37         ` Gérard Roudier
  2001-08-24 22:29         ` Daniel Phillips
  0 siblings, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 19:02 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Marc A. Lehmann, linux-kernel, oesi

On Fri, 24 Aug 2001, Roger Larsson wrote:

> Not having the patch gives you another effect - disk arm is
> moving from track to track in a furiously tempo...

Fully agreed, but remember that when you reach the point
where the readahead windows are pushing each other out
you'll be off even worse.

I guess in the long run we should have automatic collapse
of the readahead window when we find that readahead window
thrashing is going on, in the short term I think it is
enough to have the maximum readahead size tunable in /proc,
like what is happening in the -ac kernels.

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24 17:43   ` Rik van Riel
@ 2001-08-24 18:28     ` Roger Larsson
  2001-08-24 19:02       ` Rik van Riel
  2001-08-24 20:18     ` Daniel Phillips
  1 sibling, 1 reply; 124+ messages in thread
From: Roger Larsson @ 2001-08-24 18:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Marc A. Lehmann, linux-kernel, oesi

On Friday den 24 August 2001 19:43, Rik van Riel wrote:
> On Fri, 24 Aug 2001, Roger Larsson wrote:
> > I earlier questioned this too...
> > And I found out that read ahead was too short for modern disks.
> > This is a patch I did it does also enable the profiling, the only needed
> > line is the
> > -#define MAX_READAHEAD  31
> > +#define MAX_READAHEAD  511
> > I have not tried to push it further up since this resulted in virtually
> > equal total throughput for read two files than for read one.
>
> Note that this can have HORRIBLE effects if the total
> size of all the readahead windows combined doesn't fit
> in your memory.
>
> If you have 100 IO streams going on and you have space
> for 50 of them, you'll find yourself with 100 threads
> continuously pushing each other's read-ahead data out
> of RAM.

Not having the patch gives you another effect - disk arm is
moving from track to track in a furiously tempo...
The time it takes to move is longer than the time it is allowed
to read - this is not effective! That would limit throughput
to half the possible. If less than 511 - 31 pages are thrown
away you probably gain anyway...

One optimization to do would be to stop readahead at file
fragments.

But READA pages are special since they might be read never!
Streaming puts interesting kinds of pressure on VM...

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [resent PATCH] Re: very slow parallel read performance
  2001-08-24  7:35 ` [resent PATCH] " Roger Larsson
@ 2001-08-24 17:43   ` Rik van Riel
  2001-08-24 18:28     ` Roger Larsson
  2001-08-24 20:18     ` Daniel Phillips
  2001-08-24 19:42   ` Lehmann 
  2001-08-25  0:05   ` Craig I. Hagan
  2 siblings, 2 replies; 124+ messages in thread
From: Rik van Riel @ 2001-08-24 17:43 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Marc A. Lehmann, linux-kernel, oesi

On Fri, 24 Aug 2001, Roger Larsson wrote:

> I earlier questioned this too...
> And I found out that read ahead was too short for modern disks.
> This is a patch I did it does also enable the profiling, the only needed
> line is the
> -#define MAX_READAHEAD  31
> +#define MAX_READAHEAD  511
> I have not tried to push it further up since this resulted in virtually
> equal total throughput for read two files than for read one.

Note that this can have HORRIBLE effects if the total
size of all the readahead windows combined doesn't fit
in your memory.

If you have 100 IO streams going on and you have space
for 50 of them, you'll find yourself with 100 threads
continuously pushing each other's read-ahead data out
of RAM.

100 threads may sound much, but 100 clients really isn't
that special for an ftp server...

This effect is made a lot worse with the use-once
strategy used in recent Linus kernels because:

1) under memory pressure, the inactive_dirty list is
   only as large as 1 second of pageout IO, meaning
   the sum of the readahead windows is smaller than
   with a kernel which doesn't do the use-once thing
   (eg. Alan's kernel)

2) the drop-behind strategy makes it much more likely
   that we'll replace the data we already used, instead
   of the read-ahead data we haven't used yet ... this
   means the data we are about to use has a better chance
   to be in memory

regards,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [resent PATCH] Re: very slow parallel read performance
  2001-08-23 21:35 Lehmann 
@ 2001-08-24  7:35 ` Roger Larsson
  2001-08-24 17:43   ` Rik van Riel
                     ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Roger Larsson @ 2001-08-24  7:35 UTC (permalink / raw)
  To:  Marc A. Lehmann , linux-kernel; +Cc: oesi

On Thursday den 23 August 2001 23:35, pcg@goof.com ( Marc) (A.) (Lehmann ) 
wrote:
> I tested the following under linux-2.4.8-ac8, linux-2.4.8pre4 and
> 2.4.5pre4, all had similar behaviour.
>
> I have written a webserver that serves many large files, and thus, the
> disks are the bottleneck. To get around the problem of blocking reads
> (this killed thttpd's performance totally, for example) I can start one or
> more reader threads. And strace of them under load looks like this:
>

I earlier questioned this too...
And I found out that read ahead was too short for modern disks.
This is a patch I did it does also enable the profiling, the only needed
line is the  
-#define MAX_READAHEAD  31
+#define MAX_READAHEAD  511
I have not tried to push it further up since this resulted in virtually
equal total throughput for read two files than for read one.


Is possible that the limit can be altered per disk, can it?
I have read about 127 being the current max limit...

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden


*******************************************
Patch prepared by: roger.larsson@norran.net

--- linux/mm/filemap.c.orig	Fri Jul 27 21:31:41 2001
+++ linux/mm/filemap.c	Sat Jul 28 03:01:05 2001
@@ -744,10 +744,8 @@
 	return NULL;
 }
 
-#if 0
 #define PROFILE_READAHEAD
 #define DEBUG_READAHEAD
-#endif
 
 /*
  * Read-ahead profiling information
@@ -791,13 +789,13 @@
 		}
 
 		printk("Readahead average:  max=%ld, len=%ld, win=%ld, async=%ld%%\n",
-			total_ramax/total_reada,
-			total_ralen/total_reada,
-			total_rawin/total_reada,
-			(total_async*100)/total_reada);
+		       total_ramax/total_reada,
+		       total_ralen/total_reada,
+		       total_rawin/total_reada,
+		       (total_async*100)/total_reada);
 #ifdef DEBUG_READAHEAD
-		printk("Readahead snapshot: max=%ld, len=%ld, win=%ld, raend=%Ld\n",
-			filp->f_ramax, filp->f_ralen, filp->f_rawin, filp->f_raend);
+		printk("Readahead snapshot: max=%ld, len=%ld, win=%ld, raend=%ld\n",
+		       filp->f_ramax, filp->f_ralen, filp->f_rawin, filp->f_raend);
 #endif
 
 		total_reada	= 0;
--- linux/include/linux/blkdev.h.orig	Fri Jul 27 21:36:37 2001
+++ linux/include/linux/blkdev.h	Sat Jul 28 02:51:10 2001
@@ -184,7 +184,7 @@
 #define PageAlignSize(size) (((size) + PAGE_SIZE -1) & PAGE_MASK)
 
 /* read-ahead in pages.. */
-#define MAX_READAHEAD	31
+#define MAX_READAHEAD	511
 #define MIN_READAHEAD	3
 
 #define blkdev_entry_to_request(entry) list_entry((entry), struct request, 
queue)

^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2001-08-28 22:07 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-28  1:08 [resent PATCH] Re: very slow parallel read performance Dieter Nützel
2001-08-28  0:05 ` Marcelo Tosatti
2001-08-28  1:54   ` Daniel Phillips
2001-08-28  5:01 ` Mike Galbraith
2001-08-28  8:46   ` [reiserfs-list] " Hans Reiser
2001-08-28 19:17     ` Mike Galbraith
2001-08-28 18:18 ` Andrew Morton
2001-08-28 18:45   ` Hans Reiser
  -- strict thread matches above, loose matches on Subject: below --
2001-08-28 15:28 Dieter Nützel
2001-08-27  2:03 Rick Hohensee
2001-08-27  2:52 ` Keith Owens
2001-08-28 17:52 ` Kai Henningsen
2001-08-28 21:54   ` Matthew M
2001-08-23 21:35 Lehmann 
2001-08-24  7:35 ` [resent PATCH] " Roger Larsson
2001-08-24 17:43   ` Rik van Riel
2001-08-24 18:28     ` Roger Larsson
2001-08-24 19:02       ` Rik van Riel
2001-08-24 20:37         ` Gérard Roudier
2001-08-24 23:12           ` Rik van Riel
2001-08-25  8:02             ` Gérard Roudier
2001-08-25  9:26               ` Roger Larsson
2001-08-25 11:49                 ` Gérard Roudier
2001-08-25 17:56                   ` Roger Larsson
2001-08-25 19:13                     ` Gérard Roudier
2001-08-25 13:17               ` Rik van Riel
2001-08-24 22:29         ` Daniel Phillips
2001-08-24 23:10           ` Rik van Riel
2001-08-25  0:42             ` Daniel Phillips
2001-08-27  7:08           ` Helge Hafting
2001-08-27 14:31             ` Daniel Phillips
2001-08-27 14:42               ` Alex Bligh - linux-kernel
2001-08-27 15:14                 ` Rik van Riel
2001-08-27 16:04                   ` Daniel Phillips
     [not found]                   ` <Pine.LNX.4.33L.0108271213370.5646-100000@imladris.rielhome.cone ctiva>
2001-08-27 19:34                     ` Alex Bligh - linux-kernel
2001-08-27 20:03                       ` Oliver Neukum
2001-08-27 20:19                         ` Alex Bligh - linux-kernel
2001-08-27 21:38                           ` Oliver.Neukum
2001-08-27 22:26                             ` Alex Bligh - linux-kernel
2001-08-27 21:29                       ` Daniel Phillips
2001-08-27 16:02                 ` Daniel Phillips
2001-08-27 19:36                   ` Alex Bligh - linux-kernel
2001-08-27 20:24                     ` Daniel Phillips
2001-08-27 16:55               ` David Lang
2001-08-27 18:54                 ` Daniel Phillips
2001-08-27 18:37               ` Oliver Neukum
2001-08-27 19:04                 ` Daniel Phillips
2001-08-27 19:43                   ` Oliver Neukum
2001-08-27 20:37                     ` Daniel Phillips
2001-08-27 22:10                       ` Oliver.Neukum
2001-08-27 21:44                     ` Linus Torvalds
2001-08-27 22:30                       ` Daniel Phillips
2001-08-27 23:00                       ` Marcelo Tosatti
2001-08-28  3:10                         ` Linus Torvalds
2001-08-27 19:55                 ` Richard Gooch
2001-08-27 20:09                   ` Oliver Neukum
2001-08-27 21:06                   ` Daniel Phillips
2001-08-24 20:18     ` Daniel Phillips
2001-08-24 20:19       ` Rik van Riel
2001-08-24 21:11         ` Daniel Phillips
2001-08-24 23:03           ` Rik van Riel
2001-08-25  0:41             ` Daniel Phillips
2001-08-25  1:34               ` Rik van Riel
2001-08-25 15:49                 ` Daniel Phillips
2001-08-25 15:50                   ` Rik van Riel
2001-08-25 16:28                     ` Lehmann 
2001-08-25 16:34                       ` Rik van Riel
2001-08-25 16:41                         ` Lehmann 
2001-08-26 16:55                       ` Daniel Phillips
2001-08-26 18:39                         ` Rik van Riel
2001-08-26 19:46                           ` Daniel Phillips
2001-08-26 19:52                             ` Rik van Riel
2001-08-26 20:08                               ` Daniel Phillips
2001-08-26 22:33                                 ` Russell King
2001-08-26 23:24                                   ` Daniel Phillips
2001-08-26 23:24                                     ` Russell King
2001-08-27  0:07                                     ` Rik van Riel
2001-08-27  0:02                                 ` Rik van Riel
2001-08-27  0:42                                   ` Daniel Phillips
2001-08-25 16:43                     ` Daniel Phillips
2001-08-25 19:15                       ` Alan Cox
2001-08-25 19:35                         ` Lehmann 
2001-08-25 20:52                           ` Rik van Riel
2001-08-26  1:38                             ` Daniel Phillips
2001-08-26  2:49                               ` Lehmann 
2001-08-26 17:29                                 ` Daniel Phillips
2001-08-26 17:37                                   ` Craig I. Hagan
2001-08-26 18:56                                   ` Rik van Riel
2001-08-26 19:18                                   ` Lehmann 
2001-08-26 21:07                                     ` Daniel Phillips
2001-08-26 22:12                                       ` Rik van Riel
2001-08-26 23:24                                       ` Lehmann 
2001-08-26 20:26                                   ` Gérard Roudier
2001-08-26 21:20                                     ` Daniel Phillips
2001-08-26  3:32                               ` Rik van Riel
2001-08-26 13:22                                 ` Lehmann 
2001-08-26 13:48                                   ` Rik van Riel
2001-08-26 14:55                                     ` Lehmann 
2001-08-26 15:06                                       ` Rik van Riel
2001-08-26 15:25                                         ` Lehmann 
2001-08-25 21:33                           ` Alan Cox
2001-08-25 23:34                             ` Lehmann 
2001-08-26  2:02                               ` Rik van Riel
2001-08-26  2:57                                 ` Lehmann 
2001-08-26  0:46                           ` John Stoffel
2001-08-26  1:07                             ` Alan Cox
2001-08-26  3:30                               ` Rik van Riel
2001-08-26  3:40                             ` Rik van Riel
2001-08-26  5:28                               ` Daniel Phillips
2001-08-24 23:23         ` Lehmann 
     [not found]           ` <200108242344.f7ONi0h21270@mailg.telia.com>
2001-08-25  0:28             ` Lehmann 
2001-08-25  3:09           ` Rik van Riel
2001-08-25  9:13             ` Gérard Roudier
2001-08-26 16:54           ` Daniel Phillips
2001-08-26 18:59             ` Victor Yodaiken
2001-08-26 19:38               ` Rik van Riel
2001-08-26 20:05                 ` Victor Yodaiken
2001-08-26 20:34                   ` Rik van Riel
2001-08-26 20:45                     ` Victor Yodaiken
2001-08-26 21:00                       ` Alan Cox
2001-08-26 20:42               ` Daniel Phillips
2001-08-26 19:31             ` Lehmann 
2001-08-24 19:42   ` Lehmann 
2001-08-24 21:42     ` Gérard Roudier
2001-08-25  0:05   ` Craig I. Hagan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).