linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.4.6-pre2, pre3 VM Behavior
@ 2001-06-13 19:31 Tom Sightler
  2001-06-13 20:21 ` Rik van Riel
  0 siblings, 1 reply; 52+ messages in thread
From: Tom Sightler @ 2001-06-13 19:31 UTC (permalink / raw)
  To: Linux-Kernel

Hi All,

I have been using the 2.4.x kernels since the 2.4.0-test days on my Dell 5000e
laptop with 320MB of RAM and have experienced first hand many of the problems
other users have reported with the VM system in 2.4.  Most of these problems
have been only minor anoyances and I have continued testing kernels as the 2.4
series has continued, mostly without noticing much change.

With 2.4.6-pre2, and -pre3 I can say that I have seen a marked improvement on my
machine, especially in interactive response, for my day to day workstation uses.
 However, I do have one observation that seems rather strange, or at least wrong.

I, on occasion, have the need to transfer relatively large files (750MB-1GB)
from our larger Linux servers to my machine.  I usually use ftp to transfer
these files and this is where I notice the following:

1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
close to wire speed).  At this point Linux has yet to write the first byte to
disk.  OK, this might be an exaggerated, but very little disk activity has
occured on my laptop.

2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
maybe I should do that" then the hard drive light comes on solid for several
seconds.  During this time the ftp transfer drops to about 1/5 of the original
speed.

3.  After the initial burst of data is written things seem much more reasonable,
and data streams to the disk almost continually while the rest of the transfer
completes at near full speed again.

Basically, it seems the kernel buffers all of the incoming file up to nearly
available memory before it begins to panic and starts flushing the file to disk.
 It seems it should start to lazy write somewhat ealier.  Perhaps some of this
is tuneable from userland and I just don't know how.

This was much less noticeable on a server with a much faster SCSI hard disk
subsystem as it took significantly less time to flush the information to the
disk once it finally starterd, but laptop hard drives are traditionally poor
performers and at 15MB/s it take 10-15 seconds before things stable out, just
from transferring a file.

Anyway, things are still much better, with older kernels things would almost
seem locked up during those 10-15 seconds but now my apps stay fairly responsive
(I can still type in AbiWord, browse in Mozilla, etc).

Later,
Tom

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 19:31 2.4.6-pre2, pre3 VM Behavior Tom Sightler
@ 2001-06-13 20:21 ` Rik van Riel
  2001-06-14  1:49   ` Tom Sightler
  2001-06-14  8:30   ` Mike Galbraith
  0 siblings, 2 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-13 20:21 UTC (permalink / raw)
  To: Tom Sightler; +Cc: Linux-Kernel

On Wed, 13 Jun 2001, Tom Sightler wrote:

> 1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> close to wire speed).  At this point Linux has yet to write the first byte to
> disk.  OK, this might be an exaggerated, but very little disk activity has
> occured on my laptop.
>
> 2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> maybe I should do that" then the hard drive light comes on solid for several
> seconds.  During this time the ftp transfer drops to about 1/5 of the original
> speed.
>
> 3.  After the initial burst of data is written things seem much more reasonable,
> and data streams to the disk almost continually while the rest of the transfer
> completes at near full speed again.
>
> Basically, it seems the kernel buffers all of the incoming file up to nearly
> available memory before it begins to panic and starts flushing the file to disk.
>  It seems it should start to lazy write somewhat ealier.
> Perhaps some of this is tuneable from userland and I just don't
> know how.

Actually, it already does the lazy write earlier.

The page reclaim code scans up to 1/4th of the inactive_dirty
pages on the first loop, where it does NOT write things to
disk.

On the second loop, we start asynchronous writeout of data
to disk and and scan up to 1/2 of the inactive_dirty pages,
trying to find clean pages to free.

Only when there simply are no clean pages we resort to
synchronous IO and the system will wait for pages to be
cleaned.

After the initial burst, the system should stabilise,
starting the writeout of pages before we run low on
memory. How to handle the initial burst is something
I haven't figured out yet ... ;)

> Anyway, things are still much better, with older kernels things
> would almost seem locked up during those 10-15 seconds but now
> my apps stay fairly responsive (I can still type in AbiWord,
> browse in Mozilla, etc).

This is due to this smarter handling of the flushing of
dirty pages and due to a more subtle bug where the system
ended up doing synchronous IO on too many pages, whereas
now it only does synchronous IO on _1_ page per scan ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 20:21 ` Rik van Riel
@ 2001-06-14  1:49   ` Tom Sightler
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  8:30   ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Tom Sightler @ 2001-06-14  1:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel

Quoting Rik van Riel <riel@conectiva.com.br>:

> After the initial burst, the system should stabilise,
> starting the writeout of pages before we run low on
> memory. How to handle the initial burst is something
> I haven't figured out yet ... ;)

Well, at least I know that this is expected with the VM, although I do still
think this is bad behavior.  If my disk is idle why would I wait until I have
greater than 100MB of data to write before I finally start actually moving some
data to disk?
 
> This is due to this smarter handling of the flushing of
> dirty pages and due to a more subtle bug where the system
> ended up doing synchronous IO on too many pages, whereas
> now it only does synchronous IO on _1_ page per scan ;)

And this is definately a noticeable fix, thanks for your continued work.  I know
it's hard to get everything balanced out right, and I only wrote this email to
describe some behavior I was seeing and make sure it was expected in the current
VM.  You've let me know that it is, and it's really minor compared to problems
some of the earlier kernels had.

Later,
Tom


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  1:49   ` Tom Sightler
@ 2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
                         ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14  3:16 UTC (permalink / raw)
  To: Tom Sightler; +Cc: Linux-Kernel

On Wed, 13 Jun 2001, Tom Sightler wrote:
> Quoting Rik van Riel <riel@conectiva.com.br>:
> 
> > After the initial burst, the system should stabilise,
> > starting the writeout of pages before we run low on
> > memory. How to handle the initial burst is something
> > I haven't figured out yet ... ;)
> 
> Well, at least I know that this is expected with the VM, although I do
> still think this is bad behavior.  If my disk is idle why would I wait
> until I have greater than 100MB of data to write before I finally
> start actually moving some data to disk?

The file _could_ be a temporary file, which gets removed
before we'd get around to writing it to disk. Sure, the
chances of this happening with a single file are close to
zero, but having 100MB from 200 different temp files on a
shell server isn't unreasonable to expect.

> > This is due to this smarter handling of the flushing of
> > dirty pages and due to a more subtle bug where the system
> > ended up doing synchronous IO on too many pages, whereas
> > now it only does synchronous IO on _1_ page per scan ;)
> 
> And this is definately a noticeable fix, thanks for your continued
> work.  I know it's hard to get everything balanced out right, and I
> only wrote this email to describe some behavior I was seeing and make
> sure it was expected in the current VM.  You've let me know that it
> is, and it's really minor compared to problems some of the earlier
> kernels had.

I'll be sure to keep this problem in mind. I really want
to fix it, I just haven't figured out how yet  ;)

Maybe we should just see if anything in the first few MB
of inactive pages was freeable, limiting the first scan to
something like 1 or maybe even 5 MB maximum (freepages.min?
freepages.high?) and flushing as soon as we find more unfreeable
pages than that ?

Maybe another solution with bdflush tuning ?

I'll send a patch as soon as I figure this one out...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
@ 2001-06-14  7:59       ` Laramie Leavitt
  2001-06-14  9:24         ` Helge Hafting
  2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 15:10       ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
  2 siblings, 1 reply; 52+ messages in thread
From: Laramie Leavitt @ 2001-06-14  7:59 UTC (permalink / raw)
  To: Linux-Kernel

On Behalf Of Rik van Riel
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> > 
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> > 
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior.  If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
> 
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.
> 
> > > This is due to this smarter handling of the flushing of
> > > dirty pages and due to a more subtle bug where the system
> > > ended up doing synchronous IO on too many pages, whereas
> > > now it only does synchronous IO on _1_ page per scan ;)
> > 
> > And this is definitely a noticeable fix, thanks for your continued
> > work.  I know it's hard to get everything balanced out right, and I
> > only wrote this email to describe some behavior I was seeing and make
> > sure it was expected in the current VM.  You've let me know that it
> > is, and it's really minor compared to problems some of the earlier
> > kernels had.
> 
> I'll be sure to keep this problem in mind. I really want
> to fix it, I just haven't figured out how yet  ;)
> 
> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?
> 

Would it be possible to maintain a dirty-rate count 
for the dirty buffers?

For example, we it is possible to figure an approximate
disk subsystem speed from most of the given information.
If it is possible to know the rate at which new buffers
are being dirtied then we could compare that to the available
memory and the disk speed to calculate some maintainable
rate at which buffers need to be expired.  The rates would
have to maintain some historical data to account for
bursty data...

It may be possible to use a very similar mechanism to do
both.  I.e. not actually calculate the rate from the hardware,
but use a similar counter for the expiry rate of buffers.

I don't know how difficult the accounting would be
but it seems possible to make it automatically tuning.

This is a little different than just keeping a list
of dirty buffers and free buffers because you have
the rate information which tells you how long you
have until all the buffers expire.

Laramie.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 20:21 ` Rik van Riel
  2001-06-14  1:49   ` Tom Sightler
@ 2001-06-14  8:30   ` Mike Galbraith
  1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-14  8:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel

On Wed, 13 Jun 2001, Rik van Riel wrote:

> On Wed, 13 Jun 2001, Tom Sightler wrote:
>
> > 1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> > close to wire speed).  At this point Linux has yet to write the first byte to
> > disk.  OK, this might be an exaggerated, but very little disk activity has
> > occured on my laptop.
> >
> > 2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> > maybe I should do that" then the hard drive light comes on solid for several
> > seconds.  During this time the ftp transfer drops to about 1/5 of the original
> > speed.
> >
> > 3.  After the initial burst of data is written things seem much more reasonable,
> > and data streams to the disk almost continually while the rest of the transfer
> > completes at near full speed again.
> >
> > Basically, it seems the kernel buffers all of the incoming file up to nearly
> > available memory before it begins to panic and starts flushing the file to disk.
> >  It seems it should start to lazy write somewhat ealier.
> > Perhaps some of this is tuneable from userland and I just don't
> > know how.
>
> Actually, it already does the lazy write earlier.
>
> The page reclaim code scans up to 1/4th of the inactive_dirty
> pages on the first loop, where it does NOT write things to
> disk.

I've done some experiments with a _clean_ shortage.  Requiring that a
portion of inactive pages be pre cleaned improves response as you start
reclaiming.  Even though you may have enough inactive pages total, you
know that laundering is needed before things get heavy.  This gets the
dirty pages moving a little sooner.  As you're reclaiming pages, writes
trickle out whether your dirty list is short or long.  (and if I'd been
able to make that idea work a little better, you'd have seen my mess;)

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
@ 2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 20:23         ` Roger Larsson
  2001-06-14 20:39         ` John Stoffel
  2001-06-14 15:10       ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
  2 siblings, 2 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-14  8:47 UTC (permalink / raw)
  To: Rik van Riel, Tom Sightler; +Cc: Linux-Kernel

On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> >
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior.  If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
>
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.

This still doesn't make sense if the disk bandwidth isn't being used.

> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?

There are two cases, file-backed and swap-backed pages.

For file-backed pages what we want is pretty simple: when 1) disk bandwidth 
is less than xx% used 2) memory pressure is moderate, just submit whatever's 
dirty.  As pressure increases and bandwidth gets loaded up (including read 
traffic) leave things on the inactive list longer to allow more chances for 
combining and better clustering decisions.

There is no such obvious answer for swap-backed pages; the main difference is 
what should happen under low-to-moderate pressure.  On a server we probably 
want to pre-write as many inactive/dirty pages to swap as possible in order 
to respond better to surges, even when pressure is low.  We don't want this 
behaviour on a laptop, otherwise the disk would never spin down.  There's a 
configuration parameter in there somewhere.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  7:59       ` Laramie Leavitt
@ 2001-06-14  9:24         ` Helge Hafting
  2001-06-14 17:38           ` Mark Hahn
  0 siblings, 1 reply; 52+ messages in thread
From: Helge Hafting @ 2001-06-14  9:24 UTC (permalink / raw)
  To: lar, linux-kernel

Laramie Leavitt wrote:

> Would it be possible to maintain a dirty-rate count
> for the dirty buffers?
> 
> For example, we it is possible to figure an approximate
> disk subsystem speed from most of the given information.

Disk speed is difficult.  I may enable and disable swap on any number of
very different disks and files.  And making it per-device won't help
that
much.  The device may have other partitions with varying access
patterns.
and sometimes differnet devices interfer with each other, such
as two IDE drives on the same cable.  Or several scsi drives
using up scsi (or pci!) bandwith for file access.

You may be able to get some useful approximations, but you
will probably not be able to get good numbers in all cases.

Helge Hafting

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
  2001-06-14  8:47       ` Daniel Phillips
@ 2001-06-14 15:10       ` John Stoffel
  2001-06-14 18:25         ` Daniel Phillips
  2 siblings, 1 reply; 52+ messages in thread
From: John Stoffel @ 2001-06-14 15:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel


>> The file _could_ be a temporary file, which gets removed before
>> we'd get around to writing it to disk. Sure, the chances of this
>> happening with a single file are close to zero, but having 100MB
>> from 200 different temp files on a shell server isn't unreasonable
>> to expect.

Daniel> This still doesn't make sense if the disk bandwidth isn't
Daniel> being used.

And can't you tell that a certain percentage of buffers are owned by a
single file/process?  It would seem that a simple metric of

       if ##% of the buffer/cache is used by 1 process/file, start
       writing the file out to disk, even if there is no pressure.

might to the trick to handle this case.  

>> Maybe we should just see if anything in the first few MB of
>> inactive pages was freeable, limiting the first scan to something
>> like 1 or maybe even 5 MB maximum (freepages.min?  freepages.high?)
>> and flushing as soon as we find more unfreeable pages than that ?

Daniel> For file-backed pages what we want is pretty simple: when 1)
Daniel> disk bandwidth is less than xx% used 2) memory pressure is
Daniel> moderate, just submit whatever's dirty.  As pressure increases
Daniel> and bandwidth gets loaded up (including read traffic) leave
Daniel> things on the inactive list longer to allow more chances for
Daniel> combining and better clustering decisions.

Would it also be good to say that pressure should increase as the
buffer.free percentage goes down?  It won't stop you from filling the
buffer, but it should at least start pushing out pages to disk
earlier.  

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  9:24         ` Helge Hafting
@ 2001-06-14 17:38           ` Mark Hahn
  2001-06-15  8:27             ` Helge Hafting
  0 siblings, 1 reply; 52+ messages in thread
From: Mark Hahn @ 2001-06-14 17:38 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

> > Would it be possible to maintain a dirty-rate count
> > for the dirty buffers?
> > 
> > For example, we it is possible to figure an approximate
> > disk subsystem speed from most of the given information.
> 
> Disk speed is difficult.  I may enable and disable swap on any number of
...
> You may be able to get some useful approximations, but you
> will probably not be able to get good numbers in all cases.

a useful approximation would be simply an idle flag.
for instance, if the disk is idle, then cleaning a few 
inactive-dirty pages would make perfect sense, even in 
the absence of memory pressure.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 15:10       ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
@ 2001-06-14 18:25         ` Daniel Phillips
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-14 18:25 UTC (permalink / raw)
  To: John Stoffel; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel

On Thursday 14 June 2001 17:10, John Stoffel wrote:
> >> The file _could_ be a temporary file, which gets removed before
> >> we'd get around to writing it to disk. Sure, the chances of this
> >> happening with a single file are close to zero, but having 100MB
> >> from 200 different temp files on a shell server isn't unreasonable
> >> to expect.
>
> Daniel> This still doesn't make sense if the disk bandwidth isn't
> Daniel> being used.
>
> And can't you tell that a certain percentage of buffers are owned by a
> single file/process?  It would seem that a simple metric of
>
>        if ##% of the buffer/cache is used by 1 process/file, start
>        writing the file out to disk, even if there is no pressure.
>
> might do the trick to handle this case.

Buffers and file pages are owned by the vfs, not processes pre se, so it 
makes accounting harder.  In this case you don't care: it's a file, so in the 
absence of memory pressure and with disk bandwidth available it's better to 
get the data onto disk sooner rather than later.  (This glosses over the 
question of mmap's, by the way.)  It's pretty hard to see why there is any 
benefit at all in delaying, but it's clear there's a benefit in terms of data 
safety and a further benefit in terms of doing what the user expects.

> >> Maybe we should just see if anything in the first few MB of
> >> inactive pages was freeable, limiting the first scan to something
> >> like 1 or maybe even 5 MB maximum (freepages.min?  freepages.high?)
> >> and flushing as soon as we find more unfreeable pages than that ?
>
> Daniel> For file-backed pages what we want is pretty simple: when 1)
> Daniel> disk bandwidth is less than xx% used 2) memory pressure is
> Daniel> moderate, just submit whatever's dirty.  As pressure increases
> Daniel> and bandwidth gets loaded up (including read traffic) leave
> Daniel> things on the inactive list longer to allow more chances for
> Daniel> combining and better clustering decisions.
>
> Would it also be good to say that pressure should increase as the
> buffer.free percentage goes down?

Maybe - right now getblk waits until it runs completely out of buffers of a 
given size before trying to allocate more, which means that sometimes an io 
will be delayed by the time it takes to complete a page_launder cycle.  Two 
reasons why it may not be worth doing anything about this: 1) we will move 
most of the buffer users into the page cache in due course 2) the frequency 
of this kind of io delay is *probably* pretty low.

> It won't stop you from filling the
> buffer, but it should at least start pushing out pages to disk
> earlier.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  8:47       ` Daniel Phillips
@ 2001-06-14 20:23         ` Roger Larsson
  2001-06-15  6:04           ` Mike Galbraith
  2001-06-14 20:39         ` John Stoffel
  1 sibling, 1 reply; 52+ messages in thread
From: Roger Larsson @ 2001-06-14 20:23 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linux-Kernel

On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > After the initial burst, the system should stabilise,
> > > > starting the writeout of pages before we run low on
> > > > memory. How to handle the initial burst is something
> > > > I haven't figured out yet ... ;)
> > >
> > > Well, at least I know that this is expected with the VM, although I do
> > > still think this is bad behavior.  If my disk is idle why would I wait
> > > until I have greater than 100MB of data to write before I finally
> > > start actually moving some data to disk?
> >
> > The file _could_ be a temporary file, which gets removed
> > before we'd get around to writing it to disk. Sure, the
> > chances of this happening with a single file are close to
> > zero, but having 100MB from 200 different temp files on a
> > shell server isn't unreasonable to expect.
>
> This still doesn't make sense if the disk bandwidth isn't being used.
>

It does if you are running on a laptop. Then you do not want the pages
go out all the time. Disk has gone too sleep, needs to start to write a few
pages, stays idle for a while, goes to sleep, a few more pages, ...

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 20:23         ` Roger Larsson
@ 2001-06-14 20:39         ` John Stoffel
  2001-06-14 20:51           ` Rik van Riel
                             ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: John Stoffel @ 2001-06-14 20:39 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel


Roger> It does if you are running on a laptop. Then you do not want
Roger> the pages go out all the time. Disk has gone too sleep, needs
Roger> to start to write a few pages, stays idle for a while, goes to
Roger> sleep, a few more pages, ...

That could be handled by a metric which says if the disk is spun down,
wait until there is more memory pressure before writing.  But if the
disk is spinning, we don't care, you should start writing out buffers
at some low rate to keep the pressure from rising too rapidly.  

The idea of buffers is more to keep from overloading the disk
subsystem with IO, not to stop IO from happening at all.  And to keep
it from going from no IO to full out stalling the system IO.  It
should be a nice line as VM pressure goes up, buffer flushing IO rate
goes up as well.  

Overall, I think Rik, Jonathan and the rest of the hard core VM crew
have been doing a great job with 2.4.5+ work, it seems like it's
getting better and better all the time, and I really appreciate it.

We're now more into some corner cases and tuning issues.  Hopefully.

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:39         ` John Stoffel
@ 2001-06-14 20:51           ` Rik van Riel
  2001-06-14 21:33           ` John Stoffel
  2001-06-15 15:23           ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
  2 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14 20:51 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, John Stoffel wrote:

> That could be handled by a metric which says if the disk is spun down,
> wait until there is more memory pressure before writing.  But if the
> disk is spinning, we don't care, you should start writing out buffers
> at some low rate to keep the pressure from rising too rapidly.
>
> The idea of buffers is more to keep from overloading the disk
> subsystem with IO, not to stop IO from happening at all.  And to keep
> it from going from no IO to full out stalling the system IO.  It
> should be a nice line as VM pressure goes up, buffer flushing IO rate
> goes up as well.

There's another issue.  If dirty data is written out in
small bunches, that means we have to write out the dirty
data more often.

This in turn means extra disk seeks, which can horribly
interfere with disk reads.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:39         ` John Stoffel
  2001-06-14 20:51           ` Rik van Riel
@ 2001-06-14 21:33           ` John Stoffel
  2001-06-14 22:23             ` Rik van Riel
  2001-06-15 15:23           ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
  2 siblings, 1 reply; 52+ messages in thread
From: John Stoffel @ 2001-06-14 21:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: John Stoffel, Roger Larsson, Daniel Phillips, Linux-Kernel


Rik> There's another issue.  If dirty data is written out in small
Rik> bunches, that means we have to write out the dirty data more
Rik> often.

What do you consider a small bunch?  32k?  1Mb?  1% of buffer space?
I don't see how delaying writes until the buffer is almost full really
helps us.  As the buffer fills, the pressure to do writes should
increase, so that we tend, over time, to empty the buffer.  

A buffer is just that, not persistent storage.  

And in the case given, we were not seeing slow degradation, we saw
that the user ran into a wall (or inflection point in the response
time vs load graph), which was pretty sharp.  We need to handle that
more gracefully.  

Rik> This in turn means extra disk seeks, which can horribly interfere
Rik> with disk reads.

True, but are we optomizing for reads or for writes here?  Shouldn't
they really be equally weighted for priority?  And wouldn't the
Elevator help handle this to a degree?

Some areas to think about, at least for me.  And maybe it should be
read and write pressure, not rate?  

 - low write rate, and a low read rate.
   - Do seeks dominate our IO latency/throughput?

 - low read rate, higher write rate (ie buffers filling faster than
   they are being written to disk)
   - Do we care as much about reads in this case?  
   - If the write is just a small, high intensity burst, we don't want
     to go ape on writing out buffers to disk, but we do want to raise the
     rate we do so in the background, no?

- low write rate, high read rate.
  - seems like we want to keep writing the buffers, but at a lower
    rate. 

Just some thoughts...

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 21:33           ` John Stoffel
@ 2001-06-14 22:23             ` Rik van Riel
  0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14 22:23 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, John Stoffel wrote:

> Rik> There's another issue.  If dirty data is written out in small
> Rik> bunches, that means we have to write out the dirty data more
> Rik> often.
>
> What do you consider a small bunch?  32k?  1Mb?  1% of buffer space?
> I don't see how delaying writes until the buffer is almost full really
> helps us.  As the buffer fills, the pressure to do writes should
> increase, so that we tend, over time, to empty the buffer.
>
> A buffer is just that, not persistent storage.
>
> And in the case given, we were not seeing slow degradation, we saw
> that the user ran into a wall (or inflection point in the response
> time vs load graph), which was pretty sharp.  We need to handle that
> more gracefully.

No doubt on the fact that we need to handle it gracefully,
but as long as we don't have any answers to any of the
tricky questions you're asking above it'll be kind of hard
to come up with a patch ;))

> Rik> This in turn means extra disk seeks, which can horribly interfere
> Rik> with disk reads.
>
> True, but are we optomizing for reads or for writes here?  Shouldn't
> they really be equally weighted for priority?  And wouldn't the
> Elevator help handle this to a degree?

We definately need to optimise for reads.

Every time we do a read, we KNOW there's a process waiting
on the data to come in from the disk.

Most of the time we do writes, they'll be asynchronous
delayed IO which is done in the background. The program
which wrote the data has moved on to other things long
since.

> Some areas to think about, at least for me.  And maybe it should be
> read and write pressure, not rate?
>
>  - low write rate, and a low read rate.
>    - Do seeks dominate our IO latency/throughput?

Seeks always dominate IO latency ;)

If you have a program which needs to get data from some file
on disk, it is beneficial for that program if the disk head
is near the data it wants.

Moving the disk head all the way to the other side of the
disk once a second will not slow the program down too much,
but moving the disk head away 30 times a second "because
there is little disk load" might just slow the program
down by a factor of 2 ...

Ie. if the head is in the same track or in the track next
door, we only have rotational latency to count for (say 3ms),
if we're on the other side of the disk we also have to count
in seek time (say 7ms). Giving the program 30 * 7 = 210 ms
extra IO wait time per second just isn't good ;)

> - low write rate, high read rate.
>   - seems like we want to keep writing the buffers, but at a lower
>     rate.

Not at a lower rate, just in larger blocks.  Disk transfer
rate is so rediculously high nowadays that seek time seems
the only sensible thing to optimise for.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:23         ` Roger Larsson
@ 2001-06-15  6:04           ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-15  6:04 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, Roger Larsson wrote:

> On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> > On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > > After the initial burst, the system should stabilise,
> > > > > starting the writeout of pages before we run low on
> > > > > memory. How to handle the initial burst is something
> > > > > I haven't figured out yet ... ;)
> > > >
> > > > Well, at least I know that this is expected with the VM, although I do
> > > > still think this is bad behavior.  If my disk is idle why would I wait
> > > > until I have greater than 100MB of data to write before I finally
> > > > start actually moving some data to disk?
> > >
> > > The file _could_ be a temporary file, which gets removed
> > > before we'd get around to writing it to disk. Sure, the
> > > chances of this happening with a single file are close to
> > > zero, but having 100MB from 200 different temp files on a
> > > shell server isn't unreasonable to expect.
> >
> > This still doesn't make sense if the disk bandwidth isn't being used.
> >
>
> It does if you are running on a laptop. Then you do not want the pages
> go out all the time. Disk has gone too sleep, needs to start to write a few
> pages, stays idle for a while, goes to sleep, a few more pages, ...

True, you'd never want data trickling to disk on a laptop on battery.
If you have to write, you'd want to write everything dirty at once to
extend the period between disk powerups to the max.

With file IO, there is a high probability that the disk is running
while you are generating files temp or not (because you generally do
read/write, not ____/write), so that doesn't negate the argument.

Delayed write is definitely nice for temp files or whatever.. until
your dirty data no longer comfortably fits in ram.  At that point, the
write delay just became lost time and wasted disk bandwidth whether
it's a laptop or not.  The problem is how do you know the dirty data
is going to become too large for comfort?

One thing which seems to me likely to improve behavior is to have two
goals.  One is the trigger point for starting flushing, the second is
a goal below the start point so we define a quantity which needs to be
flushed to prevent us from having to flush again so soon.  Stopping as
soon as the flush trigger is reached means that we'll reach that limit
instantly if the box is doing any writing.. bad news for the laptop and
not beneficial to desktop or server boxen.  Another benefit of having
two goals is that it's easy to see if you're making progress or losing
ground so you can modify behavior accordingly.

Rik mentioned that the system definitely needs to be optimized for read,
and just thinking about it without posessing much OS theory, that rings
of golden truth.  Now, what does having much dirt lying around do to
asynchronous readahead?.. it turn it synchronous prematurely and negates
the read optimization.

	-Mike

(That's why I mentioned having tried a clean shortage, to ensure more
headroom for readahead to keep it asynchronous longer.  Not working the
disk hard 'enough' [define that;] must harm performance by turning both
read _and_ write into strictly synchronous operations prematurely)



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 17:38           ` Mark Hahn
@ 2001-06-15  8:27             ` Helge Hafting
  0 siblings, 0 replies; 52+ messages in thread
From: Helge Hafting @ 2001-06-15  8:27 UTC (permalink / raw)
  To: Mark Hahn, linux-kernel

Mark Hahn wrote:

> > Disk speed is difficult.  I may enable and disable swap on any number of
> ...
> > You may be able to get some useful approximations, but you
> > will probably not be able to get good numbers in all cases.
> 
> a useful approximation would be simply an idle flag.
> for instance, if the disk is idle, then cleaning a few
> inactive-dirty pages would make perfect sense, even in
> the absence of memory pressure.

You can't say "the disk".  One disk is common, but so is
setups with several.  You can say "clean pages if
all disks are idle".  You may then loose some opportunities 
if one disk is idle while an unrelated one is busy.

Saying "clean a page if the disk it goes to is idle" may 
look like the perfect solution, but it is surprisingly
hard.  It don't work with two IDE drives on the same
cable - accessing one will delay the other which might be busy.
The same can happen with scsi if the bandwith of the scsi bus
(or the isa/pci/whatever bus) it is connected to is maxed out.

And then there are loop & md devices.  My computer have several
md devices using different partitions on the same two disks,
as well as a few ordinary partitions.  Code to deal correctly
with that in all cases when one disk is busy and the other idle 
is hard.  Probably so complex that it'll be rejected on the
KISS principle alone.

A per-disk "low-priority queue"  in addition to the ordinary
elevator might work well even in the presence of md
devices, as the md devices just pass stuff on to the real
disks.  Basically let the disk driver pick stuff from the low-
priority queue only when the elevator is completely idle.
But this gives another problem - you get the complication
of moving stuff from low to normal priority at times.
Such as when the process does fsync() or the pressure
increases.  

Helge Hafting

^ permalink raw reply	[flat|nested] 52+ messages in thread

* spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-14 20:39         ` John Stoffel
  2001-06-14 20:51           ` Rik van Riel
  2001-06-14 21:33           ` John Stoffel
@ 2001-06-15 15:23           ` Pavel Machek
  2001-06-16 20:50             ` Daniel Phillips
  2001-06-18 20:21             ` spindown Simon Huggins
  2 siblings, 2 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-15 15:23 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel

Hi!

> Roger> It does if you are running on a laptop. Then you do not want
> Roger> the pages go out all the time. Disk has gone too sleep, needs
> Roger> to start to write a few pages, stays idle for a while, goes to
> Roger> sleep, a few more pages, ...
> 
> That could be handled by a metric which says if the disk is spun down,
> wait until there is more memory pressure before writing.  But if the
> disk is spinning, we don't care, you should start writing out buffers
> at some low rate to keep the pressure from rising too rapidly.  

Notice that write is not free (in terms of power) even if disk is spinning.
Seeks (etc) also take some power. And think about flashcards. It certainly
is cheaper tha spinning disk up but still not free.

Also note that kernel does not [currently] know that disks went spindown.
								Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-15 15:23           ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
@ 2001-06-16 20:50             ` Daniel Phillips
  2001-06-16 21:06               ` Rik van Riel
  2001-06-18 20:21             ` spindown Simon Huggins
  1 sibling, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-16 20:50 UTC (permalink / raw)
  To: Pavel Machek, John Stoffel; +Cc: Roger Larsson, Linux-Kernel

On Friday 15 June 2001 17:23, Pavel Machek wrote:
> Hi!
>
> > Roger> It does if you are running on a laptop. Then you do not want
> > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > Roger> to start to write a few pages, stays idle for a while, goes to
> > Roger> sleep, a few more pages, ...
> >
> > That could be handled by a metric which says if the disk is spun down,
> > wait until there is more memory pressure before writing.  But if the
> > disk is spinning, we don't care, you should start writing out buffers
> > at some low rate to keep the pressure from rising too rapidly.
>
> Notice that write is not free (in terms of power) even if disk is spinning.
> Seeks (etc) also take some power. And think about flashcards. It certainly
> is cheaper tha spinning disk up but still not free.
>
> Also note that kernel does not [currently] know that disks went spindown.

There's an easy answer that should work well on both servers and laptops, 
that goes something like this: when memory pressure has been brought to 0, if 
there there is plenty of disk bandwidth available, continue writeout for a 
while and clean some extra pages.  In other words, any episode of pageouts 
is followed immediately by a short episode of preemptive cleaning.

This gives both the preemptive cleaning we want in order to respond to the 
next surge, and lets the laptop disk spin down.  The definition of 'for a 
while' and 'plenty of disk bandwidth' can be tuned, but I don't think either 
is particularly critical.

As a side note, the good old multisecond delay before bdflush kicks in 
doesn't really make a lot of sense - when bandwidth is available the 
filesystem-initiated writeouts should happen right away.

It's not necessary or desirable to write out more dirty pages after the 
machine has been idle for a while, if only because the longer it's idle the 
less the 'surge protection' matters in terms of average throughput.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 20:50             ` Daniel Phillips
@ 2001-06-16 21:06               ` Rik van Riel
  2001-06-16 21:25                 ` Rik van Riel
  2001-06-16 21:44                 ` Daniel Phillips
  0 siblings, 2 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:06 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel

On Sat, 16 Jun 2001, Daniel Phillips wrote:

> In other words, any episode of pageouts is followed immediately by a
> short episode of preemptive cleaning.

linux/mm/vmscan.c::page_launder(), around line 666:
	        /* Let bdflush take care of the rest. */
                wakeup_bdflush(0);


> The definition of 'for a while' and 'plenty of disk bandwidth' can be
> tuned, but I don't think either is particularly critical.

Can be tuned a bit, indeed.

> As a side note, the good old multisecond delay before bdflush kicks in 
> doesn't really make a lot of sense - when bandwidth is available the 
> filesystem-initiated writeouts should happen right away.

... thus spinning up the disk ?

How about just making sure we write out a bigger bunch
of dirty pages whenever one buffer gets too old ?

Does the patch below do anything good for your laptop? ;)

regards,

Rik
--


--- buffer.c.orig	Sat Jun 16 18:05:15 2001
+++ buffer.c	Sat Jun 16 18:05:29 2001
@@ -2550,8 +2550,7 @@
 			   if the current bh is not yet timed out,
 			   then also all the following bhs
 			   will be too young. */
-			if (++flushed > bdf_prm.b_un.ndirty &&
-					time_before(jiffies, bh->b_flushtime))
+			if(time_before(jiffies, bh->b_flushtime))
 				goto out_unlock;
 		} else {
 			if (++flushed > bdf_prm.b_un.ndirty)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 21:06               ` Rik van Riel
@ 2001-06-16 21:25                 ` Rik van Riel
  2001-06-16 21:44                 ` Daniel Phillips
  1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel

On Sat, 16 Jun 2001, Rik van Riel wrote:


Oops, I did something stupid and the patch is reversed ;)


> --- buffer.c.orig	Sat Jun 16 18:05:15 2001
> +++ buffer.c	Sat Jun 16 18:05:29 2001
> @@ -2550,8 +2550,7 @@
>  			   if the current bh is not yet timed out,
>  			   then also all the following bhs
>  			   will be too young. */
> -			if (++flushed > bdf_prm.b_un.ndirty &&
> -					time_before(jiffies, bh->b_flushtime))
> +			if(time_before(jiffies, bh->b_flushtime))
>  				goto out_unlock;
>  		} else {
>  			if (++flushed > bdf_prm.b_un.ndirty)


Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 21:06               ` Rik van Riel
  2001-06-16 21:25                 ` Rik van Riel
@ 2001-06-16 21:44                 ` Daniel Phillips
  2001-06-16 21:54                   ` Rik van Riel
  2001-06-17 10:05                   ` Mike Galbraith
  1 sibling, 2 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-16 21:44 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel

On Saturday 16 June 2001 23:06, Rik van Riel wrote:
> On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > As a side note, the good old multisecond delay before bdflush kicks in
> > doesn't really make a lot of sense - when bandwidth is available the
> > filesystem-initiated writeouts should happen right away.
>
> ... thus spinning up the disk ?

Nope, the disk is already spinning, some other writeouts just finished.

> How about just making sure we write out a bigger bunch
> of dirty pages whenever one buffer gets too old ?

It's simpler than that.  It's basically just: disk traffic low? good, write 
out all the dirty buffers.  Not quite as crude as that, but nearly.

> Does the patch below do anything good for your laptop? ;)

I'll wait for the next one ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 21:44                 ` Daniel Phillips
@ 2001-06-16 21:54                   ` Rik van Riel
  2001-06-17 10:28                     ` Daniel Phillips
  2001-06-17 10:05                   ` Mike Galbraith
  1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:54 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel

On Sat, 16 Jun 2001, Daniel Phillips wrote:

> > Does the patch below do anything good for your laptop? ;)
> 
> I'll wait for the next one ;-)

OK, here's one which isn't reversed and should work ;))

--- fs/buffer.c.orig	Sat Jun 16 18:05:29 2001
+++ fs/buffer.c	Sat Jun 16 18:05:15 2001
@@ -2550,7 +2550,8 @@
 			   if the current bh is not yet timed out,
 			   then also all the following bhs
 			   will be too young. */
-			if (time_before(jiffies, bh->b_flushtime))
+			if (++flushed > bdf_prm.b_un.ndirty &&
+					time_before(jiffies, bh->b_flushtime))
 				goto out_unlock;
 		} else {
 			if (++flushed > bdf_prm.b_un.ndirty)

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 21:44                 ` Daniel Phillips
  2001-06-16 21:54                   ` Rik van Riel
@ 2001-06-17 10:05                   ` Mike Galbraith
  2001-06-17 12:49                     ` (lkml)Re: " thunder7
  2001-06-18 14:22                     ` Daniel Phillips
  1 sibling, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-17 10:05 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
	thunder7, Linux-Kernel

On Sat, 16 Jun 2001, Daniel Phillips wrote:

> On Saturday 16 June 2001 23:06, Rik van Riel wrote:
> > On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > > As a side note, the good old multisecond delay before bdflush kicks in
> > > doesn't really make a lot of sense - when bandwidth is available the
> > > filesystem-initiated writeouts should happen right away.
> >
> > ... thus spinning up the disk ?
>
> Nope, the disk is already spinning, some other writeouts just finished.
>
> > How about just making sure we write out a bigger bunch
> > of dirty pages whenever one buffer gets too old ?
>
> It's simpler than that.  It's basically just: disk traffic low? good, write
> out all the dirty buffers.  Not quite as crude as that, but nearly.
>
> > Does the patch below do anything good for your laptop? ;)
>
> I'll wait for the next one ;-)

Greetings!  (well, not next one, but one anyway)

It _juuust_ so happens that I was tinkering... what do you think of
something like the below?  (and boy do I ever wonder what a certain
box doing slrn stuff thinks of it.. hint hint;)

	-Mike

Doing Bonnie in big fragmented 1k bs partition on the worst spot on
the disk.  Bad benchmark, bad conditions.. but interesting results.

2.4.6.pre3 before
    -------Sequential Output-------- ---Sequential Input-- --Random--
    -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
500  9609 36.0 10569 14.3  3322  6.4  9509 47.6 10597 13.8 101.7  1.4

2.4.6.pre3 after  (using flushto behavior as in defaults)
    -------Sequential Output-------- ---Sequential Input-- --Random--
    -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
500  8293 30.2 11834 29.4  5072  9.5  8879 44.1 10597 13.6 100.4  0.9


2.4.6.pre3 after  (flushto = ndirty)
 -------Sequential Output-------- ---Sequential Input-- --Random--
 -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
 MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU  /sec %CPU
500 10286 38.4 10715 14.4  3267  6.1  9605 47.6 10596 13.4 102.7  1.6


--- fs/buffer.c.org	Fri Jun 15 06:48:17 2001
+++ fs/buffer.c	Sun Jun 17 09:14:17 2001
@@ -118,20 +118,21 @@
 				wake-cycle */
 		int nrefill; /* Number of clean buffers to try to obtain
 				each time we call refill */
-		int dummy1;   /* unused */
+		int nflushto;   /* Level to flush down to once bdflush starts */
 		int interval; /* jiffies delay between kupdate flushes */
 		int age_buffer;  /* Time for normal buffer to age before we flush it */
 		int nfract_sync; /* Percentage of buffer cache dirty to
 				    activate bdflush synchronously */
-		int dummy2;    /* unused */
+		int nmonitor;    /* Size (%physpages) at which bdflush should
+		          begin monitoring the buffercache */
 		int dummy3;    /* unused */
 	} b_un;
 	unsigned int data[N_PARAM];
-} bdf_prm = {{30, 64, 64, 256, 5*HZ, 30*HZ, 60, 0, 0}};
+} bdf_prm = {{60, 64, 64, 50, 5*HZ, 30*HZ, 85, 15, 0}};

 /* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = {  0,  10,    5,   25,  0,   1*HZ,   0, 0, 0};
-int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,600*HZ, 6000*HZ, 100, 0, 0};
+int bdflush_min[N_PARAM] = {0, 10, 5, 0, 0, 1*HZ, 0, 0, 0};
+int bdflush_max[N_PARAM] = {100,50000, 20000, 100,600*HZ, 6000*HZ, 100, 100, 0};

 /*
  * Rewrote the wait-routines to use the "new" wait-queue functionality,
@@ -763,12 +764,8 @@
 	balance_dirty(NODEV);
 	if (free_shortage())
 		page_launder(GFP_BUFFER, 0);
-	if (!grow_buffers(size)) {
+	if (!grow_buffers(size))
 		wakeup_bdflush(1);
-		current->policy |= SCHED_YIELD;
-		__set_current_state(TASK_RUNNING);
-		schedule();
-	}
 }

 void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -1042,25 +1039,43 @@
     1 -> sync flush (wait for I/O completion) */
 int balance_dirty_state(kdev_t dev)
 {
-	unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
-
-	dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
-	tot = nr_free_buffer_pages();
+	unsigned long dirty, cache, buffers = 0;
+	int i;

-	dirty *= 100;
-	soft_dirty_limit = tot * bdf_prm.b_un.nfract;
-	hard_dirty_limit = tot * bdf_prm.b_un.nfract_sync;
-
-	/* First, check for the "real" dirty limit. */
-	if (dirty > soft_dirty_limit) {
-		if (dirty > hard_dirty_limit)
+	for (i = 0; i < NR_LIST; i++)
+		buffers += size_buffers_type[i];
+	buffers >>= PAGE_SHIFT;
+	if (buffers * 100 < num_physpages * bdf_prm.b_un.nmonitor)
+		return -1;
+
+	buffers *= bdf_prm.b_un.nfract;
+	dirty = 100 * (size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT);
+	cache = atomic_read(&page_cache_size) + nr_free_pages();
+	cache *= bdf_prm.b_un.nfract_sync;
+	if (dirty > buffers) {
+		if (dirty > cache)
 			return 1;
 		return 0;
 	}
-
 	return -1;
 }

+int balance_dirty_done(kdev_t dev)
+{
+	unsigned long dirty, buffers = 0;
+	int i;
+
+	for (i = 0; i < NR_LIST; i++)
+		buffers += size_buffers_type[i];
+	buffers >>= PAGE_SHIFT;
+	buffers *= bdf_prm.b_un.nflushto;
+	dirty = 100 * (size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT);
+
+	if (dirty < buffers)
+		return 1;
+	return 0;
+}
+
 /*
  * if a new dirty buffer is created we need to balance bdflush.
  *
@@ -2528,9 +2543,15 @@
 static int flush_dirty_buffers(int check_flushtime)
 {
 	struct buffer_head * bh, *next;
-	int flushed = 0, i;
+	int flushed = 0, weight = 0, i;

  restart:
+	/*
+	 * If we have a shortage, we have been laundering and reclaiming
+	 * or will be.  In either case, we should adjust flush weight.
+	 */
+	if (!check_flushtime && current->mm)
+		weight += (free_shortage() + inactive_shortage()) >> 4;
 	spin_lock(&lru_list_lock);
 	bh = lru_list[BUF_DIRTY];
 	if (!bh)
@@ -2552,9 +2573,6 @@
 			   will be too young. */
 			if (time_before(jiffies, bh->b_flushtime))
 				goto out_unlock;
-		} else {
-			if (++flushed > bdf_prm.b_un.ndirty)
-				goto out_unlock;
 		}

 		/* OK, now we are committed to write it out. */
@@ -2563,8 +2581,14 @@
 		ll_rw_block(WRITE, 1, &bh);
 		atomic_dec(&bh->b_count);

-		if (current->need_resched)
+		if (++flushed >= bdf_prm.b_un.ndirty + weight ||
+				current->need_resched) {
+			/* kflushd and user tasks return to schedule points. */
+			if (!check_flushtime)
+				return flushed;
+			flushed = 0;
 			schedule();
+		}
 		goto restart;
 	}
  out_unlock:
@@ -2580,8 +2604,14 @@
 	if (waitqueue_active(&bdflush_wait))
 		wake_up_interruptible(&bdflush_wait);

-	if (block)
+	if (block) {
 		flush_dirty_buffers(0);
+		if (current->mm) {
+			current->policy |= SCHED_YIELD;
+			__set_current_state(TASK_RUNNING);
+			schedule();
+		}
+	}
 }

 /*
@@ -2672,7 +2702,7 @@
 int bdflush(void *sem)
 {
 	struct task_struct *tsk = current;
-	int flushed;
+	int flushed, state;
 	/*
 	 *	We have a bare-bones task_struct, and really should fill
 	 *	in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2696,13 +2726,17 @@
 		CHECK_EMERGENCY_SYNC

 		flushed = flush_dirty_buffers(0);
+		state = balance_dirty_state(NODEV);
+		if (state == 1)
+			run_task_queue(&tq_disk);

 		/*
-		 * If there are still a lot of dirty buffers around,
-		 * skip the sleep and flush some more. Otherwise, we
-		 * go to sleep waiting a wakeup.
+		 * If there are still a lot of dirty buffers around, schedule
+		 * and flush some more. Otherwise, go back to sleep.
 		 */
-		if (!flushed || balance_dirty_state(NODEV) < 0) {
+		if (current->need_resched || state == 0)
+			schedule();
+		else if (!flushed || balance_dirty_done(NODEV)) {
 			run_task_queue(&tq_disk);
 			interruptible_sleep_on(&bdflush_wait);
 		}
@@ -2738,7 +2772,11 @@
 		interval = bdf_prm.b_un.interval;
 		if (interval) {
 			tsk->state = TASK_INTERRUPTIBLE;
+sleep:
 			schedule_timeout(interval);
+			/* Get out of the way if kflushd is running. */
+			if (!waitqueue_active(&bdflush_wait))
+				goto sleep;
 		} else {
 		stop_kupdate:
 			tsk->state = TASK_STOPPED;


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-16 21:54                   ` Rik van Riel
@ 2001-06-17 10:28                     ` Daniel Phillips
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-17 10:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel

On Saturday 16 June 2001 23:54, Rik van Riel wrote:
> On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > > Does the patch below do anything good for your laptop? ;)
> >
> > I'll wait for the next one ;-)
>
> OK, here's one which isn't reversed and should work ;))
>
> --- fs/buffer.c.orig	Sat Jun 16 18:05:29 2001
> +++ fs/buffer.c	Sat Jun 16 18:05:15 2001
> @@ -2550,7 +2550,8 @@
>  			   if the current bh is not yet timed out,
>  			   then also all the following bhs
>  			   will be too young. */
> -			if (time_before(jiffies, bh->b_flushtime))
> +			if (++flushed > bdf_prm.b_un.ndirty &&
> +					time_before(jiffies, bh->b_flushtime))
>  				goto out_unlock;
>  		} else {
>  			if (++flushed > bdf_prm.b_un.ndirty)

No, it doesn't, because some way of knowing the disk load is required and 
there's nothing like that here.

There are two components to what I was talking about:

  1) Early flush when load is light
  2) Preemptive cleaning when load is light

Both are supposed to be triggered by other disk activity, swapout or file 
writes, and are supposed to be triggered when the disk activity eases up.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: (lkml)Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-17 10:05                   ` Mike Galbraith
@ 2001-06-17 12:49                     ` thunder7
  2001-06-17 16:40                       ` Mike Galbraith
  2001-06-18 14:22                     ` Daniel Phillips
  1 sibling, 1 reply; 52+ messages in thread
From: thunder7 @ 2001-06-17 12:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, riel

On Sun, Jun 17, 2001 at 12:05:10PM +0200, Mike Galbraith wrote:
> 
> It _juuust_ so happens that I was tinkering... what do you think of
> something like the below?  (and boy do I ever wonder what a certain
> box doing slrn stuff thinks of it.. hint hint;)
> 
I'm sorry to say this box doesn't really think any different of it.

Everything that's in the cache before running slrn on a big group seems
to stay there the whole time, making my active slrn-process use swap.

I applied the patch to 2.4.5-ac15, and this was the result:

   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 0  1  0  11216   2548 183560 264172   1   4   184   343  123   119   2   6  92
 0  0  0  11212   2620 183444 264184   0   0     4    72  127    99   1   2  97
 0  0  0  11212   1604 183444 264740   0   0   378     0  130   101   2   1  98
 0  1  0  11212   1588 184300 263116   0   0   552  1080  277   360   3  14  83
 2  0  2  11212   1692 174052 270536   0   0  1860     0  596   976   9  50  40
 2  0  2  11212   1588 166732 274816   0   0  1868  5426  643  1050   8  44  48
 0  1  0  11212   1588 163276 276888   0   0  1714  1816  580   972   9  17  74
 0  1  0  11212   1848 166280 273688   0   0   514  3952  301   355   3  40  57
 1  0  0  11212   1592 164232 273872   0   0  1824  3532  632  1083  11  25  64
 2  0  2  11212   1980 167304 268792   0   0  1678     0  550   881   8  51  41
 0  1  2  11212   1588 163908 271356   0   0  1344  4896  508   753   7  26  67
 1  0  0  11212   1588 160896 272756   0   0  1642  1301  574   929   9  22  69
 0  1  0  11212   1592 164936 268632   0   0   756  3594  370   467   6  43  51
 2  0  3  11212   1596 164380 266552   0   0  1904  2392  604  1017  10  52  37
 1  0  0  11212   1592 164752 265844   0   0  1784  2382  623  1000  10  22  69
 0  1  0  11212   1592 168528 262256   0   0   810  4176  364   523   5  43  52
 0  1  1  11212   1992 169324 259504   0   0  1686  3068  578   999  11  42  47
 0  1  0  11212   1588 170696 256332   0   0  1568  1080  532   894  10  20  70
 1  0  0  11212   1592 174876 253036   0   0   598  3600  315   420   4  41  55
 0  1  1  11212   2316 171592 253892   0   0  1816  3286  616  1073   7  29  64
 0  1  0  11212   1588 170380 253968   0   0  1638   840  540   910  13  29  58
 0  1  1  11212   2896 168840 253740   0   0   752  4120  342   458   4  45  51
 0  1  0  11216   2012 166392 255560   0   0  1352  2458  549   895   8  14  77
 2  0  1  11216   1588 170744 250164   0   0  1504  1260  503   791   7  48  45
 0  1  1  11224   1588 170704 249948   0   0   874  4106  516   655   6  10  84
 0  1  0  11228   1588 170148 248988   0   0  1442     0  466   772   8  20  73
 1  0  0  11228   1592 171784 247456   0   0   860  3598  362   495   7  44  48
 0  1  0  11228   1588 171864 246212   0   0  1390  3176  510   840   9  41  50
 0  1  2  11232   1992 170344 245832   0   0  1676  1808  539   898  10  45  45
 1  0  1  10508   1632 168204 246780   0 946  1508  2804  599   920   9  20  71
 0  1  0   9496   2020 168904 244880   0   0   936  3620  417   603   5  35  60
 1  0  0   9604   2516 164096 247536   0   0  1700  2214  563  1085  11  33  56
 0  1  0  16196   1820 162112 255492   0   2  1384  1596  497  1106   8  53  38
 1  0  0  19240   3000 158052 260608   0   0   400  3824  373   388   2  14  84
 1  1  1  28756   4508 146032 278104   0   0  1688  2140  612  1502   7  60  33
 2  0  0  39432  29100 105668 300912   0  18  2108  1178  645  1825  12  52  36
 1  0  0  40668  13024 108568 311748   0   0  1674  4992  623  1017   9  12  79
 0  1  0  45324   3484 105072 326432   0   0  1876  3624  619  1090  13  24  63
 1  0  0  53648   1564 102740 337688   0  18   950  3646  404   857   5  31  63
 2  0  0  53672   1604 103356 335680   0 2962  1436  5864  565   976  10  43  47
 1  0  1  54380   1920 103516 334320   0 1086  1826  1626  590  1072  13  45  42
 0  1  1  54600   6532  99568 333860   0 1006   242  5948  277  2680   2  39  59
 0  1  0  54596   1944 103744 331932   0   0  1854  3644  627  1054  11  16  73
 1  0  0  54592   1924 102876 331100   0 950  1956  2612  621  1173  11  41  48
 1  0  0  54592   1592 103576 329568   0   0  1548  4860  605  1106  11  36  53
 0  1  1  54592   1588 102908 328320   0 452  1808  2522  583  1049  11  51  38
 0  1  1  54592   1588 101916 327076   0 866  1816  1260  589  1046  11  49  40
 0  1  0  54592   2076  99568 327776   0 414   992  5728  459  1314   7  25  67
 0  1  0  54592   1588 103928 323824   0   0   968  3646  403   747   5  33  61
 1  0  0  54592   2632 100108 325136   0 402  1856  2468  622  1369  13  44  42
 0  1  0  54592   1588 101872 322600   0 392  1056  2834  461   802   6  35  60
 1  0  1  55644   1724 102108 322404   0 380  1448  2682  501  1032   9  50  41
 1  1  1  57388   1588 103068 322056   0   0  1384  1396  471   780   8  37  56
 0  1  1  58500   2048 102024 323020   0 368   876  3932  504   755   6  11  83
 1  0  1  65756  18188  85916 330256   0 2298   740  3680  316  1313   5  70  26
 1  1  1  70632  30324  69368 338880   0 1600   650  3804  329   907   4  83  14
 0  1  1  70856   9676  75040 350076   0   0  1872  4394  642  1016  10  16  75
 1  0  0  71136   1564  78716 350192   0   0  2024  3604  669  1131  11  17  72
 0  1  0  71476   1560  82388 342428   0   0  2022  3654  671  1108  13  15  72
 0  1  0  71880   1564  86068 335120   0   0  1742  3620  591   946  11  13  76
 0  1  0  71876   1560  86080 331492   0   0  1630     0  508   861   7  15  79
 1  0  0  72204   1556  89728 328004   0   0   154  3660  360   243   2   5  92
 0  1  0  72204   1560  93404 320364   0   0  1736  3612  609  1044  11  12  76
 2  0  0  72204   1560  93404 316984  68   0  1658     0  473   788  15  35  50
 1  0  0  72204   1588  95688 317020   0 1014  1934  4628  650  1119  10  20  70
 0  1  0  72196   2200  97964 320428   0  38  1660  3642  618   931   7  18  75
 0  1  1  72196   1588 100008 319428   0   0   788  4132  390   594   4  32  64
 2  3  1  72180   1920 101068 318516   0 2604  1818  4388  717  8010   6  44  49
 1  0  1  72180   1648  97756 320368   0 204  1410  3134  595   933  10  17  73
 0  1  1  72868   1588  99064 317864   0 1962  1716  4548  580  1398  10  48  42
 0  1  1  80340   2212  99744 322868   0 884  1610  2670  552  1048  10  55  34
 0  1  0  83392   1588  99792 326128   0   0   324  4620  402   629   2  17  81
 1  0  1  90228   1592  99116 331664   2 2230  1882  3730  616  1067   9  51  39
 3  0  1  95008   1588 102052 331888   0 3754  1440  5916  556  1042   9  62  29
 0  1  2  97784   1588 102432 333648   0 1900   336  5016  366   564   3  41  56
 1  0  1  98360   2828 102744 331796   0 4366   430  6242  376   868   3  62  35
 0  1  0  98384   1588 101656 332828   0   0   338    12  199   223  15  37  48
 0  1  0  98364   1588 102520 331268   0   0  1734  1160  357   421   2   3  94

here slrn starts to sort all the headers just read in.

 1  0  0  98320   1588 102520 331336 3548   0  3812     0  181   189  39   2  59
 0  1  0  98320   1616 102520 332968 3966   0  3966     0  166   176  43   3  54
 1  0  0  98320   1588 100832 335272 4096   0  4128   116  185   221  44   3  53
 1  0  0  98320   2184  97692 338568 4242   0  4274    10  218   305  44   4  52
 1  0  0  98320   1588  96320 341424 3850   2  3882    68  198   269  44  12  44
 0  1  0  98320   1588  95032 343652 3772   0  3772    30  184   236  45   3  52
 1  0  0  98320   1588  92144 347176 4064   0  4096    14  171   204  44   3  53
 1  0  0  98320   2268  89940 349532 4004   0  4036    40  215   275  45   4  51
 1  0  0  98320   2212  89348 350096 252   0   284     4  110    68  51   1  47
 1  0  0  98320   2208  89348 350100   0   0     0    36  111    65  51   1  48

process idle.

The slrn-test I use is to open a very big group (some 150000 headers)
from local spool. This first reads a lot of headers from disk, building
an impressive 100 Mb size of malloc()ed space in memory, then sorts
these headers.

Good luck,
Jurriaan
-- 
BOFH excuse #34:

(l)user error
GNU/Linux 2.4.5-ac15 SMP/ReiserFS 2x1402 bogomips load av: 0.41 0.11 0.03

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: (lkml)Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-17 12:49                     ` (lkml)Re: " thunder7
@ 2001-06-17 16:40                       ` Mike Galbraith
  0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-17 16:40 UTC (permalink / raw)
  To: thunder7; +Cc: linux-kernel, riel

On Sun, 17 Jun 2001 thunder7@xs4all.nl wrote:

> On Sun, Jun 17, 2001 at 12:05:10PM +0200, Mike Galbraith wrote:
> >
> > It _juuust_ so happens that I was tinkering... what do you think of
> > something like the below?  (and boy do I ever wonder what a certain
> > box doing slrn stuff thinks of it.. hint hint;)
> >
> I'm sorry to say this box doesn't really think any different of it.

Well darn.  But..

> Everything that's in the cache before running slrn on a big group seems
> to stay there the whole time, making my active slrn-process use swap.

It should not be the same data if page aging is working at all.  Better
stated, if it _is_ the same data and page aging is working, it's needed
data, so the movement of momentarily unused rss to disk might have been
the right thing to do.. it just has to buy you the use of the pages moved
for long enough to offset the (large) cost of dropping those pages.

I saw it adding rss to the aging pool, but not terribly much IO.  The
fact that it is using page replacement is only interesting in regard to
total system efficiency.

> I applied the patch to 2.4.5-ac15, and this was the result:

<saves vmstat>

Thanks for running it.  Can you (afford to) send me procinfo or such
(what I would like to see is job efficiency) information?  Full logs
are fine, as long as they're not truely huge :)  Anything under a meg
is gratefully accepted (privately 'course).

I think (am pretty darn sure) the aging fairness change is what is
affecting you, but it's not possible to see whether this change is
affecting you in a negative or positive way without timing data.

	-Mike

misc:

wrt this ~patch, it only allows you to move the rolldown to sync disk
behavior some.. moving write delay back some (knob) is _supposed_ to
get that IO load (at least) a modest throughput increase.  The flushto
thing was basically directed toward laptop use, but ~seems to exhibit
better IO clustering/bandwidth sharing as well.  (less old/new request
merging?.. distance?)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-17 10:05                   ` Mike Galbraith
  2001-06-17 12:49                     ` (lkml)Re: " thunder7
@ 2001-06-18 14:22                     ` Daniel Phillips
  2001-06-19  4:35                       ` Mike Galbraith
                                         ` (2 more replies)
  1 sibling, 3 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-18 14:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
	thunder7, Linux-Kernel

On Sunday 17 June 2001 12:05, Mike Galbraith wrote:
> It _juuust_ so happens that I was tinkering... what do you think of
> something like the below?  (and boy do I ever wonder what a certain
> box doing slrn stuff thinks of it.. hint hint;)

It's too subtle for me ;-)  (Not shy about sying that because this part of 
the kernel is probably subtle for everyone.)

The question I'm tackling right now is how the system behaves when the load 
goes away, or doesn't get heavy.  Your patch doesn't measure the load 
directly - it may attempt to predict it as a function of memory pressure, but 
that's a little more loosely coupled than what I had in mind.

I'm now in the midst of hatching a patch. [1] The first thing I had to do is 
go explore the block driver code, yum yum.  I found that it already computes 
the statistic I'm interested in, namely queued_sectors, which is used to pace 
the IO on block devices.  It's a little crude - we really want this to be 
per-queue and have one queue per "spindle" - but even in its current form 
it's workable.

The idea is that when queued_sectors drops below some threshold we have 
'unused disk bandwidth' so it would be nice to do something useful with it:

  1) Do an early 'sync_old_buffers'
  2) Do some preemptive pageout

The benefit of (1) is that it lets disks go idle a few seconds earlier, and 
(2) should improve the system's latency in response to load surges.  There 
are drawbacks too, which have been pointed out to me privately, but they tend 
to be pretty minor, for example: on a flash disk you'd do a few extra writes 
and wear it out ever-so-slightly sooner.  All the same, such special devices 
can be dealt easily once we progress a little further in improving the 
kernel's 'per spindle' intelligence.

Now how to implement this.  I considered putting a (newly minted) 
wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded 
transition, and that's fine except it doesn't do the whole job: we also need 
to have the early flush for any write to a disk file while the disks are 
lightly loaded, i.e., there is no convenient loaded-to-unloaded transition to 
trigger it.  The missing trigger could be inserted into __mark_dirty, but 
that would penalize the loaded state (a little, but that's still too much).  
Furthermore, it's probably desirable to maintain a small delay between the 
dirty and the flush.  So what I'll try first is just running kflush's timer 
faster, and make its reschedule period vary with disk load, i.e., when there 
are fewer queued_sectors, kflush looks at the dirty buffer list more often.

The rest of what has to happen in kflush is pretty straightforward.  It just 
uses queued_sectors to determine how far to walk the dirty buffer list, which 
is maintained in time-since-dirtied order.  If queued_sectors is below some 
threshold the entire list is flushed.  Note that we want to change the sense 
of b_flushtime to b_timedirtied.  It's more efficient to do it this way 
anyway.

I haven't done anything about preemptive pageout yet, but similar ideas apply.

[1] This is an experiment, do not worry, it will not show up in your tree any 
time soon.  IOW, constructive criticism appreciated, flames copied to 
/dev/null.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-15 15:23           ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
  2001-06-16 20:50             ` Daniel Phillips
@ 2001-06-18 20:21             ` Simon Huggins
  2001-06-19 10:46               ` spindown Pavel Machek
  1 sibling, 1 reply; 52+ messages in thread
From: Simon Huggins @ 2001-06-18 20:21 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rik van Riel, Daniel Phillips, Linux-Kernel

On Fri, Jun 15, 2001 at 03:23:07PM +0000, Pavel Machek wrote:
> > Roger> It does if you are running on a laptop. Then you do not want
> > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > Roger> to start to write a few pages, stays idle for a while, goes to
> > Roger> sleep, a few more pages, ...
> > That could be handled by a metric which says if the disk is spun
> > down, wait until there is more memory pressure before writing.  But
> > if the disk is spinning, we don't care, you should start writing out
> > buffers at some low rate to keep the pressure from rising too
> > rapidly.  
> Notice that write is not free (in terms of power) even if disk is
> spinning.  Seeks (etc) also take some power. And think about
> flashcards. It certainly is cheaper tha spinning disk up but still not
> free.

Isn't this why noflushd exists or is this an evil thing that shouldn't
ever be used and will eventually eat my disks for breakfast?


Description: allow idle hard disks to spin down
 Noflushd is a daemon that spins down disks that have not been read from
 after a certain amount of time, and then prevents disk writes from
 spinning them back up. It's targeted for laptops but can be used on any
 computer with IDE disks. The effect is that the hard disk actually spins
 down, saving you battery power, and shutting off the loudest component of
 most computers.

http://noflushd.sourceforge.net


Simon.

-- 
[ "CATS. CATS ARE NICE." - Death, "Sourcery"                           ]
        Black Cat Networks.  http://www.blackcatnetworks.co.uk/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
  2001-06-18 14:22                     ` Daniel Phillips
@ 2001-06-19  4:35                       ` Mike Galbraith
  2001-06-20  1:50                       ` [RFC] Early flush (was: spindown) Daniel Phillips
  2001-06-20  4:39                       ` Richard Gooch
  2 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-19  4:35 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
	thunder7, Linux-Kernel

On Mon, 18 Jun 2001, Daniel Phillips wrote:

> On Sunday 17 June 2001 12:05, Mike Galbraith wrote:
> > It _juuust_ so happens that I was tinkering... what do you think of
> > something like the below?  (and boy do I ever wonder what a certain
> > box doing slrn stuff thinks of it.. hint hint;)
>
> It's too subtle for me ;-)  (Not shy about sying that because this part of
> the kernel is probably subtle for everyone.)

No subtltry (hammer), it just draws a line that doesn't move around
in unpredictable ways.  For example, nr_free_buffer_pages() adds in
free pages to the line it draws.  You may have a large volume of dirty
data, decide it would be prudent to flush, then someone frees a nice
chunk of memory...  (send morse code messages via malloc/free?:)

Anyway it's crude, but it seems to have gotten results from the slrn
load.  I received logs for ac15 and ac15+patch.  ac15 took 265 seconds
to do the job whereas with the patch it took 227 seconds.  I haven't
poured over the logs yet, but there seems to be throughput to be had.

If anyone is interested in the logs, they're much smaller than expected
-rw-r--r--   1 mikeg    users       11993 Jun 19 05:58 ac15_mike.log
-rw-r--r--   1 mikeg    users       13015 Jun 19 05:58 ac15_org.log

> The question I'm tackling right now is how the system behaves when the load
> goes away, or doesn't get heavy.  Your patch doesn't measure the load
> directly - it may attempt to predict it as a function of memory pressure, but
> that's a little more loosely coupled than what I had in mind.

It doesn't attempt to predict, it reacts to the existing situation.

> I'm now in the midst of hatching a patch. [1] The first thing I had to do is
> go explore the block driver code, yum yum.  I found that it already computes
> the statistic I'm interested in, namely queued_sectors, which is used to pace
> the IO on block devices.  It's a little crude - we really want this to be
> per-queue and have one queue per "spindle" - but even in its current form
> it's workable.
>
> The idea is that when queued_sectors drops below some threshold we have
> 'unused disk bandwidth' so it would be nice to do something useful with it:

(that's much more subtle/clever:)

>   1) Do an early 'sync_old_buffers'
>   2) Do some preemptive pageout
>
> The benefit of (1) is that it lets disks go idle a few seconds earlier, and
> (2) should improve the system's latency in response to load surges.  There
> are drawbacks too, which have been pointed out to me privately, but they tend
> to be pretty minor, for example: on a flash disk you'd do a few extra writes
> and wear it out ever-so-slightly sooner.  All the same, such special devices
> can be dealt easily once we progress a little further in improving the
> kernel's 'per spindle' intelligence.
>
> Now how to implement this.  I considered putting a (newly minted)
> wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
> transition, and that's fine except it doesn't do the whole job: we also need
> to have the early flush for any write to a disk file while the disks are
> lightly loaded, i.e., there is no convenient loaded-to-unloaded transition to
> trigger it.  The missing trigger could be inserted into __mark_dirty, but
> that would penalize the loaded state (a little, but that's still too much).
> Furthermore, it's probably desirable to maintain a small delay between the
> dirty and the flush.  So what I'll try first is just running kflush's timer
> faster, and make its reschedule period vary with disk load, i.e., when there
> are fewer queued_sectors, kflush looks at the dirty buffer list more often.
>
> The rest of what has to happen in kflush is pretty straightforward.  It just
> uses queued_sectors to determine how far to walk the dirty buffer list, which
> is maintained in time-since-dirtied order.  If queued_sectors is below some
> threshold the entire list is flushed.  Note that we want to change the sense
> of b_flushtime to b_timedirtied.  It's more efficient to do it this way
> anyway.
>
> I haven't done anything about preemptive pageout yet, but similar ideas apply.

Preemptive pageout could be simply walk the dirty list looking for swap
pages and writing them out.  With the fair aging change that's already
in, there will be some.  If the fair aging change to background aging
works out, there will be more (don't want too many more though;).  The
only problem I can see with that simle method is that once written, the
page lands on the inactive_clean list.  That list is short and does get
consumed.. might turn fake pageout into a real one unintentionally.

> [1] This is an experiment, do not worry, it will not show up in your tree any
> time soon.  IOW, constructive criticism appreciated, flames copied to
> /dev/null.

Look forward to seeing it.

	-Mike


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-18 20:21             ` spindown Simon Huggins
@ 2001-06-19 10:46               ` Pavel Machek
  2001-06-20 16:52                 ` spindown Daniel Phillips
  2001-06-21 16:07                 ` spindown Jamie Lokier
  0 siblings, 2 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-19 10:46 UTC (permalink / raw)
  To: Rik van Riel, Daniel Phillips, Linux-Kernel

Hi!

> > > Roger> It does if you are running on a laptop. Then you do not want
> > > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > > Roger> to start to write a few pages, stays idle for a while, goes to
> > > Roger> sleep, a few more pages, ...
> > > That could be handled by a metric which says if the disk is spun
> > > down, wait until there is more memory pressure before writing.  But
> > > if the disk is spinning, we don't care, you should start writing out
> > > buffers at some low rate to keep the pressure from rising too
> > > rapidly.  
> > Notice that write is not free (in terms of power) even if disk is
> > spinning.  Seeks (etc) also take some power. And think about
> > flashcards. It certainly is cheaper tha spinning disk up but still not
> > free.
> 
> Isn't this why noflushd exists or is this an evil thing that shouldn't
> ever be used and will eventually eat my disks for breakfast?

It would eat your flash for breakfast. You know, flash memories have
no spinning parts, so there's nothing to spin down.
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC] Early flush (was: spindown)
  2001-06-18 14:22                     ` Daniel Phillips
  2001-06-19  4:35                       ` Mike Galbraith
@ 2001-06-20  1:50                       ` Daniel Phillips
  2001-06-20 20:58                         ` Tom Sightler
  2001-06-20  4:39                       ` Richard Gooch
  2 siblings, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20  1:50 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
	thunder7, Linux-Kernel

I never realized how much I didn't like the good old 5 second delay between 
saving an edit and actually getting it written to disk until it went away.  
Now the question is, did I lose any performance in doing that.  What I wrote 
in the previous email turned out to be pretty accurate, so I'll just quote it 
to keep it together with the patch:

> I'm now in the midst of hatching a patch. [1] The first thing I had to do
> is go explore the block driver code, yum yum.  I found that it already
> computes the statistic I'm interested in, namely queued_sectors, which is
> used to pace the IO on block devices.  It's a little crude - we really want
> this to be per-queue and have one queue per "spindle" - but even in its
> current form it's workable.
>
> The idea is that when queued_sectors drops below some threshold we have
> 'unused disk bandwidth' so it would be nice to do something useful with it:
>
>   1) Do an early 'sync_old_buffers'
>   2) Do some preemptive pageout
>
> The benefit of (1) is that it lets disks go idle a few seconds earlier, and
> (2) should improve the system's latency in response to load surges.  There
> are drawbacks too, which have been pointed out to me privately, but they
> tend to be pretty minor, for example: on a flash disk you'd do a few extra
> writes and wear it out ever-so-slightly sooner.  All the same, such special
> devices can be dealt easily once we progress a little further in improving
> the kernel's 'per spindle' intelligence.
>
> Now how to implement this.  I considered putting a (newly minted)
> wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
> transition, and that's fine except it doesn't do the whole job: we also
> need to have the early flush for any write to a disk file while the disks
> are lightly loaded, i.e., there is no convenient loaded-to-unloaded
> transition to trigger it.  The missing trigger could be inserted into
> __mark_dirty, but that would penalize the loaded state (a little, but
> that's still too much). Furthermore, it's probably desirable to maintain a
> small delay between the dirty and the flush.  So what I'll try first is
> just running kflush's timer faster, and make its reschedule period vary
> with disk load, i.e., when there are fewer queued_sectors, kflush looks at
> the dirty buffer list more often.
>
> The rest of what has to happen in kflush is pretty straightforward.  It
> just uses queued_sectors to determine how far to walk the dirty buffer
> list, which is maintained in time-since-dirtied order.  If queued_sectors
> is below some threshold the entire list is flushed.  Note that we want to
> change the sense of b_flushtime to b_timedirtied.  It's more efficient to
> do it this way anyway.
>
> I haven't done anything about preemptive pageout yet, but similar ideas
> apply.
>
> [1] This is an experiment, do not worry, it will not show up in your tree
> any time soon.  IOW, constructive criticism appreciated, flames copied to
> /dev/null.

I originally intended to implement a sliding flush delay based on disk load.  
This turned out to be a lot of work for a hard-to-discern benefit.  So the 
current approach has just two delays: .1 second and whatever the bdflush 
delay is set to.  If there is any non-flush disk traffic the longer delay is 
used.  This is crude but effective... I think.  I hope that somebody will run 
this through some benchmarks to see if I lost any performance.  According to 
my calculations, I did not.  I tested this mainly in UML, and also ran it 
briefly on my laptop.  The interactive feel of the change is immediately 
obvious, and for me at least, a big improvement.

The patch is against 2.4.5.  To apply:

  cd /your/source/tree
  patch <this/patch -p0

--- ../uml.2.4.5.clean/fs/buffer.c	Sat May 26 02:57:46 2001
+++ ./fs/buffer.c	Wed Jun 20 01:55:21 2001
@@ -1076,7 +1076,7 @@
 
 static __inline__ void __mark_dirty(struct buffer_head *bh)
 {
-	bh->b_flushtime = jiffies + bdf_prm.b_un.age_buffer;
+	bh->b_dirtytime = jiffies;
 	refile_buffer(bh);
 }
 
@@ -2524,12 +2524,20 @@
    as all dirty buffers lives _only_ in the DIRTY lru list.
    As we never browse the LOCKED and CLEAN lru lists they are infact
    completly useless. */
-static int flush_dirty_buffers(int check_flushtime)
+static int flush_dirty_buffers (int update)
 {
 	struct buffer_head * bh, *next;
 	int flushed = 0, i;
+	unsigned queued = atomic_read (&queued_sectors);
+	unsigned long youngest_to_update;
 
- restart:
+#ifdef DEBUG
+	if (update)
+		printk("kupdate %lu %i\n", jiffies, queued);
+#endif
+
+restart:
+	youngest_to_update = jiffies - (queued? bdf_prm.b_un.age_buffer: 0);
 	spin_lock(&lru_list_lock);
 	bh = lru_list[BUF_DIRTY];
 	if (!bh)
@@ -2544,19 +2552,14 @@
 		if (buffer_locked(bh))
 			continue;
 
-		if (check_flushtime) {
-			/* The dirty lru list is chronologically ordered so
-			   if the current bh is not yet timed out,
-			   then also all the following bhs
-			   will be too young. */
-			if (time_before(jiffies, bh->b_flushtime))
+		if (update) {
+			if (time_before (youngest_to_update, bh->b_dirtytime))
 				goto out_unlock;
 		} else {
 			if (++flushed > bdf_prm.b_un.ndirty)
 				goto out_unlock;
 		}
 
-		/* OK, now we are committed to write it out. */
 		atomic_inc(&bh->b_count);
 		spin_unlock(&lru_list_lock);
 		ll_rw_block(WRITE, 1, &bh);
@@ -2717,7 +2720,7 @@
 int kupdate(void *sem)
 {
 	struct task_struct * tsk = current;
-	int interval;
+	int update_when = 0;
 
 	tsk->session = 1;
 	tsk->pgrp = 1;
@@ -2733,11 +2736,11 @@
 	up((struct semaphore *)sem);
 
 	for (;;) {
-		/* update interval */
-		interval = bdf_prm.b_un.interval;
-		if (interval) {
+		unsigned check_interval = HZ/10, update_interval = bdf_prm.b_un.interval;
+		
+		if (update_interval) {
 			tsk->state = TASK_INTERRUPTIBLE;
-			schedule_timeout(interval);
+			schedule_timeout(check_interval);
 		} else {
 		stop_kupdate:
 			tsk->state = TASK_STOPPED;
@@ -2756,10 +2759,15 @@
 			if (stopped)
 				goto stop_kupdate;
 		}
+		update_when -= check_interval;
+		if (update_when > 0 && atomic_read(&queued_sectors))
+			continue;
+
 #ifdef DEBUG
 		printk("kupdate() activated...\n");
 #endif
 		sync_old_buffers();
+		update_when = update_interval;
 	}
 }
 
--- ../uml.2.4.5.clean/include/linux/fs.h	Sat May 26 03:01:28 2001
+++ ./include/linux/fs.h	Tue Jun 19 15:12:18 2001
@@ -236,7 +236,7 @@
 	atomic_t b_count;		/* users using this block */
 	kdev_t b_rdev;			/* Real device */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
-	unsigned long b_flushtime;	/* Time when (dirty) buffer should be written */
+	unsigned long b_dirtytime;	/* Time buffer became dirty */
 
 	struct buffer_head *b_next_free;/* lru/free list linkage */
 	struct buffer_head *b_prev_free;/* doubly linked list of buffers */
--- ../uml.2.4.5.clean/mm/filemap.c	Thu May 31 15:29:06 2001
+++ ./mm/filemap.c	Tue Jun 19 15:32:47 2001
@@ -349,7 +349,7 @@
 		if (buffer_locked(bh) || !buffer_dirty(bh) || !buffer_uptodate(bh))
 			continue;
 
-		bh->b_flushtime = jiffies;
+		bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
 		ll_rw_block(WRITE, 1, &bh);	
 	} while ((bh = bh->b_this_page) != head);
 	return 0;
--- ../uml.2.4.5.clean/mm/highmem.c	Sat May 26 02:57:46 2001
+++ ./mm/highmem.c	Tue Jun 19 15:33:22 2001
@@ -400,7 +400,7 @@
 	bh->b_rdev = bh_orig->b_rdev;
 	bh->b_state = bh_orig->b_state;
 #ifdef HIGHMEM_DEBUG
-	bh->b_flushtime = jiffies;
+	bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
 	bh->b_next_free = NULL;
 	bh->b_prev_free = NULL;
 	/* bh->b_this_page */
 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-18 14:22                     ` Daniel Phillips
  2001-06-19  4:35                       ` Mike Galbraith
  2001-06-20  1:50                       ` [RFC] Early flush (was: spindown) Daniel Phillips
@ 2001-06-20  4:39                       ` Richard Gooch
  2001-06-20 14:29                         ` Daniel Phillips
  2001-06-20 16:12                         ` Richard Gooch
  2 siblings, 2 replies; 52+ messages in thread
From: Richard Gooch @ 2001-06-20  4:39 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

Daniel Phillips writes:
> I never realized how much I didn't like the good old 5 second delay
> between saving an edit and actually getting it written to disk until
> it went away.  Now the question is, did I lose any performance in
> doing that.  What I wrote in the previous email turned out to be
> pretty accurate, so I'll just quote it

Starting I/O immediately if there is no load sounds nice. However,
what about the other case, when the disc is already spun down (and
hence there's no I/O load either)? I want the system to avoid doing
writes while the disc is spun down. I'm quite happy for the system to
accumulate dirtied pages/buffers, reclaiming clean pages as needed,
until it absolutely has to start writing out (or I call sync(2)).

Right now I hack that by setting bdflush parameters to 5 minutes. But
that's not ideal either.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20  4:39                       ` Richard Gooch
@ 2001-06-20 14:29                         ` Daniel Phillips
  2001-06-20 16:12                         ` Richard Gooch
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 14:29 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

On Wednesday 20 June 2001 06:39, Richard Gooch wrote:
> Daniel Phillips writes:
> > I never realized how much I didn't like the good old 5 second delay
> > between saving an edit and actually getting it written to disk until
> > it went away.  Now the question is, did I lose any performance in
> > doing that.  What I wrote in the previous email turned out to be
> > pretty accurate, so I'll just quote it
>
> Starting I/O immediately if there is no load sounds nice. However,
> what about the other case, when the disc is already spun down (and
> hence there's no I/O load either)? I want the system to avoid doing
> writes while the disc is spun down. I'm quite happy for the system to
> accumulate dirtied pages/buffers, reclaiming clean pages as needed,
> until it absolutely has to start writing out (or I call sync(2)).

I'd like that too, but what about sync writes?  As things stand now, there is 
no option but to spin the disk back up.  To get around this we'd have to 
change the basic behavior of the block device and that's doable, but it's an 
entirely different proposition than the little patch above.

You know about this project no doubt:

   http://noflushd.sourceforge.net/

This is really complementary to what I did.  Lightweight is not really a good 
way to describe it though, the tar is almost 10,000 lines long.  There is 
probably a clever thing to do at the kernel level to shorten that up.

There's one thing I think I can help fix up while I'm working in here, this 
complaint: 

    Reiserfs journaling bypasses the kernel's delayed write mechanisms and    
    writes straight to disk.

We need to address the reasons why such filesystems have to bypass kupdate.  
This touches on how sync and fsync work, updating supers, flushing the inode 
cache etc, but with Al Viro's superblock work merged now we could start 
thinking about it.

> Right now I hack that by setting bdflush parameters to 5 minutes. But
> that's not ideal either.

Yes, that still works with my patch.  The noflushd user space daemon works by 
turning off kupdate (set update time to 0).

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20  4:39                       ` Richard Gooch
  2001-06-20 14:29                         ` Daniel Phillips
@ 2001-06-20 16:12                         ` Richard Gooch
  2001-06-22 23:25                           ` Daniel Kobras
  2001-06-25 11:31                           ` Pavel Machek
  1 sibling, 2 replies; 52+ messages in thread
From: Richard Gooch @ 2001-06-20 16:12 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

Daniel Phillips writes:
> On Wednesday 20 June 2001 06:39, Richard Gooch wrote:
> > Starting I/O immediately if there is no load sounds nice. However,
> > what about the other case, when the disc is already spun down (and
> > hence there's no I/O load either)? I want the system to avoid doing
> > writes while the disc is spun down. I'm quite happy for the system to
> > accumulate dirtied pages/buffers, reclaiming clean pages as needed,
> > until it absolutely has to start writing out (or I call sync(2)).
> 
> I'd like that too, but what about sync writes?  As things stand now,
> there is no option but to spin the disk back up.  To get around this
> we'd have to change the basic behavior of the block device and
> that's doable, but it's an entirely different proposition than the
> little patch above.

I don't care as much about sync writes. They don't seem to happen very
often on my boxes.

> You know about this project no doubt:
> 
>    http://noflushd.sourceforge.net/

Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
.h files! As you say, not really lightweight. There must be a better
way. Also, I suspect (without having looked at the code) that it
doesn't handle memory pressure well. Things may get nasty when we run
low on free pages.

				Regards,

					Richard....
Permanent: rgooch@atnf.csiro.au
Current:   rgooch@ras.ucalgary.ca

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-19 10:46               ` spindown Pavel Machek
@ 2001-06-20 16:52                 ` Daniel Phillips
  2001-06-20 17:32                   ` spindown Rik van Riel
  2001-06-21 16:07                 ` spindown Jamie Lokier
  1 sibling, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 16:52 UTC (permalink / raw)
  To: Pavel Machek, Rik van Riel, Linux-Kernel

On Tuesday 19 June 2001 12:46, Pavel Machek wrote:
> > > > Roger> It does if you are running on a laptop. Then you do not want
> > > > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > > > Roger> to start to write a few pages, stays idle for a while, goes to
> > > > Roger> sleep, a few more pages, ...
> > > > That could be handled by a metric which says if the disk is spun
> > > > down, wait until there is more memory pressure before writing.  But
> > > > if the disk is spinning, we don't care, you should start writing out
> > > > buffers at some low rate to keep the pressure from rising too
> > > > rapidly.
> > >
> > > Notice that write is not free (in terms of power) even if disk is
> > > spinning.  Seeks (etc) also take some power. And think about
> > > flashcards. It certainly is cheaper tha spinning disk up but still not
> > > free.
> >
> > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > ever be used and will eventually eat my disks for breakfast?
>
> It would eat your flash for breakfast. You know, flash memories have
> no spinning parts, so there's nothing to spin down.

Yes, this doesn't make sense for flash, and in fact, it doesn't make sense to 
have just one set of bdflush parameters for the whole system, it's really a 
property of the individual device.  So the thing to do is for me to go kibitz 
on the io layer rewrite projects and figure out how to set up the 
intelligence per-queue, and have the queues per-device, at which point it's 
trivial to do the write^H^H^H^H^H right thing for each kind of device.

BTW, with nominal 100,000 erases you have to write 10 terabytes to your 100 
meg flash disk before you'll see it start to degrade.  These devices are set 
up to avoid continuous hammering on the same same page, and to take failed 
pages out of the pool as soon as they fail to erase.  Also, the 100,000 
figure is nominal - the average number of erases you'll get per page is 
considerably higher.  The extra few sectors we see with the early flush patch 
are just not going to affect the life of your flash to a measurable degree.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-20 16:52                 ` spindown Daniel Phillips
@ 2001-06-20 17:32                   ` Rik van Riel
  2001-06-20 18:00                     ` spindown Daniel Phillips
  0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-20 17:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Pavel Machek, Linux-Kernel

On Wed, 20 Jun 2001, Daniel Phillips wrote:

> BTW, with nominal 100,000 erases you have to write 10 terabytes
> to your 100 meg flash disk before you'll see it start to
> degrade.

That assumes you write out full blocks.  If you flush after
every byte written you'll hit the limit a lot sooner ;)

Btw, this is also a problem with your patch, when you write
out buffers all the time your disk will spend more time seeking
all over the place (moving the disk head away from where we are
currently reading!) and you'll end up writing the same block
multiple times ...

regards,

Rik
--
Executive summary of a recent Microsoft press release:
   "we are concerned about the GNU General Public License (GPL)"


		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-20 17:32                   ` spindown Rik van Riel
@ 2001-06-20 18:00                     ` Daniel Phillips
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 18:00 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Pavel Machek, Linux-Kernel

On Wednesday 20 June 2001 19:32, Rik van Riel wrote:
> On Wed, 20 Jun 2001, Daniel Phillips wrote:
> > BTW, with nominal 100,000 erases you have to write 10 terabytes
> > to your 100 meg flash disk before you'll see it start to
> > degrade.
>
> That assumes you write out full blocks.  If you flush after
> every byte written you'll hit the limit a lot sooner ;)

Yep, so if you are running on a Yopy, try not to sync after each byte.

> Btw, this is also a problem with your patch, when you write
> out buffers all the time your disk will spend more time seeking
> all over the place (moving the disk head away from where we are
> currently reading!) and you'll end up writing the same block
> multiple times ...

It doesn't work that way, it tacks the flush onto the trailing edge of a 
burst of disk activity, or it flushes out an isolated update, say an edit 
save, which would have required the same amount of disk activity, just a few 
seconds off in the future.  Sometimes it does write a few extra sectors when 
disk activity is sporadic, but the impact on total throughput is small enough 
to be hard to measure reliably.  Even so, there is some optimizing that could 
be done - the update could be interleaved a little better with the falling 
edge of a heavy traffic episode.  This would require that the io rate be 
monitored instead of just the queue backlog.  I'mi nterested in tackling that 
eventually - it has applications in other areas than just the early update.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20  1:50                       ` [RFC] Early flush (was: spindown) Daniel Phillips
@ 2001-06-20 20:58                         ` Tom Sightler
  2001-06-20 22:09                           ` Daniel Phillips
  2001-06-24  3:20                           ` Anuradha Ratnaweera
  0 siblings, 2 replies; 52+ messages in thread
From: Tom Sightler @ 2001-06-20 20:58 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

Quoting Daniel Phillips <phillips@bonn-fries.net>:

> I originally intended to implement a sliding flush delay based on disk
> load.  
> This turned out to be a lot of work for a hard-to-discern benefit.  So
> the 
> current approach has just two delays: .1 second and whatever the bdflush
> 
> delay is set to.  If there is any non-flush disk traffic the longer
> delay is 
> used.  This is crude but effective... I think.  I hope that somebody
> will run 
> this through some benchmarks to see if I lost any performance. 
> According to 
> my calculations, I did not.  I tested this mainly in UML, and also ran
> it 
> briefly on my laptop.  The interactive feel of the change is immediately
> 
> obvious, and for me at least, a big improvement.


Well, since a lot of this discussion seemed to spin off from my original posting
last week about my particular issue with disk flushing I decided to try your
patch with my simple test/problem that I experience on my laptop.

One note, I ran your patch against 2.4.6-pre3 as that is what currently performs
the best on my laptop.  It seems to apply cleanly and compiled without problems.

I used this kernel on my laptop kernel on my laptop all day for my normal
workload which consist ofa Gnome 1.4 desktop, several Mozilla instances, several
ssh sessions with remote X programs displayed, StarOffice, VMware (running
Windows 2000 Pro in 128MB).  I also preformed several compiles throughout the
day.  Overall the machine feels slightly more sluggish I think due to the
following two things:

1.  When running a compile, or anything else that produces lots of small disk
writes, you tend to get lots of little pauses for all the little writes to disk.
 These seem to be unnoticable without the patch.

2.  Loading programs when writing activity is occuring (even light activity like
during the compile) is noticable slower, actually any reading from disk is.

I also ran my simple ftp test that produced the symptom I reported earlier.  I
transferred a 750MB file via FTP, and with your patch sure enough disk writing
started almost immediately, but it still didn't seem to write enough data to
disk to keep up with the transfer so at approximately the 200MB mark the old
behavior still kicked in as it went into full flush mode, during the time
network activity halted, just like before.  The big difference with the patch
and without is that the patched kernel never seems to balance out, without the
patch once the initial burst is done you get a nice stream of data from the
network to disk with the disk staying moderately active.  With the patch the
disk varies from barely active moderate to heavy and back, during the heavy the
network transfer always pauses (although very briefly).

Just my observations, you asked for comments.

Later,
Tom


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20 20:58                         ` Tom Sightler
@ 2001-06-20 22:09                           ` Daniel Phillips
  2001-06-24  3:20                           ` Anuradha Ratnaweera
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 22:09 UTC (permalink / raw)
  To: Tom Sightler
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

On Wednesday 20 June 2001 22:58, Tom Sightler wrote:
> Quoting Daniel Phillips <phillips@bonn-fries.net>:
> > I originally intended to implement a sliding flush delay based on disk
> > load.
> > This turned out to be a lot of work for a hard-to-discern benefit.  So
> > the
> > current approach has just two delays: .1 second and whatever the bdflush
> >
> > delay is set to.  If there is any non-flush disk traffic the longer
> > delay is
> > used.  This is crude but effective... I think.  I hope that somebody
> > will run
> > this through some benchmarks to see if I lost any performance.
> > According to
> > my calculations, I did not.  I tested this mainly in UML, and also ran
> > it
> > briefly on my laptop.  The interactive feel of the change is immediately
> >
> > obvious, and for me at least, a big improvement.
>
> Well, since a lot of this discussion seemed to spin off from my original
> posting last week about my particular issue with disk flushing I decided to
> try your patch with my simple test/problem that I experience on my laptop.
>
> One note, I ran your patch against 2.4.6-pre3 as that is what currently
> performs the best on my laptop.  It seems to apply cleanly and compiled
> without problems.
>
> I used this kernel on my laptop kernel on my laptop all day for my normal
> workload which consist ofa Gnome 1.4 desktop, several Mozilla instances,
> several ssh sessions with remote X programs displayed, StarOffice, VMware
> (running Windows 2000 Pro in 128MB).  I also preformed several compiles
> throughout the day.  Overall the machine feels slightly more sluggish I
> think due to the following two things:
>
> 1.  When running a compile, or anything else that produces lots of small
> disk writes, you tend to get lots of little pauses for all the little
> writes to disk. These seem to be unnoticable without the patch.

OK, this is because the early flush doesn't quit when load picks up again.  
Measuring only the io backlog, as I do now, isn't adequate for telling the 
difference between load initiated by the flush itself and other load, such as 
cpu bound process proceding to read another file, so that's why the flush 
doesn't stop flushing when other IO starts happening.  This has to be fixed.

In the mean time, you could try this simple tweak: just set the lower bound, 
currently 1/10th second a little higher:

-               unsigned check_interval = HZ/10, ...
+               unsigned check_interval = HZ/5, ...

This may be enough to bridge the little pauses in the the compiler's disk 
access pattern so the flush isn't triggered.  (This is not by any means a 
nice solution.)  If you set check_interval to HZ*5, you *should* get exactly 
the old behaviour, I'd be very interested to hear if you do.

Also, could you do your compiles with 'time' so you can quantify the results?

> 2.  Loading programs when writing activity is occuring (even light activity
> like during the compile) is noticable slower, actually any reading from
> disk is.

Hmm, let me think why that may be.  The loader doesn't actually read the 
program into memory, it just maps it and lets the pages fault in as they're 
called for.  So if readahead isn't perfect (it isn't) the io backlog may drop 
to 0 briefly just as the kflush decides to sample it, and it initiates a 
flush.  This flush cleans the whole dirty list out, stealing bandwidth from 
the reads.

> I also ran my simple ftp test that produced the symptom I reported earlier.
>  I transferred a 750MB file via FTP, and with your patch sure enough disk
> writing started almost immediately, but it still didn't seem to write
> enough data to disk to keep up with the transfer so at approximately the
> 200MB mark the old behavior still kicked in as it went into full flush
> mode, during the time network activity halted, just like before.  The big
> difference with the patch and without is that the patched kernel never
> seems to balance out, without the patch once the initial burst is done you
> get a nice stream of data from the network to disk with the disk staying
> moderately active.  With the patch the disk varies from barely active
> moderate to heavy and back, during the heavy the network transfer always
> pauses (although very briefly).
>
> Just my observations, you asked for comments.

Yes, I have to refine this.  The inner flush loop has to know how many io 
submissions are happening, from which it can subtract its own submissions and 
know sombebody else is submitting IO, at which point it can fall back to the 
good old 5 second buffer age limit.  False positives from kflush are handled 
as a fringe benefit, and flush_dirty_buffers won't do extra writeout.  This 
is easy and cheap.

I could get a lot fancier than this and caculate IO load averages, but I'd 
only do that after mining out the simple possibilities.  I'll probably have 
something new for you to try tomorrow, if you're willing.  By the way, I'm 
not addressing your fundamental problem, that's Rik's job ;-).  In fact, I 
define success in this effort by the extent to which I don't affect behaviour 
under load.

Oh, and I'd better finish configuring my kernel and boot my laptop with this, 
i.e., eat my own dogfood ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-19 10:46               ` spindown Pavel Machek
  2001-06-20 16:52                 ` spindown Daniel Phillips
@ 2001-06-21 16:07                 ` Jamie Lokier
  2001-06-22 22:09                   ` spindown Daniel Kobras
  2001-06-28  0:27                   ` spindown Troy Benjegerdes
  1 sibling, 2 replies; 52+ messages in thread
From: Jamie Lokier @ 2001-06-21 16:07 UTC (permalink / raw)
  To: Pavel Machek; +Cc: Rik van Riel, Daniel Phillips, Linux-Kernel

Pavel Machek wrote:
> > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > ever be used and will eventually eat my disks for breakfast?
> 
> It would eat your flash for breakfast. You know, flash memories have
> no spinning parts, so there's nothing to spin down.

Btw Pavel, does noflushd work with 2.4.4?  The noflushd version 2.4 I
tried said it couldn't find some kernel process (kflushd?  I don't
remember) and that I should use bdflush.  The manual says that's
appropriate for older kernels, but not 2.4.4 surely.

-- Jamie

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-21 16:07                 ` spindown Jamie Lokier
@ 2001-06-22 22:09                   ` Daniel Kobras
  2001-06-28  0:27                   ` spindown Troy Benjegerdes
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel Kobras @ 2001-06-22 22:09 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Pavel Machek, Linux-Kernel

On Thu, Jun 21, 2001 at 06:07:01PM +0200, Jamie Lokier wrote:
> Pavel Machek wrote:
> > > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > > ever be used and will eventually eat my disks for breakfast?
> > 
> > It would eat your flash for breakfast. You know, flash memories have
> > no spinning parts, so there's nothing to spin down.
> 
> Btw Pavel, does noflushd work with 2.4.4?  The noflushd version 2.4 I
> tried said it couldn't find some kernel process (kflushd?  I don't
> remember) and that I should use bdflush.  The manual says that's
> appropriate for older kernels, but not 2.4.4 surely.

That's because of my favourite change from the 2.4.3 patch:

-       strcpy(tsk->comm, "kupdate");
+       strcpy(tsk->comm, "kupdated");

noflushd 2.4 fixed this issue in the daemon itself, but I had forgotten about 
the generic startup skript. (Rpms and debs run their customized versions.)

Either the current version from CVS, or

ed /your/init.d/location/noflushd << EOF
%s/kupdate/kupdated/g
w
q
EOF

should get you going.

Regards,

Daniel.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20 16:12                         ` Richard Gooch
@ 2001-06-22 23:25                           ` Daniel Kobras
  2001-06-23  5:10                             ` Daniel Phillips
  2001-06-25 11:31                           ` Pavel Machek
  1 sibling, 1 reply; 52+ messages in thread
From: Daniel Kobras @ 2001-06-22 23:25 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
	John Stoffel, Roger Larsson, thunder7, Linux-Kernel

On Wed, Jun 20, 2001 at 10:12:38AM -0600, Richard Gooch wrote:
> Daniel Phillips writes:
> > I'd like that too, but what about sync writes?  As things stand now,
> > there is no option but to spin the disk back up.  To get around this
> > we'd have to change the basic behavior of the block device and
> > that's doable, but it's an entirely different proposition than the
> > little patch above.
> 
> I don't care as much about sync writes. They don't seem to happen very
> often on my boxes.

syslog and some editors are the most common users of sync writes. vim, e.g.,
per default keeps fsync()ing its swapfile. Tweaking the configuration of
these apps, this can be prevented fairly easy though. Changing sync semantics
for this matter on the other hand seems pretty awkward to me. I'd expect an
application calling fsync() to have good reason for having its data flushed
to disk _now_, no matter what state the disk happens to be in. If it hasn't,
fix the app, not the kernel. 

> > You know about this project no doubt:
> > 
> >    http://noflushd.sourceforge.net/
> 
> Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> .h files! As you say, not really lightweight. There must be a better
> way.

noflushd would benefit a lot from being able to set bdflush parameters per
device or per disk. So I'm really eager to see what Daniel comes up with.
Currently, we can only turn kupdate either on or off as a whole, which means
that noflushd implements a crude replacement for the benefit of multi-disk
setups. A lot of the cruft stems from there.

> Also, I suspect (without having looked at the code) that it
> doesn't handle memory pressure well. Things may get nasty when we run
> low on free pages.

It doesn't handle memory pressure at all. It doesn't have to. noflushd only
messes with kupdate{,d} but leaves bdflush (formerly known as kflushd) alone.
If memory gets tight, bdflush starts writing out dirty buffers, which makes the
disk spin up, and we're back to normal.

Regards,

Daniel.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-22 23:25                           ` Daniel Kobras
@ 2001-06-23  5:10                             ` Daniel Phillips
  2001-06-25 11:33                               ` Pavel Machek
  0 siblings, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-23  5:10 UTC (permalink / raw)
  To: Daniel Kobras, Richard Gooch, Jens Axboe
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

On Saturday 23 June 2001 01:25, Daniel Kobras wrote:
> On Wed, Jun 20, 2001 at 10:12:38AM -0600, Richard Gooch wrote:
> > Daniel Phillips writes:
> > > I'd like that too, but what about sync writes?  As things stand now,
> > > there is no option but to spin the disk back up.  To get around this
> > > we'd have to change the basic behavior of the block device and
> > > that's doable, but it's an entirely different proposition than the
> > > little patch above.
> >
> > I don't care as much about sync writes. They don't seem to happen very
> > often on my boxes.
>
> syslog and some editors are the most common users of sync writes. vim,
> e.g., per default keeps fsync()ing its swapfile. Tweaking the configuration
> of these apps, this can be prevented fairly easy though. Changing sync
> semantics for this matter on the other hand seems pretty awkward to me. I'd
> expect an application calling fsync() to have good reason for having its
> data flushed to disk _now_, no matter what state the disk happens to be in.
> If it hasn't, fix the app, not the kernel.

But apps shouldn't have to know about the special requirements of laptops.  
I've been playing a little with the idea of creating a special block device 
for laptops that goes between the vfs and the real block device, and adds the 
behaviour of being able to buffer writes in memory.  In all respects it would 
seem to the vfs to be a disk.  So far this is just a thought experiment.

> > > You know about this project no doubt:
> > >
> > >    http://noflushd.sourceforge.net/
> >
> > Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> > .h files! As you say, not really lightweight. There must be a better
> > way.
>
> noflushd would benefit a lot from being able to set bdflush parameters per
> device or per disk. So I'm really eager to see what Daniel comes up with.
> Currently, we can only turn kupdate either on or off as a whole, which
> means that noflushd implements a crude replacement for the benefit of
> multi-disk setups. A lot of the cruft stems from there.

Yes, another person to talk to about this is Jens Axboe who has been doing 
some serious hacking on the block layer.  I thought I'd get the early flush 
patch working well for one disk before generalizing to N ;-)

> > Also, I suspect (without having looked at the code) that it
> > doesn't handle memory pressure well. Things may get nasty when we run
> > low on free pages.
>
> It doesn't handle memory pressure at all. It doesn't have to. noflushd only
> messes with kupdate{,d} but leaves bdflush (formerly known as kflushd)
> alone. If memory gets tight, bdflush starts writing out dirty buffers,
> which makes the disk spin up, and we're back to normal.

Exactly.  And in addition, when bdflush does wake up, I try to get kupdate 
out of the way as much as possible, though I've been following the 
traditional recipe and having it submit all buffers past a certain age.  This 
is quite possibily a bad thing to do because it could starve the swapper.  
Ouch.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20 20:58                         ` Tom Sightler
  2001-06-20 22:09                           ` Daniel Phillips
@ 2001-06-24  3:20                           ` Anuradha Ratnaweera
  2001-06-24 11:14                             ` Daniel Phillips
  2001-06-24 15:06                             ` Rik van Riel
  1 sibling, 2 replies; 52+ messages in thread
From: Anuradha Ratnaweera @ 2001-06-24  3:20 UTC (permalink / raw)
  To: Tom Sightler
  Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
	John Stoffel, Roger Larsson, thunder7, Linux-Kernel


On Wed, Jun 20, 2001 at 04:58:51PM -0400, Tom Sightler wrote:
> 
> 1.  When running a compile, or anything else that produces lots of small disk
> writes, you tend to get lots of little pauses for all the little writes to disk.
>  These seem to be unnoticable without the patch.
> 
> 2.  Loading programs when writing activity is occuring (even light activity like
> during the compile) is noticable slower, actually any reading from disk is.
> 
> I also ran my simple ftp test that produced the symptom I reported earlier.  I
> transferred a 750MB file via FTP, and with your patch sure enough disk writing
> started almost immediately, but it still didn't seem to write enough data to
> disk to keep up with the transfer so at approximately the 200MB mark the old
> behavior still kicked in as it went into full flush mode, during the time
> network activity halted, just like before.

It is not uncommon to have a large number of tmp files on the disk(s) (Rik also
pointed this out somewhere early in the original thread) and it is sensible to
keep all of them in buffers if RAM is sufficient. Transfering _very_ large
files is not _that_ common so why shouldn't that case be handled from the user
space by calling sync(2)?

Anuradha

-- 

Debian GNU/Linux (kernel 2.4.6-pre5)

Keep cool, but don't freeze.
		-- Hellman's Mayonnaise


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-24  3:20                           ` Anuradha Ratnaweera
@ 2001-06-24 11:14                             ` Daniel Phillips
  2001-06-24 15:06                             ` Rik van Riel
  1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-24 11:14 UTC (permalink / raw)
  To: Anuradha Ratnaweera, Tom Sightler
  Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
	Roger Larsson, thunder7, Linux-Kernel

On Sunday 24 June 2001 05:20, Anuradha Ratnaweera wrote:
> On Wed, Jun 20, 2001 at 04:58:51PM -0400, Tom Sightler wrote:
> > 1.  When running a compile, or anything else that produces lots of small
> > disk writes, you tend to get lots of little pauses for all the little
> > writes to disk. These seem to be unnoticable without the patch.
> >
> > 2.  Loading programs when writing activity is occuring (even light
> > activity like during the compile) is noticable slower, actually any
> > reading from disk is.
> >
> > I also ran my simple ftp test that produced the symptom I reported
> > earlier.  I transferred a 750MB file via FTP, and with your patch sure
> > enough disk writing started almost immediately, but it still didn't seem
> > to write enough data to disk to keep up with the transfer so at
> > approximately the 200MB mark the old behavior still kicked in as it went
> > into full flush mode, during the time network activity halted, just like
> > before.
>
> It is not uncommon to have a large number of tmp files on the disk(s) (Rik
> also pointed this out somewhere early in the original thread) and it is
> sensible to keep all of them in buffers if RAM is sufficient. Transfering
> _very_ large files is not _that_ common so why shouldn't that case be
> handled from the user space by calling sync(2)?

The patch you're discussing has been superceded - check my "[RFC] Early 
flush: new, improved" post from yesterday.  This addresses the problem of 
handling tmp files efficiently while still having the early flush.

The latest patch shows no degradation at all for compilation, which uses lots 
of temporary files.

--
Daniel 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-24  3:20                           ` Anuradha Ratnaweera
  2001-06-24 11:14                             ` Daniel Phillips
@ 2001-06-24 15:06                             ` Rik van Riel
  2001-06-24 16:21                               ` Daniel Phillips
  1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-24 15:06 UTC (permalink / raw)
  To: Anuradha Ratnaweera
  Cc: Tom Sightler, Daniel Phillips, Mike Galbraith, Pavel Machek,
	John Stoffel, Roger Larsson, thunder7, Linux-Kernel

On Sun, 24 Jun 2001, Anuradha Ratnaweera wrote:

> It is not uncommon to have a large number of tmp files on the disk(s)
> (Rik also pointed this out somewhere early in the original thread) and
> it is sensible to keep all of them in buffers if RAM is sufficient.
> Transfering _very_ large files is not _that_ common so why shouldn't
> that case be handled from the user space by calling sync(2)?

Wait a moment.

The only observed bad case I've heard about here is
that of large files being written out.

It should be easy enough to just trigger writeout of
pages of an inode once that inode has more than a
certain amount of dirty pages in RAM ... say, something
like freepages.high ?

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-24 15:06                             ` Rik van Riel
@ 2001-06-24 16:21                               ` Daniel Phillips
  0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-24 16:21 UTC (permalink / raw)
  To: Rik van Riel, Anuradha Ratnaweera
  Cc: Tom Sightler <ttsig@tuxyturvy.com> Mike Galbraith,
	Pavel Machek, John Stoffel, Roger Larsson, thunder7,
	Linux-Kernel

On Sunday 24 June 2001 17:06, Rik van Riel wrote:
> On Sun, 24 Jun 2001, Anuradha Ratnaweera wrote:
> > It is not uncommon to have a large number of tmp files on the disk(s)
> > (Rik also pointed this out somewhere early in the original thread) and
> > it is sensible to keep all of them in buffers if RAM is sufficient.
> > Transfering _very_ large files is not _that_ common so why shouldn't
> > that case be handled from the user space by calling sync(2)?
>
> Wait a moment.
>
> The only observed bad case I've heard about here is
> that of large files being written out.

But that's not the only advantage of doing the early update:

  - Early spindown for laptops
  - Improved latency under some conditions
  - Improved throughput for some loads
  - Improved filesystem safety

> It should be easy enough to just trigger writeout of
> pages of an inode once that inode has more than a
> certain amount of dirty pages in RAM ... say, something
> like freepages.high ?

The inode dirty page list is not sorted by "time dirtied" so you would be 
eroding the system's ability to ensure that dirty file buffers never get 
older than X.

--
Daniel

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-20 16:12                         ` Richard Gooch
  2001-06-22 23:25                           ` Daniel Kobras
@ 2001-06-25 11:31                           ` Pavel Machek
  1 sibling, 0 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-25 11:31 UTC (permalink / raw)
  To: Richard Gooch
  Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
	John Stoffel, Roger Larsson, thunder7, Linux-Kernel

Hi!

> > You know about this project no doubt:
> > 
> >    http://noflushd.sourceforge.net/
> 
> Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> .h files! As you say, not really lightweight. There must be a better
> way. Also, I suspect (without having looked at the code) that it
> doesn't handle memory pressure well. Things may get nasty when we run
> low on free pages.

Noflushd *is* lightweight. It is complicated because it has to know
about different kernel versions etc. It is "easy stuff". If you add
kernel support, it will only *add* lines to noflushd.
								Pavel
-- 
The best software in life is free (not shareware)!		Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [RFC] Early flush (was: spindown)
  2001-06-23  5:10                             ` Daniel Phillips
@ 2001-06-25 11:33                               ` Pavel Machek
  0 siblings, 0 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-25 11:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Daniel Kobras, Richard Gooch, Jens Axboe, Mike Galbraith,
	Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
	thunder7, Linux-Kernel

Hi!

> > > > I'd like that too, but what about sync writes?  As things stand now,
> > > > there is no option but to spin the disk back up.  To get around this
> > > > we'd have to change the basic behavior of the block device and
> > > > that's doable, but it's an entirely different proposition than the
> > > > little patch above.
> > >
> > > I don't care as much about sync writes. They don't seem to happen very
> > > often on my boxes.
> >
> > syslog and some editors are the most common users of sync writes. vim,
> > e.g., per default keeps fsync()ing its swapfile. Tweaking the configuration
> > of these apps, this can be prevented fairly easy though. Changing sync
> > semantics for this matter on the other hand seems pretty awkward to me. I'd
> > expect an application calling fsync() to have good reason for having its
> > data flushed to disk _now_, no matter what state the disk happens to be in.
> > If it hasn't, fix the app, not the kernel.
> 
> But apps shouldn't have to know about the special requirements of
> laptops.  

If app does fsync(), it hopefully knows what it is doing. [Random apps
should not really do sync even on normal systems -- it hurts
performance.]
								Pavel
-- 
The best software in life is free (not shareware)!		Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: spindown
  2001-06-21 16:07                 ` spindown Jamie Lokier
  2001-06-22 22:09                   ` spindown Daniel Kobras
@ 2001-06-28  0:27                   ` Troy Benjegerdes
  1 sibling, 0 replies; 52+ messages in thread
From: Troy Benjegerdes @ 2001-06-28  0:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Pavel Machek, Rik van Riel, Daniel Phillips, Linux-Kernel

On Thu, Jun 21, 2001 at 06:07:01PM +0200, Jamie Lokier wrote:
> Pavel Machek wrote:
> > > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > > ever be used and will eventually eat my disks for breakfast?
> > 
> > It would eat your flash for breakfast. You know, flash memories have
> > no spinning parts, so there's nothing to spin down.
> 
> Btw Pavel, does noflushd work with 2.4.4?  The noflushd version 2.4 I
> tried said it couldn't find some kernel process (kflushd?  I don't
> remember) and that I should use bdflush.  The manual says that's
> appropriate for older kernels, but not 2.4.4 surely.

Yes, noflushd works with 2.4.x. I'm running it on an ibook with 
debian-unstable.

And as a word of warning: while running noflushd, make sure you 'sync' a 
few times after an 'apt-get dist-upgrade' that upgrades damn near 
everything before doing something that crashes the kernel. This WILL eat 
your ext2fs for breakfast.

-- 
Troy Benjegerdes | master of mispeeling | 'da hozer' |  hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why I draw cartoons. It's my life." -- Charles Shulz

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2001-06-28  0:28 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-13 19:31 2.4.6-pre2, pre3 VM Behavior Tom Sightler
2001-06-13 20:21 ` Rik van Riel
2001-06-14  1:49   ` Tom Sightler
2001-06-14  3:16     ` Rik van Riel
2001-06-14  7:59       ` Laramie Leavitt
2001-06-14  9:24         ` Helge Hafting
2001-06-14 17:38           ` Mark Hahn
2001-06-15  8:27             ` Helge Hafting
2001-06-14  8:47       ` Daniel Phillips
2001-06-14 20:23         ` Roger Larsson
2001-06-15  6:04           ` Mike Galbraith
2001-06-14 20:39         ` John Stoffel
2001-06-14 20:51           ` Rik van Riel
2001-06-14 21:33           ` John Stoffel
2001-06-14 22:23             ` Rik van Riel
2001-06-15 15:23           ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2001-06-16 20:50             ` Daniel Phillips
2001-06-16 21:06               ` Rik van Riel
2001-06-16 21:25                 ` Rik van Riel
2001-06-16 21:44                 ` Daniel Phillips
2001-06-16 21:54                   ` Rik van Riel
2001-06-17 10:28                     ` Daniel Phillips
2001-06-17 10:05                   ` Mike Galbraith
2001-06-17 12:49                     ` (lkml)Re: " thunder7
2001-06-17 16:40                       ` Mike Galbraith
2001-06-18 14:22                     ` Daniel Phillips
2001-06-19  4:35                       ` Mike Galbraith
2001-06-20  1:50                       ` [RFC] Early flush (was: spindown) Daniel Phillips
2001-06-20 20:58                         ` Tom Sightler
2001-06-20 22:09                           ` Daniel Phillips
2001-06-24  3:20                           ` Anuradha Ratnaweera
2001-06-24 11:14                             ` Daniel Phillips
2001-06-24 15:06                             ` Rik van Riel
2001-06-24 16:21                               ` Daniel Phillips
2001-06-20  4:39                       ` Richard Gooch
2001-06-20 14:29                         ` Daniel Phillips
2001-06-20 16:12                         ` Richard Gooch
2001-06-22 23:25                           ` Daniel Kobras
2001-06-23  5:10                             ` Daniel Phillips
2001-06-25 11:33                               ` Pavel Machek
2001-06-25 11:31                           ` Pavel Machek
2001-06-18 20:21             ` spindown Simon Huggins
2001-06-19 10:46               ` spindown Pavel Machek
2001-06-20 16:52                 ` spindown Daniel Phillips
2001-06-20 17:32                   ` spindown Rik van Riel
2001-06-20 18:00                     ` spindown Daniel Phillips
2001-06-21 16:07                 ` spindown Jamie Lokier
2001-06-22 22:09                   ` spindown Daniel Kobras
2001-06-28  0:27                   ` spindown Troy Benjegerdes
2001-06-14 15:10       ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
2001-06-14 18:25         ` Daniel Phillips
2001-06-14  8:30   ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).