* 2.4.6-pre2, pre3 VM Behavior
@ 2001-06-13 19:31 Tom Sightler
2001-06-13 20:21 ` Rik van Riel
0 siblings, 1 reply; 52+ messages in thread
From: Tom Sightler @ 2001-06-13 19:31 UTC (permalink / raw)
To: Linux-Kernel
Hi All,
I have been using the 2.4.x kernels since the 2.4.0-test days on my Dell 5000e
laptop with 320MB of RAM and have experienced first hand many of the problems
other users have reported with the VM system in 2.4. Most of these problems
have been only minor anoyances and I have continued testing kernels as the 2.4
series has continued, mostly without noticing much change.
With 2.4.6-pre2, and -pre3 I can say that I have seen a marked improvement on my
machine, especially in interactive response, for my day to day workstation uses.
However, I do have one observation that seems rather strange, or at least wrong.
I, on occasion, have the need to transfer relatively large files (750MB-1GB)
from our larger Linux servers to my machine. I usually use ftp to transfer
these files and this is where I notice the following:
1. Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
close to wire speed). At this point Linux has yet to write the first byte to
disk. OK, this might be an exaggerated, but very little disk activity has
occured on my laptop.
2. Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
maybe I should do that" then the hard drive light comes on solid for several
seconds. During this time the ftp transfer drops to about 1/5 of the original
speed.
3. After the initial burst of data is written things seem much more reasonable,
and data streams to the disk almost continually while the rest of the transfer
completes at near full speed again.
Basically, it seems the kernel buffers all of the incoming file up to nearly
available memory before it begins to panic and starts flushing the file to disk.
It seems it should start to lazy write somewhat ealier. Perhaps some of this
is tuneable from userland and I just don't know how.
This was much less noticeable on a server with a much faster SCSI hard disk
subsystem as it took significantly less time to flush the information to the
disk once it finally starterd, but laptop hard drives are traditionally poor
performers and at 15MB/s it take 10-15 seconds before things stable out, just
from transferring a file.
Anyway, things are still much better, with older kernels things would almost
seem locked up during those 10-15 seconds but now my apps stay fairly responsive
(I can still type in AbiWord, browse in Mozilla, etc).
Later,
Tom
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-13 19:31 2.4.6-pre2, pre3 VM Behavior Tom Sightler
@ 2001-06-13 20:21 ` Rik van Riel
2001-06-14 1:49 ` Tom Sightler
2001-06-14 8:30 ` Mike Galbraith
0 siblings, 2 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-13 20:21 UTC (permalink / raw)
To: Tom Sightler; +Cc: Linux-Kernel
On Wed, 13 Jun 2001, Tom Sightler wrote:
> 1. Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> close to wire speed). At this point Linux has yet to write the first byte to
> disk. OK, this might be an exaggerated, but very little disk activity has
> occured on my laptop.
>
> 2. Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> maybe I should do that" then the hard drive light comes on solid for several
> seconds. During this time the ftp transfer drops to about 1/5 of the original
> speed.
>
> 3. After the initial burst of data is written things seem much more reasonable,
> and data streams to the disk almost continually while the rest of the transfer
> completes at near full speed again.
>
> Basically, it seems the kernel buffers all of the incoming file up to nearly
> available memory before it begins to panic and starts flushing the file to disk.
> It seems it should start to lazy write somewhat ealier.
> Perhaps some of this is tuneable from userland and I just don't
> know how.
Actually, it already does the lazy write earlier.
The page reclaim code scans up to 1/4th of the inactive_dirty
pages on the first loop, where it does NOT write things to
disk.
On the second loop, we start asynchronous writeout of data
to disk and and scan up to 1/2 of the inactive_dirty pages,
trying to find clean pages to free.
Only when there simply are no clean pages we resort to
synchronous IO and the system will wait for pages to be
cleaned.
After the initial burst, the system should stabilise,
starting the writeout of pages before we run low on
memory. How to handle the initial burst is something
I haven't figured out yet ... ;)
> Anyway, things are still much better, with older kernels things
> would almost seem locked up during those 10-15 seconds but now
> my apps stay fairly responsive (I can still type in AbiWord,
> browse in Mozilla, etc).
This is due to this smarter handling of the flushing of
dirty pages and due to a more subtle bug where the system
ended up doing synchronous IO on too many pages, whereas
now it only does synchronous IO on _1_ page per scan ;)
regards,
Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-13 20:21 ` Rik van Riel
@ 2001-06-14 1:49 ` Tom Sightler
2001-06-14 3:16 ` Rik van Riel
2001-06-14 8:30 ` Mike Galbraith
1 sibling, 1 reply; 52+ messages in thread
From: Tom Sightler @ 2001-06-14 1:49 UTC (permalink / raw)
To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel
Quoting Rik van Riel <riel@conectiva.com.br>:
> After the initial burst, the system should stabilise,
> starting the writeout of pages before we run low on
> memory. How to handle the initial burst is something
> I haven't figured out yet ... ;)
Well, at least I know that this is expected with the VM, although I do still
think this is bad behavior. If my disk is idle why would I wait until I have
greater than 100MB of data to write before I finally start actually moving some
data to disk?
> This is due to this smarter handling of the flushing of
> dirty pages and due to a more subtle bug where the system
> ended up doing synchronous IO on too many pages, whereas
> now it only does synchronous IO on _1_ page per scan ;)
And this is definately a noticeable fix, thanks for your continued work. I know
it's hard to get everything balanced out right, and I only wrote this email to
describe some behavior I was seeing and make sure it was expected in the current
VM. You've let me know that it is, and it's really minor compared to problems
some of the earlier kernels had.
Later,
Tom
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 1:49 ` Tom Sightler
@ 2001-06-14 3:16 ` Rik van Riel
2001-06-14 7:59 ` Laramie Leavitt
` (2 more replies)
0 siblings, 3 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14 3:16 UTC (permalink / raw)
To: Tom Sightler; +Cc: Linux-Kernel
On Wed, 13 Jun 2001, Tom Sightler wrote:
> Quoting Rik van Riel <riel@conectiva.com.br>:
>
> > After the initial burst, the system should stabilise,
> > starting the writeout of pages before we run low on
> > memory. How to handle the initial burst is something
> > I haven't figured out yet ... ;)
>
> Well, at least I know that this is expected with the VM, although I do
> still think this is bad behavior. If my disk is idle why would I wait
> until I have greater than 100MB of data to write before I finally
> start actually moving some data to disk?
The file _could_ be a temporary file, which gets removed
before we'd get around to writing it to disk. Sure, the
chances of this happening with a single file are close to
zero, but having 100MB from 200 different temp files on a
shell server isn't unreasonable to expect.
> > This is due to this smarter handling of the flushing of
> > dirty pages and due to a more subtle bug where the system
> > ended up doing synchronous IO on too many pages, whereas
> > now it only does synchronous IO on _1_ page per scan ;)
>
> And this is definately a noticeable fix, thanks for your continued
> work. I know it's hard to get everything balanced out right, and I
> only wrote this email to describe some behavior I was seeing and make
> sure it was expected in the current VM. You've let me know that it
> is, and it's really minor compared to problems some of the earlier
> kernels had.
I'll be sure to keep this problem in mind. I really want
to fix it, I just haven't figured out how yet ;)
Maybe we should just see if anything in the first few MB
of inactive pages was freeable, limiting the first scan to
something like 1 or maybe even 5 MB maximum (freepages.min?
freepages.high?) and flushing as soon as we find more unfreeable
pages than that ?
Maybe another solution with bdflush tuning ?
I'll send a patch as soon as I figure this one out...
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to aardvark@nl.linux.org (spam digging piggy)
^ permalink raw reply [flat|nested] 52+ messages in thread
* RE: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 3:16 ` Rik van Riel
@ 2001-06-14 7:59 ` Laramie Leavitt
2001-06-14 9:24 ` Helge Hafting
2001-06-14 8:47 ` Daniel Phillips
2001-06-14 15:10 ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
2 siblings, 1 reply; 52+ messages in thread
From: Laramie Leavitt @ 2001-06-14 7:59 UTC (permalink / raw)
To: Linux-Kernel
On Behalf Of Rik van Riel
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> >
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> >
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior. If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
>
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.
>
> > > This is due to this smarter handling of the flushing of
> > > dirty pages and due to a more subtle bug where the system
> > > ended up doing synchronous IO on too many pages, whereas
> > > now it only does synchronous IO on _1_ page per scan ;)
> >
> > And this is definitely a noticeable fix, thanks for your continued
> > work. I know it's hard to get everything balanced out right, and I
> > only wrote this email to describe some behavior I was seeing and make
> > sure it was expected in the current VM. You've let me know that it
> > is, and it's really minor compared to problems some of the earlier
> > kernels had.
>
> I'll be sure to keep this problem in mind. I really want
> to fix it, I just haven't figured out how yet ;)
>
> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?
>
Would it be possible to maintain a dirty-rate count
for the dirty buffers?
For example, we it is possible to figure an approximate
disk subsystem speed from most of the given information.
If it is possible to know the rate at which new buffers
are being dirtied then we could compare that to the available
memory and the disk speed to calculate some maintainable
rate at which buffers need to be expired. The rates would
have to maintain some historical data to account for
bursty data...
It may be possible to use a very similar mechanism to do
both. I.e. not actually calculate the rate from the hardware,
but use a similar counter for the expiry rate of buffers.
I don't know how difficult the accounting would be
but it seems possible to make it automatically tuning.
This is a little different than just keeping a list
of dirty buffers and free buffers because you have
the rate information which tells you how long you
have until all the buffers expire.
Laramie.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-13 20:21 ` Rik van Riel
2001-06-14 1:49 ` Tom Sightler
@ 2001-06-14 8:30 ` Mike Galbraith
1 sibling, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-14 8:30 UTC (permalink / raw)
To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel
On Wed, 13 Jun 2001, Rik van Riel wrote:
> On Wed, 13 Jun 2001, Tom Sightler wrote:
>
> > 1. Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> > close to wire speed). At this point Linux has yet to write the first byte to
> > disk. OK, this might be an exaggerated, but very little disk activity has
> > occured on my laptop.
> >
> > 2. Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> > maybe I should do that" then the hard drive light comes on solid for several
> > seconds. During this time the ftp transfer drops to about 1/5 of the original
> > speed.
> >
> > 3. After the initial burst of data is written things seem much more reasonable,
> > and data streams to the disk almost continually while the rest of the transfer
> > completes at near full speed again.
> >
> > Basically, it seems the kernel buffers all of the incoming file up to nearly
> > available memory before it begins to panic and starts flushing the file to disk.
> > It seems it should start to lazy write somewhat ealier.
> > Perhaps some of this is tuneable from userland and I just don't
> > know how.
>
> Actually, it already does the lazy write earlier.
>
> The page reclaim code scans up to 1/4th of the inactive_dirty
> pages on the first loop, where it does NOT write things to
> disk.
I've done some experiments with a _clean_ shortage. Requiring that a
portion of inactive pages be pre cleaned improves response as you start
reclaiming. Even though you may have enough inactive pages total, you
know that laundering is needed before things get heavy. This gets the
dirty pages moving a little sooner. As you're reclaiming pages, writes
trickle out whether your dirty list is short or long. (and if I'd been
able to make that idea work a little better, you'd have seen my mess;)
-Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 3:16 ` Rik van Riel
2001-06-14 7:59 ` Laramie Leavitt
@ 2001-06-14 8:47 ` Daniel Phillips
2001-06-14 20:23 ` Roger Larsson
2001-06-14 20:39 ` John Stoffel
2001-06-14 15:10 ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
2 siblings, 2 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-14 8:47 UTC (permalink / raw)
To: Rik van Riel, Tom Sightler; +Cc: Linux-Kernel
On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> >
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior. If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
>
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.
This still doesn't make sense if the disk bandwidth isn't being used.
> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?
There are two cases, file-backed and swap-backed pages.
For file-backed pages what we want is pretty simple: when 1) disk bandwidth
is less than xx% used 2) memory pressure is moderate, just submit whatever's
dirty. As pressure increases and bandwidth gets loaded up (including read
traffic) leave things on the inactive list longer to allow more chances for
combining and better clustering decisions.
There is no such obvious answer for swap-backed pages; the main difference is
what should happen under low-to-moderate pressure. On a server we probably
want to pre-write as many inactive/dirty pages to swap as possible in order
to respond better to surges, even when pressure is low. We don't want this
behaviour on a laptop, otherwise the disk would never spin down. There's a
configuration parameter in there somewhere.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 7:59 ` Laramie Leavitt
@ 2001-06-14 9:24 ` Helge Hafting
2001-06-14 17:38 ` Mark Hahn
0 siblings, 1 reply; 52+ messages in thread
From: Helge Hafting @ 2001-06-14 9:24 UTC (permalink / raw)
To: lar, linux-kernel
Laramie Leavitt wrote:
> Would it be possible to maintain a dirty-rate count
> for the dirty buffers?
>
> For example, we it is possible to figure an approximate
> disk subsystem speed from most of the given information.
Disk speed is difficult. I may enable and disable swap on any number of
very different disks and files. And making it per-device won't help
that
much. The device may have other partitions with varying access
patterns.
and sometimes differnet devices interfer with each other, such
as two IDE drives on the same cable. Or several scsi drives
using up scsi (or pci!) bandwith for file access.
You may be able to get some useful approximations, but you
will probably not be able to get good numbers in all cases.
Helge Hafting
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 3:16 ` Rik van Riel
2001-06-14 7:59 ` Laramie Leavitt
2001-06-14 8:47 ` Daniel Phillips
@ 2001-06-14 15:10 ` John Stoffel
2001-06-14 18:25 ` Daniel Phillips
2 siblings, 1 reply; 52+ messages in thread
From: John Stoffel @ 2001-06-14 15:10 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel
>> The file _could_ be a temporary file, which gets removed before
>> we'd get around to writing it to disk. Sure, the chances of this
>> happening with a single file are close to zero, but having 100MB
>> from 200 different temp files on a shell server isn't unreasonable
>> to expect.
Daniel> This still doesn't make sense if the disk bandwidth isn't
Daniel> being used.
And can't you tell that a certain percentage of buffers are owned by a
single file/process? It would seem that a simple metric of
if ##% of the buffer/cache is used by 1 process/file, start
writing the file out to disk, even if there is no pressure.
might to the trick to handle this case.
>> Maybe we should just see if anything in the first few MB of
>> inactive pages was freeable, limiting the first scan to something
>> like 1 or maybe even 5 MB maximum (freepages.min? freepages.high?)
>> and flushing as soon as we find more unfreeable pages than that ?
Daniel> For file-backed pages what we want is pretty simple: when 1)
Daniel> disk bandwidth is less than xx% used 2) memory pressure is
Daniel> moderate, just submit whatever's dirty. As pressure increases
Daniel> and bandwidth gets loaded up (including read traffic) leave
Daniel> things on the inactive list longer to allow more chances for
Daniel> combining and better clustering decisions.
Would it also be good to say that pressure should increase as the
buffer.free percentage goes down? It won't stop you from filling the
buffer, but it should at least start pushing out pages to disk
earlier.
John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
stoffel@lucent.com - http://www.lucent.com - 978-952-7548
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 9:24 ` Helge Hafting
@ 2001-06-14 17:38 ` Mark Hahn
2001-06-15 8:27 ` Helge Hafting
0 siblings, 1 reply; 52+ messages in thread
From: Mark Hahn @ 2001-06-14 17:38 UTC (permalink / raw)
To: Helge Hafting; +Cc: linux-kernel
> > Would it be possible to maintain a dirty-rate count
> > for the dirty buffers?
> >
> > For example, we it is possible to figure an approximate
> > disk subsystem speed from most of the given information.
>
> Disk speed is difficult. I may enable and disable swap on any number of
...
> You may be able to get some useful approximations, but you
> will probably not be able to get good numbers in all cases.
a useful approximation would be simply an idle flag.
for instance, if the disk is idle, then cleaning a few
inactive-dirty pages would make perfect sense, even in
the absence of memory pressure.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 15:10 ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
@ 2001-06-14 18:25 ` Daniel Phillips
0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-14 18:25 UTC (permalink / raw)
To: John Stoffel; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel
On Thursday 14 June 2001 17:10, John Stoffel wrote:
> >> The file _could_ be a temporary file, which gets removed before
> >> we'd get around to writing it to disk. Sure, the chances of this
> >> happening with a single file are close to zero, but having 100MB
> >> from 200 different temp files on a shell server isn't unreasonable
> >> to expect.
>
> Daniel> This still doesn't make sense if the disk bandwidth isn't
> Daniel> being used.
>
> And can't you tell that a certain percentage of buffers are owned by a
> single file/process? It would seem that a simple metric of
>
> if ##% of the buffer/cache is used by 1 process/file, start
> writing the file out to disk, even if there is no pressure.
>
> might do the trick to handle this case.
Buffers and file pages are owned by the vfs, not processes pre se, so it
makes accounting harder. In this case you don't care: it's a file, so in the
absence of memory pressure and with disk bandwidth available it's better to
get the data onto disk sooner rather than later. (This glosses over the
question of mmap's, by the way.) It's pretty hard to see why there is any
benefit at all in delaying, but it's clear there's a benefit in terms of data
safety and a further benefit in terms of doing what the user expects.
> >> Maybe we should just see if anything in the first few MB of
> >> inactive pages was freeable, limiting the first scan to something
> >> like 1 or maybe even 5 MB maximum (freepages.min? freepages.high?)
> >> and flushing as soon as we find more unfreeable pages than that ?
>
> Daniel> For file-backed pages what we want is pretty simple: when 1)
> Daniel> disk bandwidth is less than xx% used 2) memory pressure is
> Daniel> moderate, just submit whatever's dirty. As pressure increases
> Daniel> and bandwidth gets loaded up (including read traffic) leave
> Daniel> things on the inactive list longer to allow more chances for
> Daniel> combining and better clustering decisions.
>
> Would it also be good to say that pressure should increase as the
> buffer.free percentage goes down?
Maybe - right now getblk waits until it runs completely out of buffers of a
given size before trying to allocate more, which means that sometimes an io
will be delayed by the time it takes to complete a page_launder cycle. Two
reasons why it may not be worth doing anything about this: 1) we will move
most of the buffer users into the page cache in due course 2) the frequency
of this kind of io delay is *probably* pretty low.
> It won't stop you from filling the
> buffer, but it should at least start pushing out pages to disk
> earlier.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 8:47 ` Daniel Phillips
@ 2001-06-14 20:23 ` Roger Larsson
2001-06-15 6:04 ` Mike Galbraith
2001-06-14 20:39 ` John Stoffel
1 sibling, 1 reply; 52+ messages in thread
From: Roger Larsson @ 2001-06-14 20:23 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Linux-Kernel
On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > After the initial burst, the system should stabilise,
> > > > starting the writeout of pages before we run low on
> > > > memory. How to handle the initial burst is something
> > > > I haven't figured out yet ... ;)
> > >
> > > Well, at least I know that this is expected with the VM, although I do
> > > still think this is bad behavior. If my disk is idle why would I wait
> > > until I have greater than 100MB of data to write before I finally
> > > start actually moving some data to disk?
> >
> > The file _could_ be a temporary file, which gets removed
> > before we'd get around to writing it to disk. Sure, the
> > chances of this happening with a single file are close to
> > zero, but having 100MB from 200 different temp files on a
> > shell server isn't unreasonable to expect.
>
> This still doesn't make sense if the disk bandwidth isn't being used.
>
It does if you are running on a laptop. Then you do not want the pages
go out all the time. Disk has gone too sleep, needs to start to write a few
pages, stays idle for a while, goes to sleep, a few more pages, ...
/RogerL
--
Roger Larsson
Skellefteå
Sweden
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 8:47 ` Daniel Phillips
2001-06-14 20:23 ` Roger Larsson
@ 2001-06-14 20:39 ` John Stoffel
2001-06-14 20:51 ` Rik van Riel
` (2 more replies)
1 sibling, 3 replies; 52+ messages in thread
From: John Stoffel @ 2001-06-14 20:39 UTC (permalink / raw)
To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel
Roger> It does if you are running on a laptop. Then you do not want
Roger> the pages go out all the time. Disk has gone too sleep, needs
Roger> to start to write a few pages, stays idle for a while, goes to
Roger> sleep, a few more pages, ...
That could be handled by a metric which says if the disk is spun down,
wait until there is more memory pressure before writing. But if the
disk is spinning, we don't care, you should start writing out buffers
at some low rate to keep the pressure from rising too rapidly.
The idea of buffers is more to keep from overloading the disk
subsystem with IO, not to stop IO from happening at all. And to keep
it from going from no IO to full out stalling the system IO. It
should be a nice line as VM pressure goes up, buffer flushing IO rate
goes up as well.
Overall, I think Rik, Jonathan and the rest of the hard core VM crew
have been doing a great job with 2.4.5+ work, it seems like it's
getting better and better all the time, and I really appreciate it.
We're now more into some corner cases and tuning issues. Hopefully.
John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
stoffel@lucent.com - http://www.lucent.com - 978-952-7548
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 20:39 ` John Stoffel
@ 2001-06-14 20:51 ` Rik van Riel
2001-06-14 21:33 ` John Stoffel
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14 20:51 UTC (permalink / raw)
To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel
On Thu, 14 Jun 2001, John Stoffel wrote:
> That could be handled by a metric which says if the disk is spun down,
> wait until there is more memory pressure before writing. But if the
> disk is spinning, we don't care, you should start writing out buffers
> at some low rate to keep the pressure from rising too rapidly.
>
> The idea of buffers is more to keep from overloading the disk
> subsystem with IO, not to stop IO from happening at all. And to keep
> it from going from no IO to full out stalling the system IO. It
> should be a nice line as VM pressure goes up, buffer flushing IO rate
> goes up as well.
There's another issue. If dirty data is written out in
small bunches, that means we have to write out the dirty
data more often.
This in turn means extra disk seeks, which can horribly
interfere with disk reads.
regards,
Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 20:39 ` John Stoffel
2001-06-14 20:51 ` Rik van Riel
@ 2001-06-14 21:33 ` John Stoffel
2001-06-14 22:23 ` Rik van Riel
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2 siblings, 1 reply; 52+ messages in thread
From: John Stoffel @ 2001-06-14 21:33 UTC (permalink / raw)
To: Rik van Riel; +Cc: John Stoffel, Roger Larsson, Daniel Phillips, Linux-Kernel
Rik> There's another issue. If dirty data is written out in small
Rik> bunches, that means we have to write out the dirty data more
Rik> often.
What do you consider a small bunch? 32k? 1Mb? 1% of buffer space?
I don't see how delaying writes until the buffer is almost full really
helps us. As the buffer fills, the pressure to do writes should
increase, so that we tend, over time, to empty the buffer.
A buffer is just that, not persistent storage.
And in the case given, we were not seeing slow degradation, we saw
that the user ran into a wall (or inflection point in the response
time vs load graph), which was pretty sharp. We need to handle that
more gracefully.
Rik> This in turn means extra disk seeks, which can horribly interfere
Rik> with disk reads.
True, but are we optomizing for reads or for writes here? Shouldn't
they really be equally weighted for priority? And wouldn't the
Elevator help handle this to a degree?
Some areas to think about, at least for me. And maybe it should be
read and write pressure, not rate?
- low write rate, and a low read rate.
- Do seeks dominate our IO latency/throughput?
- low read rate, higher write rate (ie buffers filling faster than
they are being written to disk)
- Do we care as much about reads in this case?
- If the write is just a small, high intensity burst, we don't want
to go ape on writing out buffers to disk, but we do want to raise the
rate we do so in the background, no?
- low write rate, high read rate.
- seems like we want to keep writing the buffers, but at a lower
rate.
Just some thoughts...
John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
stoffel@lucent.com - http://www.lucent.com - 978-952-7548
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 21:33 ` John Stoffel
@ 2001-06-14 22:23 ` Rik van Riel
0 siblings, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-14 22:23 UTC (permalink / raw)
To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel
On Thu, 14 Jun 2001, John Stoffel wrote:
> Rik> There's another issue. If dirty data is written out in small
> Rik> bunches, that means we have to write out the dirty data more
> Rik> often.
>
> What do you consider a small bunch? 32k? 1Mb? 1% of buffer space?
> I don't see how delaying writes until the buffer is almost full really
> helps us. As the buffer fills, the pressure to do writes should
> increase, so that we tend, over time, to empty the buffer.
>
> A buffer is just that, not persistent storage.
>
> And in the case given, we were not seeing slow degradation, we saw
> that the user ran into a wall (or inflection point in the response
> time vs load graph), which was pretty sharp. We need to handle that
> more gracefully.
No doubt on the fact that we need to handle it gracefully,
but as long as we don't have any answers to any of the
tricky questions you're asking above it'll be kind of hard
to come up with a patch ;))
> Rik> This in turn means extra disk seeks, which can horribly interfere
> Rik> with disk reads.
>
> True, but are we optomizing for reads or for writes here? Shouldn't
> they really be equally weighted for priority? And wouldn't the
> Elevator help handle this to a degree?
We definately need to optimise for reads.
Every time we do a read, we KNOW there's a process waiting
on the data to come in from the disk.
Most of the time we do writes, they'll be asynchronous
delayed IO which is done in the background. The program
which wrote the data has moved on to other things long
since.
> Some areas to think about, at least for me. And maybe it should be
> read and write pressure, not rate?
>
> - low write rate, and a low read rate.
> - Do seeks dominate our IO latency/throughput?
Seeks always dominate IO latency ;)
If you have a program which needs to get data from some file
on disk, it is beneficial for that program if the disk head
is near the data it wants.
Moving the disk head all the way to the other side of the
disk once a second will not slow the program down too much,
but moving the disk head away 30 times a second "because
there is little disk load" might just slow the program
down by a factor of 2 ...
Ie. if the head is in the same track or in the track next
door, we only have rotational latency to count for (say 3ms),
if we're on the other side of the disk we also have to count
in seek time (say 7ms). Giving the program 30 * 7 = 210 ms
extra IO wait time per second just isn't good ;)
> - low write rate, high read rate.
> - seems like we want to keep writing the buffers, but at a lower
> rate.
Not at a lower rate, just in larger blocks. Disk transfer
rate is so rediculously high nowadays that seek time seems
the only sensible thing to optimise for.
regards,
Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 20:23 ` Roger Larsson
@ 2001-06-15 6:04 ` Mike Galbraith
0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-15 6:04 UTC (permalink / raw)
To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel
On Thu, 14 Jun 2001, Roger Larsson wrote:
> On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> > On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > > After the initial burst, the system should stabilise,
> > > > > starting the writeout of pages before we run low on
> > > > > memory. How to handle the initial burst is something
> > > > > I haven't figured out yet ... ;)
> > > >
> > > > Well, at least I know that this is expected with the VM, although I do
> > > > still think this is bad behavior. If my disk is idle why would I wait
> > > > until I have greater than 100MB of data to write before I finally
> > > > start actually moving some data to disk?
> > >
> > > The file _could_ be a temporary file, which gets removed
> > > before we'd get around to writing it to disk. Sure, the
> > > chances of this happening with a single file are close to
> > > zero, but having 100MB from 200 different temp files on a
> > > shell server isn't unreasonable to expect.
> >
> > This still doesn't make sense if the disk bandwidth isn't being used.
> >
>
> It does if you are running on a laptop. Then you do not want the pages
> go out all the time. Disk has gone too sleep, needs to start to write a few
> pages, stays idle for a while, goes to sleep, a few more pages, ...
True, you'd never want data trickling to disk on a laptop on battery.
If you have to write, you'd want to write everything dirty at once to
extend the period between disk powerups to the max.
With file IO, there is a high probability that the disk is running
while you are generating files temp or not (because you generally do
read/write, not ____/write), so that doesn't negate the argument.
Delayed write is definitely nice for temp files or whatever.. until
your dirty data no longer comfortably fits in ram. At that point, the
write delay just became lost time and wasted disk bandwidth whether
it's a laptop or not. The problem is how do you know the dirty data
is going to become too large for comfort?
One thing which seems to me likely to improve behavior is to have two
goals. One is the trigger point for starting flushing, the second is
a goal below the start point so we define a quantity which needs to be
flushed to prevent us from having to flush again so soon. Stopping as
soon as the flush trigger is reached means that we'll reach that limit
instantly if the box is doing any writing.. bad news for the laptop and
not beneficial to desktop or server boxen. Another benefit of having
two goals is that it's easy to see if you're making progress or losing
ground so you can modify behavior accordingly.
Rik mentioned that the system definitely needs to be optimized for read,
and just thinking about it without posessing much OS theory, that rings
of golden truth. Now, what does having much dirt lying around do to
asynchronous readahead?.. it turn it synchronous prematurely and negates
the read optimization.
-Mike
(That's why I mentioned having tried a clean shortage, to ensure more
headroom for readahead to keep it asynchronous longer. Not working the
disk hard 'enough' [define that;] must harm performance by turning both
read _and_ write into strictly synchronous operations prematurely)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: 2.4.6-pre2, pre3 VM Behavior
2001-06-14 17:38 ` Mark Hahn
@ 2001-06-15 8:27 ` Helge Hafting
0 siblings, 0 replies; 52+ messages in thread
From: Helge Hafting @ 2001-06-15 8:27 UTC (permalink / raw)
To: Mark Hahn, linux-kernel
Mark Hahn wrote:
> > Disk speed is difficult. I may enable and disable swap on any number of
> ...
> > You may be able to get some useful approximations, but you
> > will probably not be able to get good numbers in all cases.
>
> a useful approximation would be simply an idle flag.
> for instance, if the disk is idle, then cleaning a few
> inactive-dirty pages would make perfect sense, even in
> the absence of memory pressure.
You can't say "the disk". One disk is common, but so is
setups with several. You can say "clean pages if
all disks are idle". You may then loose some opportunities
if one disk is idle while an unrelated one is busy.
Saying "clean a page if the disk it goes to is idle" may
look like the perfect solution, but it is surprisingly
hard. It don't work with two IDE drives on the same
cable - accessing one will delay the other which might be busy.
The same can happen with scsi if the bandwith of the scsi bus
(or the isa/pci/whatever bus) it is connected to is maxed out.
And then there are loop & md devices. My computer have several
md devices using different partitions on the same two disks,
as well as a few ordinary partitions. Code to deal correctly
with that in all cases when one disk is busy and the other idle
is hard. Probably so complex that it'll be rejected on the
KISS principle alone.
A per-disk "low-priority queue" in addition to the ordinary
elevator might work well even in the presence of md
devices, as the md devices just pass stuff on to the real
disks. Basically let the disk driver pick stuff from the low-
priority queue only when the elevator is completely idle.
But this gives another problem - you get the complication
of moving stuff from low to normal priority at times.
Such as when the process does fsync() or the pressure
increases.
Helge Hafting
^ permalink raw reply [flat|nested] 52+ messages in thread
* spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-14 20:39 ` John Stoffel
2001-06-14 20:51 ` Rik van Riel
2001-06-14 21:33 ` John Stoffel
@ 2001-06-15 15:23 ` Pavel Machek
2001-06-16 20:50 ` Daniel Phillips
2001-06-18 20:21 ` spindown Simon Huggins
2 siblings, 2 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-15 15:23 UTC (permalink / raw)
To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel
Hi!
> Roger> It does if you are running on a laptop. Then you do not want
> Roger> the pages go out all the time. Disk has gone too sleep, needs
> Roger> to start to write a few pages, stays idle for a while, goes to
> Roger> sleep, a few more pages, ...
>
> That could be handled by a metric which says if the disk is spun down,
> wait until there is more memory pressure before writing. But if the
> disk is spinning, we don't care, you should start writing out buffers
> at some low rate to keep the pressure from rising too rapidly.
Notice that write is not free (in terms of power) even if disk is spinning.
Seeks (etc) also take some power. And think about flashcards. It certainly
is cheaper tha spinning disk up but still not free.
Also note that kernel does not [currently] know that disks went spindown.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
@ 2001-06-16 20:50 ` Daniel Phillips
2001-06-16 21:06 ` Rik van Riel
2001-06-18 20:21 ` spindown Simon Huggins
1 sibling, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-16 20:50 UTC (permalink / raw)
To: Pavel Machek, John Stoffel; +Cc: Roger Larsson, Linux-Kernel
On Friday 15 June 2001 17:23, Pavel Machek wrote:
> Hi!
>
> > Roger> It does if you are running on a laptop. Then you do not want
> > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > Roger> to start to write a few pages, stays idle for a while, goes to
> > Roger> sleep, a few more pages, ...
> >
> > That could be handled by a metric which says if the disk is spun down,
> > wait until there is more memory pressure before writing. But if the
> > disk is spinning, we don't care, you should start writing out buffers
> > at some low rate to keep the pressure from rising too rapidly.
>
> Notice that write is not free (in terms of power) even if disk is spinning.
> Seeks (etc) also take some power. And think about flashcards. It certainly
> is cheaper tha spinning disk up but still not free.
>
> Also note that kernel does not [currently] know that disks went spindown.
There's an easy answer that should work well on both servers and laptops,
that goes something like this: when memory pressure has been brought to 0, if
there there is plenty of disk bandwidth available, continue writeout for a
while and clean some extra pages. In other words, any episode of pageouts
is followed immediately by a short episode of preemptive cleaning.
This gives both the preemptive cleaning we want in order to respond to the
next surge, and lets the laptop disk spin down. The definition of 'for a
while' and 'plenty of disk bandwidth' can be tuned, but I don't think either
is particularly critical.
As a side note, the good old multisecond delay before bdflush kicks in
doesn't really make a lot of sense - when bandwidth is available the
filesystem-initiated writeouts should happen right away.
It's not necessary or desirable to write out more dirty pages after the
machine has been idle for a while, if only because the longer it's idle the
less the 'surge protection' matters in terms of average throughput.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 20:50 ` Daniel Phillips
@ 2001-06-16 21:06 ` Rik van Riel
2001-06-16 21:25 ` Rik van Riel
2001-06-16 21:44 ` Daniel Phillips
0 siblings, 2 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:06 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel
On Sat, 16 Jun 2001, Daniel Phillips wrote:
> In other words, any episode of pageouts is followed immediately by a
> short episode of preemptive cleaning.
linux/mm/vmscan.c::page_launder(), around line 666:
/* Let bdflush take care of the rest. */
wakeup_bdflush(0);
> The definition of 'for a while' and 'plenty of disk bandwidth' can be
> tuned, but I don't think either is particularly critical.
Can be tuned a bit, indeed.
> As a side note, the good old multisecond delay before bdflush kicks in
> doesn't really make a lot of sense - when bandwidth is available the
> filesystem-initiated writeouts should happen right away.
... thus spinning up the disk ?
How about just making sure we write out a bigger bunch
of dirty pages whenever one buffer gets too old ?
Does the patch below do anything good for your laptop? ;)
regards,
Rik
--
--- buffer.c.orig Sat Jun 16 18:05:15 2001
+++ buffer.c Sat Jun 16 18:05:29 2001
@@ -2550,8 +2550,7 @@
if the current bh is not yet timed out,
then also all the following bhs
will be too young. */
- if (++flushed > bdf_prm.b_un.ndirty &&
- time_before(jiffies, bh->b_flushtime))
+ if(time_before(jiffies, bh->b_flushtime))
goto out_unlock;
} else {
if (++flushed > bdf_prm.b_un.ndirty)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 21:06 ` Rik van Riel
@ 2001-06-16 21:25 ` Rik van Riel
2001-06-16 21:44 ` Daniel Phillips
1 sibling, 0 replies; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:25 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel
On Sat, 16 Jun 2001, Rik van Riel wrote:
Oops, I did something stupid and the patch is reversed ;)
> --- buffer.c.orig Sat Jun 16 18:05:15 2001
> +++ buffer.c Sat Jun 16 18:05:29 2001
> @@ -2550,8 +2550,7 @@
> if the current bh is not yet timed out,
> then also all the following bhs
> will be too young. */
> - if (++flushed > bdf_prm.b_un.ndirty &&
> - time_before(jiffies, bh->b_flushtime))
> + if(time_before(jiffies, bh->b_flushtime))
> goto out_unlock;
> } else {
> if (++flushed > bdf_prm.b_un.ndirty)
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to aardvark@nl.linux.org (spam digging piggy)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 21:06 ` Rik van Riel
2001-06-16 21:25 ` Rik van Riel
@ 2001-06-16 21:44 ` Daniel Phillips
2001-06-16 21:54 ` Rik van Riel
2001-06-17 10:05 ` Mike Galbraith
1 sibling, 2 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-16 21:44 UTC (permalink / raw)
To: Rik van Riel; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel
On Saturday 16 June 2001 23:06, Rik van Riel wrote:
> On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > As a side note, the good old multisecond delay before bdflush kicks in
> > doesn't really make a lot of sense - when bandwidth is available the
> > filesystem-initiated writeouts should happen right away.
>
> ... thus spinning up the disk ?
Nope, the disk is already spinning, some other writeouts just finished.
> How about just making sure we write out a bigger bunch
> of dirty pages whenever one buffer gets too old ?
It's simpler than that. It's basically just: disk traffic low? good, write
out all the dirty buffers. Not quite as crude as that, but nearly.
> Does the patch below do anything good for your laptop? ;)
I'll wait for the next one ;-)
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 21:44 ` Daniel Phillips
@ 2001-06-16 21:54 ` Rik van Riel
2001-06-17 10:28 ` Daniel Phillips
2001-06-17 10:05 ` Mike Galbraith
1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-16 21:54 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel
On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > Does the patch below do anything good for your laptop? ;)
>
> I'll wait for the next one ;-)
OK, here's one which isn't reversed and should work ;))
--- fs/buffer.c.orig Sat Jun 16 18:05:29 2001
+++ fs/buffer.c Sat Jun 16 18:05:15 2001
@@ -2550,7 +2550,8 @@
if the current bh is not yet timed out,
then also all the following bhs
will be too young. */
- if (time_before(jiffies, bh->b_flushtime))
+ if (++flushed > bdf_prm.b_un.ndirty &&
+ time_before(jiffies, bh->b_flushtime))
goto out_unlock;
} else {
if (++flushed > bdf_prm.b_un.ndirty)
cheers,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to aardvark@nl.linux.org (spam digging piggy)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 21:44 ` Daniel Phillips
2001-06-16 21:54 ` Rik van Riel
@ 2001-06-17 10:05 ` Mike Galbraith
2001-06-17 12:49 ` (lkml)Re: " thunder7
2001-06-18 14:22 ` Daniel Phillips
1 sibling, 2 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-17 10:05 UTC (permalink / raw)
To: Daniel Phillips
Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
thunder7, Linux-Kernel
On Sat, 16 Jun 2001, Daniel Phillips wrote:
> On Saturday 16 June 2001 23:06, Rik van Riel wrote:
> > On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > > As a side note, the good old multisecond delay before bdflush kicks in
> > > doesn't really make a lot of sense - when bandwidth is available the
> > > filesystem-initiated writeouts should happen right away.
> >
> > ... thus spinning up the disk ?
>
> Nope, the disk is already spinning, some other writeouts just finished.
>
> > How about just making sure we write out a bigger bunch
> > of dirty pages whenever one buffer gets too old ?
>
> It's simpler than that. It's basically just: disk traffic low? good, write
> out all the dirty buffers. Not quite as crude as that, but nearly.
>
> > Does the patch below do anything good for your laptop? ;)
>
> I'll wait for the next one ;-)
Greetings! (well, not next one, but one anyway)
It _juuust_ so happens that I was tinkering... what do you think of
something like the below? (and boy do I ever wonder what a certain
box doing slrn stuff thinks of it.. hint hint;)
-Mike
Doing Bonnie in big fragmented 1k bs partition on the worst spot on
the disk. Bad benchmark, bad conditions.. but interesting results.
2.4.6.pre3 before
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 9609 36.0 10569 14.3 3322 6.4 9509 47.6 10597 13.8 101.7 1.4
2.4.6.pre3 after (using flushto behavior as in defaults)
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 8293 30.2 11834 29.4 5072 9.5 8879 44.1 10597 13.6 100.4 0.9
2.4.6.pre3 after (flushto = ndirty)
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
500 10286 38.4 10715 14.4 3267 6.1 9605 47.6 10596 13.4 102.7 1.6
--- fs/buffer.c.org Fri Jun 15 06:48:17 2001
+++ fs/buffer.c Sun Jun 17 09:14:17 2001
@@ -118,20 +118,21 @@
wake-cycle */
int nrefill; /* Number of clean buffers to try to obtain
each time we call refill */
- int dummy1; /* unused */
+ int nflushto; /* Level to flush down to once bdflush starts */
int interval; /* jiffies delay between kupdate flushes */
int age_buffer; /* Time for normal buffer to age before we flush it */
int nfract_sync; /* Percentage of buffer cache dirty to
activate bdflush synchronously */
- int dummy2; /* unused */
+ int nmonitor; /* Size (%physpages) at which bdflush should
+ begin monitoring the buffercache */
int dummy3; /* unused */
} b_un;
unsigned int data[N_PARAM];
-} bdf_prm = {{30, 64, 64, 256, 5*HZ, 30*HZ, 60, 0, 0}};
+} bdf_prm = {{60, 64, 64, 50, 5*HZ, 30*HZ, 85, 15, 0}};
/* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = { 0, 10, 5, 25, 0, 1*HZ, 0, 0, 0};
-int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,600*HZ, 6000*HZ, 100, 0, 0};
+int bdflush_min[N_PARAM] = {0, 10, 5, 0, 0, 1*HZ, 0, 0, 0};
+int bdflush_max[N_PARAM] = {100,50000, 20000, 100,600*HZ, 6000*HZ, 100, 100, 0};
/*
* Rewrote the wait-routines to use the "new" wait-queue functionality,
@@ -763,12 +764,8 @@
balance_dirty(NODEV);
if (free_shortage())
page_launder(GFP_BUFFER, 0);
- if (!grow_buffers(size)) {
+ if (!grow_buffers(size))
wakeup_bdflush(1);
- current->policy |= SCHED_YIELD;
- __set_current_state(TASK_RUNNING);
- schedule();
- }
}
void init_buffer(struct buffer_head *bh, bh_end_io_t *handler, void *private)
@@ -1042,25 +1039,43 @@
1 -> sync flush (wait for I/O completion) */
int balance_dirty_state(kdev_t dev)
{
- unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
-
- dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
- tot = nr_free_buffer_pages();
+ unsigned long dirty, cache, buffers = 0;
+ int i;
- dirty *= 100;
- soft_dirty_limit = tot * bdf_prm.b_un.nfract;
- hard_dirty_limit = tot * bdf_prm.b_un.nfract_sync;
-
- /* First, check for the "real" dirty limit. */
- if (dirty > soft_dirty_limit) {
- if (dirty > hard_dirty_limit)
+ for (i = 0; i < NR_LIST; i++)
+ buffers += size_buffers_type[i];
+ buffers >>= PAGE_SHIFT;
+ if (buffers * 100 < num_physpages * bdf_prm.b_un.nmonitor)
+ return -1;
+
+ buffers *= bdf_prm.b_un.nfract;
+ dirty = 100 * (size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT);
+ cache = atomic_read(&page_cache_size) + nr_free_pages();
+ cache *= bdf_prm.b_un.nfract_sync;
+ if (dirty > buffers) {
+ if (dirty > cache)
return 1;
return 0;
}
-
return -1;
}
+int balance_dirty_done(kdev_t dev)
+{
+ unsigned long dirty, buffers = 0;
+ int i;
+
+ for (i = 0; i < NR_LIST; i++)
+ buffers += size_buffers_type[i];
+ buffers >>= PAGE_SHIFT;
+ buffers *= bdf_prm.b_un.nflushto;
+ dirty = 100 * (size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT);
+
+ if (dirty < buffers)
+ return 1;
+ return 0;
+}
+
/*
* if a new dirty buffer is created we need to balance bdflush.
*
@@ -2528,9 +2543,15 @@
static int flush_dirty_buffers(int check_flushtime)
{
struct buffer_head * bh, *next;
- int flushed = 0, i;
+ int flushed = 0, weight = 0, i;
restart:
+ /*
+ * If we have a shortage, we have been laundering and reclaiming
+ * or will be. In either case, we should adjust flush weight.
+ */
+ if (!check_flushtime && current->mm)
+ weight += (free_shortage() + inactive_shortage()) >> 4;
spin_lock(&lru_list_lock);
bh = lru_list[BUF_DIRTY];
if (!bh)
@@ -2552,9 +2573,6 @@
will be too young. */
if (time_before(jiffies, bh->b_flushtime))
goto out_unlock;
- } else {
- if (++flushed > bdf_prm.b_un.ndirty)
- goto out_unlock;
}
/* OK, now we are committed to write it out. */
@@ -2563,8 +2581,14 @@
ll_rw_block(WRITE, 1, &bh);
atomic_dec(&bh->b_count);
- if (current->need_resched)
+ if (++flushed >= bdf_prm.b_un.ndirty + weight ||
+ current->need_resched) {
+ /* kflushd and user tasks return to schedule points. */
+ if (!check_flushtime)
+ return flushed;
+ flushed = 0;
schedule();
+ }
goto restart;
}
out_unlock:
@@ -2580,8 +2604,14 @@
if (waitqueue_active(&bdflush_wait))
wake_up_interruptible(&bdflush_wait);
- if (block)
+ if (block) {
flush_dirty_buffers(0);
+ if (current->mm) {
+ current->policy |= SCHED_YIELD;
+ __set_current_state(TASK_RUNNING);
+ schedule();
+ }
+ }
}
/*
@@ -2672,7 +2702,7 @@
int bdflush(void *sem)
{
struct task_struct *tsk = current;
- int flushed;
+ int flushed, state;
/*
* We have a bare-bones task_struct, and really should fill
* in a few more things so "top" and /proc/2/{exe,root,cwd}
@@ -2696,13 +2726,17 @@
CHECK_EMERGENCY_SYNC
flushed = flush_dirty_buffers(0);
+ state = balance_dirty_state(NODEV);
+ if (state == 1)
+ run_task_queue(&tq_disk);
/*
- * If there are still a lot of dirty buffers around,
- * skip the sleep and flush some more. Otherwise, we
- * go to sleep waiting a wakeup.
+ * If there are still a lot of dirty buffers around, schedule
+ * and flush some more. Otherwise, go back to sleep.
*/
- if (!flushed || balance_dirty_state(NODEV) < 0) {
+ if (current->need_resched || state == 0)
+ schedule();
+ else if (!flushed || balance_dirty_done(NODEV)) {
run_task_queue(&tq_disk);
interruptible_sleep_on(&bdflush_wait);
}
@@ -2738,7 +2772,11 @@
interval = bdf_prm.b_un.interval;
if (interval) {
tsk->state = TASK_INTERRUPTIBLE;
+sleep:
schedule_timeout(interval);
+ /* Get out of the way if kflushd is running. */
+ if (!waitqueue_active(&bdflush_wait))
+ goto sleep;
} else {
stop_kupdate:
tsk->state = TASK_STOPPED;
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-16 21:54 ` Rik van Riel
@ 2001-06-17 10:28 ` Daniel Phillips
0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-17 10:28 UTC (permalink / raw)
To: Rik van Riel; +Cc: Pavel Machek, John Stoffel, Roger Larsson, Linux-Kernel
On Saturday 16 June 2001 23:54, Rik van Riel wrote:
> On Sat, 16 Jun 2001, Daniel Phillips wrote:
> > > Does the patch below do anything good for your laptop? ;)
> >
> > I'll wait for the next one ;-)
>
> OK, here's one which isn't reversed and should work ;))
>
> --- fs/buffer.c.orig Sat Jun 16 18:05:29 2001
> +++ fs/buffer.c Sat Jun 16 18:05:15 2001
> @@ -2550,7 +2550,8 @@
> if the current bh is not yet timed out,
> then also all the following bhs
> will be too young. */
> - if (time_before(jiffies, bh->b_flushtime))
> + if (++flushed > bdf_prm.b_un.ndirty &&
> + time_before(jiffies, bh->b_flushtime))
> goto out_unlock;
> } else {
> if (++flushed > bdf_prm.b_un.ndirty)
No, it doesn't, because some way of knowing the disk load is required and
there's nothing like that here.
There are two components to what I was talking about:
1) Early flush when load is light
2) Preemptive cleaning when load is light
Both are supposed to be triggered by other disk activity, swapout or file
writes, and are supposed to be triggered when the disk activity eases up.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: (lkml)Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-17 10:05 ` Mike Galbraith
@ 2001-06-17 12:49 ` thunder7
2001-06-17 16:40 ` Mike Galbraith
2001-06-18 14:22 ` Daniel Phillips
1 sibling, 1 reply; 52+ messages in thread
From: thunder7 @ 2001-06-17 12:49 UTC (permalink / raw)
To: Mike Galbraith; +Cc: linux-kernel, riel
On Sun, Jun 17, 2001 at 12:05:10PM +0200, Mike Galbraith wrote:
>
> It _juuust_ so happens that I was tinkering... what do you think of
> something like the below? (and boy do I ever wonder what a certain
> box doing slrn stuff thinks of it.. hint hint;)
>
I'm sorry to say this box doesn't really think any different of it.
Everything that's in the cache before running slrn on a big group seems
to stay there the whole time, making my active slrn-process use swap.
I applied the patch to 2.4.5-ac15, and this was the result:
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 1 0 11216 2548 183560 264172 1 4 184 343 123 119 2 6 92
0 0 0 11212 2620 183444 264184 0 0 4 72 127 99 1 2 97
0 0 0 11212 1604 183444 264740 0 0 378 0 130 101 2 1 98
0 1 0 11212 1588 184300 263116 0 0 552 1080 277 360 3 14 83
2 0 2 11212 1692 174052 270536 0 0 1860 0 596 976 9 50 40
2 0 2 11212 1588 166732 274816 0 0 1868 5426 643 1050 8 44 48
0 1 0 11212 1588 163276 276888 0 0 1714 1816 580 972 9 17 74
0 1 0 11212 1848 166280 273688 0 0 514 3952 301 355 3 40 57
1 0 0 11212 1592 164232 273872 0 0 1824 3532 632 1083 11 25 64
2 0 2 11212 1980 167304 268792 0 0 1678 0 550 881 8 51 41
0 1 2 11212 1588 163908 271356 0 0 1344 4896 508 753 7 26 67
1 0 0 11212 1588 160896 272756 0 0 1642 1301 574 929 9 22 69
0 1 0 11212 1592 164936 268632 0 0 756 3594 370 467 6 43 51
2 0 3 11212 1596 164380 266552 0 0 1904 2392 604 1017 10 52 37
1 0 0 11212 1592 164752 265844 0 0 1784 2382 623 1000 10 22 69
0 1 0 11212 1592 168528 262256 0 0 810 4176 364 523 5 43 52
0 1 1 11212 1992 169324 259504 0 0 1686 3068 578 999 11 42 47
0 1 0 11212 1588 170696 256332 0 0 1568 1080 532 894 10 20 70
1 0 0 11212 1592 174876 253036 0 0 598 3600 315 420 4 41 55
0 1 1 11212 2316 171592 253892 0 0 1816 3286 616 1073 7 29 64
0 1 0 11212 1588 170380 253968 0 0 1638 840 540 910 13 29 58
0 1 1 11212 2896 168840 253740 0 0 752 4120 342 458 4 45 51
0 1 0 11216 2012 166392 255560 0 0 1352 2458 549 895 8 14 77
2 0 1 11216 1588 170744 250164 0 0 1504 1260 503 791 7 48 45
0 1 1 11224 1588 170704 249948 0 0 874 4106 516 655 6 10 84
0 1 0 11228 1588 170148 248988 0 0 1442 0 466 772 8 20 73
1 0 0 11228 1592 171784 247456 0 0 860 3598 362 495 7 44 48
0 1 0 11228 1588 171864 246212 0 0 1390 3176 510 840 9 41 50
0 1 2 11232 1992 170344 245832 0 0 1676 1808 539 898 10 45 45
1 0 1 10508 1632 168204 246780 0 946 1508 2804 599 920 9 20 71
0 1 0 9496 2020 168904 244880 0 0 936 3620 417 603 5 35 60
1 0 0 9604 2516 164096 247536 0 0 1700 2214 563 1085 11 33 56
0 1 0 16196 1820 162112 255492 0 2 1384 1596 497 1106 8 53 38
1 0 0 19240 3000 158052 260608 0 0 400 3824 373 388 2 14 84
1 1 1 28756 4508 146032 278104 0 0 1688 2140 612 1502 7 60 33
2 0 0 39432 29100 105668 300912 0 18 2108 1178 645 1825 12 52 36
1 0 0 40668 13024 108568 311748 0 0 1674 4992 623 1017 9 12 79
0 1 0 45324 3484 105072 326432 0 0 1876 3624 619 1090 13 24 63
1 0 0 53648 1564 102740 337688 0 18 950 3646 404 857 5 31 63
2 0 0 53672 1604 103356 335680 0 2962 1436 5864 565 976 10 43 47
1 0 1 54380 1920 103516 334320 0 1086 1826 1626 590 1072 13 45 42
0 1 1 54600 6532 99568 333860 0 1006 242 5948 277 2680 2 39 59
0 1 0 54596 1944 103744 331932 0 0 1854 3644 627 1054 11 16 73
1 0 0 54592 1924 102876 331100 0 950 1956 2612 621 1173 11 41 48
1 0 0 54592 1592 103576 329568 0 0 1548 4860 605 1106 11 36 53
0 1 1 54592 1588 102908 328320 0 452 1808 2522 583 1049 11 51 38
0 1 1 54592 1588 101916 327076 0 866 1816 1260 589 1046 11 49 40
0 1 0 54592 2076 99568 327776 0 414 992 5728 459 1314 7 25 67
0 1 0 54592 1588 103928 323824 0 0 968 3646 403 747 5 33 61
1 0 0 54592 2632 100108 325136 0 402 1856 2468 622 1369 13 44 42
0 1 0 54592 1588 101872 322600 0 392 1056 2834 461 802 6 35 60
1 0 1 55644 1724 102108 322404 0 380 1448 2682 501 1032 9 50 41
1 1 1 57388 1588 103068 322056 0 0 1384 1396 471 780 8 37 56
0 1 1 58500 2048 102024 323020 0 368 876 3932 504 755 6 11 83
1 0 1 65756 18188 85916 330256 0 2298 740 3680 316 1313 5 70 26
1 1 1 70632 30324 69368 338880 0 1600 650 3804 329 907 4 83 14
0 1 1 70856 9676 75040 350076 0 0 1872 4394 642 1016 10 16 75
1 0 0 71136 1564 78716 350192 0 0 2024 3604 669 1131 11 17 72
0 1 0 71476 1560 82388 342428 0 0 2022 3654 671 1108 13 15 72
0 1 0 71880 1564 86068 335120 0 0 1742 3620 591 946 11 13 76
0 1 0 71876 1560 86080 331492 0 0 1630 0 508 861 7 15 79
1 0 0 72204 1556 89728 328004 0 0 154 3660 360 243 2 5 92
0 1 0 72204 1560 93404 320364 0 0 1736 3612 609 1044 11 12 76
2 0 0 72204 1560 93404 316984 68 0 1658 0 473 788 15 35 50
1 0 0 72204 1588 95688 317020 0 1014 1934 4628 650 1119 10 20 70
0 1 0 72196 2200 97964 320428 0 38 1660 3642 618 931 7 18 75
0 1 1 72196 1588 100008 319428 0 0 788 4132 390 594 4 32 64
2 3 1 72180 1920 101068 318516 0 2604 1818 4388 717 8010 6 44 49
1 0 1 72180 1648 97756 320368 0 204 1410 3134 595 933 10 17 73
0 1 1 72868 1588 99064 317864 0 1962 1716 4548 580 1398 10 48 42
0 1 1 80340 2212 99744 322868 0 884 1610 2670 552 1048 10 55 34
0 1 0 83392 1588 99792 326128 0 0 324 4620 402 629 2 17 81
1 0 1 90228 1592 99116 331664 2 2230 1882 3730 616 1067 9 51 39
3 0 1 95008 1588 102052 331888 0 3754 1440 5916 556 1042 9 62 29
0 1 2 97784 1588 102432 333648 0 1900 336 5016 366 564 3 41 56
1 0 1 98360 2828 102744 331796 0 4366 430 6242 376 868 3 62 35
0 1 0 98384 1588 101656 332828 0 0 338 12 199 223 15 37 48
0 1 0 98364 1588 102520 331268 0 0 1734 1160 357 421 2 3 94
here slrn starts to sort all the headers just read in.
1 0 0 98320 1588 102520 331336 3548 0 3812 0 181 189 39 2 59
0 1 0 98320 1616 102520 332968 3966 0 3966 0 166 176 43 3 54
1 0 0 98320 1588 100832 335272 4096 0 4128 116 185 221 44 3 53
1 0 0 98320 2184 97692 338568 4242 0 4274 10 218 305 44 4 52
1 0 0 98320 1588 96320 341424 3850 2 3882 68 198 269 44 12 44
0 1 0 98320 1588 95032 343652 3772 0 3772 30 184 236 45 3 52
1 0 0 98320 1588 92144 347176 4064 0 4096 14 171 204 44 3 53
1 0 0 98320 2268 89940 349532 4004 0 4036 40 215 275 45 4 51
1 0 0 98320 2212 89348 350096 252 0 284 4 110 68 51 1 47
1 0 0 98320 2208 89348 350100 0 0 0 36 111 65 51 1 48
process idle.
The slrn-test I use is to open a very big group (some 150000 headers)
from local spool. This first reads a lot of headers from disk, building
an impressive 100 Mb size of malloc()ed space in memory, then sorts
these headers.
Good luck,
Jurriaan
--
BOFH excuse #34:
(l)user error
GNU/Linux 2.4.5-ac15 SMP/ReiserFS 2x1402 bogomips load av: 0.41 0.11 0.03
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: (lkml)Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-17 12:49 ` (lkml)Re: " thunder7
@ 2001-06-17 16:40 ` Mike Galbraith
0 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-17 16:40 UTC (permalink / raw)
To: thunder7; +Cc: linux-kernel, riel
On Sun, 17 Jun 2001 thunder7@xs4all.nl wrote:
> On Sun, Jun 17, 2001 at 12:05:10PM +0200, Mike Galbraith wrote:
> >
> > It _juuust_ so happens that I was tinkering... what do you think of
> > something like the below? (and boy do I ever wonder what a certain
> > box doing slrn stuff thinks of it.. hint hint;)
> >
> I'm sorry to say this box doesn't really think any different of it.
Well darn. But..
> Everything that's in the cache before running slrn on a big group seems
> to stay there the whole time, making my active slrn-process use swap.
It should not be the same data if page aging is working at all. Better
stated, if it _is_ the same data and page aging is working, it's needed
data, so the movement of momentarily unused rss to disk might have been
the right thing to do.. it just has to buy you the use of the pages moved
for long enough to offset the (large) cost of dropping those pages.
I saw it adding rss to the aging pool, but not terribly much IO. The
fact that it is using page replacement is only interesting in regard to
total system efficiency.
> I applied the patch to 2.4.5-ac15, and this was the result:
<saves vmstat>
Thanks for running it. Can you (afford to) send me procinfo or such
(what I would like to see is job efficiency) information? Full logs
are fine, as long as they're not truely huge :) Anything under a meg
is gratefully accepted (privately 'course).
I think (am pretty darn sure) the aging fairness change is what is
affecting you, but it's not possible to see whether this change is
affecting you in a negative or positive way without timing data.
-Mike
misc:
wrt this ~patch, it only allows you to move the rolldown to sync disk
behavior some.. moving write delay back some (knob) is _supposed_ to
get that IO load (at least) a modest throughput increase. The flushto
thing was basically directed toward laptop use, but ~seems to exhibit
better IO clustering/bandwidth sharing as well. (less old/new request
merging?.. distance?)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-17 10:05 ` Mike Galbraith
2001-06-17 12:49 ` (lkml)Re: " thunder7
@ 2001-06-18 14:22 ` Daniel Phillips
2001-06-19 4:35 ` Mike Galbraith
` (2 more replies)
1 sibling, 3 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-18 14:22 UTC (permalink / raw)
To: Mike Galbraith
Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
thunder7, Linux-Kernel
On Sunday 17 June 2001 12:05, Mike Galbraith wrote:
> It _juuust_ so happens that I was tinkering... what do you think of
> something like the below? (and boy do I ever wonder what a certain
> box doing slrn stuff thinks of it.. hint hint;)
It's too subtle for me ;-) (Not shy about sying that because this part of
the kernel is probably subtle for everyone.)
The question I'm tackling right now is how the system behaves when the load
goes away, or doesn't get heavy. Your patch doesn't measure the load
directly - it may attempt to predict it as a function of memory pressure, but
that's a little more loosely coupled than what I had in mind.
I'm now in the midst of hatching a patch. [1] The first thing I had to do is
go explore the block driver code, yum yum. I found that it already computes
the statistic I'm interested in, namely queued_sectors, which is used to pace
the IO on block devices. It's a little crude - we really want this to be
per-queue and have one queue per "spindle" - but even in its current form
it's workable.
The idea is that when queued_sectors drops below some threshold we have
'unused disk bandwidth' so it would be nice to do something useful with it:
1) Do an early 'sync_old_buffers'
2) Do some preemptive pageout
The benefit of (1) is that it lets disks go idle a few seconds earlier, and
(2) should improve the system's latency in response to load surges. There
are drawbacks too, which have been pointed out to me privately, but they tend
to be pretty minor, for example: on a flash disk you'd do a few extra writes
and wear it out ever-so-slightly sooner. All the same, such special devices
can be dealt easily once we progress a little further in improving the
kernel's 'per spindle' intelligence.
Now how to implement this. I considered putting a (newly minted)
wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
transition, and that's fine except it doesn't do the whole job: we also need
to have the early flush for any write to a disk file while the disks are
lightly loaded, i.e., there is no convenient loaded-to-unloaded transition to
trigger it. The missing trigger could be inserted into __mark_dirty, but
that would penalize the loaded state (a little, but that's still too much).
Furthermore, it's probably desirable to maintain a small delay between the
dirty and the flush. So what I'll try first is just running kflush's timer
faster, and make its reschedule period vary with disk load, i.e., when there
are fewer queued_sectors, kflush looks at the dirty buffer list more often.
The rest of what has to happen in kflush is pretty straightforward. It just
uses queued_sectors to determine how far to walk the dirty buffer list, which
is maintained in time-since-dirtied order. If queued_sectors is below some
threshold the entire list is flushed. Note that we want to change the sense
of b_flushtime to b_timedirtied. It's more efficient to do it this way
anyway.
I haven't done anything about preemptive pageout yet, but similar ideas apply.
[1] This is an experiment, do not worry, it will not show up in your tree any
time soon. IOW, constructive criticism appreciated, flames copied to
/dev/null.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2001-06-16 20:50 ` Daniel Phillips
@ 2001-06-18 20:21 ` Simon Huggins
2001-06-19 10:46 ` spindown Pavel Machek
1 sibling, 1 reply; 52+ messages in thread
From: Simon Huggins @ 2001-06-18 20:21 UTC (permalink / raw)
To: Pavel Machek; +Cc: Rik van Riel, Daniel Phillips, Linux-Kernel
On Fri, Jun 15, 2001 at 03:23:07PM +0000, Pavel Machek wrote:
> > Roger> It does if you are running on a laptop. Then you do not want
> > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > Roger> to start to write a few pages, stays idle for a while, goes to
> > Roger> sleep, a few more pages, ...
> > That could be handled by a metric which says if the disk is spun
> > down, wait until there is more memory pressure before writing. But
> > if the disk is spinning, we don't care, you should start writing out
> > buffers at some low rate to keep the pressure from rising too
> > rapidly.
> Notice that write is not free (in terms of power) even if disk is
> spinning. Seeks (etc) also take some power. And think about
> flashcards. It certainly is cheaper tha spinning disk up but still not
> free.
Isn't this why noflushd exists or is this an evil thing that shouldn't
ever be used and will eventually eat my disks for breakfast?
Description: allow idle hard disks to spin down
Noflushd is a daemon that spins down disks that have not been read from
after a certain amount of time, and then prevents disk writes from
spinning them back up. It's targeted for laptops but can be used on any
computer with IDE disks. The effect is that the hard disk actually spins
down, saving you battery power, and shutting off the loudest component of
most computers.
http://noflushd.sourceforge.net
Simon.
--
[ "CATS. CATS ARE NICE." - Death, "Sourcery" ]
Black Cat Networks. http://www.blackcatnetworks.co.uk/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown [was Re: 2.4.6-pre2, pre3 VM Behavior]
2001-06-18 14:22 ` Daniel Phillips
@ 2001-06-19 4:35 ` Mike Galbraith
2001-06-20 1:50 ` [RFC] Early flush (was: spindown) Daniel Phillips
2001-06-20 4:39 ` Richard Gooch
2 siblings, 0 replies; 52+ messages in thread
From: Mike Galbraith @ 2001-06-19 4:35 UTC (permalink / raw)
To: Daniel Phillips
Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
thunder7, Linux-Kernel
On Mon, 18 Jun 2001, Daniel Phillips wrote:
> On Sunday 17 June 2001 12:05, Mike Galbraith wrote:
> > It _juuust_ so happens that I was tinkering... what do you think of
> > something like the below? (and boy do I ever wonder what a certain
> > box doing slrn stuff thinks of it.. hint hint;)
>
> It's too subtle for me ;-) (Not shy about sying that because this part of
> the kernel is probably subtle for everyone.)
No subtltry (hammer), it just draws a line that doesn't move around
in unpredictable ways. For example, nr_free_buffer_pages() adds in
free pages to the line it draws. You may have a large volume of dirty
data, decide it would be prudent to flush, then someone frees a nice
chunk of memory... (send morse code messages via malloc/free?:)
Anyway it's crude, but it seems to have gotten results from the slrn
load. I received logs for ac15 and ac15+patch. ac15 took 265 seconds
to do the job whereas with the patch it took 227 seconds. I haven't
poured over the logs yet, but there seems to be throughput to be had.
If anyone is interested in the logs, they're much smaller than expected
-rw-r--r-- 1 mikeg users 11993 Jun 19 05:58 ac15_mike.log
-rw-r--r-- 1 mikeg users 13015 Jun 19 05:58 ac15_org.log
> The question I'm tackling right now is how the system behaves when the load
> goes away, or doesn't get heavy. Your patch doesn't measure the load
> directly - it may attempt to predict it as a function of memory pressure, but
> that's a little more loosely coupled than what I had in mind.
It doesn't attempt to predict, it reacts to the existing situation.
> I'm now in the midst of hatching a patch. [1] The first thing I had to do is
> go explore the block driver code, yum yum. I found that it already computes
> the statistic I'm interested in, namely queued_sectors, which is used to pace
> the IO on block devices. It's a little crude - we really want this to be
> per-queue and have one queue per "spindle" - but even in its current form
> it's workable.
>
> The idea is that when queued_sectors drops below some threshold we have
> 'unused disk bandwidth' so it would be nice to do something useful with it:
(that's much more subtle/clever:)
> 1) Do an early 'sync_old_buffers'
> 2) Do some preemptive pageout
>
> The benefit of (1) is that it lets disks go idle a few seconds earlier, and
> (2) should improve the system's latency in response to load surges. There
> are drawbacks too, which have been pointed out to me privately, but they tend
> to be pretty minor, for example: on a flash disk you'd do a few extra writes
> and wear it out ever-so-slightly sooner. All the same, such special devices
> can be dealt easily once we progress a little further in improving the
> kernel's 'per spindle' intelligence.
>
> Now how to implement this. I considered putting a (newly minted)
> wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
> transition, and that's fine except it doesn't do the whole job: we also need
> to have the early flush for any write to a disk file while the disks are
> lightly loaded, i.e., there is no convenient loaded-to-unloaded transition to
> trigger it. The missing trigger could be inserted into __mark_dirty, but
> that would penalize the loaded state (a little, but that's still too much).
> Furthermore, it's probably desirable to maintain a small delay between the
> dirty and the flush. So what I'll try first is just running kflush's timer
> faster, and make its reschedule period vary with disk load, i.e., when there
> are fewer queued_sectors, kflush looks at the dirty buffer list more often.
>
> The rest of what has to happen in kflush is pretty straightforward. It just
> uses queued_sectors to determine how far to walk the dirty buffer list, which
> is maintained in time-since-dirtied order. If queued_sectors is below some
> threshold the entire list is flushed. Note that we want to change the sense
> of b_flushtime to b_timedirtied. It's more efficient to do it this way
> anyway.
>
> I haven't done anything about preemptive pageout yet, but similar ideas apply.
Preemptive pageout could be simply walk the dirty list looking for swap
pages and writing them out. With the fair aging change that's already
in, there will be some. If the fair aging change to background aging
works out, there will be more (don't want too many more though;). The
only problem I can see with that simle method is that once written, the
page lands on the inactive_clean list. That list is short and does get
consumed.. might turn fake pageout into a real one unintentionally.
> [1] This is an experiment, do not worry, it will not show up in your tree any
> time soon. IOW, constructive criticism appreciated, flames copied to
> /dev/null.
Look forward to seeing it.
-Mike
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-18 20:21 ` spindown Simon Huggins
@ 2001-06-19 10:46 ` Pavel Machek
2001-06-20 16:52 ` spindown Daniel Phillips
2001-06-21 16:07 ` spindown Jamie Lokier
0 siblings, 2 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-19 10:46 UTC (permalink / raw)
To: Rik van Riel, Daniel Phillips, Linux-Kernel
Hi!
> > > Roger> It does if you are running on a laptop. Then you do not want
> > > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > > Roger> to start to write a few pages, stays idle for a while, goes to
> > > Roger> sleep, a few more pages, ...
> > > That could be handled by a metric which says if the disk is spun
> > > down, wait until there is more memory pressure before writing. But
> > > if the disk is spinning, we don't care, you should start writing out
> > > buffers at some low rate to keep the pressure from rising too
> > > rapidly.
> > Notice that write is not free (in terms of power) even if disk is
> > spinning. Seeks (etc) also take some power. And think about
> > flashcards. It certainly is cheaper tha spinning disk up but still not
> > free.
>
> Isn't this why noflushd exists or is this an evil thing that shouldn't
> ever be used and will eventually eat my disks for breakfast?
It would eat your flash for breakfast. You know, flash memories have
no spinning parts, so there's nothing to spin down.
Pavel
--
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org
^ permalink raw reply [flat|nested] 52+ messages in thread
* [RFC] Early flush (was: spindown)
2001-06-18 14:22 ` Daniel Phillips
2001-06-19 4:35 ` Mike Galbraith
@ 2001-06-20 1:50 ` Daniel Phillips
2001-06-20 20:58 ` Tom Sightler
2001-06-20 4:39 ` Richard Gooch
2 siblings, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 1:50 UTC (permalink / raw)
To: Mike Galbraith
Cc: Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
thunder7, Linux-Kernel
I never realized how much I didn't like the good old 5 second delay between
saving an edit and actually getting it written to disk until it went away.
Now the question is, did I lose any performance in doing that. What I wrote
in the previous email turned out to be pretty accurate, so I'll just quote it
to keep it together with the patch:
> I'm now in the midst of hatching a patch. [1] The first thing I had to do
> is go explore the block driver code, yum yum. I found that it already
> computes the statistic I'm interested in, namely queued_sectors, which is
> used to pace the IO on block devices. It's a little crude - we really want
> this to be per-queue and have one queue per "spindle" - but even in its
> current form it's workable.
>
> The idea is that when queued_sectors drops below some threshold we have
> 'unused disk bandwidth' so it would be nice to do something useful with it:
>
> 1) Do an early 'sync_old_buffers'
> 2) Do some preemptive pageout
>
> The benefit of (1) is that it lets disks go idle a few seconds earlier, and
> (2) should improve the system's latency in response to load surges. There
> are drawbacks too, which have been pointed out to me privately, but they
> tend to be pretty minor, for example: on a flash disk you'd do a few extra
> writes and wear it out ever-so-slightly sooner. All the same, such special
> devices can be dealt easily once we progress a little further in improving
> the kernel's 'per spindle' intelligence.
>
> Now how to implement this. I considered putting a (newly minted)
> wakeup_kflush in blk_finished_io, conditional on a loaded-to-unloaded
> transition, and that's fine except it doesn't do the whole job: we also
> need to have the early flush for any write to a disk file while the disks
> are lightly loaded, i.e., there is no convenient loaded-to-unloaded
> transition to trigger it. The missing trigger could be inserted into
> __mark_dirty, but that would penalize the loaded state (a little, but
> that's still too much). Furthermore, it's probably desirable to maintain a
> small delay between the dirty and the flush. So what I'll try first is
> just running kflush's timer faster, and make its reschedule period vary
> with disk load, i.e., when there are fewer queued_sectors, kflush looks at
> the dirty buffer list more often.
>
> The rest of what has to happen in kflush is pretty straightforward. It
> just uses queued_sectors to determine how far to walk the dirty buffer
> list, which is maintained in time-since-dirtied order. If queued_sectors
> is below some threshold the entire list is flushed. Note that we want to
> change the sense of b_flushtime to b_timedirtied. It's more efficient to
> do it this way anyway.
>
> I haven't done anything about preemptive pageout yet, but similar ideas
> apply.
>
> [1] This is an experiment, do not worry, it will not show up in your tree
> any time soon. IOW, constructive criticism appreciated, flames copied to
> /dev/null.
I originally intended to implement a sliding flush delay based on disk load.
This turned out to be a lot of work for a hard-to-discern benefit. So the
current approach has just two delays: .1 second and whatever the bdflush
delay is set to. If there is any non-flush disk traffic the longer delay is
used. This is crude but effective... I think. I hope that somebody will run
this through some benchmarks to see if I lost any performance. According to
my calculations, I did not. I tested this mainly in UML, and also ran it
briefly on my laptop. The interactive feel of the change is immediately
obvious, and for me at least, a big improvement.
The patch is against 2.4.5. To apply:
cd /your/source/tree
patch <this/patch -p0
--- ../uml.2.4.5.clean/fs/buffer.c Sat May 26 02:57:46 2001
+++ ./fs/buffer.c Wed Jun 20 01:55:21 2001
@@ -1076,7 +1076,7 @@
static __inline__ void __mark_dirty(struct buffer_head *bh)
{
- bh->b_flushtime = jiffies + bdf_prm.b_un.age_buffer;
+ bh->b_dirtytime = jiffies;
refile_buffer(bh);
}
@@ -2524,12 +2524,20 @@
as all dirty buffers lives _only_ in the DIRTY lru list.
As we never browse the LOCKED and CLEAN lru lists they are infact
completly useless. */
-static int flush_dirty_buffers(int check_flushtime)
+static int flush_dirty_buffers (int update)
{
struct buffer_head * bh, *next;
int flushed = 0, i;
+ unsigned queued = atomic_read (&queued_sectors);
+ unsigned long youngest_to_update;
- restart:
+#ifdef DEBUG
+ if (update)
+ printk("kupdate %lu %i\n", jiffies, queued);
+#endif
+
+restart:
+ youngest_to_update = jiffies - (queued? bdf_prm.b_un.age_buffer: 0);
spin_lock(&lru_list_lock);
bh = lru_list[BUF_DIRTY];
if (!bh)
@@ -2544,19 +2552,14 @@
if (buffer_locked(bh))
continue;
- if (check_flushtime) {
- /* The dirty lru list is chronologically ordered so
- if the current bh is not yet timed out,
- then also all the following bhs
- will be too young. */
- if (time_before(jiffies, bh->b_flushtime))
+ if (update) {
+ if (time_before (youngest_to_update, bh->b_dirtytime))
goto out_unlock;
} else {
if (++flushed > bdf_prm.b_un.ndirty)
goto out_unlock;
}
- /* OK, now we are committed to write it out. */
atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
ll_rw_block(WRITE, 1, &bh);
@@ -2717,7 +2720,7 @@
int kupdate(void *sem)
{
struct task_struct * tsk = current;
- int interval;
+ int update_when = 0;
tsk->session = 1;
tsk->pgrp = 1;
@@ -2733,11 +2736,11 @@
up((struct semaphore *)sem);
for (;;) {
- /* update interval */
- interval = bdf_prm.b_un.interval;
- if (interval) {
+ unsigned check_interval = HZ/10, update_interval = bdf_prm.b_un.interval;
+
+ if (update_interval) {
tsk->state = TASK_INTERRUPTIBLE;
- schedule_timeout(interval);
+ schedule_timeout(check_interval);
} else {
stop_kupdate:
tsk->state = TASK_STOPPED;
@@ -2756,10 +2759,15 @@
if (stopped)
goto stop_kupdate;
}
+ update_when -= check_interval;
+ if (update_when > 0 && atomic_read(&queued_sectors))
+ continue;
+
#ifdef DEBUG
printk("kupdate() activated...\n");
#endif
sync_old_buffers();
+ update_when = update_interval;
}
}
--- ../uml.2.4.5.clean/include/linux/fs.h Sat May 26 03:01:28 2001
+++ ./include/linux/fs.h Tue Jun 19 15:12:18 2001
@@ -236,7 +236,7 @@
atomic_t b_count; /* users using this block */
kdev_t b_rdev; /* Real device */
unsigned long b_state; /* buffer state bitmap (see above) */
- unsigned long b_flushtime; /* Time when (dirty) buffer should be written */
+ unsigned long b_dirtytime; /* Time buffer became dirty */
struct buffer_head *b_next_free;/* lru/free list linkage */
struct buffer_head *b_prev_free;/* doubly linked list of buffers */
--- ../uml.2.4.5.clean/mm/filemap.c Thu May 31 15:29:06 2001
+++ ./mm/filemap.c Tue Jun 19 15:32:47 2001
@@ -349,7 +349,7 @@
if (buffer_locked(bh) || !buffer_dirty(bh) || !buffer_uptodate(bh))
continue;
- bh->b_flushtime = jiffies;
+ bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
ll_rw_block(WRITE, 1, &bh);
} while ((bh = bh->b_this_page) != head);
return 0;
--- ../uml.2.4.5.clean/mm/highmem.c Sat May 26 02:57:46 2001
+++ ./mm/highmem.c Tue Jun 19 15:33:22 2001
@@ -400,7 +400,7 @@
bh->b_rdev = bh_orig->b_rdev;
bh->b_state = bh_orig->b_state;
#ifdef HIGHMEM_DEBUG
- bh->b_flushtime = jiffies;
+ bh->b_dirtytime = jiffies /*- bdf_prm.b_un.age_buffer*/; // needed??
bh->b_next_free = NULL;
bh->b_prev_free = NULL;
/* bh->b_this_page */
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-18 14:22 ` Daniel Phillips
2001-06-19 4:35 ` Mike Galbraith
2001-06-20 1:50 ` [RFC] Early flush (was: spindown) Daniel Phillips
@ 2001-06-20 4:39 ` Richard Gooch
2001-06-20 14:29 ` Daniel Phillips
2001-06-20 16:12 ` Richard Gooch
2 siblings, 2 replies; 52+ messages in thread
From: Richard Gooch @ 2001-06-20 4:39 UTC (permalink / raw)
To: Daniel Phillips
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
Daniel Phillips writes:
> I never realized how much I didn't like the good old 5 second delay
> between saving an edit and actually getting it written to disk until
> it went away. Now the question is, did I lose any performance in
> doing that. What I wrote in the previous email turned out to be
> pretty accurate, so I'll just quote it
Starting I/O immediately if there is no load sounds nice. However,
what about the other case, when the disc is already spun down (and
hence there's no I/O load either)? I want the system to avoid doing
writes while the disc is spun down. I'm quite happy for the system to
accumulate dirtied pages/buffers, reclaiming clean pages as needed,
until it absolutely has to start writing out (or I call sync(2)).
Right now I hack that by setting bdflush parameters to 5 minutes. But
that's not ideal either.
Regards,
Richard....
Permanent: rgooch@atnf.csiro.au
Current: rgooch@ras.ucalgary.ca
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 4:39 ` Richard Gooch
@ 2001-06-20 14:29 ` Daniel Phillips
2001-06-20 16:12 ` Richard Gooch
1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 14:29 UTC (permalink / raw)
To: Richard Gooch
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
On Wednesday 20 June 2001 06:39, Richard Gooch wrote:
> Daniel Phillips writes:
> > I never realized how much I didn't like the good old 5 second delay
> > between saving an edit and actually getting it written to disk until
> > it went away. Now the question is, did I lose any performance in
> > doing that. What I wrote in the previous email turned out to be
> > pretty accurate, so I'll just quote it
>
> Starting I/O immediately if there is no load sounds nice. However,
> what about the other case, when the disc is already spun down (and
> hence there's no I/O load either)? I want the system to avoid doing
> writes while the disc is spun down. I'm quite happy for the system to
> accumulate dirtied pages/buffers, reclaiming clean pages as needed,
> until it absolutely has to start writing out (or I call sync(2)).
I'd like that too, but what about sync writes? As things stand now, there is
no option but to spin the disk back up. To get around this we'd have to
change the basic behavior of the block device and that's doable, but it's an
entirely different proposition than the little patch above.
You know about this project no doubt:
http://noflushd.sourceforge.net/
This is really complementary to what I did. Lightweight is not really a good
way to describe it though, the tar is almost 10,000 lines long. There is
probably a clever thing to do at the kernel level to shorten that up.
There's one thing I think I can help fix up while I'm working in here, this
complaint:
Reiserfs journaling bypasses the kernel's delayed write mechanisms and
writes straight to disk.
We need to address the reasons why such filesystems have to bypass kupdate.
This touches on how sync and fsync work, updating supers, flushing the inode
cache etc, but with Al Viro's superblock work merged now we could start
thinking about it.
> Right now I hack that by setting bdflush parameters to 5 minutes. But
> that's not ideal either.
Yes, that still works with my patch. The noflushd user space daemon works by
turning off kupdate (set update time to 0).
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 4:39 ` Richard Gooch
2001-06-20 14:29 ` Daniel Phillips
@ 2001-06-20 16:12 ` Richard Gooch
2001-06-22 23:25 ` Daniel Kobras
2001-06-25 11:31 ` Pavel Machek
1 sibling, 2 replies; 52+ messages in thread
From: Richard Gooch @ 2001-06-20 16:12 UTC (permalink / raw)
To: Daniel Phillips
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
Daniel Phillips writes:
> On Wednesday 20 June 2001 06:39, Richard Gooch wrote:
> > Starting I/O immediately if there is no load sounds nice. However,
> > what about the other case, when the disc is already spun down (and
> > hence there's no I/O load either)? I want the system to avoid doing
> > writes while the disc is spun down. I'm quite happy for the system to
> > accumulate dirtied pages/buffers, reclaiming clean pages as needed,
> > until it absolutely has to start writing out (or I call sync(2)).
>
> I'd like that too, but what about sync writes? As things stand now,
> there is no option but to spin the disk back up. To get around this
> we'd have to change the basic behavior of the block device and
> that's doable, but it's an entirely different proposition than the
> little patch above.
I don't care as much about sync writes. They don't seem to happen very
often on my boxes.
> You know about this project no doubt:
>
> http://noflushd.sourceforge.net/
Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
.h files! As you say, not really lightweight. There must be a better
way. Also, I suspect (without having looked at the code) that it
doesn't handle memory pressure well. Things may get nasty when we run
low on free pages.
Regards,
Richard....
Permanent: rgooch@atnf.csiro.au
Current: rgooch@ras.ucalgary.ca
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-19 10:46 ` spindown Pavel Machek
@ 2001-06-20 16:52 ` Daniel Phillips
2001-06-20 17:32 ` spindown Rik van Riel
2001-06-21 16:07 ` spindown Jamie Lokier
1 sibling, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 16:52 UTC (permalink / raw)
To: Pavel Machek, Rik van Riel, Linux-Kernel
On Tuesday 19 June 2001 12:46, Pavel Machek wrote:
> > > > Roger> It does if you are running on a laptop. Then you do not want
> > > > Roger> the pages go out all the time. Disk has gone too sleep, needs
> > > > Roger> to start to write a few pages, stays idle for a while, goes to
> > > > Roger> sleep, a few more pages, ...
> > > > That could be handled by a metric which says if the disk is spun
> > > > down, wait until there is more memory pressure before writing. But
> > > > if the disk is spinning, we don't care, you should start writing out
> > > > buffers at some low rate to keep the pressure from rising too
> > > > rapidly.
> > >
> > > Notice that write is not free (in terms of power) even if disk is
> > > spinning. Seeks (etc) also take some power. And think about
> > > flashcards. It certainly is cheaper tha spinning disk up but still not
> > > free.
> >
> > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > ever be used and will eventually eat my disks for breakfast?
>
> It would eat your flash for breakfast. You know, flash memories have
> no spinning parts, so there's nothing to spin down.
Yes, this doesn't make sense for flash, and in fact, it doesn't make sense to
have just one set of bdflush parameters for the whole system, it's really a
property of the individual device. So the thing to do is for me to go kibitz
on the io layer rewrite projects and figure out how to set up the
intelligence per-queue, and have the queues per-device, at which point it's
trivial to do the write^H^H^H^H^H right thing for each kind of device.
BTW, with nominal 100,000 erases you have to write 10 terabytes to your 100
meg flash disk before you'll see it start to degrade. These devices are set
up to avoid continuous hammering on the same same page, and to take failed
pages out of the pool as soon as they fail to erase. Also, the 100,000
figure is nominal - the average number of erases you'll get per page is
considerably higher. The extra few sectors we see with the early flush patch
are just not going to affect the life of your flash to a measurable degree.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-20 16:52 ` spindown Daniel Phillips
@ 2001-06-20 17:32 ` Rik van Riel
2001-06-20 18:00 ` spindown Daniel Phillips
0 siblings, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-20 17:32 UTC (permalink / raw)
To: Daniel Phillips; +Cc: Pavel Machek, Linux-Kernel
On Wed, 20 Jun 2001, Daniel Phillips wrote:
> BTW, with nominal 100,000 erases you have to write 10 terabytes
> to your 100 meg flash disk before you'll see it start to
> degrade.
That assumes you write out full blocks. If you flush after
every byte written you'll hit the limit a lot sooner ;)
Btw, this is also a problem with your patch, when you write
out buffers all the time your disk will spend more time seeking
all over the place (moving the disk head away from where we are
currently reading!) and you'll end up writing the same block
multiple times ...
regards,
Rik
--
Executive summary of a recent Microsoft press release:
"we are concerned about the GNU General Public License (GPL)"
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-20 17:32 ` spindown Rik van Riel
@ 2001-06-20 18:00 ` Daniel Phillips
0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 18:00 UTC (permalink / raw)
To: Rik van Riel; +Cc: Pavel Machek, Linux-Kernel
On Wednesday 20 June 2001 19:32, Rik van Riel wrote:
> On Wed, 20 Jun 2001, Daniel Phillips wrote:
> > BTW, with nominal 100,000 erases you have to write 10 terabytes
> > to your 100 meg flash disk before you'll see it start to
> > degrade.
>
> That assumes you write out full blocks. If you flush after
> every byte written you'll hit the limit a lot sooner ;)
Yep, so if you are running on a Yopy, try not to sync after each byte.
> Btw, this is also a problem with your patch, when you write
> out buffers all the time your disk will spend more time seeking
> all over the place (moving the disk head away from where we are
> currently reading!) and you'll end up writing the same block
> multiple times ...
It doesn't work that way, it tacks the flush onto the trailing edge of a
burst of disk activity, or it flushes out an isolated update, say an edit
save, which would have required the same amount of disk activity, just a few
seconds off in the future. Sometimes it does write a few extra sectors when
disk activity is sporadic, but the impact on total throughput is small enough
to be hard to measure reliably. Even so, there is some optimizing that could
be done - the update could be interleaved a little better with the falling
edge of a heavy traffic episode. This would require that the io rate be
monitored instead of just the queue backlog. I'mi nterested in tackling that
eventually - it has applications in other areas than just the early update.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 1:50 ` [RFC] Early flush (was: spindown) Daniel Phillips
@ 2001-06-20 20:58 ` Tom Sightler
2001-06-20 22:09 ` Daniel Phillips
2001-06-24 3:20 ` Anuradha Ratnaweera
0 siblings, 2 replies; 52+ messages in thread
From: Tom Sightler @ 2001-06-20 20:58 UTC (permalink / raw)
To: Daniel Phillips
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
Quoting Daniel Phillips <phillips@bonn-fries.net>:
> I originally intended to implement a sliding flush delay based on disk
> load.
> This turned out to be a lot of work for a hard-to-discern benefit. So
> the
> current approach has just two delays: .1 second and whatever the bdflush
>
> delay is set to. If there is any non-flush disk traffic the longer
> delay is
> used. This is crude but effective... I think. I hope that somebody
> will run
> this through some benchmarks to see if I lost any performance.
> According to
> my calculations, I did not. I tested this mainly in UML, and also ran
> it
> briefly on my laptop. The interactive feel of the change is immediately
>
> obvious, and for me at least, a big improvement.
Well, since a lot of this discussion seemed to spin off from my original posting
last week about my particular issue with disk flushing I decided to try your
patch with my simple test/problem that I experience on my laptop.
One note, I ran your patch against 2.4.6-pre3 as that is what currently performs
the best on my laptop. It seems to apply cleanly and compiled without problems.
I used this kernel on my laptop kernel on my laptop all day for my normal
workload which consist ofa Gnome 1.4 desktop, several Mozilla instances, several
ssh sessions with remote X programs displayed, StarOffice, VMware (running
Windows 2000 Pro in 128MB). I also preformed several compiles throughout the
day. Overall the machine feels slightly more sluggish I think due to the
following two things:
1. When running a compile, or anything else that produces lots of small disk
writes, you tend to get lots of little pauses for all the little writes to disk.
These seem to be unnoticable without the patch.
2. Loading programs when writing activity is occuring (even light activity like
during the compile) is noticable slower, actually any reading from disk is.
I also ran my simple ftp test that produced the symptom I reported earlier. I
transferred a 750MB file via FTP, and with your patch sure enough disk writing
started almost immediately, but it still didn't seem to write enough data to
disk to keep up with the transfer so at approximately the 200MB mark the old
behavior still kicked in as it went into full flush mode, during the time
network activity halted, just like before. The big difference with the patch
and without is that the patched kernel never seems to balance out, without the
patch once the initial burst is done you get a nice stream of data from the
network to disk with the disk staying moderately active. With the patch the
disk varies from barely active moderate to heavy and back, during the heavy the
network transfer always pauses (although very briefly).
Just my observations, you asked for comments.
Later,
Tom
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 20:58 ` Tom Sightler
@ 2001-06-20 22:09 ` Daniel Phillips
2001-06-24 3:20 ` Anuradha Ratnaweera
1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-20 22:09 UTC (permalink / raw)
To: Tom Sightler
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
On Wednesday 20 June 2001 22:58, Tom Sightler wrote:
> Quoting Daniel Phillips <phillips@bonn-fries.net>:
> > I originally intended to implement a sliding flush delay based on disk
> > load.
> > This turned out to be a lot of work for a hard-to-discern benefit. So
> > the
> > current approach has just two delays: .1 second and whatever the bdflush
> >
> > delay is set to. If there is any non-flush disk traffic the longer
> > delay is
> > used. This is crude but effective... I think. I hope that somebody
> > will run
> > this through some benchmarks to see if I lost any performance.
> > According to
> > my calculations, I did not. I tested this mainly in UML, and also ran
> > it
> > briefly on my laptop. The interactive feel of the change is immediately
> >
> > obvious, and for me at least, a big improvement.
>
> Well, since a lot of this discussion seemed to spin off from my original
> posting last week about my particular issue with disk flushing I decided to
> try your patch with my simple test/problem that I experience on my laptop.
>
> One note, I ran your patch against 2.4.6-pre3 as that is what currently
> performs the best on my laptop. It seems to apply cleanly and compiled
> without problems.
>
> I used this kernel on my laptop kernel on my laptop all day for my normal
> workload which consist ofa Gnome 1.4 desktop, several Mozilla instances,
> several ssh sessions with remote X programs displayed, StarOffice, VMware
> (running Windows 2000 Pro in 128MB). I also preformed several compiles
> throughout the day. Overall the machine feels slightly more sluggish I
> think due to the following two things:
>
> 1. When running a compile, or anything else that produces lots of small
> disk writes, you tend to get lots of little pauses for all the little
> writes to disk. These seem to be unnoticable without the patch.
OK, this is because the early flush doesn't quit when load picks up again.
Measuring only the io backlog, as I do now, isn't adequate for telling the
difference between load initiated by the flush itself and other load, such as
cpu bound process proceding to read another file, so that's why the flush
doesn't stop flushing when other IO starts happening. This has to be fixed.
In the mean time, you could try this simple tweak: just set the lower bound,
currently 1/10th second a little higher:
- unsigned check_interval = HZ/10, ...
+ unsigned check_interval = HZ/5, ...
This may be enough to bridge the little pauses in the the compiler's disk
access pattern so the flush isn't triggered. (This is not by any means a
nice solution.) If you set check_interval to HZ*5, you *should* get exactly
the old behaviour, I'd be very interested to hear if you do.
Also, could you do your compiles with 'time' so you can quantify the results?
> 2. Loading programs when writing activity is occuring (even light activity
> like during the compile) is noticable slower, actually any reading from
> disk is.
Hmm, let me think why that may be. The loader doesn't actually read the
program into memory, it just maps it and lets the pages fault in as they're
called for. So if readahead isn't perfect (it isn't) the io backlog may drop
to 0 briefly just as the kflush decides to sample it, and it initiates a
flush. This flush cleans the whole dirty list out, stealing bandwidth from
the reads.
> I also ran my simple ftp test that produced the symptom I reported earlier.
> I transferred a 750MB file via FTP, and with your patch sure enough disk
> writing started almost immediately, but it still didn't seem to write
> enough data to disk to keep up with the transfer so at approximately the
> 200MB mark the old behavior still kicked in as it went into full flush
> mode, during the time network activity halted, just like before. The big
> difference with the patch and without is that the patched kernel never
> seems to balance out, without the patch once the initial burst is done you
> get a nice stream of data from the network to disk with the disk staying
> moderately active. With the patch the disk varies from barely active
> moderate to heavy and back, during the heavy the network transfer always
> pauses (although very briefly).
>
> Just my observations, you asked for comments.
Yes, I have to refine this. The inner flush loop has to know how many io
submissions are happening, from which it can subtract its own submissions and
know sombebody else is submitting IO, at which point it can fall back to the
good old 5 second buffer age limit. False positives from kflush are handled
as a fringe benefit, and flush_dirty_buffers won't do extra writeout. This
is easy and cheap.
I could get a lot fancier than this and caculate IO load averages, but I'd
only do that after mining out the simple possibilities. I'll probably have
something new for you to try tomorrow, if you're willing. By the way, I'm
not addressing your fundamental problem, that's Rik's job ;-). In fact, I
define success in this effort by the extent to which I don't affect behaviour
under load.
Oh, and I'd better finish configuring my kernel and boot my laptop with this,
i.e., eat my own dogfood ;-)
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-19 10:46 ` spindown Pavel Machek
2001-06-20 16:52 ` spindown Daniel Phillips
@ 2001-06-21 16:07 ` Jamie Lokier
2001-06-22 22:09 ` spindown Daniel Kobras
2001-06-28 0:27 ` spindown Troy Benjegerdes
1 sibling, 2 replies; 52+ messages in thread
From: Jamie Lokier @ 2001-06-21 16:07 UTC (permalink / raw)
To: Pavel Machek; +Cc: Rik van Riel, Daniel Phillips, Linux-Kernel
Pavel Machek wrote:
> > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > ever be used and will eventually eat my disks for breakfast?
>
> It would eat your flash for breakfast. You know, flash memories have
> no spinning parts, so there's nothing to spin down.
Btw Pavel, does noflushd work with 2.4.4? The noflushd version 2.4 I
tried said it couldn't find some kernel process (kflushd? I don't
remember) and that I should use bdflush. The manual says that's
appropriate for older kernels, but not 2.4.4 surely.
-- Jamie
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-21 16:07 ` spindown Jamie Lokier
@ 2001-06-22 22:09 ` Daniel Kobras
2001-06-28 0:27 ` spindown Troy Benjegerdes
1 sibling, 0 replies; 52+ messages in thread
From: Daniel Kobras @ 2001-06-22 22:09 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Pavel Machek, Linux-Kernel
On Thu, Jun 21, 2001 at 06:07:01PM +0200, Jamie Lokier wrote:
> Pavel Machek wrote:
> > > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > > ever be used and will eventually eat my disks for breakfast?
> >
> > It would eat your flash for breakfast. You know, flash memories have
> > no spinning parts, so there's nothing to spin down.
>
> Btw Pavel, does noflushd work with 2.4.4? The noflushd version 2.4 I
> tried said it couldn't find some kernel process (kflushd? I don't
> remember) and that I should use bdflush. The manual says that's
> appropriate for older kernels, but not 2.4.4 surely.
That's because of my favourite change from the 2.4.3 patch:
- strcpy(tsk->comm, "kupdate");
+ strcpy(tsk->comm, "kupdated");
noflushd 2.4 fixed this issue in the daemon itself, but I had forgotten about
the generic startup skript. (Rpms and debs run their customized versions.)
Either the current version from CVS, or
ed /your/init.d/location/noflushd << EOF
%s/kupdate/kupdated/g
w
q
EOF
should get you going.
Regards,
Daniel.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 16:12 ` Richard Gooch
@ 2001-06-22 23:25 ` Daniel Kobras
2001-06-23 5:10 ` Daniel Phillips
2001-06-25 11:31 ` Pavel Machek
1 sibling, 1 reply; 52+ messages in thread
From: Daniel Kobras @ 2001-06-22 23:25 UTC (permalink / raw)
To: Richard Gooch
Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
John Stoffel, Roger Larsson, thunder7, Linux-Kernel
On Wed, Jun 20, 2001 at 10:12:38AM -0600, Richard Gooch wrote:
> Daniel Phillips writes:
> > I'd like that too, but what about sync writes? As things stand now,
> > there is no option but to spin the disk back up. To get around this
> > we'd have to change the basic behavior of the block device and
> > that's doable, but it's an entirely different proposition than the
> > little patch above.
>
> I don't care as much about sync writes. They don't seem to happen very
> often on my boxes.
syslog and some editors are the most common users of sync writes. vim, e.g.,
per default keeps fsync()ing its swapfile. Tweaking the configuration of
these apps, this can be prevented fairly easy though. Changing sync semantics
for this matter on the other hand seems pretty awkward to me. I'd expect an
application calling fsync() to have good reason for having its data flushed
to disk _now_, no matter what state the disk happens to be in. If it hasn't,
fix the app, not the kernel.
> > You know about this project no doubt:
> >
> > http://noflushd.sourceforge.net/
>
> Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> .h files! As you say, not really lightweight. There must be a better
> way.
noflushd would benefit a lot from being able to set bdflush parameters per
device or per disk. So I'm really eager to see what Daniel comes up with.
Currently, we can only turn kupdate either on or off as a whole, which means
that noflushd implements a crude replacement for the benefit of multi-disk
setups. A lot of the cruft stems from there.
> Also, I suspect (without having looked at the code) that it
> doesn't handle memory pressure well. Things may get nasty when we run
> low on free pages.
It doesn't handle memory pressure at all. It doesn't have to. noflushd only
messes with kupdate{,d} but leaves bdflush (formerly known as kflushd) alone.
If memory gets tight, bdflush starts writing out dirty buffers, which makes the
disk spin up, and we're back to normal.
Regards,
Daniel.
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-22 23:25 ` Daniel Kobras
@ 2001-06-23 5:10 ` Daniel Phillips
2001-06-25 11:33 ` Pavel Machek
0 siblings, 1 reply; 52+ messages in thread
From: Daniel Phillips @ 2001-06-23 5:10 UTC (permalink / raw)
To: Daniel Kobras, Richard Gooch, Jens Axboe
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
On Saturday 23 June 2001 01:25, Daniel Kobras wrote:
> On Wed, Jun 20, 2001 at 10:12:38AM -0600, Richard Gooch wrote:
> > Daniel Phillips writes:
> > > I'd like that too, but what about sync writes? As things stand now,
> > > there is no option but to spin the disk back up. To get around this
> > > we'd have to change the basic behavior of the block device and
> > > that's doable, but it's an entirely different proposition than the
> > > little patch above.
> >
> > I don't care as much about sync writes. They don't seem to happen very
> > often on my boxes.
>
> syslog and some editors are the most common users of sync writes. vim,
> e.g., per default keeps fsync()ing its swapfile. Tweaking the configuration
> of these apps, this can be prevented fairly easy though. Changing sync
> semantics for this matter on the other hand seems pretty awkward to me. I'd
> expect an application calling fsync() to have good reason for having its
> data flushed to disk _now_, no matter what state the disk happens to be in.
> If it hasn't, fix the app, not the kernel.
But apps shouldn't have to know about the special requirements of laptops.
I've been playing a little with the idea of creating a special block device
for laptops that goes between the vfs and the real block device, and adds the
behaviour of being able to buffer writes in memory. In all respects it would
seem to the vfs to be a disk. So far this is just a thought experiment.
> > > You know about this project no doubt:
> > >
> > > http://noflushd.sourceforge.net/
> >
> > Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> > .h files! As you say, not really lightweight. There must be a better
> > way.
>
> noflushd would benefit a lot from being able to set bdflush parameters per
> device or per disk. So I'm really eager to see what Daniel comes up with.
> Currently, we can only turn kupdate either on or off as a whole, which
> means that noflushd implements a crude replacement for the benefit of
> multi-disk setups. A lot of the cruft stems from there.
Yes, another person to talk to about this is Jens Axboe who has been doing
some serious hacking on the block layer. I thought I'd get the early flush
patch working well for one disk before generalizing to N ;-)
> > Also, I suspect (without having looked at the code) that it
> > doesn't handle memory pressure well. Things may get nasty when we run
> > low on free pages.
>
> It doesn't handle memory pressure at all. It doesn't have to. noflushd only
> messes with kupdate{,d} but leaves bdflush (formerly known as kflushd)
> alone. If memory gets tight, bdflush starts writing out dirty buffers,
> which makes the disk spin up, and we're back to normal.
Exactly. And in addition, when bdflush does wake up, I try to get kupdate
out of the way as much as possible, though I've been following the
traditional recipe and having it submit all buffers past a certain age. This
is quite possibily a bad thing to do because it could starve the swapper.
Ouch.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 20:58 ` Tom Sightler
2001-06-20 22:09 ` Daniel Phillips
@ 2001-06-24 3:20 ` Anuradha Ratnaweera
2001-06-24 11:14 ` Daniel Phillips
2001-06-24 15:06 ` Rik van Riel
1 sibling, 2 replies; 52+ messages in thread
From: Anuradha Ratnaweera @ 2001-06-24 3:20 UTC (permalink / raw)
To: Tom Sightler
Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
John Stoffel, Roger Larsson, thunder7, Linux-Kernel
On Wed, Jun 20, 2001 at 04:58:51PM -0400, Tom Sightler wrote:
>
> 1. When running a compile, or anything else that produces lots of small disk
> writes, you tend to get lots of little pauses for all the little writes to disk.
> These seem to be unnoticable without the patch.
>
> 2. Loading programs when writing activity is occuring (even light activity like
> during the compile) is noticable slower, actually any reading from disk is.
>
> I also ran my simple ftp test that produced the symptom I reported earlier. I
> transferred a 750MB file via FTP, and with your patch sure enough disk writing
> started almost immediately, but it still didn't seem to write enough data to
> disk to keep up with the transfer so at approximately the 200MB mark the old
> behavior still kicked in as it went into full flush mode, during the time
> network activity halted, just like before.
It is not uncommon to have a large number of tmp files on the disk(s) (Rik also
pointed this out somewhere early in the original thread) and it is sensible to
keep all of them in buffers if RAM is sufficient. Transfering _very_ large
files is not _that_ common so why shouldn't that case be handled from the user
space by calling sync(2)?
Anuradha
--
Debian GNU/Linux (kernel 2.4.6-pre5)
Keep cool, but don't freeze.
-- Hellman's Mayonnaise
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-24 3:20 ` Anuradha Ratnaweera
@ 2001-06-24 11:14 ` Daniel Phillips
2001-06-24 15:06 ` Rik van Riel
1 sibling, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-24 11:14 UTC (permalink / raw)
To: Anuradha Ratnaweera, Tom Sightler
Cc: Mike Galbraith, Rik van Riel, Pavel Machek, John Stoffel,
Roger Larsson, thunder7, Linux-Kernel
On Sunday 24 June 2001 05:20, Anuradha Ratnaweera wrote:
> On Wed, Jun 20, 2001 at 04:58:51PM -0400, Tom Sightler wrote:
> > 1. When running a compile, or anything else that produces lots of small
> > disk writes, you tend to get lots of little pauses for all the little
> > writes to disk. These seem to be unnoticable without the patch.
> >
> > 2. Loading programs when writing activity is occuring (even light
> > activity like during the compile) is noticable slower, actually any
> > reading from disk is.
> >
> > I also ran my simple ftp test that produced the symptom I reported
> > earlier. I transferred a 750MB file via FTP, and with your patch sure
> > enough disk writing started almost immediately, but it still didn't seem
> > to write enough data to disk to keep up with the transfer so at
> > approximately the 200MB mark the old behavior still kicked in as it went
> > into full flush mode, during the time network activity halted, just like
> > before.
>
> It is not uncommon to have a large number of tmp files on the disk(s) (Rik
> also pointed this out somewhere early in the original thread) and it is
> sensible to keep all of them in buffers if RAM is sufficient. Transfering
> _very_ large files is not _that_ common so why shouldn't that case be
> handled from the user space by calling sync(2)?
The patch you're discussing has been superceded - check my "[RFC] Early
flush: new, improved" post from yesterday. This addresses the problem of
handling tmp files efficiently while still having the early flush.
The latest patch shows no degradation at all for compilation, which uses lots
of temporary files.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-24 3:20 ` Anuradha Ratnaweera
2001-06-24 11:14 ` Daniel Phillips
@ 2001-06-24 15:06 ` Rik van Riel
2001-06-24 16:21 ` Daniel Phillips
1 sibling, 1 reply; 52+ messages in thread
From: Rik van Riel @ 2001-06-24 15:06 UTC (permalink / raw)
To: Anuradha Ratnaweera
Cc: Tom Sightler, Daniel Phillips, Mike Galbraith, Pavel Machek,
John Stoffel, Roger Larsson, thunder7, Linux-Kernel
On Sun, 24 Jun 2001, Anuradha Ratnaweera wrote:
> It is not uncommon to have a large number of tmp files on the disk(s)
> (Rik also pointed this out somewhere early in the original thread) and
> it is sensible to keep all of them in buffers if RAM is sufficient.
> Transfering _very_ large files is not _that_ common so why shouldn't
> that case be handled from the user space by calling sync(2)?
Wait a moment.
The only observed bad case I've heard about here is
that of large files being written out.
It should be easy enough to just trigger writeout of
pages of an inode once that inode has more than a
certain amount of dirty pages in RAM ... say, something
like freepages.high ?
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to aardvark@nl.linux.org (spam digging piggy)
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-24 15:06 ` Rik van Riel
@ 2001-06-24 16:21 ` Daniel Phillips
0 siblings, 0 replies; 52+ messages in thread
From: Daniel Phillips @ 2001-06-24 16:21 UTC (permalink / raw)
To: Rik van Riel, Anuradha Ratnaweera
Cc: Tom Sightler <ttsig@tuxyturvy.com> Mike Galbraith,
Pavel Machek, John Stoffel, Roger Larsson, thunder7,
Linux-Kernel
On Sunday 24 June 2001 17:06, Rik van Riel wrote:
> On Sun, 24 Jun 2001, Anuradha Ratnaweera wrote:
> > It is not uncommon to have a large number of tmp files on the disk(s)
> > (Rik also pointed this out somewhere early in the original thread) and
> > it is sensible to keep all of them in buffers if RAM is sufficient.
> > Transfering _very_ large files is not _that_ common so why shouldn't
> > that case be handled from the user space by calling sync(2)?
>
> Wait a moment.
>
> The only observed bad case I've heard about here is
> that of large files being written out.
But that's not the only advantage of doing the early update:
- Early spindown for laptops
- Improved latency under some conditions
- Improved throughput for some loads
- Improved filesystem safety
> It should be easy enough to just trigger writeout of
> pages of an inode once that inode has more than a
> certain amount of dirty pages in RAM ... say, something
> like freepages.high ?
The inode dirty page list is not sorted by "time dirtied" so you would be
eroding the system's ability to ensure that dirty file buffers never get
older than X.
--
Daniel
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-20 16:12 ` Richard Gooch
2001-06-22 23:25 ` Daniel Kobras
@ 2001-06-25 11:31 ` Pavel Machek
1 sibling, 0 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-25 11:31 UTC (permalink / raw)
To: Richard Gooch
Cc: Daniel Phillips, Mike Galbraith, Rik van Riel, Pavel Machek,
John Stoffel, Roger Larsson, thunder7, Linux-Kernel
Hi!
> > You know about this project no doubt:
> >
> > http://noflushd.sourceforge.net/
>
> Only vaguely. It's huge. Over 2300 lines of C code and >560 lines in
> .h files! As you say, not really lightweight. There must be a better
> way. Also, I suspect (without having looked at the code) that it
> doesn't handle memory pressure well. Things may get nasty when we run
> low on free pages.
Noflushd *is* lightweight. It is complicated because it has to know
about different kernel versions etc. It is "easy stuff". If you add
kernel support, it will only *add* lines to noflushd.
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: [RFC] Early flush (was: spindown)
2001-06-23 5:10 ` Daniel Phillips
@ 2001-06-25 11:33 ` Pavel Machek
0 siblings, 0 replies; 52+ messages in thread
From: Pavel Machek @ 2001-06-25 11:33 UTC (permalink / raw)
To: Daniel Phillips
Cc: Daniel Kobras, Richard Gooch, Jens Axboe, Mike Galbraith,
Rik van Riel, Pavel Machek, John Stoffel, Roger Larsson,
thunder7, Linux-Kernel
Hi!
> > > > I'd like that too, but what about sync writes? As things stand now,
> > > > there is no option but to spin the disk back up. To get around this
> > > > we'd have to change the basic behavior of the block device and
> > > > that's doable, but it's an entirely different proposition than the
> > > > little patch above.
> > >
> > > I don't care as much about sync writes. They don't seem to happen very
> > > often on my boxes.
> >
> > syslog and some editors are the most common users of sync writes. vim,
> > e.g., per default keeps fsync()ing its swapfile. Tweaking the configuration
> > of these apps, this can be prevented fairly easy though. Changing sync
> > semantics for this matter on the other hand seems pretty awkward to me. I'd
> > expect an application calling fsync() to have good reason for having its
> > data flushed to disk _now_, no matter what state the disk happens to be in.
> > If it hasn't, fix the app, not the kernel.
>
> But apps shouldn't have to know about the special requirements of
> laptops.
If app does fsync(), it hopefully knows what it is doing. [Random apps
should not really do sync even on normal systems -- it hurts
performance.]
Pavel
--
The best software in life is free (not shareware)! Pavel
GCM d? s-: !g p?:+ au- a--@ w+ v- C++@ UL+++ L++ N++ E++ W--- M- Y- R+
^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: spindown
2001-06-21 16:07 ` spindown Jamie Lokier
2001-06-22 22:09 ` spindown Daniel Kobras
@ 2001-06-28 0:27 ` Troy Benjegerdes
1 sibling, 0 replies; 52+ messages in thread
From: Troy Benjegerdes @ 2001-06-28 0:27 UTC (permalink / raw)
To: Jamie Lokier; +Cc: Pavel Machek, Rik van Riel, Daniel Phillips, Linux-Kernel
On Thu, Jun 21, 2001 at 06:07:01PM +0200, Jamie Lokier wrote:
> Pavel Machek wrote:
> > > Isn't this why noflushd exists or is this an evil thing that shouldn't
> > > ever be used and will eventually eat my disks for breakfast?
> >
> > It would eat your flash for breakfast. You know, flash memories have
> > no spinning parts, so there's nothing to spin down.
>
> Btw Pavel, does noflushd work with 2.4.4? The noflushd version 2.4 I
> tried said it couldn't find some kernel process (kflushd? I don't
> remember) and that I should use bdflush. The manual says that's
> appropriate for older kernels, but not 2.4.4 surely.
Yes, noflushd works with 2.4.x. I'm running it on an ibook with
debian-unstable.
And as a word of warning: while running noflushd, make sure you 'sync' a
few times after an 'apt-get dist-upgrade' that upgrades damn near
everything before doing something that crashes the kernel. This WILL eat
your ext2fs for breakfast.
--
Troy Benjegerdes | master of mispeeling | 'da hozer' | hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's
why I draw cartoons. It's my life." -- Charles Shulz
^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2001-06-28 0:28 UTC | newest]
Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-06-13 19:31 2.4.6-pre2, pre3 VM Behavior Tom Sightler
2001-06-13 20:21 ` Rik van Riel
2001-06-14 1:49 ` Tom Sightler
2001-06-14 3:16 ` Rik van Riel
2001-06-14 7:59 ` Laramie Leavitt
2001-06-14 9:24 ` Helge Hafting
2001-06-14 17:38 ` Mark Hahn
2001-06-15 8:27 ` Helge Hafting
2001-06-14 8:47 ` Daniel Phillips
2001-06-14 20:23 ` Roger Larsson
2001-06-15 6:04 ` Mike Galbraith
2001-06-14 20:39 ` John Stoffel
2001-06-14 20:51 ` Rik van Riel
2001-06-14 21:33 ` John Stoffel
2001-06-14 22:23 ` Rik van Riel
2001-06-15 15:23 ` spindown [was Re: 2.4.6-pre2, pre3 VM Behavior] Pavel Machek
2001-06-16 20:50 ` Daniel Phillips
2001-06-16 21:06 ` Rik van Riel
2001-06-16 21:25 ` Rik van Riel
2001-06-16 21:44 ` Daniel Phillips
2001-06-16 21:54 ` Rik van Riel
2001-06-17 10:28 ` Daniel Phillips
2001-06-17 10:05 ` Mike Galbraith
2001-06-17 12:49 ` (lkml)Re: " thunder7
2001-06-17 16:40 ` Mike Galbraith
2001-06-18 14:22 ` Daniel Phillips
2001-06-19 4:35 ` Mike Galbraith
2001-06-20 1:50 ` [RFC] Early flush (was: spindown) Daniel Phillips
2001-06-20 20:58 ` Tom Sightler
2001-06-20 22:09 ` Daniel Phillips
2001-06-24 3:20 ` Anuradha Ratnaweera
2001-06-24 11:14 ` Daniel Phillips
2001-06-24 15:06 ` Rik van Riel
2001-06-24 16:21 ` Daniel Phillips
2001-06-20 4:39 ` Richard Gooch
2001-06-20 14:29 ` Daniel Phillips
2001-06-20 16:12 ` Richard Gooch
2001-06-22 23:25 ` Daniel Kobras
2001-06-23 5:10 ` Daniel Phillips
2001-06-25 11:33 ` Pavel Machek
2001-06-25 11:31 ` Pavel Machek
2001-06-18 20:21 ` spindown Simon Huggins
2001-06-19 10:46 ` spindown Pavel Machek
2001-06-20 16:52 ` spindown Daniel Phillips
2001-06-20 17:32 ` spindown Rik van Riel
2001-06-20 18:00 ` spindown Daniel Phillips
2001-06-21 16:07 ` spindown Jamie Lokier
2001-06-22 22:09 ` spindown Daniel Kobras
2001-06-28 0:27 ` spindown Troy Benjegerdes
2001-06-14 15:10 ` 2.4.6-pre2, pre3 VM Behavior John Stoffel
2001-06-14 18:25 ` Daniel Phillips
2001-06-14 8:30 ` Mike Galbraith
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).