linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.4.6-pre2, pre3 VM Behavior
       [not found] <Pine.LNX.4.10.10106140024230.980-100000@coffee.psychology.mcmaster.ca>
@ 2001-06-14  2:08 ` Tom Sightler
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Sightler @ 2001-06-14  2:08 UTC (permalink / raw)
  To: Mark Hahn; +Cc: Linux-Kernel

Quoting Mark Hahn <hahn@coffee.psychology.mcmaster.ca>:

> > 1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb
> Ethernet,
> > close to wire speed).  At this point Linux has yet to write the first
> byte to
> > disk.  OK, this might be an exaggerated, but very little disk activity
> has
> > occured on my laptop.
> 
> right.  it could be that the VM scheduling stuff needs some way to
> tell
> whether the IO system is idle.  that is, if there's no contention for 
> the disk, it might as well be less lazy about writebacks.

That's exactly the way it seems.

> > 2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to
> flush,
> > maybe I should do that" then the hard drive light comes on solid for
> several
> > seconds.  During this time the ftp transfer drops to about 1/5 of the
> original
> > speed.
> 
> such a dramatic change could be the result of IDE misconfiguration;
> is it safe to assume you have DMA or UDMA enabled?

Yes, UDMA/33 is enabled and working on the drive (using hdparm -d 0 makes the
problem way worse and my drive performs about 1/4 the speed).

> > This was much less noticeable on a server with a much faster SCSI hard
> disk
> > subsystem as it took significantly less time to flush the information
> to the
> 
> is the SCSI disk actually faster (unlikley, for modern disks), or 
> is the SCSI controller simply busmastering, like DMA/UDMA IDE,
> but wholly unlike PIO-mode IDE?

First, lets be fair, we're comparing a UDMA/33 IDE drive in a 1 year old laptop
(IBM Travelstar, if your interested) to a true SCSI Disk Subsystem with
mirrored/striped Ultra160 SCSI disk connected via 64bit PCI/66Mhz bus, so yes,
the SCSI subsystem is MUCH faster.  Specific numbers:

Laptop with TravelStar IDE HD Max sustained read: 16.5MB/s
Server with Ultra160 SCSI disk array Max sustained read: >100MB/s

That's a big difference.  The Travelstar is probably only 5400RPM and is
optimized for power savings, not speed, the SCSI subsystem has multiple 15000RPM
in a striped/mirrored configuration for speed.

Later,
Tom


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 17:38           ` Mark Hahn
@ 2001-06-15  8:27             ` Helge Hafting
  0 siblings, 0 replies; 19+ messages in thread
From: Helge Hafting @ 2001-06-15  8:27 UTC (permalink / raw)
  To: Mark Hahn, linux-kernel

Mark Hahn wrote:

> > Disk speed is difficult.  I may enable and disable swap on any number of
> ...
> > You may be able to get some useful approximations, but you
> > will probably not be able to get good numbers in all cases.
> 
> a useful approximation would be simply an idle flag.
> for instance, if the disk is idle, then cleaning a few
> inactive-dirty pages would make perfect sense, even in
> the absence of memory pressure.

You can't say "the disk".  One disk is common, but so is
setups with several.  You can say "clean pages if
all disks are idle".  You may then loose some opportunities 
if one disk is idle while an unrelated one is busy.

Saying "clean a page if the disk it goes to is idle" may 
look like the perfect solution, but it is surprisingly
hard.  It don't work with two IDE drives on the same
cable - accessing one will delay the other which might be busy.
The same can happen with scsi if the bandwith of the scsi bus
(or the isa/pci/whatever bus) it is connected to is maxed out.

And then there are loop & md devices.  My computer have several
md devices using different partitions on the same two disks,
as well as a few ordinary partitions.  Code to deal correctly
with that in all cases when one disk is busy and the other idle 
is hard.  Probably so complex that it'll be rejected on the
KISS principle alone.

A per-disk "low-priority queue"  in addition to the ordinary
elevator might work well even in the presence of md
devices, as the md devices just pass stuff on to the real
disks.  Basically let the disk driver pick stuff from the low-
priority queue only when the elevator is completely idle.
But this gives another problem - you get the complication
of moving stuff from low to normal priority at times.
Such as when the process does fsync() or the pressure
increases.  

Helge Hafting

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:23         ` Roger Larsson
@ 2001-06-15  6:04           ` Mike Galbraith
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2001-06-15  6:04 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, Roger Larsson wrote:

> On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> > On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > > After the initial burst, the system should stabilise,
> > > > > starting the writeout of pages before we run low on
> > > > > memory. How to handle the initial burst is something
> > > > > I haven't figured out yet ... ;)
> > > >
> > > > Well, at least I know that this is expected with the VM, although I do
> > > > still think this is bad behavior.  If my disk is idle why would I wait
> > > > until I have greater than 100MB of data to write before I finally
> > > > start actually moving some data to disk?
> > >
> > > The file _could_ be a temporary file, which gets removed
> > > before we'd get around to writing it to disk. Sure, the
> > > chances of this happening with a single file are close to
> > > zero, but having 100MB from 200 different temp files on a
> > > shell server isn't unreasonable to expect.
> >
> > This still doesn't make sense if the disk bandwidth isn't being used.
> >
>
> It does if you are running on a laptop. Then you do not want the pages
> go out all the time. Disk has gone too sleep, needs to start to write a few
> pages, stays idle for a while, goes to sleep, a few more pages, ...

True, you'd never want data trickling to disk on a laptop on battery.
If you have to write, you'd want to write everything dirty at once to
extend the period between disk powerups to the max.

With file IO, there is a high probability that the disk is running
while you are generating files temp or not (because you generally do
read/write, not ____/write), so that doesn't negate the argument.

Delayed write is definitely nice for temp files or whatever.. until
your dirty data no longer comfortably fits in ram.  At that point, the
write delay just became lost time and wasted disk bandwidth whether
it's a laptop or not.  The problem is how do you know the dirty data
is going to become too large for comfort?

One thing which seems to me likely to improve behavior is to have two
goals.  One is the trigger point for starting flushing, the second is
a goal below the start point so we define a quantity which needs to be
flushed to prevent us from having to flush again so soon.  Stopping as
soon as the flush trigger is reached means that we'll reach that limit
instantly if the box is doing any writing.. bad news for the laptop and
not beneficial to desktop or server boxen.  Another benefit of having
two goals is that it's easy to see if you're making progress or losing
ground so you can modify behavior accordingly.

Rik mentioned that the system definitely needs to be optimized for read,
and just thinking about it without posessing much OS theory, that rings
of golden truth.  Now, what does having much dirt lying around do to
asynchronous readahead?.. it turn it synchronous prematurely and negates
the read optimization.

	-Mike

(That's why I mentioned having tried a clean shortage, to ensure more
headroom for readahead to keep it asynchronous longer.  Not working the
disk hard 'enough' [define that;] must harm performance by turning both
read _and_ write into strictly synchronous operations prematurely)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 21:33           ` John Stoffel
@ 2001-06-14 22:23             ` Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2001-06-14 22:23 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, John Stoffel wrote:

> Rik> There's another issue.  If dirty data is written out in small
> Rik> bunches, that means we have to write out the dirty data more
> Rik> often.
>
> What do you consider a small bunch?  32k?  1Mb?  1% of buffer space?
> I don't see how delaying writes until the buffer is almost full really
> helps us.  As the buffer fills, the pressure to do writes should
> increase, so that we tend, over time, to empty the buffer.
>
> A buffer is just that, not persistent storage.
>
> And in the case given, we were not seeing slow degradation, we saw
> that the user ran into a wall (or inflection point in the response
> time vs load graph), which was pretty sharp.  We need to handle that
> more gracefully.

No doubt on the fact that we need to handle it gracefully,
but as long as we don't have any answers to any of the
tricky questions you're asking above it'll be kind of hard
to come up with a patch ;))

> Rik> This in turn means extra disk seeks, which can horribly interfere
> Rik> with disk reads.
>
> True, but are we optomizing for reads or for writes here?  Shouldn't
> they really be equally weighted for priority?  And wouldn't the
> Elevator help handle this to a degree?

We definately need to optimise for reads.

Every time we do a read, we KNOW there's a process waiting
on the data to come in from the disk.

Most of the time we do writes, they'll be asynchronous
delayed IO which is done in the background. The program
which wrote the data has moved on to other things long
since.

> Some areas to think about, at least for me.  And maybe it should be
> read and write pressure, not rate?
>
>  - low write rate, and a low read rate.
>    - Do seeks dominate our IO latency/throughput?

Seeks always dominate IO latency ;)

If you have a program which needs to get data from some file
on disk, it is beneficial for that program if the disk head
is near the data it wants.

Moving the disk head all the way to the other side of the
disk once a second will not slow the program down too much,
but moving the disk head away 30 times a second "because
there is little disk load" might just slow the program
down by a factor of 2 ...

Ie. if the head is in the same track or in the track next
door, we only have rotational latency to count for (say 3ms),
if we're on the other side of the disk we also have to count
in seek time (say 7ms). Giving the program 30 * 7 = 210 ms
extra IO wait time per second just isn't good ;)

> - low write rate, high read rate.
>   - seems like we want to keep writing the buffers, but at a lower
>     rate.

Not at a lower rate, just in larger blocks.  Disk transfer
rate is so rediculously high nowadays that seek time seems
the only sensible thing to optimise for.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:39         ` John Stoffel
  2001-06-14 20:51           ` Rik van Riel
@ 2001-06-14 21:33           ` John Stoffel
  2001-06-14 22:23             ` Rik van Riel
  1 sibling, 1 reply; 19+ messages in thread
From: John Stoffel @ 2001-06-14 21:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: John Stoffel, Roger Larsson, Daniel Phillips, Linux-Kernel


Rik> There's another issue.  If dirty data is written out in small
Rik> bunches, that means we have to write out the dirty data more
Rik> often.

What do you consider a small bunch?  32k?  1Mb?  1% of buffer space?
I don't see how delaying writes until the buffer is almost full really
helps us.  As the buffer fills, the pressure to do writes should
increase, so that we tend, over time, to empty the buffer.  

A buffer is just that, not persistent storage.  

And in the case given, we were not seeing slow degradation, we saw
that the user ran into a wall (or inflection point in the response
time vs load graph), which was pretty sharp.  We need to handle that
more gracefully.  

Rik> This in turn means extra disk seeks, which can horribly interfere
Rik> with disk reads.

True, but are we optomizing for reads or for writes here?  Shouldn't
they really be equally weighted for priority?  And wouldn't the
Elevator help handle this to a degree?

Some areas to think about, at least for me.  And maybe it should be
read and write pressure, not rate?  

 - low write rate, and a low read rate.
   - Do seeks dominate our IO latency/throughput?

 - low read rate, higher write rate (ie buffers filling faster than
   they are being written to disk)
   - Do we care as much about reads in this case?  
   - If the write is just a small, high intensity burst, we don't want
     to go ape on writing out buffers to disk, but we do want to raise the
     rate we do so in the background, no?

- low write rate, high read rate.
  - seems like we want to keep writing the buffers, but at a lower
    rate. 

Just some thoughts...

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 20:39         ` John Stoffel
@ 2001-06-14 20:51           ` Rik van Riel
  2001-06-14 21:33           ` John Stoffel
  1 sibling, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2001-06-14 20:51 UTC (permalink / raw)
  To: John Stoffel; +Cc: Roger Larsson, Daniel Phillips, Linux-Kernel

On Thu, 14 Jun 2001, John Stoffel wrote:

> That could be handled by a metric which says if the disk is spun down,
> wait until there is more memory pressure before writing.  But if the
> disk is spinning, we don't care, you should start writing out buffers
> at some low rate to keep the pressure from rising too rapidly.
>
> The idea of buffers is more to keep from overloading the disk
> subsystem with IO, not to stop IO from happening at all.  And to keep
> it from going from no IO to full out stalling the system IO.  It
> should be a nice line as VM pressure goes up, buffer flushing IO rate
> goes up as well.

There's another issue.  If dirty data is written out in
small bunches, that means we have to write out the dirty
data more often.

This in turn means extra disk seeks, which can horribly
interfere with disk reads.

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 20:23         ` Roger Larsson
@ 2001-06-14 20:39         ` John Stoffel
  2001-06-14 20:51           ` Rik van Riel
  2001-06-14 21:33           ` John Stoffel
  1 sibling, 2 replies; 19+ messages in thread
From: John Stoffel @ 2001-06-14 20:39 UTC (permalink / raw)
  To: Roger Larsson; +Cc: Daniel Phillips, Linux-Kernel


Roger> It does if you are running on a laptop. Then you do not want
Roger> the pages go out all the time. Disk has gone too sleep, needs
Roger> to start to write a few pages, stays idle for a while, goes to
Roger> sleep, a few more pages, ...

That could be handled by a metric which says if the disk is spun down,
wait until there is more memory pressure before writing.  But if the
disk is spinning, we don't care, you should start writing out buffers
at some low rate to keep the pressure from rising too rapidly.  

The idea of buffers is more to keep from overloading the disk
subsystem with IO, not to stop IO from happening at all.  And to keep
it from going from no IO to full out stalling the system IO.  It
should be a nice line as VM pressure goes up, buffer flushing IO rate
goes up as well.  

Overall, I think Rik, Jonathan and the rest of the hard core VM crew
have been doing a great job with 2.4.5+ work, it seems like it's
getting better and better all the time, and I really appreciate it.

We're now more into some corner cases and tuning issues.  Hopefully.

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  8:47       ` Daniel Phillips
@ 2001-06-14 20:23         ` Roger Larsson
  2001-06-15  6:04           ` Mike Galbraith
  2001-06-14 20:39         ` John Stoffel
  1 sibling, 1 reply; 19+ messages in thread
From: Roger Larsson @ 2001-06-14 20:23 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linux-Kernel

On Thursday 14 June 2001 10:47, Daniel Phillips wrote:
> On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> > On Wed, 13 Jun 2001, Tom Sightler wrote:
> > > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > > After the initial burst, the system should stabilise,
> > > > starting the writeout of pages before we run low on
> > > > memory. How to handle the initial burst is something
> > > > I haven't figured out yet ... ;)
> > >
> > > Well, at least I know that this is expected with the VM, although I do
> > > still think this is bad behavior.  If my disk is idle why would I wait
> > > until I have greater than 100MB of data to write before I finally
> > > start actually moving some data to disk?
> >
> > The file _could_ be a temporary file, which gets removed
> > before we'd get around to writing it to disk. Sure, the
> > chances of this happening with a single file are close to
> > zero, but having 100MB from 200 different temp files on a
> > shell server isn't unreasonable to expect.
>
> This still doesn't make sense if the disk bandwidth isn't being used.
>

It does if you are running on a laptop. Then you do not want the pages
go out all the time. Disk has gone too sleep, needs to start to write a few
pages, stays idle for a while, goes to sleep, a few more pages, ...

/RogerL

-- 
Roger Larsson
Skellefteå
Sweden


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14 15:10       ` John Stoffel
@ 2001-06-14 18:25         ` Daniel Phillips
  0 siblings, 0 replies; 19+ messages in thread
From: Daniel Phillips @ 2001-06-14 18:25 UTC (permalink / raw)
  To: John Stoffel; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel

On Thursday 14 June 2001 17:10, John Stoffel wrote:
> >> The file _could_ be a temporary file, which gets removed before
> >> we'd get around to writing it to disk. Sure, the chances of this
> >> happening with a single file are close to zero, but having 100MB
> >> from 200 different temp files on a shell server isn't unreasonable
> >> to expect.
>
> Daniel> This still doesn't make sense if the disk bandwidth isn't
> Daniel> being used.
>
> And can't you tell that a certain percentage of buffers are owned by a
> single file/process?  It would seem that a simple metric of
>
>        if ##% of the buffer/cache is used by 1 process/file, start
>        writing the file out to disk, even if there is no pressure.
>
> might do the trick to handle this case.

Buffers and file pages are owned by the vfs, not processes pre se, so it 
makes accounting harder.  In this case you don't care: it's a file, so in the 
absence of memory pressure and with disk bandwidth available it's better to 
get the data onto disk sooner rather than later.  (This glosses over the 
question of mmap's, by the way.)  It's pretty hard to see why there is any 
benefit at all in delaying, but it's clear there's a benefit in terms of data 
safety and a further benefit in terms of doing what the user expects.

> >> Maybe we should just see if anything in the first few MB of
> >> inactive pages was freeable, limiting the first scan to something
> >> like 1 or maybe even 5 MB maximum (freepages.min?  freepages.high?)
> >> and flushing as soon as we find more unfreeable pages than that ?
>
> Daniel> For file-backed pages what we want is pretty simple: when 1)
> Daniel> disk bandwidth is less than xx% used 2) memory pressure is
> Daniel> moderate, just submit whatever's dirty.  As pressure increases
> Daniel> and bandwidth gets loaded up (including read traffic) leave
> Daniel> things on the inactive list longer to allow more chances for
> Daniel> combining and better clustering decisions.
>
> Would it also be good to say that pressure should increase as the
> buffer.free percentage goes down?

Maybe - right now getblk waits until it runs completely out of buffers of a 
given size before trying to allocate more, which means that sometimes an io 
will be delayed by the time it takes to complete a page_launder cycle.  Two 
reasons why it may not be worth doing anything about this: 1) we will move 
most of the buffer users into the page cache in due course 2) the frequency 
of this kind of io delay is *probably* pretty low.

> It won't stop you from filling the
> buffer, but it should at least start pushing out pages to disk
> earlier.

--
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  9:24         ` Helge Hafting
@ 2001-06-14 17:38           ` Mark Hahn
  2001-06-15  8:27             ` Helge Hafting
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Hahn @ 2001-06-14 17:38 UTC (permalink / raw)
  To: Helge Hafting; +Cc: linux-kernel

> > Would it be possible to maintain a dirty-rate count
> > for the dirty buffers?
> > 
> > For example, we it is possible to figure an approximate
> > disk subsystem speed from most of the given information.
> 
> Disk speed is difficult.  I may enable and disable swap on any number of
...
> You may be able to get some useful approximations, but you
> will probably not be able to get good numbers in all cases.

a useful approximation would be simply an idle flag.
for instance, if the disk is idle, then cleaning a few 
inactive-dirty pages would make perfect sense, even in 
the absence of memory pressure.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
  2001-06-14  8:47       ` Daniel Phillips
@ 2001-06-14 15:10       ` John Stoffel
  2001-06-14 18:25         ` Daniel Phillips
  2 siblings, 1 reply; 19+ messages in thread
From: John Stoffel @ 2001-06-14 15:10 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Rik van Riel, Tom Sightler, Linux-Kernel


>> The file _could_ be a temporary file, which gets removed before
>> we'd get around to writing it to disk. Sure, the chances of this
>> happening with a single file are close to zero, but having 100MB
>> from 200 different temp files on a shell server isn't unreasonable
>> to expect.

Daniel> This still doesn't make sense if the disk bandwidth isn't
Daniel> being used.

And can't you tell that a certain percentage of buffers are owned by a
single file/process?  It would seem that a simple metric of

       if ##% of the buffer/cache is used by 1 process/file, start
       writing the file out to disk, even if there is no pressure.

might to the trick to handle this case.  

>> Maybe we should just see if anything in the first few MB of
>> inactive pages was freeable, limiting the first scan to something
>> like 1 or maybe even 5 MB maximum (freepages.min?  freepages.high?)
>> and flushing as soon as we find more unfreeable pages than that ?

Daniel> For file-backed pages what we want is pretty simple: when 1)
Daniel> disk bandwidth is less than xx% used 2) memory pressure is
Daniel> moderate, just submit whatever's dirty.  As pressure increases
Daniel> and bandwidth gets loaded up (including read traffic) leave
Daniel> things on the inactive list longer to allow more chances for
Daniel> combining and better clustering decisions.

Would it also be good to say that pressure should increase as the
buffer.free percentage goes down?  It won't stop you from filling the
buffer, but it should at least start pushing out pages to disk
earlier.  

John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  7:59       ` Laramie Leavitt
@ 2001-06-14  9:24         ` Helge Hafting
  2001-06-14 17:38           ` Mark Hahn
  0 siblings, 1 reply; 19+ messages in thread
From: Helge Hafting @ 2001-06-14  9:24 UTC (permalink / raw)
  To: lar, linux-kernel

Laramie Leavitt wrote:

> Would it be possible to maintain a dirty-rate count
> for the dirty buffers?
> 
> For example, we it is possible to figure an approximate
> disk subsystem speed from most of the given information.

Disk speed is difficult.  I may enable and disable swap on any number of
very different disks and files.  And making it per-device won't help
that
much.  The device may have other partitions with varying access
patterns.
and sometimes differnet devices interfer with each other, such
as two IDE drives on the same cable.  Or several scsi drives
using up scsi (or pci!) bandwith for file access.

You may be able to get some useful approximations, but you
will probably not be able to get good numbers in all cases.

Helge Hafting

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
@ 2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 20:23         ` Roger Larsson
  2001-06-14 20:39         ` John Stoffel
  2001-06-14 15:10       ` John Stoffel
  2 siblings, 2 replies; 19+ messages in thread
From: Daniel Phillips @ 2001-06-14  8:47 UTC (permalink / raw)
  To: Rik van Riel, Tom Sightler; +Cc: Linux-Kernel

On Thursday 14 June 2001 05:16, Rik van Riel wrote:
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> >
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior.  If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
>
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.

This still doesn't make sense if the disk bandwidth isn't being used.

> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?

There are two cases, file-backed and swap-backed pages.

For file-backed pages what we want is pretty simple: when 1) disk bandwidth 
is less than xx% used 2) memory pressure is moderate, just submit whatever's 
dirty.  As pressure increases and bandwidth gets loaded up (including read 
traffic) leave things on the inactive list longer to allow more chances for 
combining and better clustering decisions.

There is no such obvious answer for swap-backed pages; the main difference is 
what should happen under low-to-moderate pressure.  On a server we probably 
want to pre-write as many inactive/dirty pages to swap as possible in order 
to respond better to surges, even when pressure is low.  We don't want this 
behaviour on a laptop, otherwise the disk would never spin down.  There's a 
configuration parameter in there somewhere.

--
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 20:21 ` Rik van Riel
  2001-06-14  1:49   ` Tom Sightler
@ 2001-06-14  8:30   ` Mike Galbraith
  1 sibling, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2001-06-14  8:30 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel

On Wed, 13 Jun 2001, Rik van Riel wrote:

> On Wed, 13 Jun 2001, Tom Sightler wrote:
>
> > 1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> > close to wire speed).  At this point Linux has yet to write the first byte to
> > disk.  OK, this might be an exaggerated, but very little disk activity has
> > occured on my laptop.
> >
> > 2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> > maybe I should do that" then the hard drive light comes on solid for several
> > seconds.  During this time the ftp transfer drops to about 1/5 of the original
> > speed.
> >
> > 3.  After the initial burst of data is written things seem much more reasonable,
> > and data streams to the disk almost continually while the rest of the transfer
> > completes at near full speed again.
> >
> > Basically, it seems the kernel buffers all of the incoming file up to nearly
> > available memory before it begins to panic and starts flushing the file to disk.
> >  It seems it should start to lazy write somewhat ealier.
> > Perhaps some of this is tuneable from userland and I just don't
> > know how.
>
> Actually, it already does the lazy write earlier.
>
> The page reclaim code scans up to 1/4th of the inactive_dirty
> pages on the first loop, where it does NOT write things to
> disk.

I've done some experiments with a _clean_ shortage.  Requiring that a
portion of inactive pages be pre cleaned improves response as you start
reclaiming.  Even though you may have enough inactive pages total, you
know that laundering is needed before things get heavy.  This gets the
dirty pages moving a little sooner.  As you're reclaiming pages, writes
trickle out whether your dirty list is short or long.  (and if I'd been
able to make that idea work a little better, you'd have seen my mess;)

	-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  3:16     ` Rik van Riel
@ 2001-06-14  7:59       ` Laramie Leavitt
  2001-06-14  9:24         ` Helge Hafting
  2001-06-14  8:47       ` Daniel Phillips
  2001-06-14 15:10       ` John Stoffel
  2 siblings, 1 reply; 19+ messages in thread
From: Laramie Leavitt @ 2001-06-14  7:59 UTC (permalink / raw)
  To: Linux-Kernel

On Behalf Of Rik van Riel
> On Wed, 13 Jun 2001, Tom Sightler wrote:
> > Quoting Rik van Riel <riel@conectiva.com.br>:
> > 
> > > After the initial burst, the system should stabilise,
> > > starting the writeout of pages before we run low on
> > > memory. How to handle the initial burst is something
> > > I haven't figured out yet ... ;)
> > 
> > Well, at least I know that this is expected with the VM, although I do
> > still think this is bad behavior.  If my disk is idle why would I wait
> > until I have greater than 100MB of data to write before I finally
> > start actually moving some data to disk?
> 
> The file _could_ be a temporary file, which gets removed
> before we'd get around to writing it to disk. Sure, the
> chances of this happening with a single file are close to
> zero, but having 100MB from 200 different temp files on a
> shell server isn't unreasonable to expect.
> 
> > > This is due to this smarter handling of the flushing of
> > > dirty pages and due to a more subtle bug where the system
> > > ended up doing synchronous IO on too many pages, whereas
> > > now it only does synchronous IO on _1_ page per scan ;)
> > 
> > And this is definitely a noticeable fix, thanks for your continued
> > work.  I know it's hard to get everything balanced out right, and I
> > only wrote this email to describe some behavior I was seeing and make
> > sure it was expected in the current VM.  You've let me know that it
> > is, and it's really minor compared to problems some of the earlier
> > kernels had.
> 
> I'll be sure to keep this problem in mind. I really want
> to fix it, I just haven't figured out how yet  ;)
> 
> Maybe we should just see if anything in the first few MB
> of inactive pages was freeable, limiting the first scan to
> something like 1 or maybe even 5 MB maximum (freepages.min?
> freepages.high?) and flushing as soon as we find more unfreeable
> pages than that ?
> 

Would it be possible to maintain a dirty-rate count 
for the dirty buffers?

For example, we it is possible to figure an approximate
disk subsystem speed from most of the given information.
If it is possible to know the rate at which new buffers
are being dirtied then we could compare that to the available
memory and the disk speed to calculate some maintainable
rate at which buffers need to be expired.  The rates would
have to maintain some historical data to account for
bursty data...

It may be possible to use a very similar mechanism to do
both.  I.e. not actually calculate the rate from the hardware,
but use a similar counter for the expiry rate of buffers.

I don't know how difficult the accounting would be
but it seems possible to make it automatically tuning.

This is a little different than just keeping a list
of dirty buffers and free buffers because you have
the rate information which tells you how long you
have until all the buffers expire.

Laramie.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-14  1:49   ` Tom Sightler
@ 2001-06-14  3:16     ` Rik van Riel
  2001-06-14  7:59       ` Laramie Leavitt
                         ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Rik van Riel @ 2001-06-14  3:16 UTC (permalink / raw)
  To: Tom Sightler; +Cc: Linux-Kernel

On Wed, 13 Jun 2001, Tom Sightler wrote:
> Quoting Rik van Riel <riel@conectiva.com.br>:
> 
> > After the initial burst, the system should stabilise,
> > starting the writeout of pages before we run low on
> > memory. How to handle the initial burst is something
> > I haven't figured out yet ... ;)
> 
> Well, at least I know that this is expected with the VM, although I do
> still think this is bad behavior.  If my disk is idle why would I wait
> until I have greater than 100MB of data to write before I finally
> start actually moving some data to disk?

The file _could_ be a temporary file, which gets removed
before we'd get around to writing it to disk. Sure, the
chances of this happening with a single file are close to
zero, but having 100MB from 200 different temp files on a
shell server isn't unreasonable to expect.

> > This is due to this smarter handling of the flushing of
> > dirty pages and due to a more subtle bug where the system
> > ended up doing synchronous IO on too many pages, whereas
> > now it only does synchronous IO on _1_ page per scan ;)
> 
> And this is definately a noticeable fix, thanks for your continued
> work.  I know it's hard to get everything balanced out right, and I
> only wrote this email to describe some behavior I was seeing and make
> sure it was expected in the current VM.  You've let me know that it
> is, and it's really minor compared to problems some of the earlier
> kernels had.

I'll be sure to keep this problem in mind. I really want
to fix it, I just haven't figured out how yet  ;)

Maybe we should just see if anything in the first few MB
of inactive pages was freeable, limiting the first scan to
something like 1 or maybe even 5 MB maximum (freepages.min?
freepages.high?) and flushing as soon as we find more unfreeable
pages than that ?

Maybe another solution with bdflush tuning ?

I'll send a patch as soon as I figure this one out...

regards,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 20:21 ` Rik van Riel
@ 2001-06-14  1:49   ` Tom Sightler
  2001-06-14  3:16     ` Rik van Riel
  2001-06-14  8:30   ` Mike Galbraith
  1 sibling, 1 reply; 19+ messages in thread
From: Tom Sightler @ 2001-06-14  1:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Tom Sightler, Linux-Kernel

Quoting Rik van Riel <riel@conectiva.com.br>:

> After the initial burst, the system should stabilise,
> starting the writeout of pages before we run low on
> memory. How to handle the initial burst is something
> I haven't figured out yet ... ;)

Well, at least I know that this is expected with the VM, although I do still
think this is bad behavior.  If my disk is idle why would I wait until I have
greater than 100MB of data to write before I finally start actually moving some
data to disk?
 
> This is due to this smarter handling of the flushing of
> dirty pages and due to a more subtle bug where the system
> ended up doing synchronous IO on too many pages, whereas
> now it only does synchronous IO on _1_ page per scan ;)

And this is definately a noticeable fix, thanks for your continued work.  I know
it's hard to get everything balanced out right, and I only wrote this email to
describe some behavior I was seeing and make sure it was expected in the current
VM.  You've let me know that it is, and it's really minor compared to problems
some of the earlier kernels had.

Later,
Tom


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: 2.4.6-pre2, pre3 VM Behavior
  2001-06-13 19:31 Tom Sightler
@ 2001-06-13 20:21 ` Rik van Riel
  2001-06-14  1:49   ` Tom Sightler
  2001-06-14  8:30   ` Mike Galbraith
  0 siblings, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2001-06-13 20:21 UTC (permalink / raw)
  To: Tom Sightler; +Cc: Linux-Kernel

On Wed, 13 Jun 2001, Tom Sightler wrote:

> 1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
> close to wire speed).  At this point Linux has yet to write the first byte to
> disk.  OK, this might be an exaggerated, but very little disk activity has
> occured on my laptop.
>
> 2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
> maybe I should do that" then the hard drive light comes on solid for several
> seconds.  During this time the ftp transfer drops to about 1/5 of the original
> speed.
>
> 3.  After the initial burst of data is written things seem much more reasonable,
> and data streams to the disk almost continually while the rest of the transfer
> completes at near full speed again.
>
> Basically, it seems the kernel buffers all of the incoming file up to nearly
> available memory before it begins to panic and starts flushing the file to disk.
>  It seems it should start to lazy write somewhat ealier.
> Perhaps some of this is tuneable from userland and I just don't
> know how.

Actually, it already does the lazy write earlier.

The page reclaim code scans up to 1/4th of the inactive_dirty
pages on the first loop, where it does NOT write things to
disk.

On the second loop, we start asynchronous writeout of data
to disk and and scan up to 1/2 of the inactive_dirty pages,
trying to find clean pages to free.

Only when there simply are no clean pages we resort to
synchronous IO and the system will wait for pages to be
cleaned.

After the initial burst, the system should stabilise,
starting the writeout of pages before we run low on
memory. How to handle the initial burst is something
I haven't figured out yet ... ;)

> Anyway, things are still much better, with older kernels things
> would almost seem locked up during those 10-15 seconds but now
> my apps stay fairly responsive (I can still type in AbiWord,
> browse in Mozilla, etc).

This is due to this smarter handling of the flushing of
dirty pages and due to a more subtle bug where the system
ended up doing synchronous IO on too many pages, whereas
now it only does synchronous IO on _1_ page per scan ;)

regards,

Rik
--
Linux MM bugzilla: http://linux-mm.org/bugzilla.shtml

Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* 2.4.6-pre2, pre3 VM Behavior
@ 2001-06-13 19:31 Tom Sightler
  2001-06-13 20:21 ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Sightler @ 2001-06-13 19:31 UTC (permalink / raw)
  To: Linux-Kernel

Hi All,

I have been using the 2.4.x kernels since the 2.4.0-test days on my Dell 5000e
laptop with 320MB of RAM and have experienced first hand many of the problems
other users have reported with the VM system in 2.4.  Most of these problems
have been only minor anoyances and I have continued testing kernels as the 2.4
series has continued, mostly without noticing much change.

With 2.4.6-pre2, and -pre3 I can say that I have seen a marked improvement on my
machine, especially in interactive response, for my day to day workstation uses.
 However, I do have one observation that seems rather strange, or at least wrong.

I, on occasion, have the need to transfer relatively large files (750MB-1GB)
from our larger Linux servers to my machine.  I usually use ftp to transfer
these files and this is where I notice the following:

1.  Transfer of the first 100-150MB is very fast (9.8MB/sec via 100Mb Ethernet,
close to wire speed).  At this point Linux has yet to write the first byte to
disk.  OK, this might be an exaggerated, but very little disk activity has
occured on my laptop.

2.  Suddenly it's as if Linux says, "Damn, I've got a lot of data to flush,
maybe I should do that" then the hard drive light comes on solid for several
seconds.  During this time the ftp transfer drops to about 1/5 of the original
speed.

3.  After the initial burst of data is written things seem much more reasonable,
and data streams to the disk almost continually while the rest of the transfer
completes at near full speed again.

Basically, it seems the kernel buffers all of the incoming file up to nearly
available memory before it begins to panic and starts flushing the file to disk.
 It seems it should start to lazy write somewhat ealier.  Perhaps some of this
is tuneable from userland and I just don't know how.

This was much less noticeable on a server with a much faster SCSI hard disk
subsystem as it took significantly less time to flush the information to the
disk once it finally starterd, but laptop hard drives are traditionally poor
performers and at 15MB/s it take 10-15 seconds before things stable out, just
from transferring a file.

Anyway, things are still much better, with older kernels things would almost
seem locked up during those 10-15 seconds but now my apps stay fairly responsive
(I can still type in AbiWord, browse in Mozilla, etc).

Later,
Tom

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2001-06-15  8:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.10.10106140024230.980-100000@coffee.psychology.mcmaster.ca>
2001-06-14  2:08 ` 2.4.6-pre2, pre3 VM Behavior Tom Sightler
2001-06-13 19:31 Tom Sightler
2001-06-13 20:21 ` Rik van Riel
2001-06-14  1:49   ` Tom Sightler
2001-06-14  3:16     ` Rik van Riel
2001-06-14  7:59       ` Laramie Leavitt
2001-06-14  9:24         ` Helge Hafting
2001-06-14 17:38           ` Mark Hahn
2001-06-15  8:27             ` Helge Hafting
2001-06-14  8:47       ` Daniel Phillips
2001-06-14 20:23         ` Roger Larsson
2001-06-15  6:04           ` Mike Galbraith
2001-06-14 20:39         ` John Stoffel
2001-06-14 20:51           ` Rik van Riel
2001-06-14 21:33           ` John Stoffel
2001-06-14 22:23             ` Rik van Riel
2001-06-14 15:10       ` John Stoffel
2001-06-14 18:25         ` Daniel Phillips
2001-06-14  8:30   ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).