linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: broken VM in 2.4.10-pre9
@ 2001-09-16 15:19 Ricardo Galli
  2001-09-16 15:23 ` Michael Rothwell
  0 siblings, 1 reply; 133+ messages in thread
From: Ricardo Galli @ 2001-09-16 15:19 UTC (permalink / raw)
  To: linux-kernel

> So whether Linux uses swap or not is a 100% meaningless indicator of
> "goodness". The only thing that matters is how well the job gets done,
> ie was it reasonably responsive, and did the big untars finish quickly..

I am running 2.4.9 on a PII with 448MB RAM. After listening a couple of
hours MP3 from the /dev/cdrom and KDE started, more than 70MB went to
swap, about 300 MB in cache and KDE takes about 15-20 seconds just for
logging out and showing the greeting console.

Obviously, all apps went to disk to leave space for caching mp3 files that
are read only once. Altough logging out is not a very often process...

Regards,


--ricardo



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 15:19 broken VM in 2.4.10-pre9 Ricardo Galli
@ 2001-09-16 15:23 ` Michael Rothwell
  2001-09-16 16:33   ` Rik van Riel
  0 siblings, 1 reply; 133+ messages in thread
From: Michael Rothwell @ 2001-09-16 15:23 UTC (permalink / raw)
  To: Ricardo Galli; +Cc: linux-kernel

Is there a way to tell the VM to prune its cache? Or a way to limit the
amount of cache it uses?



On 16 Sep 2001 17:19:43 +0200, Ricardo Galli wrote:
> > So whether Linux uses swap or not is a 100% meaningless indicator of
> > "goodness". The only thing that matters is how well the job gets done,
> > ie was it reasonably responsive, and did the big untars finish quickly..
> 
> I am running 2.4.9 on a PII with 448MB RAM. After listening a couple of
> hours MP3 from the /dev/cdrom and KDE started, more than 70MB went to
> swap, about 300 MB in cache and KDE takes about 15-20 seconds just for
> logging out and showing the greeting console.
> 
> Obviously, all apps went to disk to leave space for caching mp3 files that
> are read only once. Altough logging out is not a very often process...
> 
> Regards,
> 
> 
> --ricardo
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 15:23 ` Michael Rothwell
@ 2001-09-16 16:33   ` Rik van Riel
  2001-09-16 16:50     ` Andreas Steinmetz
                       ` (4 more replies)
  0 siblings, 5 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-16 16:33 UTC (permalink / raw)
  To: Michael Rothwell; +Cc: Ricardo Galli, linux-kernel

On 16 Sep 2001, Michael Rothwell wrote:

> Is there a way to tell the VM to prune its cache? Or a way to limit
> the amount of cache it uses?

Not yet, I'll make a quick hack for this when I get back next
week. It's pretty obvious now that the 2.4 kernel cannot get
enough information to select the right pages to evict from
memory.

For 2.5 I'm making a VM subsystem with reverse mappings, the
first iterations are giving very sweet performance so I will
continue with this project regardless of what other kernel
hackers might say ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:33   ` Rik van Riel
@ 2001-09-16 16:50     ` Andreas Steinmetz
  2001-09-16 17:12       ` Ricardo Galli
  2001-09-16 17:06     ` Ricardo Galli
                       ` (3 subsequent siblings)
  4 siblings, 1 reply; 133+ messages in thread
From: Andreas Steinmetz @ 2001-09-16 16:50 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, Ricardo Galli, Michael Rothwell


On 16-Sep-2001 Rik van Riel wrote:
> On 16 Sep 2001, Michael Rothwell wrote:
> 
>> Is there a way to tell the VM to prune its cache? Or a way to limit
>> the amount of cache it uses?
> 
> Not yet, I'll make a quick hack for this when I get back next
> week. It's pretty obvious now that the 2.4 kernel cannot get
> enough information to select the right pages to evict from
> memory.
> 
In my experience you should try to run aide
(ftp://ftp.cs.tut.fi/pub/src/gnu/aide-0.7.tar.gz) for tests. This is a case of
one single process doing a file system consistency check and stopping all other
processes cold due to swapout due to heavy cacheing. While aide runs the system
just becomes unusable.


Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:33   ` Rik van Riel
  2001-09-16 16:50     ` Andreas Steinmetz
@ 2001-09-16 17:06     ` Ricardo Galli
  2001-09-16 17:18       ` Jeremy Zawodny
  2001-09-16 18:45       ` Stephan von Krawczynski
  2001-09-16 18:16     ` broken VM in 2.4.10-pre9 Stephan von Krawczynski
                       ` (2 subsequent siblings)
  4 siblings, 2 replies; 133+ messages in thread
From: Ricardo Galli @ 2001-09-16 17:06 UTC (permalink / raw)
  To: linux-kernel

On Sun, 16 Sep 2001, Rik van Riel wrote:
>
> > Is there a way to tell the VM to prune its cache? Or a way to limit
> > the amount of cache it uses?
>
> Not yet, I'll make a quick hack for this when I get back next
> week. It's pretty obvious now that the 2.4 kernel cannot get
> enough information to select the right pages to evict from
> memory.

....

On Sun, 16 Sep 2001, Jeremy Zawodny wrote:
>
> Agreed. I'd be great if there was an option to say "Don't swap out
> memory that was allocated by these programs. If you run out of disk
> buffers, toss the oldest ones and start re-using them."

More easy though (for cases of listening mp3's and backups): cache pages
that were accesed only "once"(*) several seconds ago must be discarded
first. It only implies a check against an access counter and a "last
accesed"  epoch fields of the page.

(*) Or by the same process/process group in a very short period, i.e. the
last-access timestamp should be updated only if the previous access was
few seconds ago.


--ricardo


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:50     ` Andreas Steinmetz
@ 2001-09-16 17:12       ` Ricardo Galli
  0 siblings, 0 replies; 133+ messages in thread
From: Ricardo Galli @ 2001-09-16 17:12 UTC (permalink / raw)
  To: linux-kernel

On Sun, 16 Sep 2001, Andreas Steinmetz wrote:

> More easy though (for cases of listening mp3's and backups): cache pages
> that were accesed only "once"(*) several seconds ago must be discarded
> first. It only implies a check against an access counter and a "last
> accesed" epoch fields of the page.
>
>
> (*) Or by the same process/process group in a very short period, i.e. the
> last-access timestamp should be updated only if the previous access was
  ^^^^^^^^^^^^^^^^^^^^

Sorry, I wanted to say "access counter".

--ricardo



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 17:06     ` Ricardo Galli
@ 2001-09-16 17:18       ` Jeremy Zawodny
  2001-09-16 18:45       ` Stephan von Krawczynski
  1 sibling, 0 replies; 133+ messages in thread
From: Jeremy Zawodny @ 2001-09-16 17:18 UTC (permalink / raw)
  To: Ricardo Galli; +Cc: linux-kernel

On Sun, Sep 16, 2001 at 07:06:45PM +0200, Ricardo Galli wrote:
> On Sun, 16 Sep 2001, Rik van Riel wrote:
> >
> > > Is there a way to tell the VM to prune its cache? Or a way to limit
> > > the amount of cache it uses?
> >
> > Not yet, I'll make a quick hack for this when I get back next
> > week. It's pretty obvious now that the 2.4 kernel cannot get
> > enough information to select the right pages to evict from
> > memory.
> 
> ....
> 
> On Sun, 16 Sep 2001, Jeremy Zawodny wrote:
> >
> > Agreed. I'd be great if there was an option to say "Don't swap out
> > memory that was allocated by these programs. If you run out of disk
> > buffers, toss the oldest ones and start re-using them."
> 
> More easy though (for cases of listening mp3's and backups): cache
> pages that were accesed only "once"(*) several seconds ago must be
> discarded first. It only implies a check against an access counter
> and a "last accesed" epoch fields of the page.

Yeah, something along those lines would be great.  It would keep a big
(13 GB) drive to drive file copy from causing a large (400MB) and
relatively active process from having its memory swapped out (and then
back in 20 seconds later).

Imagine watching that for 45 minutes while a backup that used to take
5-8 minutes runs.

Jeremy
-- 
Jeremy D. Zawodny     |  Perl, Web, MySQL, Linux Magazine, Yahoo!
<Jeremy@Zawodny.com>  |  http://jeremy.zawodny.com/

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:33   ` Rik van Riel
  2001-09-16 16:50     ` Andreas Steinmetz
  2001-09-16 17:06     ` Ricardo Galli
@ 2001-09-16 18:16     ` Stephan von Krawczynski
  2001-09-16 19:43     ` Linus Torvalds
  2001-09-17  8:06     ` Eric W. Biederman
  4 siblings, 0 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-16 18:16 UTC (permalink / raw)
  To: Ricardo Galli; +Cc: linux-kernel

On Sun, 16 Sep 2001 19:06:45 +0200 (MET) Ricardo Galli <gallir@m3d.uib.es>
wrote:

> On Sun, 16 Sep 2001, Jeremy Zawodny wrote:
> >
> > Agreed. I'd be great if there was an option to say "Don't swap out
> > memory that was allocated by these programs. If you run out of disk
> > buffers, toss the oldest ones and start re-using them."
> 
> More easy though (for cases of listening mp3's and backups): cache pages
> that were accesed only "once"(*) several seconds ago must be discarded
> first. It only implies a check against an access counter and a "last
> accesed"  epoch fields of the page.

Well, I guess this is everybody's first idea about the problem: make an initial
timestamp for knowing how _old_ an allocation really is, and make an access
counter. Ok. The first is easy, but how do you achieve an access-counter? If
this was solved, the problem is solved.
You can do really nice aging with access to such kind of information.
Any ideas?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 17:06     ` Ricardo Galli
  2001-09-16 17:18       ` Jeremy Zawodny
@ 2001-09-16 18:45       ` Stephan von Krawczynski
  2001-09-21  3:16         ` Bill Davidsen
                           ` (2 more replies)
  1 sibling, 3 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-16 18:45 UTC (permalink / raw)
  To: gallir; +Cc: linux-kernel

On Sun, 16 Sep 2001 20:16:57 +0200 Stephan von Krawczynski <skraw@ithnet.com>
wrote:

> On Sun, 16 Sep 2001 19:06:45 +0200 (MET) Ricardo Galli <gallir@m3d.uib.es>
> wrote:
> 
> > On Sun, 16 Sep 2001, Jeremy Zawodny wrote:
> > >
> > > Agreed. I'd be great if there was an option to say "Don't swap out
> > > memory that was allocated by these programs. If you run out of disk
> > > buffers, toss the oldest ones and start re-using them."
> > 
> > More easy though (for cases of listening mp3's and backups): cache pages
> > that were accesed only "once"(*) several seconds ago must be discarded
> > first. It only implies a check against an access counter and a "last
> > accesed"  epoch fields of the page.
> 
> Well, I guess this is everybody's first idea about the problem: make an
initial
> timestamp for knowing how _old_ an allocation really is,

Thinking again about it, I guess I would prefer a FIFO-list of allocated pages.
This would allow to "know" the age simply by its position in the list. You
wouldn't need a timestamp then, and even better it works equally well for
systems with high vm load and low, because you do not deal with absolute time
comparisons, but relative.
That sounds pretty good for me. 
Still the problem with page accesses is not solved, but if you had an idea on
that, you could manipulate the alloc-list simply by moving accessed pages to
the end (one or several positions) of the list which effectively "youngers"
them. So you came around any new members of some structs with simple (and fast)
list operations.
Comments?

Regards,
Stephan



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:33   ` Rik van Riel
                       ` (2 preceding siblings ...)
  2001-09-16 18:16     ` broken VM in 2.4.10-pre9 Stephan von Krawczynski
@ 2001-09-16 19:43     ` Linus Torvalds
  2001-09-16 19:57       ` Rik van Riel
                         ` (4 more replies)
  2001-09-17  8:06     ` Eric W. Biederman
  4 siblings, 5 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 19:43 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>On 16 Sep 2001, Michael Rothwell wrote:
>
>> Is there a way to tell the VM to prune its cache? Or a way to limit
>> the amount of cache it uses?
>
>Not yet, I'll make a quick hack for this when I get back next
>week. It's pretty obvious now that the 2.4 kernel cannot get
>enough information to select the right pages to evict from
>memory.

Don't be stupid.

The desribed behaviour has nothing to do with limiting the cache or
anything else "cannot get enough information", except for the fact that
the kernel obviously cannot know what will happen in the future.

The kernel _correctly_ swapped out tons of pages that weren't touched in
a long long time. That's what you want to happen - the fact that they
then all became active on logout is sad.

The fact that the "use-once" logic didn't kick in is the problem. It's
hard to tell _why_ it didn't kick in, possibly because the MP3 player
read small chunks of the pages (touching them multiple times). 

THAT is worth looking into. But blathering about "reverse mappings will
help this" is just incredibly stupid. You seem to think that they are a
panacea for all problems, ranging from MP3 playback to world peace and
re-building the WTC.

		Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:43     ` Linus Torvalds
@ 2001-09-16 19:57       ` Rik van Riel
  2001-09-16 20:17       ` Rik van Riel
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-16 19:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, 16 Sep 2001, Linus Torvalds wrote:

> The desribed behaviour has nothing to do with limiting the cache or
> anything else "cannot get enough information", except for the fact that
> the kernel obviously cannot know what will happen in the future.
>
> The kernel _correctly_ swapped out tons of pages that weren't touched in
> a long long time. That's what you want to happen - the fact that they
> then all became active on logout is sad.

The problem is that a too large cache reliably makes the
system unsuitable for interactive use. In that case its
probably worth it to make evicting pages from that cache
more likely than evicting pages from user processes,
while still giving the truly busy cache pages a chance to
stay resident.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:43     ` Linus Torvalds
  2001-09-16 19:57       ` Rik van Riel
@ 2001-09-16 20:17       ` Rik van Riel
  2001-09-16 20:29       ` Andreas Steinmetz
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-16 20:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, 16 Sep 2001, Linus Torvalds wrote:

> The fact that the "use-once" logic didn't kick in is the problem. It's
> hard to tell _why_ it didn't kick in, possibly because the MP3 player
> read small chunks of the pages (touching them multiple times).

It's probably because it used mmap(), all mp3 players seem
to do that. If they also use MADV_SEQUENTIAL, I guess it'd
be easy to also do drop_behind on them, though...

> THAT is worth looking into. But blathering about "reverse mappings
> will help this" is just incredibly stupid. You seem to think that they
> are a panacea for all problems,

Absoluteley not, all reverse mappings allow us is an easier
framework to get the other decisions right. By implementing
_just_ the reverse mappings and leaving the other stuff the
same I've already found my desktop to be more usable. This
seems to be the effect of the fact that reverse mappings
allow us to get page aging right because we see all referenced
bits on a page. If you think we can do this without reverse
mappings I only have to point at linux 1.2, 2.0, 2.2 and 2.4
as a suggestion to the contrary. If it was possible, surely we
would have succeeded in 7 years of trying ?

Add to that the fact that reverse mappings makes it trivial to
do things like defragmenting memory a bit to make sure fork()
succeeds or sparc64 users can allocate page tables or being
able to keep the page tables mapped until the page is cleaned
in page_launder() (reducing soft page faults) or doing a physical
page scan to deal with gross imbalance between memory zones and
we'll have something which, IMHO, is worth experimenting with.

Sure, reverse mappings also have disadvantages, like one pointer
extra in the page_struct or as much as 2 pointers per mapping
for shared pages and a slight complication of the locking, but
I'm not convinced that these disadvantages are so severe we should
continue the VM the same way we failed to make it work right the
last 7 years.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:43     ` Linus Torvalds
  2001-09-16 19:57       ` Rik van Riel
  2001-09-16 20:17       ` Rik van Riel
@ 2001-09-16 20:29       ` Andreas Steinmetz
  2001-09-16 21:28         ` Linus Torvalds
  2001-09-17  0:37       ` Daniel Phillips
  2001-09-21  3:10       ` Bill Davidsen
  4 siblings, 1 reply; 133+ messages in thread
From: Andreas Steinmetz @ 2001-09-16 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

> The fact that the "use-once" logic didn't kick in is the problem. It's
> hard to tell _why_ it didn't kick in, possibly because the MP3 player
> read small chunks of the pages (touching them multiple times). 
> 
Then you should have an eye on mmap(). aide uses it. And it causes a real
problem. The basic logic is here:

open(file,O_RDONLY);
mmap(whole-file,PROT_READ,MAP_SHARED);
<do md5sum of mapped file>
munmap();
close();

No matter how you call the thing above (not my code, anyway): I strongly feel
that the use once logic isn't a great idea. What if lots of pages get accessed
twice? Where to set the limit?
How about adding a swapout cost factor? This would prevent swapping until
pressure is really high without any fixed limits. Calculate clean page reuse in
microseconds whereas swapout followed by swapin is going to be milliseconds.
That's a factor of at least 1000 which needs to be applied in page selection.


Andreas Steinmetz
D.O.M. Datenverarbeitung GmbH

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 20:29       ` Andreas Steinmetz
@ 2001-09-16 21:28         ` Linus Torvalds
  2001-09-16 22:47           ` Alex Bligh - linux-kernel
  2001-09-16 22:59           ` Stephan von Krawczynski
  0 siblings, 2 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 21:28 UTC (permalink / raw)
  To: Andreas Steinmetz; +Cc: linux-kernel


On Sun, 16 Sep 2001, Andreas Steinmetz wrote:
>
> > The fact that the "use-once" logic didn't kick in is the problem. It's
> > hard to tell _why_ it didn't kick in, possibly because the MP3 player
> > read small chunks of the pages (touching them multiple times).
> >
> Then you should have an eye on mmap(). aide uses it. And it causes a real
> problem. The basic logic is here:
>
> open(file,O_RDONLY);
> mmap(whole-file,PROT_READ,MAP_SHARED);
> <do md5sum of mapped file>
> munmap();
> close();

Okey-dokey.

I actually started looking at the current Linux page referenced logic, and
it just looks incredibly broken. There's no logic to it, and it's obvious
how some of it has grown over time without people really understanding or
caring about the referenced bit.

It looks very much like part of the VM was done with only "page->age", and
another part was done with the reference bit. So some users will totally
ignore the information that other users use and update. It's not pretty.

> No matter how you call the thing above (not my code, anyway): I strongly feel
> that the use once logic isn't a great idea. What if lots of pages get accessed
> twice? Where to set the limit?

Actually, the once-used approach _should_ work fine for mmap'ed pages too,
but the fact is that the code didn't even try, partly because the mmap
code was the code that used page->age and didn't care about the referenced
bit at all (except it _did_ care about the referenced bit in the page
tables: just not the bit in "struct page". And it's the latter bit that
actually ends up being the best once-used logic).

> How about adding a swapout cost factor? This would prevent swapping until
> pressure is really high without any fixed limits. Calculate clean page reuse in
> microseconds whereas swapout followed by swapin is going to be milliseconds.
> That's a factor of at least 1000 which needs to be applied in page selection.

Well, the thing is, swap-out is often cheaper than read-in, and just
dropping a page is often the cheapest of all. And all of these things are
a bit intertwined.

I actually have a _sane_ generic "used-once" approach that works with
mmap'ed memory and with other kinds too, and right now it doesn't bother
with "page->age" _at_all_. Instead, the aging is done by moving things
from one list to another, which actually seems to be better, but who
knows.

And that automatically gets used-once right - any pages are always added
to the inactive lists, and get bumped up to active only after they are
physically referenced the second time. This is actually incredibly trivial
to do without any aging at all:

	void mark_page_accessed(struct page *page)
	{
	        if (!PageActive(page) && PageReferenced(page)) {
	                activate_page(page);
	                ClearPageReferenced(page);
	                return;
	        }

	        /* Mark the page referenced, AFTER checking for previous usage..  */
	        SetPageReferenced(page);
	}

and the other importan tpart that we got (completely) wrong wrt the
use-once logic is the fact that when we scan the inactive lists and find a
page that is marked "referenced", we should NOT move it to the active list
(that defeats the whole point of use-once), but we should instead just
clear the reference bit and move it to the head of the right inactive
list.

So it actually looks like the use-once logic only worked under some very
specific circumstances, not in general.

Anybody willing to test the simple used-once cleanups? No guarantees, but
at least they make sense (some of the old interactions certainly do not).

(The new code is a simple state machine:

 - touch non-referenced page: set the reference bit

 - touch already referenced page: move it to next list "upwards" (ie the
   active list)

 - age a non-referenced page on a list: move to "next" list downwards (ie
   free if already inactive, move to inactive if currently active)

 - age a referenced page on a list: clear ref bit and move to beginning of
   same list.

which works fine for mmap pages too. I left the age updates, because the
page age may well make sense within the active list).

I'll make a 2.4.10pre10.

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 22:59           ` Stephan von Krawczynski
@ 2001-09-16 22:14             ` Linus Torvalds
  2001-09-16 23:29               ` Stephan von Krawczynski
  2001-09-17 15:35             ` Stephan von Krawczynski
  1 sibling, 1 reply; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 22:14 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Andreas Steinmetz, linux-kernel


On Sun, 16 Sep 2001, Stephan von Krawczynski wrote:

> > [...]
> > Anybody willing to test the simple used-once cleanups? No
> guarantees, but
> > at least they make sense (some of the old interactions certainly do
> not).
>
> Very willing. Just send it to me, please.

It's there as 2.4.10pre10, on ftp.kernel.org under "testing" now.

However, note that it hasn't gotten any "tweaking", ie there's none of the
small changes that aging differences usually tend to need. I'm hoping
that's ok, as the new behaviour shouldn't be that different from the old
behaviour in most cases, and that the biggest differences _should_ be just
proper once-use things.

But it would be interesting to hear which loads show markedly worse/better
behaviour. If any.

> >  - age a referenced page on a list: clear ref bit and move to beginning of
> >    same list.
>
> Are you sure about the _beginning_? You are aging out _all_ non-ref
> pages in the next step?

Well, it depends on what your definition of "is" is..

Or rather, what the "beginning" is. The way things work now, is that all
pages are added to the "beginning", and the aging is done from the end,
moving pages at the end to other lists (or, in the case of a referenced
page, back to the beginning).

You could, of course, define the list to be done the other way around. It
won't make any actual behavioural difference, unless there are bugs due to
confusion about which end is "new" and which is "old".

Which there might well be, of course.

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 21:28         ` Linus Torvalds
@ 2001-09-16 22:47           ` Alex Bligh - linux-kernel
  2001-09-16 22:55             ` Linus Torvalds
  2001-09-16 22:59           ` Stephan von Krawczynski
  1 sibling, 1 reply; 133+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-09-16 22:47 UTC (permalink / raw)
  To: Linus Torvalds, Andreas Steinmetz; +Cc: linux-kernel, Alex Bligh - linux-kernel

 
X-Mailer: Mulberry/2.1.0 (Win32)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline



--On Sunday, 16 September, 2001 2:28 PM -0700 Linus Torvalds 
<torvalds@transmeta.com> wrote:

>  - age a non-referenced page on a list: move to "next" list downwards (ie
>    free if already inactive, move to inactive if currently active)

Do you still make the distinction between Inactive Clean
and Inactive Dirty (& just move to appropriate list)?

Effectively this is just a 'binary' aging function (OK position
on the list matters too). Others on the list have observed
page->age performs in a binary manner anyhow with exponential
aging.

How do you balance between Inactive Clean before Inactive Dirty
and avoid evicting many (infrequently used) code pages at
the expense of many (historic, even less frequently used) dirty
data pages? Or don't we care?

--
Alex Bligh

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 22:47           ` Alex Bligh - linux-kernel
@ 2001-09-16 22:55             ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 22:55 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: Andreas Steinmetz, linux-kernel


On Sun, 16 Sep 2001, Alex Bligh - linux-kernel wrote:
>
> >  - age a non-referenced page on a list: move to "next" list downwards (ie
> >    free if already inactive, move to inactive if currently active)
>
> Do you still make the distinction between Inactive Clean
> and Inactive Dirty (& just move to appropriate list)?

That part of the code doesn't.

> Effectively this is just a 'binary' aging function (OK position
> on the list matters too). Others on the list have observed
> page->age performs in a binary manner anyhow with exponential
> aging.

Right. I'm not saying that this is anything _exiting_ - I'm just saying
that the old code did not have any clear behaviour at all. It would
inappropriately raise a page from the inactive lists to the active list
even if the code that actually _touched_ the page had decided that the
page was not active.

And the behaviour of the referenced bit once on the active list was
unclear. I personally think that clearing the reference bit when moving to
the active list is the right thing to do (so that it is marked
"unimportant" on the active list and needs a _third_ access to be marked
important), but this is an example of one of the tweaks/decisions we
should clearly make instead of leaving the behaviour undefined (which it
was before).

> How do you balance between Inactive Clean before Inactive Dirty
> and avoid evicting many (infrequently used) code pages at
> the expense of many (historic, even less frequently used) dirty
> data pages? Or don't we care?

We probably _do_ care. I suspect that if there are balancing problems,
they could easily be in the reclaim_page() vs page_launder() balance (ie
aging of inactive_clean vs inactive_dirty).

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 21:28         ` Linus Torvalds
  2001-09-16 22:47           ` Alex Bligh - linux-kernel
@ 2001-09-16 22:59           ` Stephan von Krawczynski
  2001-09-16 22:14             ` Linus Torvalds
  2001-09-17 15:35             ` Stephan von Krawczynski
  1 sibling, 2 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-16 22:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Steinmetz, linux-kernel

> [...]                                                               
> Anybody willing to test the simple used-once cleanups? No           
guarantees, but                                                       
> at least they make sense (some of the old interactions certainly do 
not).                                                                 
                                                                      
Very willing. Just send it to me, please.                             
                                                                      
> (The new code is a simple state machine:                            
>                                                                     
>  - touch non-referenced page: set the reference bit                 
>                                                                     
>  - touch already referenced page: move it to next list "upwards" (ie
the                                                                   
>    active list)                                                     
>                                                                     
>  - age a non-referenced page on a list: move to "next" list         
downwards (ie                                                         
>    free if already inactive, move to inactive if currently active)  
>                                                                     
>  - age a referenced page on a list: clear ref bit and move to       
beginning of                                                          
>    same list.                                                       
                                                                      
Are you sure about the _beginning_? You are aging out _all_ non-ref   
pages in the next step?                                               
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 22:14             ` Linus Torvalds
@ 2001-09-16 23:29               ` Stephan von Krawczynski
  0 siblings, 0 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-16 23:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Stephan von Krawczynski, Andreas Steinmetz, linux-kernel

>                                                                     
                                                                      
>>> (The new code is a simple state machine:                          
>>>                                                                   
>>> - touch non-referenced page: set the reference bit                
>>>                                                                   
>>> - touch already referenced page: move it to next list "upwards"   
(ie the                                                               
>>>    active list)                                                   
>>>                                                                   
>>> - age a non-referenced page on a list: move to "next" list        
downwards (ie                                                         
>>>   free if already inactive, move to inactive if currently active) 
>>>                                                                   
>>> - age a referenced page on a list: clear ref bit and move to      
beginning of                                                          
>>>    same list.                                                     
                                                                      
> > Are you sure about the _beginning_? You are aging out _all_       
non-ref                                                               
> > pages in the next step?                                           
>                                                                     
> Well, it depends on what your definition of "is" is..               
>                                                                     
> Or rather, what the "beginning" is. The way things work now, is that
all                                                                   
> pages are added to the "beginning", and the aging is done from the  
end,                                                                  
> moving pages at the end to other lists (or, in the case of a        
referenced                                                            
> page, back to the beginning).                                       
                                                                      
Wait a minute: if you age a page in active by clearing ref-bit and    
moving it to beginning of the list, and your next aging cycle starts  
from the end, that reads like you have to walk the whole list to find 
all the ageable pages for moving "downwards". In fact I expected your 
aging to start the list from the beginning, as there should be "most" 
of the non-ref entries, and you could stop age-walking hitting the    
first ref-page in that list.                                          
This was my guess when aging is used to find possible free pages      
_fast_.                                                               
Am I wrong or is this idea generally not useable?                     
                                                                      
Regards,                                                              
Stephan                                                               
                                                                      
                                                                      

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:43     ` Linus Torvalds
                         ` (2 preceding siblings ...)
  2001-09-16 20:29       ` Andreas Steinmetz
@ 2001-09-17  0:37       ` Daniel Phillips
  2001-09-17  1:07         ` Linus Torvalds
  2001-09-21  3:10       ` Bill Davidsen
  4 siblings, 1 reply; 133+ messages in thread
From: Daniel Phillips @ 2001-09-17  0:37 UTC (permalink / raw)
  To: Linus Torvalds, linux-kernel

On September 16, 2001 09:43 pm, Linus Torvalds wrote:
> The fact that the "use-once" logic didn't kick in is the problem. It's
> hard to tell _why_ it didn't kick in, possibly because the MP3 player
> read small chunks of the pages (touching them multiple times). 

Can we confirm that the mp3 player is making subpage accesses? (strace)

The 'partially read/written' state isn't handled properly now.  The 
transition to the 'used-once' state should only occur if the transfer ends at 
the exact end of the page.  Right now it always takes place after the *first* 
transfer on the page which is correct only for full-page transfers.

It's still best to start all pages unreferenced, because otherwise we don't 
have a means of distinguishing between the first and subsequent page cache 
lookups.  The check_used_once logic should set the page referenced if the IO 
transfer ends in the interior of the page or unreferenced if it ends at the 
end of the page.

This straightforward to fix, I'll have a tested patch by Tuesday if nobody 
beats me to it.  I don't think this is the whole problem though, it's just 
exposing a balancing problem.  Even if I did go and randomly access a huge 
file so that all cache pages have high age (the effect we're simulating by 
accident here) I still shouldn't be able to drive all my swap pages out of 
memory.

--
Daniel
  

  

  

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  0:37       ` Daniel Phillips
@ 2001-09-17  1:07         ` Linus Torvalds
  2001-09-17  2:23           ` Daniel Phillips
                             ` (2 more replies)
  0 siblings, 3 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17  1:07 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-kernel


On Mon, 17 Sep 2001, Daniel Phillips wrote:
>
> Can we confirm that the mp3 player is making subpage accesses? (strace)

People claim that they do mmap's, which the old code definitely didn't
handle correctly at all.

I'm not 100% sure that the 2.4.10-pre10 aging is right for anonymous pages
either, and the page-referenced handling at COW time looks suspiciously
broken, for example. It's not something we have ever gotten right, I think
- if the old pre-C-O-W page was marked accessed, we should mark that page
referenced before we break the COW. Otherwise we'll move over to a new
page without crediting the source.

> The 'partially read/written' state isn't handled properly now.  The
> transition to the 'used-once' state should only occur if the transfer ends at
> the exact end of the page.  Right now it always takes place after the *first*
> transfer on the page which is correct only for full-page transfers.

No, it's not as easy as you make it sound.

The problem is that partial accesses are real, and they should be counted
as such - except when they are _linear_ partial accesses, in which case
they should not be counted at all except for the first one.

Having some "if transfer ends at end of page" logic would minimally get
the enf-of-file case wrong, for example, never mind the case of a reader
that is seeking around in the file. The EOF case could be worked around
with yet another hack, but I suspect that the real fix is to try to fix
applications that do bad things.

> It's still best to start all pages unreferenced, because otherwise we don't
> have a means of distinguishing between the first and subsequent page cache
> lookups.  The check_used_once logic should set the page referenced if the IO
> transfer ends in the interior of the page or unreferenced if it ends at the
> end of the page.

See how 2.4.10-pre10 doesn't have any use_once hackery at all, but instead
has a clear path on references:

 prefetching: non-referenced page on inactive list
 after 1st reference: refrenced page on inactive list
 after 2nd reference: non-referenced page on active list
 after 3rd and subsequent accesses: referenced page on active list

while the "age down" logic is the exact reverse of the above. Logical and
easy to implement, and gives four distinct "stages" for all pages (along
with the LRU ordering within each list, of course).

Now, the above _is_ different from what we used to do. For one thing, it's
logical. But it might be different enough that the heuristics we have for
aging may need some tuning again. "Logical" is not enough..

There's also a few issues that I don't like right now wrt reference
handling, notably:

 - COW issue mentioned above. Probably trivially fixed by something like

	diff -u --recursive --new-file pre10/linux/mm/memory.c linux/mm/memory.c
	--- pre10/linux/mm/memory.c     Sun Sep 16 18:01:48 2001
	+++ linux/mm/memory.c   Sun Sep 16 18:00:59 2001
	@@ -955,6 +955,8 @@
	        if (pte_same(*page_table, pte)) {
	                if (PageReserved(old_page))
	                        ++mm->rss;
	+               if (pte_young(pte))
	+                       mark_page_accessed(old_page);
	                break_cow(vma, new_page, address, page_table);

	                /* Free the old page.. */

   which looks right (it basically saves off the referenced bit for the
   old page table entry in the physical page reference count).

 - truly anonymous pages (ie before they've been added to the swap cache)
   are not necessarily going to behave as nicely as other pages. They
   magically appear after VM scanning as a "1st reference", and I have a
   reasonably good argument that says that they'll have been aged up and
   down roughly the same number of times, which makes this more-or-less
   correct. But it's still a theoretical argument, nothing more.

   This could reasonably easily be fixed by adding these anonymous pages
   to the LRU lists anyway (with a bogus "page->mapping" which causes them
   to be re-mapped as _real_ swap cache pages when they need writeout),
   but that's a bit too subtle for my taste. If anybody wants to look into
   this, I'd love to know if it makes a difference in behaviour, though..

 - I don't like the lack of aging in 'reclaim_page()'. It will walk the
   whole LRU list if required, which kind of defeats the purpose of having
   reference bits and LRU on that list. The code _claims_ that it almost
   always succeeds with the first page, but I don't see why it would. I
   think that comment assumed that the inactive_clean list cannot have any
   referenced pages, but that's never been true.

There are probably other issues too, these are the ones I was wondering
about when I walked over the use of the PG_referenced bit..

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  1:07         ` Linus Torvalds
@ 2001-09-17  2:23           ` Daniel Phillips
  2001-09-17  5:11           ` Jan Harkes
  2001-09-17 12:26           ` Rik van Riel
  2 siblings, 0 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-17  2:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On September 17, 2001 03:07 am, Linus Torvalds wrote:
> > The 'partially read/written' state isn't handled properly now.  The
> > transition to the 'used-once' state should only occur if the transfer ends at
> > the exact end of the page.  Right now it always takes place after the *first*
> > transfer on the page which is correct only for full-page transfers.
> 
> No, it's not as easy as you make it sound.
> 
> The problem is that partial accesses are real, and they should be counted
> as such - except when they are _linear_ partial accesses, in which case
> they should not be counted at all except for the first one.

Yes, in a really fancy VM manager we'd analyze the access patterns to get a 
reliable determination of what is serial access and what is not, and we'd do 
things like retroactively lowering the priority of pages we didn't initially 
know much about but were later able to determine were part of a serial access.

But we can pick the low-hanging fruit by just looking at where the most 
recent access lands.  This relies on the fact that most serial transfers 
procede forward.  If we get it wrong, a few pages end up referenced when they 
should not be, but so what?  Also, if a page ends up unreferenced when it 
should be then it still has a good chance to be rescued.

> Having some "if transfer ends at end of page" logic would minimally get
> the enf-of-file case wrong,

Right, the condition should be "transfer ends exactly at the lower of the end 
of page or end of file".

> for example, never mind the case of a reader
> that is seeking around in the file. The EOF case could be worked around
> with yet another hack, but I suspect that the real fix is to try to fix
> applications that do bad things.

I'd say this one isn't a hack, its just a matter of finishing the job.
Sorry I should have done tested this a week ago but I got a little
distracted if you know what I mean.

Seeking around in a file will be handled OK.  We'll tend to drop those 
pages that are fully accessed but not reaccessed soon and retain pages that
are partially accessed.  So long as the aging mechanism doesn't drop the
ball, such pages will just live a little longer in cache, not forever.

What we're missing is a way for the swap cache to 'push back' at the page
cache.  Right now, the little bit of extra pressure I accidently created
by ignoring the subpage transfers is pushing all anonymous pages out of
memory.  That's way too fragile.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  1:07         ` Linus Torvalds
  2001-09-17  2:23           ` Daniel Phillips
@ 2001-09-17  5:11           ` Jan Harkes
  2001-09-17 12:33             ` Daniel Phillips
  2001-09-17 15:38             ` Linus Torvalds
  2001-09-17 12:26           ` Rik van Riel
  2 siblings, 2 replies; 133+ messages in thread
From: Jan Harkes @ 2001-09-17  5:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

On Sun, Sep 16, 2001 at 06:07:34PM -0700, Linus Torvalds wrote:
> See how 2.4.10-pre10 doesn't have any use_once hackery at all, but instead
> has a clear path on references:
> 
>  prefetching: non-referenced page on inactive list
>  after 1st reference: refrenced page on inactive list
>  after 2nd reference: non-referenced page on active list
>  after 3rd and subsequent accesses: referenced page on active list

So it ends up using a 'used_thrice' hack. Yeah, that does solve some of
the used once problems ;)

>  - COW issue mentioned above. Probably trivially fixed by something like

The COW is triggered by a pagefault, so the page will be accessed and
the hardware bits (both accessed and dirty) should get set automatically.

>  - truly anonymous pages (ie before they've been added to the swap cache)
>    are not necessarily going to behave as nicely as other pages. They

I just found a simple example that none of the 2.4.x kernels really like
that much. Create a program that malloc's the available free memory
minus 5-10MB, memset's it to page the memory in as anonymous pages and
then goes to sleep. Then run something like a kernel compile. If there
is enough memory left to catch the allocation spikes to avoid swapping,
the system will be heavily paging with the small amount of "aged memory"
that is left over to work with.

>    but that's a bit too subtle for my taste. If anybody wants to look into
>    this, I'd love to know if it makes a difference in behaviour, though..

pre10 right after booting,
    MemTotal:       127104 kB
    MemFree:         41844 kB
    Active:          11632 kB
    Inact_dirty:     19148 kB
    Inact_clean:         0 kB
    Inact_target:     1004 kB

pre9 with Rik's reverse mapping & delayed swap allocation and my local hacks,
    MemTotal:       126976 kB
    MemFree:         41244 kB
    Active:          80032 kB
    Inact_dirty:         0 kB
    Inact_clean:         0 kB
    Inact_target:      984 kB

Inactive target is interesting, because it is directly related to the
amount of memory pressure we've seen (memory_pressure >> 6). Also as
we're still far from running low on free memory, nothing was pushed into
the inactive lists (yes, there is no used_once, or used_thrice stuff at
all). While pre10 has about 50 MB that is 'lost' to anonymous pages
which don't get aged until we start swapping things out.

Differences are definitely noticeable, but I'm almost sure that is
mostly related to the fact that we have all potentially pageable or
swappable memory on the lists.

>  - I don't like the lack of aging in 'reclaim_page()'. It will walk the
>    whole LRU list if required, which kind of defeats the purpose of having
>    reference bits and LRU on that list. The code _claims_ that it almost
>    always succeeds with the first page, but I don't see why it would. I
>    think that comment assumed that the inactive_clean list cannot have any
>    referenced pages, but that's never been true.

As far as I can understand the _original_ design on which the current VM
is based, aging only occurs to pages on the active 'ring', the inactive
lists are basically LRU-ordered victim caches. Pages are unmapped before
they go to the inactive_dirty list and buffers are flushed before they
can go to inactive_clean.

Ofcourse both the used_once changes and -pre10 sort of flushed these
designs down the toilet by putting mapped pages on the inactive_dirty
list and turning the active list into an LRU.

Jan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:33   ` Rik van Riel
                       ` (3 preceding siblings ...)
  2001-09-16 19:43     ` Linus Torvalds
@ 2001-09-17  8:06     ` Eric W. Biederman
  2001-09-17 12:12       ` Rik van Riel
  4 siblings, 1 reply; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-17  8:06 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 16 Sep 2001, Michael Rothwell wrote:
> 
> > Is there a way to tell the VM to prune its cache? Or a way to limit
> > the amount of cache it uses?
> 
> Not yet, I'll make a quick hack for this when I get back next
> week. It's pretty obvious now that the 2.4 kernel cannot get
> enough information to select the right pages to evict from
> memory.

Hmm.  Perhaps or perhaps it is using the information poorly.
There is an alternative approach to have better aging information.

An address_space can be allocated per mm_struct.    And all of the
anonymous pages can be allocated to that address_space.  The
address_space can then have an array or better a tree of extents that
list which indexes correspond to which swap pages.  With some
pages not being backed.

Getting the allocation of indices correct so that merging will work
is a little trickier then now, as is the case of a private writeable
mapping of a file.  But in a lot of other ways the logic becomes
simpler.
 
> For 2.5 I'm making a VM subsystem with reverse mappings, the
> first iterations are giving very sweet performance so I will
> continue with this project regardless of what other kernel
> hackers might say ;)

Do you have any arguments for the reverse mappings or just for some of
the other side effects that go along with them?

Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  8:06     ` Eric W. Biederman
@ 2001-09-17 12:12       ` Rik van Riel
  2001-09-17 15:45         ` Eric W. Biederman
  0 siblings, 1 reply; 133+ messages in thread
From: Rik van Riel @ 2001-09-17 12:12 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: linux-kernel, linux-mm

On 17 Sep 2001, Eric W. Biederman wrote:

> There is an alternative approach to have better aging information.

[snip incomplete description of data structure]

What you didn't explain is how your idea is related to
aging.

> > For 2.5 I'm making a VM subsystem with reverse mappings, the
> > first iterations are giving very sweet performance so I will
> > continue with this project regardless of what other kernel
> > hackers might say ;)
>
> Do you have any arguments for the reverse mappings or just for some of
> the other side effects that go along with them?

Mainly for the side effects, but until somebody comes
up with another idea to achieve all the side effects I'm
not giving up on reverse mappings. If you can achieve
all the good stuff in another way, show it.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  1:07         ` Linus Torvalds
  2001-09-17  2:23           ` Daniel Phillips
  2001-09-17  5:11           ` Jan Harkes
@ 2001-09-17 12:26           ` Rik van Riel
  2001-09-17 15:42             ` Linus Torvalds
  2001-09-17 17:33             ` Linus Torvalds
  2 siblings, 2 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-17 12:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

On Sun, 16 Sep 2001, Linus Torvalds wrote:

>  - truly anonymous pages (ie before they've been added to the swap cache)
>    are not necessarily going to behave as nicely as other pages. They
>    magically appear after VM scanning as a "1st reference", and I have a
>    reasonably good argument that says that they'll have been aged up and
>    down roughly the same number of times, which makes this more-or-less
>    correct. But it's still a theoretical argument, nothing more.

This nicely points out the problem with page aging which Linux
has always had. Pages which are referenced all the time by the
processes using them STILL get aged down all the time.

I suspect that the biggest impact the reverse mapping patch
has right now seems to be caused by fixing this behaviour and
just aging up a page when it is referenced and down when it is
not.

>  - I don't like the lack of aging in 'reclaim_page()'. It will walk the
>    whole LRU list if required, which kind of defeats the purpose of having
>    reference bits and LRU on that list. The code _claims_ that it almost
>    always succeeds with the first page, but I don't see why it would. I
>    think that comment assumed that the inactive_clean list cannot have any
>    referenced pages, but that's never been true.

This depends on whether we do reactivation in __find_page_nolock()
or if we leave the page alone and wait for kswapd to do that for
us.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  5:11           ` Jan Harkes
@ 2001-09-17 12:33             ` Daniel Phillips
  2001-09-17 12:41               ` Rik van Riel
  2001-09-17 16:14               ` Jan Harkes
  2001-09-17 15:38             ` Linus Torvalds
  1 sibling, 2 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-17 12:33 UTC (permalink / raw)
  To: Jan Harkes, Linus Torvalds; +Cc: linux-kernel

On September 17, 2001 07:11 am, Jan Harkes wrote:
> As far as I can understand the _original_ design on which the current VM
> is based, aging only occurs to pages on the active 'ring', the inactive
> lists are basically LRU-ordered victim caches. Pages are unmapped before
> they go to the inactive_dirty list and buffers are flushed before they
> can go to inactive_clean.
>
> Ofcourse both the used_once changes and -pre10 sort of flushed these
> designs down the toilet by putting mapped pages on the inactive_dirty
> list and turning the active list into an LRU.

The active list is *supposed* to approximate an LRU.  The inactive lists
are not LRUs but queues, and have always been.

The inactive queues have always had both mapped and unmapped pages on
them. The reason for unmapping a swap cache page page when putting it
on the inactive queue is to give it some time to be rescued, since we
otherwise have no information about its short-term activity because
we have no way of accessing the hardware dirty bit given the physical
page on the lru.  A second reason for unmapping it is, we don't have
any choice.  The point where we place it on the inactive queue is the
last point where we're able to find its userspace page table entry.

<paid advertisement>
We'd be able to avoid unmapping swap cache pages with Rik's rmap
patch because we can easily check the hardware referenced bit before
finally evicting the page.  Plus, and I hope I'm interpreting this
correctly, we can allocate the swap slot and perform swap clustering
at that time, greatly simplifying the swapout code.
</paid advertisment> ;-)

Drifting a little further offtopic.  As far as I can tell, there's no 
fundamental reason why we cannot make the current strategy work as 
well as Rik's rmaps probably will, with some more blood, sweat and
code study.  On the other hand, Matt Dillon, the reigning champion of
virtual memory managment, was quite firm in stating that we should
drop the current virtually scanning strategy in favor of 100%
physical scanning as BSD uses, relying on reverse mapping.

   http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html
   (Matt Dillon holds forth on the design of BSD's memory manager)

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:33             ` Daniel Phillips
@ 2001-09-17 12:41               ` Rik van Riel
  2001-09-17 14:49                 ` Daniel Phillips
  2001-09-17 16:14               ` Jan Harkes
  1 sibling, 1 reply; 133+ messages in thread
From: Rik van Riel @ 2001-09-17 12:41 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Jan Harkes, Linus Torvalds, linux-kernel

On Mon, 17 Sep 2001, Daniel Phillips wrote:

> Drifting a little further offtopic.  As far as I can tell, there's no
> fundamental reason why we cannot make the current strategy work as
> well as Rik's rmaps probably will, with some more blood, sweat and
> code study.

I don't see any possibility to get that to work without
reverse mapping. Of course, that could be me overlooking
some possibility, but I'm not holding by breath waiting
for somebody to invent this other possibility.

> On the other hand, Matt Dillon, the reigning champion of
> virtual memory managment, was quite firm in stating that we should
> drop the current virtually scanning strategy in favor of 100%
> physical scanning as BSD uses, relying on reverse mapping.
>
>    http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html
>    (Matt Dillon holds forth on the design of BSD's memory manager)

His claims are backed up by FreeBSD's VM performance,
so I'm inclined to believe them. If you think you can
come up with something better, I'll believe you when
you show it...

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:41               ` Rik van Riel
@ 2001-09-17 14:49                 ` Daniel Phillips
  0 siblings, 0 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-17 14:49 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Jan Harkes, Linus Torvalds, linux-kernel

On September 17, 2001 02:41 pm, Rik van Riel wrote:
> > On the other hand, Matt Dillon, the reigning champion of
> > virtual memory managment, was quite firm in stating that we should
> > drop the current virtually scanning strategy in favor of 100%
> > physical scanning as BSD uses, relying on reverse mapping.
> >
> >    http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html
> >    (Matt Dillon holds forth on the design of BSD's memory manager)
> 
> His claims are backed up by FreeBSD's VM performance,
> so I'm inclined to believe them. If you think you can
> come up with something better, I'll believe you when
> you show it...

Rik, read the post, I'm supporting you.  Please don't be so paranoid ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 22:59           ` Stephan von Krawczynski
  2001-09-16 22:14             ` Linus Torvalds
@ 2001-09-17 15:35             ` Stephan von Krawczynski
  2001-09-17 15:51               ` Linus Torvalds
  2001-09-17 16:34               ` Stephan von Krawczynski
  1 sibling, 2 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-17 15:35 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: ast, linux-kernel

On Sun, 16 Sep 2001 15:14:22 -0700 (PDT) Linus Torvalds
<torvalds@transmeta.com> wrote:

> > Very willing. Just send it to me, please.
> 
> It's there as 2.4.10pre10, on ftp.kernel.org under "testing" now.
> 
> However, note that it hasn't gotten any "tweaking", ie there's none of the
> small changes that aging differences usually tend to need. I'm hoping
> that's ok, as the new behaviour shouldn't be that different from the old
> behaviour in most cases, and that the biggest differences _should_ be just
> proper once-use things.
> 
> But it would be interesting to hear which loads show markedly worse/better
> behaviour. If any.

Hello,

I tried my usual test setup today with 2.4.10-pre10 and experienced the
following:

- cpu load goes pretty high (11-12 according to xosview)during several
occasions, upto the point where you cannot even move the mouse. Compared to an
once tested ac-version it is not _that_ nice. I have some problems cat'ing
/proc/meminfo, too. I takes sometimes pretty long (minutes).

- the meminfo shows me great difference to former versions in the balancing of
inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty pages
than active (compared to pre9 and before) in my test. I guess this is intended
by this (used-once) patch. So take this as a hint, that your work performs as
expected.

- of course the alloc problems itself stayed the same.

Regards,
Stephan



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17  5:11           ` Jan Harkes
  2001-09-17 12:33             ` Daniel Phillips
@ 2001-09-17 15:38             ` Linus Torvalds
  1 sibling, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 15:38 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Daniel Phillips, linux-kernel


On Mon, 17 Sep 2001, Jan Harkes wrote:
>
> >  - COW issue mentioned above. Probably trivially fixed by something like
>
> The COW is triggered by a pagefault, so the page will be accessed and
> the hardware bits (both accessed and dirty) should get set automatically.

No.

The point is that yes, the bits are set in the _page_table_, but we've
never set them on the physical page.

And the COW fault will switch the page table entry to a new page, so if we
don't set the referenced bit on the physical page at that time, we _never_
will.

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:26           ` Rik van Riel
@ 2001-09-17 15:42             ` Linus Torvalds
  2001-09-18 12:04               ` Rik van Riel
  2001-09-17 17:33             ` Linus Torvalds
  1 sibling, 1 reply; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 15:42 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel


On Mon, 17 Sep 2001, Rik van Riel wrote:
>
> >  - I don't like the lack of aging in 'reclaim_page()'. It will walk the
> >    whole LRU list if required, which kind of defeats the purpose of having
> >    reference bits and LRU on that list. The code _claims_ that it almost
> >    always succeeds with the first page, but I don't see why it would. I
> >    think that comment assumed that the inactive_clean list cannot have any
> >    referenced pages, but that's never been true.
>
> This depends on whether we do reactivation in __find_page_nolock()
> or if we leave the page alone and wait for kswapd to do that for
> us.

We should not do _anything_ in __find_page_nolock().

It's positively wrong to touch any aging information there - if you do,
you are guaranteed to not get read-ahead right (ie a page that gets
read-ahead first will behave differently than a page that got read
directly, which just cannot be right).

The aging has to be done at a higher level (ie when you actually _use_
it, not when you search the hash queues).

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:12       ` Rik van Riel
@ 2001-09-17 15:45         ` Eric W. Biederman
  0 siblings, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-17 15:45 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 17 Sep 2001, Eric W. Biederman wrote:
> 
> > There is an alternative approach to have better aging information.
> 
> [snip incomplete description of data structure]
> 
> What you didn't explain is how your idea is related to
> aging.

Sorry I thought you had been staring at the problem long enough to
see.  In any case the problem with the current code is that you can't
put all pages in the swap cache immediately because you don't want to
allocate the swap space just yet.  And without being in the swap cache
aging isn't especially effective.

By using something like a shared memory segment behind every anonymous
page, you can put the page in the swap cache before you allocate swap
for it (because it has a persistent identity).   Further since you no
longer need counts for every swap page.  You can deallocate swap space
from pages simply by walking through the ``indirect pages'' and
removing the reference to swap space.

> > > For 2.5 I'm making a VM subsystem with reverse mappings, the
> > > first iterations are giving very sweet performance so I will
> > > continue with this project regardless of what other kernel
> > > hackers might say ;)
> >
> > Do you have any arguments for the reverse mappings or just for some of
> > the other side effects that go along with them?
> 
> Mainly for the side effects, but until somebody comes
> up with another idea to achieve all the side effects I'm
> not giving up on reverse mappings. If you can achieve
> all the good stuff in another way, show it.

I think I can I haven't had time to implement it.  Given the way Alan
and some of the others were talking I though my idea has long ago been
thought of and put on the plate for 2.5.  If it really is a new idea
under the sun I'll look at implementing it as soon as I have a hole
in my schedule.

Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 15:35             ` Stephan von Krawczynski
@ 2001-09-17 15:51               ` Linus Torvalds
  2001-09-17 16:34               ` Stephan von Krawczynski
  1 sibling, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 15:51 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: ast, linux-kernel


On Mon, 17 Sep 2001, Stephan von Krawczynski wrote:
>
> - cpu load goes pretty high (11-12 according to xosview)during several
> occasions, upto the point where you cannot even move the mouse. Compared to an
> once tested ac-version it is not _that_ nice. I have some problems cat'ing
> /proc/meminfo, too. I takes sometimes pretty long (minutes).

It's not really CPU load - the loadaverage in Linux (and some other UNIXes
too) also accounts for disk wait.

> - the meminfo shows me great difference to former versions in the balancing of
> inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty pages
> than active (compared to pre9 and before) in my test. I guess this is intended
> by this (used-once) patch. So take this as a hint, that your work performs as
> expected.

No, I think they are related, and bad. I suspect it just means that pages
really do not get elevated to the active list, and it's probably _too_
unwilling to activate pages. That's bad too - it means that the inactive
list is the one solely responsible for working set changes, and the VM
won't bother with any other pages. Which also leads to bad results..

That's always the downside with having multiple lists of any kind - if the
balance between the lists is bad, performance will be bad. Historically,
the active list was the big one, and the other ones mostly didn't matter,
which makes the balancing issue much less noticeable.

[ This is also the very same problem we used to have with buffer cache
  pages vs mapped pages vs other caches ]

The fix may be to just make the inactive lists not do aging at all.

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:33             ` Daniel Phillips
  2001-09-17 12:41               ` Rik van Riel
@ 2001-09-17 16:14               ` Jan Harkes
  2001-09-17 16:34                 ` Linus Torvalds
  1 sibling, 1 reply; 133+ messages in thread
From: Jan Harkes @ 2001-09-17 16:14 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel

On Mon, Sep 17, 2001 at 02:33:12PM +0200, Daniel Phillips wrote:
> The inactive queues have always had both mapped and unmapped pages on
> them. The reason for unmapping a swap cache page page when putting it

So the following code in refill_inactive_scan only exists in my
imagination?

	    if (page_count(page) <= (page->buffers ? 2 : 1)) {
		    deactivate_page_nolock(page);
		    page_active = 0;
	    } else {
		    page_active = 1;
	    }

We only move pages to the inactive list when they have one reference
from the page cache and one from buffers. Since all mapped pte's also
keep a reference, this means that there cannot be any pte's that point
to this page by the time we decide to deactivate the page.

> any choice.  The point where we place it on the inactive queue is the
> last point where we're able to find its userspace page table entry.

And that is because we only move it after all pte's have been unmapped.

Jan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 15:35             ` Stephan von Krawczynski
  2001-09-17 15:51               ` Linus Torvalds
@ 2001-09-17 16:34               ` Stephan von Krawczynski
  2001-09-17 16:46                 ` Linus Torvalds
  2001-09-17 17:20                 ` Stephan von Krawczynski
  1 sibling, 2 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-17 16:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, ast

On Mon, 17 Sep 2001 08:51:54 -0700 (PDT) Linus Torvalds
<torvalds@transmeta.com> wrote:

> 
> On Mon, 17 Sep 2001, Stephan von Krawczynski wrote:
> >
> > - cpu load goes pretty high (11-12 according to xosview)during several
> > occasions, upto the point where you cannot even move the mouse. Compared to
an
> > once tested ac-version it is not _that_ nice. I have some problems cat'ing
> > /proc/meminfo, too. I takes sometimes pretty long (minutes).
> 
> It's not really CPU load - the loadaverage in Linux (and some other UNIXes
> too) also accounts for disk wait.

Well, what I meant was: compared to the _same_ situation and test bed, the load
seems "pretty high". ac versions are somewhat lower in this setup.

> > - the meminfo shows me great difference to former versions in the balancing
of
> > inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty
pages
> > than active (compared to pre9 and before) in my test. I guess this is
intended
> > by this (used-once) patch. So take this as a hint, that your work performs
as
> > expected.
> 
> No, I think they are related, and bad. I suspect it just means that pages
> really do not get elevated to the active list, and it's probably _too_
> unwilling to activate pages. That's bad too - it means that the inactive
> list is the one solely responsible for working set changes, and the VM
> won't bother with any other pages. Which also leads to bad results..

Hm, remember my setup: I read a lot from CD, write it to disk and read a lot
from nfs and write it to disk. Basically both are read once - write once
setups, so the pages are touched once (or worst twice) at maximum, so I see a
good chance none of them ever make it to the active list, according to your
state explanation from previous posts. And thats what I see (I guess). If I do
a CD compare (read disk, read CD and compare) I see lots of pages walk over to
active. And that again looks as you told before. I think it does work as you
said.
Anyway I cannot "feel" a difference in performance (maybe even worse than
before), but it _looks_ cleaner. How about taking it as a first step in the
cleanup direction? :-)

Regards, Stephan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 16:14               ` Jan Harkes
@ 2001-09-17 16:34                 ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 16:34 UTC (permalink / raw)
  To: Jan Harkes; +Cc: Daniel Phillips, linux-kernel


On Mon, 17 Sep 2001, Jan Harkes wrote:

> On Mon, Sep 17, 2001 at 02:33:12PM +0200, Daniel Phillips wrote:
> > The inactive queues have always had both mapped and unmapped pages on
> > them. The reason for unmapping a swap cache page page when putting it
>
> So the following code in refill_inactive_scan only exists in my
> imagination?
>
> 	    if (page_count(page) <= (page->buffers ? 2 : 1)) {
> 		    deactivate_page_nolock(page);

No, but I agree with Daniel that it's wrong.

The reason it exists there is because the current inactive_clean list
scanning doesn't have any pressure into VM scanning, so if we'd let mapped
pages on the inactive queue, then reclaim_page() would be unhappy about
them.

That can be solved several ways:
 - like we do now. Hackish and wrong, but kind-of-works.
 - make reclaim_page() have the ability to do vm scanning pressure (ie if
   it starts noticing that there are too many mapped pages on the reclaim
   list, it should cause VM scan)
 - physical maps

Actually, now that I look at it, the lack of de-activation actually hurts
page_launder() - which doesn't get to launder pages that are still mapped
(even though getting rid of buffers from them would almost certainly be
good under memory pressure).

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 16:34               ` Stephan von Krawczynski
@ 2001-09-17 16:46                 ` Linus Torvalds
  2001-09-17 17:20                 ` Stephan von Krawczynski
  1 sibling, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 16:46 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, ast


On Mon, 17 Sep 2001, Stephan von Krawczynski wrote:
> >
> > No, I think they are related, and bad. I suspect it just means that pages
> > really do not get elevated to the active list, and it's probably _too_
> > unwilling to activate pages. That's bad too - it means that the inactive
> > list is the one solely responsible for working set changes, and the VM
> > won't bother with any other pages. Which also leads to bad results..
>
> Hm, remember my setup: I read a lot from CD, write it to disk and read a lot
> from nfs and write it to disk. Basically both are read once - write once
> setups, so the pages are touched once (or worst twice) at maximum, so I see a
> good chance none of them ever make it to the active list, according to your
> state explanation from previous posts.

Right. That part is fine.

The problematic part is that I suspect that _because_ there's a lot of
inactive pages, the VM layer won't even try to age the active ones.
Which will result in the inactive pages being re-circulated reasonably
quickly..

Hmm. Although maybe that's the right behaviour, considering that you don't
actually _want_ to cache them. It leaves your _truly_ active set
untouched.

> Anyway I cannot "feel" a difference in performance (maybe even worse than
> before), but it _looks_ cleaner. How about taking it as a first step in the
> cleanup direction? :-)

"Looks cleaner" is very important for me for maintenance reasons - having
behaviour that you cannot explain tends to result in more and more ad-hoc
hacks over time, and it just tends to get worse and worse.

However, at the same time I'd really like to hear about improved
behaviour, not just "feels the same". And certainly not "(maybe even
worse.."

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 16:34               ` Stephan von Krawczynski
  2001-09-17 16:46                 ` Linus Torvalds
@ 2001-09-17 17:20                 ` Stephan von Krawczynski
  2001-09-17 17:37                   ` Linus Torvalds
  1 sibling, 1 reply; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-17 17:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel, ast

On Mon, 17 Sep 2001 09:46:28 -0700 (PDT) Linus Torvalds
<torvalds@transmeta.com> wrote:

> "Looks cleaner" is very important for me for maintenance reasons - having
> behaviour that you cannot explain tends to result in more and more ad-hoc
> hacks over time, and it just tends to get worse and worse.

Agreed.

> However, at the same time I'd really like to hear about improved
> behaviour, not just "feels the same". And certainly not "(maybe even
> worse.."

Hm, sorry for that. But that's what I see. Maybe the problem is now on a
different field.

> The problematic part is that I suspect that _because_ there's a lot of
> inactive pages, the VM layer won't even try to age the active ones.
> Which will result in the inactive pages being re-circulated reasonably
> quickly..

Do you think this re-circulation is _fast_ in current code? Maybe performance
loss comes from this point?

BTW: I tried Andrea's brand new patch and have to admit it has a _big_
performance gain, though I understand you dislike the design very much. 

Regards,
Stephan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 12:26           ` Rik van Riel
  2001-09-17 15:42             ` Linus Torvalds
@ 2001-09-17 17:33             ` Linus Torvalds
  2001-09-17 18:07               ` Linus Torvalds
  2001-09-18 12:09               ` Rik van Riel
  1 sibling, 2 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 17:33 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel


On Mon, 17 Sep 2001, Rik van Riel wrote:
>
> >  - truly anonymous pages (ie before they've been added to the swap cache)
> >    are not necessarily going to behave as nicely as other pages. They
> >    magically appear after VM scanning as a "1st reference", and I have a
> >    reasonably good argument that says that they'll have been aged up and
> >    down roughly the same number of times, which makes this more-or-less
> >    correct. But it's still a theoretical argument, nothing more.
>
> This nicely points out the problem with page aging which Linux
> has always had. Pages which are referenced all the time by the
> processes using them STILL get aged down all the time.
>
> I suspect that the biggest impact the reverse mapping patch
> has right now seems to be caused by fixing this behaviour and
> just aging up a page when it is referenced and down when it is
> not.

Well, here's a 10-line patch to make the anonymous pages get on the LRU
queues, and thus get aged along with all the others.

NOTE NOTE NOTE! This is _literally_ a 15-minute hack, and I expect that
there are paths where I forget to remove the page from the LRU queue
(which should result in a nice big oops in __free_pages_ok()).

Also, I didn't look into shm handling - it _looks_ like shm will remove
the page from the LRU list and re-insert it, which will lose all list
information, of course.

But the point being that keeping anonymous pages on the LRU list shouldn't
be all that hard. Even if I missed something on this first try.

		Linus

------
diff -u --recursive --new-file penguin/linux/mm/filemap.c linux/mm/filemap.c
--- penguin/linux/mm/filemap.c	Mon Sep 17 09:22:57 2001
+++ linux/mm/filemap.c	Mon Sep 17 09:15:45 2001
@@ -489,7 +489,6 @@
 	page->index = index;
 	add_page_to_inode_queue(mapping, page);
 	add_page_to_hash_queue(page, page_hash(mapping, index));
-	lru_cache_add(page);
 	spin_unlock(&pagecache_lock);
 }

diff -u --recursive --new-file penguin/linux/mm/memory.c linux/mm/memory.c
--- penguin/linux/mm/memory.c	Mon Sep 17 09:23:55 2001
+++ linux/mm/memory.c	Mon Sep 17 10:15:57 2001
@@ -958,6 +958,7 @@
 		if (pte_young(pte))
 			mark_page_accessed(old_page);
 		break_cow(vma, new_page, address, page_table);
+		lru_cache_add(new_page);

 		/* Free the old page.. */
 		new_page = old_page;
@@ -1198,6 +1199,7 @@
 		mm->rss++;
 		flush_page_to_ram(page);
 		entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
+		lru_cache_add(page);
 	}

 	set_pte(page_table, entry);
diff -u --recursive --new-file penguin/linux/mm/shmem.c linux/mm/shmem.c
--- penguin/linux/mm/shmem.c	Mon Sep 17 09:22:57 2001
+++ linux/mm/shmem.c	Mon Sep 17 09:17:12 2001
@@ -356,6 +356,7 @@
 		flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_referenced) | (1 << PG_arch_1));
 		page->flags = flags | (1 << PG_dirty);
 		add_to_page_cache_locked(page, mapping, idx);
+		lru_cache_add(page);
 		info->swapped--;
 		spin_unlock (&info->lock);
 	} else {
diff -u --recursive --new-file penguin/linux/mm/swap.c linux/mm/swap.c
--- penguin/linux/mm/swap.c	Wed Aug  8 15:17:26 2001
+++ linux/mm/swap.c	Mon Sep 17 09:50:33 2001
@@ -153,8 +153,6 @@
 void lru_cache_add(struct page * page)
 {
 	spin_lock(&pagemap_lru_lock);
-	if (!PageLocked(page))
-		BUG();
 	add_page_to_inactive_dirty_list(page);
 	page->age = 0;
 	spin_unlock(&pagemap_lru_lock);
@@ -176,7 +174,7 @@
 	} else if (PageInactiveClean(page)) {
 		del_page_from_inactive_clean_list(page);
 	} else {
-		printk("VM: __lru_cache_del, found unknown page ?!\n");
+//		printk("VM: __lru_cache_del, found unknown page ?!\n");
 	}
 	DEBUG_ADD_PAGE
 }
@@ -187,8 +185,6 @@
  */
 void lru_cache_del(struct page * page)
 {
-	if (!PageLocked(page))
-		BUG();
 	spin_lock(&pagemap_lru_lock);
 	__lru_cache_del(page);
 	spin_unlock(&pagemap_lru_lock);
diff -u --recursive --new-file penguin/linux/mm/swap_state.c linux/mm/swap_state.c
--- penguin/linux/mm/swap_state.c	Mon Sep 17 09:22:57 2001
+++ linux/mm/swap_state.c	Mon Sep 17 09:42:20 2001
@@ -147,6 +147,10 @@
  */
 void free_page_and_swap_cache(struct page *page)
 {
+	if (page_count(page) == 1 && !page->mapping) {
+		lru_cache_del(page);
+	}
+
 	/*
 	 * If we are the only user, then try to free up the swap cache.
 	 *


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 17:20                 ` Stephan von Krawczynski
@ 2001-09-17 17:37                   ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 17:37 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel, ast


On Mon, 17 Sep 2001, Stephan von Krawczynski wrote:
>
> > However, at the same time I'd really like to hear about improved
> > behaviour, not just "feels the same". And certainly not "(maybe even
> > worse.."
>
> Hm, sorry for that. But that's what I see. Maybe the problem is now on a
> different field.

Heh. I wasn't blaming you. The code obviously leaves something to be
desired, still.

> BTW: I tried Andrea's brand new patch and have to admit it has a _big_
> performance gain, though I understand you dislike the design very much.

I only dislike one aspect of it, not the whole patch. Andrea has spent a
lot of time doing tuning, which is hugely important for real-world
workloads.  I also suspect from previous patches that he increases
read-ahead aggressively etc.

I'll take a look,

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 17:33             ` Linus Torvalds
@ 2001-09-17 18:07               ` Linus Torvalds
  2001-09-18 12:09               ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-17 18:07 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel


On Mon, 17 Sep 2001, Linus Torvalds wrote:
>
> NOTE NOTE NOTE! This is _literally_ a 15-minute hack, and I expect that
> there are paths where I forget to remove the page from the LRU queue
> (which should result in a nice big oops in __free_pages_ok()).

Actually, the most common failure mode seems to be that we have plenty of
inactive pages (all the anonymous pages that we added to the LRU list and
thus to the statistics). And because we have tons of these pages, the VM
scanning is never even started, because do_try_to_free_pages() thinks
that it can just launder them.

Which means that we'll never get rid of them. Oops.

So it's easy adding anonymous pages to the LRU lists per se, but it
obviously needs some more work to make the scanners be aware of the fact
that they are there...

(I suspect that the easiest way to make them be aware of the anonymous
pages is to have a bogus address space associated with the anonymous
pages, with no actual hashing going on. And then make that address space
have a "writepage()" function that turns an anonymous pages into a swap
cache page. But I was hoping to get off more easily ;).

		Linus


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 15:42             ` Linus Torvalds
@ 2001-09-18 12:04               ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-18 12:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

On Mon, 17 Sep 2001, Linus Torvalds wrote:

> We should not do _anything_ in __find_page_nolock().

> The aging has to be done at a higher level (ie when you actually _use_
> it, not when you search the hash queues).

Absolutely agreed. In fact, I already did this last week
in the -still not published- new version of the reverse
mapping patch ;)

(now if I only could get that thing SMP safe in an efficient
way)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 17:33             ` Linus Torvalds
  2001-09-17 18:07               ` Linus Torvalds
@ 2001-09-18 12:09               ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-18 12:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel

On Mon, 17 Sep 2001, Linus Torvalds wrote:

> Well, here's a 10-line patch to make the anonymous pages get on the LRU
> queues, and thus get aged along with all the others.

The problem is that they will still get aged DOWN all
the time, even if they are accessed continuously by
the process which owns the page....

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:43     ` Linus Torvalds
                         ` (3 preceding siblings ...)
  2001-09-17  0:37       ` Daniel Phillips
@ 2001-09-21  3:10       ` Bill Davidsen
  4 siblings, 0 replies; 133+ messages in thread
From: Bill Davidsen @ 2001-09-21  3:10 UTC (permalink / raw)
  To: linux-kernel

On Sun, 16 Sep 2001, Linus Torvalds wrote:

> In article <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>,
> Rik van Riel  <riel@conectiva.com.br> wrote:
> >On 16 Sep 2001, Michael Rothwell wrote:
> >
> >> Is there a way to tell the VM to prune its cache? Or a way to limit
> >> the amount of cache it uses?
> >
> >Not yet, I'll make a quick hack for this when I get back next
> >week. It's pretty obvious now that the 2.4 kernel cannot get
> >enough information to select the right pages to evict from
> >memory.
> 
> Don't be stupid.
> 
> The desribed behaviour has nothing to do with limiting the cache or
> anything else "cannot get enough information", except for the fact that
> the kernel obviously cannot know what will happen in the future.

I think that's very harsh, because while the kernel can't predict the
future, in many cases the sysadmin can, and some of the tools used to act
on that information are not gone due to "enhancement." Most particularly,
the free pages are now not settable, leaving the admin to diddle with
*unrelated* things trying to get correct function, instead of setting the
free required and letting the kernel do a balance between buffes, cache,
etc. So if I know I have an application which will need 12MB suddenly to
maintain good response, I have lost my tool to just tell the system that
much free is needed. And honestly the fact that the kernel makes good
overall choices pales when the worst case is to blatently bad.
 
> The kernel _correctly_ swapped out tons of pages that weren't touched in
> a long long time. That's what you want to happen - the fact that they
> then all became active on logout is sad.

It reflects poor decisions in the kernel. To balance program and i/o pages
the kernel should track the i/o rate while increasing the cache used. When
the i/o rate stops getting better, the kernel should assume that the
program is not reusing the data pages at this time. Obviously this need
hysterisis to keep the program vs. data ratio from changing too fast after
some good initial setting, but having a file copy or CD rip push programs
out of memory shows that the kernel is not making optimal use of the
information it has.
 
> The fact that the "use-once" logic didn't kick in is the problem. It's
> hard to tell _why_ it didn't kick in, possibly because the MP3 player
> read small chunks of the pages (touching them multiple times). 
> 
> THAT is worth looking into. But blathering about "reverse mappings will
> help this" is just incredibly stupid. You seem to think that they are a
> panacea for all problems, ranging from MP3 playback to world peace and
> re-building the WTC.

Sorry, I think the problem is that the existing logic is just not working.
When you trade a small gain in overall performance for a really bad worst
case you are balancing a gain which is measured rather than felt with a
loss which is instantly painful.

Please rethink, the use-once is elegant, but it just doesn't work, and
until the kernel makes some effort to avoid paging out text for data when
it doesn't help performance you will have these ugly pauses. I will note
that we were doing just this type of balance of space in 1968 in GECOS (as
in the arcane GECOS password field).

Hopefully you will find this criticism constructive...

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 18:45       ` Stephan von Krawczynski
@ 2001-09-21  3:16         ` Bill Davidsen
  2001-09-21 10:21         ` Stephan von Krawczynski
  2001-09-21 10:43         ` Stephan von Krawczynski
  2 siblings, 0 replies; 133+ messages in thread
From: Bill Davidsen @ 2001-09-21  3:16 UTC (permalink / raw)
  To: linux-kernel

On Sun, 16 Sep 2001, Stephan von Krawczynski wrote:


> Thinking again about it, I guess I would prefer a FIFO-list of allocated pages.
> This would allow to "know" the age simply by its position in the list. You
> wouldn't need a timestamp then, and even better it works equally well for
> systems with high vm load and low, because you do not deal with absolute time
> comparisons, but relative.
> That sounds pretty good for me. 

The problem is that when many things effect the optimal ratio of text,
data, buffer and free space a solution which doesn't measure all the
important factors will produce sub-optimal results. Your proposal is
simple and elegant, but I think it's too simple to produce good results.
See my reply to Linus' comments.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 18:45       ` Stephan von Krawczynski
  2001-09-21  3:16         ` Bill Davidsen
@ 2001-09-21 10:21         ` Stephan von Krawczynski
  2001-09-21 14:08           ` Bill Davidsen
  2001-09-21 10:43         ` Stephan von Krawczynski
  2 siblings, 1 reply; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-21 10:21 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com>
wrote:

> On Sun, 16 Sep 2001, Stephan von Krawczynski wrote:
> 
> 
> > Thinking again about it, I guess I would prefer a FIFO-list of allocated
pages.
> > This would allow to "know" the age simply by its position in the list. You
> > wouldn't need a timestamp then, and even better it works equally well for
> > systems with high vm load and low, because you do not deal with absolute
time
> > comparisons, but relative.
> > That sounds pretty good for me. 
> 
> The problem is that when many things effect the optimal ratio of text,
> data, buffer and free space a solution which doesn't measure all the
> important factors will produce sub-optimal results. Your proposal is
> simple and elegant, but I think it's too simple to produce good results.
> See my reply to Linus' comments.

Actually I did not really propose a method of valueing the several pros and
cons in aging itself, but a very basic idea of how this could be done without
fiddling around with page->members (like page->age) which always implies you
have to walk down a whole list to get the full picture in case of urgent need
for freeable pages.
If you age something by re-arranging its position in a list you have the
drawback of list-locking, but the gain of fast finding the best freeable pages
by simply using the first ones in that list. Even better you can add whatever
criteria you like to this aging, e.g. you could rearrange to let consecutive
pages be freed together and so on, all would be pretty easy to achieve, and
page-struct becomes even smaller.
The more I think about it the better it sounds.
Your opinion?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 18:45       ` Stephan von Krawczynski
  2001-09-21  3:16         ` Bill Davidsen
  2001-09-21 10:21         ` Stephan von Krawczynski
@ 2001-09-21 10:43         ` Stephan von Krawczynski
  2001-09-21 12:13           ` Rik van Riel
                             ` (3 more replies)
  2 siblings, 4 replies; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-21 10:43 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: linux-kernel

On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com>
wrote:

> On Sun, 16 Sep 2001, Stephan von Krawczynski wrote:
> 
> 
> > Thinking again about it, I guess I would prefer a FIFO-list of allocated
pages.
> > This would allow to "know" the age simply by its position in the list. You
> > wouldn't need a timestamp then, and even better it works equally well for
> > systems with high vm load and low, because you do not deal with absolute
time
> > comparisons, but relative.
> > That sounds pretty good for me. 
> 
> The problem is that when many things effect the optimal ratio of text,
> data, buffer and free space a solution which doesn't measure all the
> important factors will produce sub-optimal results. Your proposal is
> simple and elegant, but I think it's too simple to produce good results.
> See my reply to Linus' comments.

Sorry to followup to the same post again, but I just read across another thread
where people discuss heavily about aging up by 3 and down by 1 or vice versa is
considered to be better or worse. The real problem behind this is that they are
trying to bring some order in the pages by the age. Unfortunately this cannot
really work out well, because you will _always_ end up with few or a lot of
pages with the same age, which does not help you all that much in a situation
where you need to know who is the _best one_ that is to be dropped next. In
this drawing situation you have nothing to rely on but the age and some rough
guesses (or even worse: performance issues, _not_ to walk the whole tree to
find the best fitting page). This _solely_ comes from the fact that you have no
steady rising order in page->age ordering (is this correct english for this
mathematical term?). You _cannot_ know from the current function. So it must be
considered _bad_. On the other hand a list is always ordered and therefore does
not have this problem.
Shit, if I only were able to implement that. Can anybody help me to proove my
point?

Regards,
Stephan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 10:43         ` Stephan von Krawczynski
@ 2001-09-21 12:13           ` Rik van Riel
  2001-09-21 12:55           ` Stephan von Krawczynski
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-21 12:13 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: Bill Davidsen, linux-kernel

On Fri, 21 Sep 2001, Stephan von Krawczynski wrote:

> Shit, if I only were able to implement that. Can anybody help me to
> proove my point?

Trying to implement your idea would probably pose a nice
counter-argument. Without measuring which pages are in
heavy use, how are you going to evict the right pages ?

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 10:43         ` Stephan von Krawczynski
  2001-09-21 12:13           ` Rik van Riel
@ 2001-09-21 12:55           ` Stephan von Krawczynski
  2001-09-21 13:01             ` Rik van Riel
  2001-09-22 11:01           ` Daniel Phillips
  2001-09-24  9:36           ` Linux VM design VDA
  3 siblings, 1 reply; 133+ messages in thread
From: Stephan von Krawczynski @ 2001-09-21 12:55 UTC (permalink / raw)
  To: Rik van Riel; +Cc: davidsen, linux-kernel

On Fri, 21 Sep 2001 09:13:07 -0300 (BRST) Rik van Riel <riel@conectiva.com.br>
wrote:

> On Fri, 21 Sep 2001, Stephan von Krawczynski wrote:
> 
> > Shit, if I only were able to implement that. Can anybody help me to
> > proove my point?
> 
> Trying to implement your idea would probably pose a nice
> counter-argument. Without measuring which pages are in
> heavy use, how are you going to evict the right pages ?

Hi Rik,

The really beautiful thing about it is that you can divide it completely in two
parts:
1) basic list handling, you obviously need the list itself and some atomic
functions to queue/dequeue/requeue entries, possibly as well as
get_next_freeable() for simplicity. The rest vm only uses this to work.
2) the management "plugins" where you can virtually do any check of heavy use
or aging or buddy-finding or whatever comes to your mind and requeue
accordingly. You may do that on every alloc (surely not nice), or on page hits,
or on low-mem condition (like page_launder), or in a independant process
(somehow like kswapd), whatever you tend to believe is the best performing way
- feel free to find the killer-plugin :-). 

BUT (and thats the real good point): (2) is completely independant in structure
and processing from the basic mem-handling, because the only interaction is
requeuing, which means as a first step (only experimental of course) you could
just fill the list with addtail and shorten it on demand of free pages with
remhead (hope my short-terms are understandable). This implements a _very_
simple aging based on only the age of the allocation and nothing else (FIFO).
You can spend any time and brain in refining the strategy without ever touching
the vm basics _and_ (because of the simple and clean interface between (1) and
(2), you got no chance to screw things up (unless you do not drop entries in a
bug-implementation)). No obvious need for patches or weird workarounds.

Your opinion?

Regards,
Stephan



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 12:55           ` Stephan von Krawczynski
@ 2001-09-21 13:01             ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-21 13:01 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: davidsen, linux-kernel

On Fri, 21 Sep 2001, Stephan von Krawczynski wrote:

> The really beautiful thing about it is that you can divide it
> completely in two parts:

> Your opinion?

I'll believe it when I see it, your idea is still very
abstract and I haven't seen you even start to talk about
even the data structures used in the implementation.

(except for "list" showing up at random places in the text)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 10:21         ` Stephan von Krawczynski
@ 2001-09-21 14:08           ` Bill Davidsen
  2001-09-21 14:23             ` Rik van Riel
  0 siblings, 1 reply; 133+ messages in thread
From: Bill Davidsen @ 2001-09-21 14:08 UTC (permalink / raw)
  To: Stephan von Krawczynski; +Cc: linux-kernel

On Fri, 21 Sep 2001, Stephan von Krawczynski wrote:

> On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com>
> wrote:
> 
> > On Sun, 16 Sep 2001, Stephan von Krawczynski wrote:
	[... snip ...]
> > The problem is that when many things effect the optimal ratio of text,
> > data, buffer and free space a solution which doesn't measure all the
> > important factors will produce sub-optimal results. Your proposal is
> > simple and elegant, but I think it's too simple to produce good results.
> > See my reply to Linus' comments.
> 
> Actually I did not really propose a method of valueing the several pros and
> cons in aging itself, but a very basic idea of how this could be done without
> fiddling around with page->members (like page->age) which always implies you
> have to walk down a whole list to get the full picture in case of urgent need
> for freeable pages.
> If you age something by re-arranging its position in a list you have the
> drawback of list-locking, but the gain of fast finding the best freeable pages
> by simply using the first ones in that list. Even better you can add whatever
> criteria you like to this aging, e.g. you could rearrange to let consecutive
> pages be freed together and so on, all would be pretty easy to achieve, and
> page-struct becomes even smaller.
> The more I think about it the better it sounds.
> Your opinion?

The list is an okay way to determine rank within a class, but I still
think that there is a need for some balance between text, program data,
pages loaded via i/o, perhaps more. My disquiet with the new
implementation is based on a desire to avoid swapping program data to make
room for i/o data (using those terms in a loose way for identification).

I would also like to have time to investigate what happens if the pages
associated with a program load are handled in larger blocks, meta-pages
perhaps, which would at least cause many to be loaded at once on a page
fault, rather than faulting them in one at a time. I have to look at the
code again in my spare time, my last serious visit was 2.2.15 or so,
looking to improve SMP performance.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 14:08           ` Bill Davidsen
@ 2001-09-21 14:23             ` Rik van Riel
  2001-09-23 13:13               ` Eric W. Biederman
  0 siblings, 1 reply; 133+ messages in thread
From: Rik van Riel @ 2001-09-21 14:23 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Stephan von Krawczynski, linux-kernel

On Fri, 21 Sep 2001, Bill Davidsen wrote:

> The list is an okay way to determine rank within a class, but I still
> think that there is a need for some balance between text, program data,
> pages loaded via i/o, perhaps more. My disquiet with the new
> implementation is based on a desire to avoid swapping program data to make
> room for i/o data (using those terms in a loose way for identification).

Preference for evicting one kind of cache is indeed a bad
thing. It might work for 90% of the workloads, but you can
be sure it breaks horribly for the other 10%.

I'm currently busy tweaking the old 2.4 VM (in the -ac kernels)
to try and get optimal performance from that one, without giving
preference to one kind of cache ... except in the situation where
the amount of cache is excessive.

> I would also like to have time to investigate what happens if the pages
> associated with a program load are handled in larger blocks, meta-pages
> perhaps, which would at least cause many to be loaded at once on a page
> fault, rather than faulting them in one at a time.

This is an interesting thing, too. Something to look into for
2.5 and if it turns out simple enough we may even want to
backport it to 2.4.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 10:43         ` Stephan von Krawczynski
  2001-09-21 12:13           ` Rik van Riel
  2001-09-21 12:55           ` Stephan von Krawczynski
@ 2001-09-22 11:01           ` Daniel Phillips
  2001-09-22 20:05             ` Rik van Riel
  2001-09-24  9:36           ` Linux VM design VDA
  3 siblings, 1 reply; 133+ messages in thread
From: Daniel Phillips @ 2001-09-22 11:01 UTC (permalink / raw)
  To: Stephan von Krawczynski, Bill Davidsen; +Cc: linux-kernel

On September 21, 2001 12:43 pm, Stephan von Krawczynski wrote:
> Sorry to followup to the same post again, but I just read across another thread
> where people discuss heavily about aging up by 3 and down by 1 or vice versa is
> considered to be better or worse. The real problem behind this is that they are
> trying to bring some order in the pages by the age. Unfortunately this cannot
> really work out well, because you will _always_ end up with few or a lot of
> pages with the same age, which does not help you all that much in a situation
> where you need to know who is the _best one_ that is to be dropped next. In
> this drawing situation you have nothing to rely on but the age and some rough
> guesses (or even worse: performance issues, _not_ to walk the whole tree to
> find the best fitting page). This _solely_ comes from the fact that you have no
> steady rising order in page->age ordering (is this correct english for this
> mathematical term?). You _cannot_ know from the current function. So it must be
> considered _bad_. On the other hand a list is always ordered and therefore does
> not have this problem.
> Shit, if I only were able to implement that. Can anybody help me to proove my
> point?

You got your wish.  Andrea's mm patch introduced into the main tree at
2.4.10-pre11 uses a standard LRU:

+               if (PageTestandClearReferenced(page)) {
+                       if (PageInactive(page)) {
+                               del_page_from_inactive_list(page);
+                               add_page_to_active_list(page);
+                       } else if (PageActive(page)) {
+                               list_del(entry);
+                               list_add(entry, &active_list);

<musings>
There are arguments about whether page aging can be superior to standard LRU,
and I personally believe it can be, but there's no question that ordinary LRU
is a lot easier to implement correctly and will perform a lot better than
incorrectly implemented/untuned page aging.

The arguments in support of aging over LRU that I'm aware of are:

  - incrementing an age is more efficient than resetting several LRU list links
  - also captures some frequency-of-use information
  - it can be implemented in hardware (not that that matters much)
  - allows more scope for tuning/balancing (and also rope to hang oneself)

The big problem with aging is that unless it's entirely correctly balanced its
just not going to work very well.  To balance it well requires knowing a lot
about rates of list scanning and so on.  Matt Dillon perfected this art in BSD,
but we never did, being preoccupied with things like just getting the mm
scanners to activate when required, and sorting out our special complexities
like zones and highmem buffers.  Probably another few months of working on it
would let us get past the remaining structural problems and actually start
tuning it, but we've already made people wait way too long for a stable 2.4.
A more robust strategy makes a lot of sense right now.  We can still play with
stronger magic in 2.5, and of course Rik's aging strategy will continue to be
developed in Alan's tree while Andrea's is still going through the test of
fire.
</musings>

I'll keep reading Andrea's code and maybe I'll be able to shed some more light
on the algorithms he's using, since he doesn't seem to be in a big hurry to
do that himself.  (Hi Andrea ;-)

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22 11:01           ` Daniel Phillips
@ 2001-09-22 20:05             ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-22 20:05 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Stephan von Krawczynski, Bill Davidsen, linux-kernel

On Sat, 22 Sep 2001, Daniel Phillips wrote:

> I'll keep reading Andrea's code and maybe I'll be able to shed some
> more light on the algorithms he's using, since he doesn't seem to be
> in a big hurry to do that himself.  (Hi Andrea ;-)

Heh, this'll probably lead to the same maintenance nightmare we
had in 2.2 (undocumented code and nobody even agreeing on exactly
what the code does, let alone what it's supposed to do).

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 14:23             ` Rik van Riel
@ 2001-09-23 13:13               ` Eric W. Biederman
  2001-09-23 13:27                 ` Rik van Riel
  0 siblings, 1 reply; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-23 13:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Bill Davidsen, Stephan von Krawczynski, linux-kernel

Rik van Riel <riel@conectiva.com.br> writes:

> > I would also like to have time to investigate what happens if the pages
> > associated with a program load are handled in larger blocks, meta-pages
> > perhaps, which would at least cause many to be loaded at once on a page
> > fault, rather than faulting them in one at a time.
> 
> This is an interesting thing, too. Something to look into for
> 2.5 and if it turns out simple enough we may even want to
> backport it to 2.4.

filemap_nopage already does all of this except put the page in the
page table.


Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-23 13:13               ` Eric W. Biederman
@ 2001-09-23 13:27                 ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-23 13:27 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: Bill Davidsen, Stephan von Krawczynski, linux-kernel

On 23 Sep 2001, Eric W. Biederman wrote:
> Rik van Riel <riel@conectiva.com.br> writes:

> > > I would also like to have time to investigate what happens if the pages
> > > associated with a program load are handled in larger blocks, meta-pages
> > > perhaps, which would at least cause many to be loaded at once on a page
> > > fault, rather than faulting them in one at a time.
> >
> > This is an interesting thing, too. Something to look into for
> > 2.5 and if it turns out simple enough we may even want to
> > backport it to 2.4.
>
> filemap_nopage already does all of this except put the page in the
> page table.

Exactly, there are two things we need to fix:

1) set up the page tables in a clustered way
2) make filemap_nopage() aware of sequential IO and
   teach it to do asynchronous readahead .. maybe even
   with drop-behind on the VMA level ?

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Linux VM design
  2001-09-21 10:43         ` Stephan von Krawczynski
                             ` (2 preceding siblings ...)
  2001-09-22 11:01           ` Daniel Phillips
@ 2001-09-24  9:36           ` VDA
  2001-09-24 11:06             ` Dave Jones
                               ` (5 more replies)
  3 siblings, 6 replies; 133+ messages in thread
From: VDA @ 2001-09-24  9:36 UTC (permalink / raw)
  To: Andrea Arcangeli, Rik van Riel, Alexander Viro, Daniel Phillips
  Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3776 bytes --]

Hi VM folks,

I'd like to understand Linux VM but there's not much in
Documentation/vm/* on the subject. I understand that with current
frantic development pace it is hard to maintain such docs.

However, with only a handful of people really understading how VM
works we risk ending in a situation when nobody will know how to fix
it (does recent Andrea's VM rewrite just replaced large part of hardly
maintenable, not-quite-right VM in 2.4.10?)

When I have a stalled problem to solve, I sometimes catch
unsuspecting victim and start explainig what I am trying to do and
how I'm doing that. Often, in the middle of my explanation, I realize
myself what I did wrong. There is an old teacher's joke:
"My pupils are dumb! I explained them this theme once, then twice, I
finally myself understood it, and they still don't".
        ^^^^^^^^^^^^^^^^^^^^
Since we reached some kind of stability with 2.4, maybe
Andrea, Rik and whoever else is considering himself VM geek
would tell us not-so-clever lkml readers how VM works and put it in
vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*,
I will be unbelievably happy. Matt Dillon's post belongs there too.

I have an example how I would describe VM if I knew anything about it.
I am putting it in the zip attachment just to reduce number of
people laughing on how stupid I am :-). Most lkml readers won't open
it, I hope :-).

If VM geeks are disagreeing with each other on some VM inner workings,
they can describe their views in those separate files, giving readers
ability to compare their VM designs. Maybe these files will evolve in
VM FAQs.

Saturday, September 22, 2001, 2:01:02 PM,
Daniel Phillips <phillips@bonn-fries.net> wrote:
DP> The arguments in support of aging over LRU that I'm aware of are:

DP>   - incrementing an age is more efficient than resetting several LRU list links
DP>   - also captures some frequency-of-use information

Of what use this info can be? If one page is accessed 100 times/second
and other one once in 10 seconds, they both have to stay in RAM.
VM should take 'time since last access' into account whan deciding
which page to swap out, not how often it was referenced.

DP>   - it can be implemented in hardware (not that that matters much)
DP>   - allows more scope for tuning/balancing (and also rope to hang oneself)

DP> The big problem with aging is that unless it's entirely correctly balanced its
DP> just not going to work very well.  To balance it well requires knowing a lot
DP> about rates of list scanning and so on.  Matt Dillon perfected this art in BSD,
DP> but we never did, being preoccupied with things like just getting the mm
DP> scanners to activate when required, and sorting out our special complexities
DP> like zones and highmem buffers.  Probably another few months of working on it
DP> would let us get past the remaining structural problems and actually start
DP> tuning it, but we've already made people wait way too long for a stable 2.4.
DP> A more robust strategy makes a lot of sense right now.  We can still play with
DP> stronger magic in 2.5, and of course Rik's aging strategy will continue to be
DP> developed in Alan's tree while Andrea's is still going through the test of
DP> fire.
DP> </musings>

DP> I'll keep reading Andrea's code and maybe I'll be able to shed some more light
DP> on the algorithms he's using, since he doesn't seem to be in a big hurry to
DP> do that himself.  (Hi Andrea ;-)

DP> --
DP> Daniel
DP> -
DP> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
DP> the body of a message to majordomo@vger.kernel.org
DP> More majordomo info at  http://vger.kernel.org/majordomo-info.html
DP> Please read the FAQ at  http://www.tux.org/lkml/




-- 
Best regards, VDA
mailto:VDA@port.imtp.ilyichevsk.odessa.ua

[-- Attachment #2: Vm-dumb.zip --]
[-- Type: application/x-zip-compressed, Size: 2006 bytes --]

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
@ 2001-09-24 11:06             ` Dave Jones
  2001-09-24 12:15               ` Kirill Ratkin
  2001-09-24 13:29             ` Rik van Riel
                               ` (4 subsequent siblings)
  5 siblings, 1 reply; 133+ messages in thread
From: Dave Jones @ 2001-09-24 11:06 UTC (permalink / raw)
  To: VDA; +Cc: Linux Kernel Mailing List

On Mon, 24 Sep 2001, VDA wrote:

> I'd like to understand Linux VM but there's not much in
> Documentation/vm/* on the subject. I understand that with current
> frantic development pace it is hard to maintain such docs.

In case you're not aware of it, http://linux-mm.org/wiki/moin.cgi
is starting to fill out with documentation/ideas/etc on VM strategies
past, present and future.

regards,

Dave.

-- 
| Dave Jones.        http://www.suse.de/~davej
| SuSE Labs


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 11:06             ` Dave Jones
@ 2001-09-24 12:15               ` Kirill Ratkin
  0 siblings, 0 replies; 133+ messages in thread
From: Kirill Ratkin @ 2001-09-24 12:15 UTC (permalink / raw)
  To: Dave Jones, VDA; +Cc: Linux Kernel Mailing List


--- Dave Jones <davej@suse.de> wrote:
> On Mon, 24 Sep 2001, VDA wrote:
> 
> > I'd like to understand Linux VM but there's not
> much in
> > Documentation/vm/* on the subject. I understand
> that with current
> > frantic development pace it is hard to maintain
> such docs.
> 
> In case you're not aware of it,
> http://linux-mm.org/wiki/moin.cgi
> is starting to fill out with documentation/ideas/etc
> on VM strategies
> past, present and future.
> 
> regards,
> 
> Dave.
> 
> -- 
> | Dave Jones.        http://www.suse.de/~davej
> | SuSE Labs
> 

And here:
http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html

> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


__________________________________________________
Do You Yahoo!?
Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
  2001-09-24 11:06             ` Dave Jones
@ 2001-09-24 13:29             ` Rik van Riel
  2001-09-24 14:05               ` VDA
  2001-09-24 18:37             ` Daniel Phillips
                               ` (3 subsequent siblings)
  5 siblings, 1 reply; 133+ messages in thread
From: Rik van Riel @ 2001-09-24 13:29 UTC (permalink / raw)
  To: VDA; +Cc: Andrea Arcangeli, Alexander Viro, Daniel Phillips, linux-kernel

On Mon, 24 Sep 2001, VDA wrote:

> I'd like to understand Linux VM but there's not much in
> Documentation/vm/* on the subject.

http://linux-mm.org/ has some stuff and I wrote a freenix
paper on the subject as well http://www.surriel.com/lectures/.

> Since we reached some kind of stability with 2.4, maybe
> Andrea, Rik and whoever else is considering himself VM geek
> would tell us not-so-clever lkml readers how VM works and put it in
> vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*,
> I will be unbelievably happy. Matt Dillon's post belongs there too.

http://linux-mm.org/

The only thing missing is an explanation of Andrea's
VM, but knowing Andrea's enthusiasm at documentation
I wouldn't really count on that any time soon ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 13:29             ` Rik van Riel
@ 2001-09-24 14:05               ` VDA
  2001-09-24 14:37                 ` Rik van Riel
  2001-09-24 14:42                 ` Rik van Riel
  0 siblings, 2 replies; 133+ messages in thread
From: VDA @ 2001-09-24 14:05 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

Hello Rik,
Monday, September 24, 2001, 4:29:46 PM, you wrote:

>> Since we reached some kind of stability with 2.4, maybe
>> Andrea, Rik and whoever else is considering himself VM geek
>> would tell us not-so-clever lkml readers how VM works and put it in
>> vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*,
>> I will be unbelievably happy. Matt Dillon's post belongs there too.

RvR> http://linux-mm.org/

I was there today. Good. Can this stuff be placed as
Doc/mv/vm2.4rik
to prevent it from being outdated in 2-3 months?
Linus?

Also I'd like to be enlightened why this:

>Virtual Memory Management Policy
>--------------------------------
>The basic principle of the Linux VM system is page aging. We've seen
>that refill_inactive_scan() is invoked periodically to try to
>deactivate pages, and that it ages pages down as it does so,
>deactivating them when their age reaches 0. We've also seen that
>swap_out() will age referenced page frames up while scanning process
>memory maps. This is the fundamental mechanism for VM resource
>balancing in Linux: pages are aged down at a more-or-less steady rate,
>and deactivated when they become sufficiently old; but processes can
>keep pages "young" by referencing them frequently.

is better than plain simple LRU?

We definitely need VM FAQ to have these questions answered once per VM
design, not once per week :-)

RvR> The only thing missing is an explanation of Andrea's
RvR> VM, but knowing Andrea's enthusiasm at documentation
RvR> I wouldn't really count on that any time soon ;)

:-)

-- 
Best regards, VDA
mailto:VDA@port.imtp.ilyichevsk.odessa.ua



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 14:05               ` VDA
@ 2001-09-24 14:37                 ` Rik van Riel
  2001-09-24 14:42                 ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-24 14:37 UTC (permalink / raw)
  To: VDA; +Cc: linux-kernel

On Mon, 24 Sep 2001, VDA wrote:

> RvR> http://linux-mm.org/
>
> I was there today. Good. Can this stuff be placed as
> Doc/mv/vm2.4rik
> to prevent it from being outdated in 2-3 months?

Putting documents in the kernel tree has never worked
as a means of keeping them up to date.

Unless, of course, you're volunteering to keep them
up to date ;)

> Also I'd like to be enlightened why this:
>
> >Virtual Memory Management Policy
> >--------------------------------
> >The basic principle of the Linux VM system is page aging.

> is better than plain simple LRU?
>
> We definitely need VM FAQ to have these questions answered once per VM
> design, not once per week :-)



> RvR> The only thing missing is an explanation of Andrea's
> RvR> VM, but knowing Andrea's enthusiasm at documentation
> RvR> I wouldn't really count on that any time soon ;)
>
> :-)
>
> --
> Best regards, VDA
> mailto:VDA@port.imtp.ilyichevsk.odessa.ua
>
>

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 14:05               ` VDA
  2001-09-24 14:37                 ` Rik van Riel
@ 2001-09-24 14:42                 ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-24 14:42 UTC (permalink / raw)
  To: VDA; +Cc: linux-kernel

[grrrrr, the dog was sitting against my arm and I pressed the
wrong key ;)]

On Mon, 24 Sep 2001, VDA wrote:

> >Virtual Memory Management Policy
> >--------------------------------
> >The basic principle of the Linux VM system is page aging.

> is better than plain simple LRU?

All research I've seen indicates that it's better to take
frequency into account as well instead of only access
recency.

Plain LRU just breaks down under sequential IO, LRU with
a large enough inactive list should hold up decently under
streaming IO, but only a replacement strategy which keeps
access frequency into account too will be able to make
proper decisions as to which pages to keep in memory and
which pages to throw out.

Note that it's not me making this up, it's simply the info
I've seen everywhere ... I don't like reinventing the wheel ;)

> We definitely need VM FAQ to have these questions answered once per VM
> design, not once per week :-)

Go ahead, make on on the Linux-MM wiki:

	http://linux-mm.org/wiki/

(note that for some reason the thing gives an internal
server error once in a while ... I haven't yet been able
to find a pattern to it, so I it's not fixed yet)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 19:32               ` Rik van Riel
@ 2001-09-24 17:27                 ` Rob Landley
  2001-09-24 21:48                   ` Rik van Riel
  2001-09-25  9:58                 ` Daniel Phillips
  1 sibling, 1 reply; 133+ messages in thread
From: Rob Landley @ 2001-09-24 17:27 UTC (permalink / raw)
  To: Rik van Riel, Daniel Phillips
  Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel

On Monday 24 September 2001 15:32, Rik van Riel wrote:
> On Mon, 24 Sep 2001, Daniel Phillips wrote:
> > To tell the truth, I don't really see why the frequency
> > information is all that useful either.
> >
> > So the list of reasons why aging is good is looking really short.
>
> Ummmm, that _you_ can't see it doesn't mean suddenly all
> VM research from the last 15 years has been invalidated.

Out of morbid curiosity, how much of that research either said or assumed 
that microkernels were a good idea?

Rob

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
  2001-09-24 11:06             ` Dave Jones
  2001-09-24 13:29             ` Rik van Riel
@ 2001-09-24 18:37             ` Daniel Phillips
  2001-09-24 19:32               ` Rik van Riel
  2001-09-25 16:03               ` bill davidsen
  2001-09-24 18:46             ` Jonathan Morton
                               ` (2 subsequent siblings)
  5 siblings, 2 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-24 18:37 UTC (permalink / raw)
  To: VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro; +Cc: linux-kernel

On September 24, 2001 11:36 am, VDA wrote:
> Daniel Phillips <phillips@bonn-fries.net> wrote:
> DP> The arguments in support of aging over LRU that I'm aware of are:
> 
> DP>   - incrementing an age is more efficient than resetting several LRU 
> DP>     list links
> DP>   - also captures some frequency-of-use information
> 
> Of what use this info can be? If one page is accessed 100 times/second
> and other one once in 10 seconds, they both have to stay in RAM.
> VM should take 'time since last access' into account whan deciding
> which page to swap out, not how often it was referenced.

You might want to have a look at this:

   http://archi.snu.ac.kr/jhkim/seminar/96-004.ps
   (lrfu algorithm)

To tell the truth, I don't really see why the frequency information is all
that useful either.  Rik suggested it's good for streaming IO but we already 
have effective means of dealing with that that don't rely on any frequency 
information.

So the list of reasons why aging is good is looking really short.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
                               ` (2 preceding siblings ...)
  2001-09-24 18:37             ` Daniel Phillips
@ 2001-09-24 18:46             ` Jonathan Morton
  2001-09-24 19:16               ` Daniel Phillips
  2001-09-24 19:11             ` Dan Mann
  2001-09-25 10:55             ` VDA
  5 siblings, 1 reply; 133+ messages in thread
From: Jonathan Morton @ 2001-09-24 18:46 UTC (permalink / raw)
  To: Daniel Phillips, VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro
  Cc: linux-kernel

>  > DP> The arguments in support of aging over LRU that I'm aware of are:
>>
>>  DP>   - incrementing an age is more efficient than resetting several LRU
>>  DP>     list links
>>  DP>   - also captures some frequency-of-use information
>>
>>  Of what use this info can be? If one page is accessed 100 times/second
>>  and other one once in 10 seconds, they both have to stay in RAM.
>>  VM should take 'time since last access' into account whan deciding
>>  which page to swap out, not how often it was referenced.
>
>You might want to have a look at this:
>
>    http://archi.snu.ac.kr/jhkim/seminar/96-004.ps
>    (lrfu algorithm)
>
>To tell the truth, I don't really see why the frequency information is all
>that useful either.  Rik suggested it's good for streaming IO but we already
>have effective means of dealing with that that don't rely on any frequency
>information.
>
>So the list of reasons why aging is good is looking really short.

It's not really frequency information.  If a page is accessed 1000 
times during a single schedule cycle, that will count as a single 
increment in the age come the time.  However, *macro* frequency 
information of this type *is* useful in the case where thrashing is 
taking place.  You want to swap out the page that is accessed only 
once every other schedule cycle, before the one accessed every cycle. 
This is of course moot if one process is being suspended (as it 
probably should), but the criteria for suspension might include this 
access information.

-- 
--------------------------------------------------------------
from:     Jonathan "Chromatix" Morton
mail:     chromi@cyberspace.org  (not for attachments)
website:  http://www.chromatix.uklinux.net/vnc/
geekcode: GCS$/E dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$
           V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*)
tagline:  The key to knowledge is not to rely on people to teach you it.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
                               ` (3 preceding siblings ...)
  2001-09-24 18:46             ` Jonathan Morton
@ 2001-09-24 19:11             ` Dan Mann
  2001-09-25 10:55             ` VDA
  5 siblings, 0 replies; 133+ messages in thread
From: Dan Mann @ 2001-09-24 19:11 UTC (permalink / raw)
  To: VDA; +Cc: linux-kernel

I hope this isn't the wrong place to ask this but,  wouldn't it be better to
increase ram size and decrease swap size as memory requirements grow?  For
instance, say I have a lightly loaded machine, that has 192MB of ram.  From
everything I've heard in the past, I'd use roughly 192MB of swap with this
machine.  The problem I would imagine is that if all 192MB got used wouldn't
it be terribly slow to read/write that much data back in?  Would less swap,
say 32 MB make the kernel more restrictive with it's available memory and
make the box more responsive when it's heavily using swap?

Or am I way off and just smoking crack?  (which I may very well be)

This damn mailing list is addictive.  Now I read it at work.

Dan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 18:46             ` Jonathan Morton
@ 2001-09-24 19:16               ` Daniel Phillips
  0 siblings, 0 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-24 19:16 UTC (permalink / raw)
  To: Jonathan Morton, VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro
  Cc: linux-kernel

On September 24, 2001 08:46 pm, Jonathan Morton wrote:
> >  > DP> The arguments in support of aging over LRU that I'm aware of are:
> >>
> >>  DP>   - incrementing an age is more efficient than resetting several LRU
> >>  DP>     list links
> >>  DP>   - also captures some frequency-of-use information
> >>
> >>  Of what use this info can be? If one page is accessed 100 times/second
> >>  and other one once in 10 seconds, they both have to stay in RAM.
> >>  VM should take 'time since last access' into account whan deciding
> >>  which page to swap out, not how often it was referenced.
> >
> >You might want to have a look at this:
> >
> >    http://archi.snu.ac.kr/jhkim/seminar/96-004.ps
> >    (lrfu algorithm)
> >
> >To tell the truth, I don't really see why the frequency information is all
> >that useful either.  Rik suggested it's good for streaming IO but we 
already
> >have effective means of dealing with that that don't rely on any frequency
> >information.
> >
> >So the list of reasons why aging is good is looking really short.
> 
> It's not really frequency information.  If a page is accessed 1000 
> times during a single schedule cycle, that will count as a single 
> increment in the age come the time.  However, *macro* frequency 
> information of this type *is* useful in the case where thrashing is 
> taking place.  You want to swap out the page that is accessed only 
> once every other schedule cycle, before the one accessed every cycle. 

But this happens naturally with LRU.  Think how it works: to get evicted a 
page has to progress all the way from the head to the tail of the LRU list.  
Any page that's accessed frequently is going to keep being put back at the 
head of the list, and only infrequently accessed pages will drop off the tail.

> This is of course moot if one process is being suspended (as it 
> probably should), but the criteria for suspension might include this 
> access information.

OK, that does get at something you can do with aging that you can't do with 
an LRU list: look at the weightings of random pages.  You can't do that with 
the LRU list because there's no efficient way to determine which position a 
page holds in the list.

One application where you would want to know the weightings of random pages 
is defragmentation.  That might become important in the future but we're not 
doing it now.

A little contemplation will probably turn up other uses for this special 
property.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 18:37             ` Daniel Phillips
@ 2001-09-24 19:32               ` Rik van Riel
  2001-09-24 17:27                 ` Rob Landley
  2001-09-25  9:58                 ` Daniel Phillips
  2001-09-25 16:03               ` bill davidsen
  1 sibling, 2 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-24 19:32 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel

On Mon, 24 Sep 2001, Daniel Phillips wrote:

> To tell the truth, I don't really see why the frequency
> information is all that useful either.

> So the list of reasons why aging is good is looking really short.

Ummmm, that _you_ can't see it doesn't mean suddenly all
VM research from the last 15 years has been invalidated.

cheers,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 17:27                 ` Rob Landley
@ 2001-09-24 21:48                   ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-24 21:48 UTC (permalink / raw)
  To: Rob Landley
  Cc: Daniel Phillips, VDA, Andrea Arcangeli, Alexander Viro, linux-kernel

On Mon, 24 Sep 2001, Rob Landley wrote:

> Out of morbid curiosity, how much of that research either said
> or assumed that microkernels were a good idea?

*grin*

None that I can remember even dealt with this. The
page replacement research I've read was both generic
OS and database page replacement, maybe 50 to 100
papers total...

cheers,

Rik
--
IA64: a worthy successor to the i860.

		http://www.surriel.com/
http://www.conectiva.com/	http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 19:32               ` Rik van Riel
  2001-09-24 17:27                 ` Rob Landley
@ 2001-09-25  9:58                 ` Daniel Phillips
  1 sibling, 0 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-25  9:58 UTC (permalink / raw)
  To: Rik van Riel; +Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel

On September 24, 2001 09:32 pm, Rik van Riel wrote:
> On Mon, 24 Sep 2001, Daniel Phillips wrote:
> > To tell the truth, I don't really see why the frequency
> > information is all that useful either.
> 
> > So the list of reasons why aging is good is looking really short.
> 
> Ummmm, that _you_ can't see it doesn't mean suddenly all
> VM research from the last 15 years has been invalidated.

Did you have some more reasons to add to the list?

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24  9:36           ` Linux VM design VDA
                               ` (4 preceding siblings ...)
  2001-09-24 19:11             ` Dan Mann
@ 2001-09-25 10:55             ` VDA
  5 siblings, 0 replies; 133+ messages in thread
From: VDA @ 2001-09-25 10:55 UTC (permalink / raw)
  To: Dan Mann; +Cc: linux-kernel

Hello Dan,
Monday, September 24, 2001, 10:11:08 PM, you wrote:

DM> I hope this isn't the wrong place to ask this but,  wouldn't it be better to
DM> increase ram size and decrease swap size as memory requirements grow?  For
DM> instance, say I have a lightly loaded machine, that has 192MB of ram.  From
DM> everything I've heard in the past, I'd use roughly 192MB of swap with this
DM> machine.  The problem I would imagine is that if all 192MB got used wouldn't
DM> it be terribly slow to read/write that much data back in?  Would less swap,
DM> say 32 MB make the kernel more restrictive with it's available memory and
DM> make the box more responsive when it's heavily using swap?

If you want everything to be fast, buy more RAM and use no swap whatsoever.

Swap is useful if your total memory requirements are big but working
set is significantly smaller. You need RAM to cover working set
and RAM+swap to cover total memory requirements.

As you can see, amount of RAM and swap thus *application dependent*.
-- 
Best regards, VDA
mailto:VDA@port.imtp.ilyichevsk.odessa.ua



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: Linux VM design
  2001-09-24 18:37             ` Daniel Phillips
  2001-09-24 19:32               ` Rik van Riel
@ 2001-09-25 16:03               ` bill davidsen
  1 sibling, 0 replies; 133+ messages in thread
From: bill davidsen @ 2001-09-25 16:03 UTC (permalink / raw)
  To: linux-kernel

In article <20010924182948Z16175-2757+1593@humbolt.nl.linux.org> phillips@bonn-fries.net wrote:

| You might want to have a look at this:
| 
|    http://archi.snu.ac.kr/jhkim/seminar/96-004.ps
|    (lrfu algorithm)
| 
| To tell the truth, I don't really see why the frequency information is all
| that useful either.  Rik suggested it's good for streaming IO but we already 
| have effective means of dealing with that that don't rely on any frequency 
| information.

  A count which may actually be useful is a count of how many time the
page has been swapped in (after being swapped out) as a predictor that
it will be a good page to keep. The problem is that there are many
things which help, and I don't think we have the balance quite right
yet. I suspect that there need to be some hysteresis and runtime tuning
over seconds to get optimal performance. Of course systems with really
odd loads will still need to have hand tuning, and the /proc/sys
interface should include sensible ways to do this.

| So the list of reasons why aging is good is looking really short.

  The primary reason on my list is that under some load conditions it
produces much better response. Note that I didn't say all conditions
before you rush to disagree with me. Sometimes people will trade a
little steady state performance to avoid a really bad worst case.

  How the problem is solved really isn't the issue, but responsiveness
is important. Right now it seems some people are reporting that their
loads work better with aging.

-- 
bill davidsen <davidsen@tmr.com>
 "If I were a diplomat, in the best case I'd go hungry.  In the worst
  case, people would die."
		-- Robert Lipe

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 23:44               ` Pavel Machek
  2001-09-27 13:52                 ` Eric W. Biederman
@ 2001-10-01 11:37                 ` Marcelo Tosatti
  1 sibling, 0 replies; 133+ messages in thread
From: Marcelo Tosatti @ 2001-10-01 11:37 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm



On Thu, 27 Sep 2001, Pavel Machek wrote:

> Hi!
> 
> > > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > > based data structure we can get the cost down under the current 8 bits
> > > > > per page that we have for the swap counts, and make allocating swap
> > > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > > an extent based system is a natural fit.
> > > >
> > > > Much of this goes away if you get rid of both the swap and anonymous page
> > > > special cases. Back anonymous pages with the "whoops everything I write here
> > > > vanishes mysteriously" file system and swap with a swapfs
> > >
> > > What exactly is anonymous memory? I thought it is what you do when you
> > > want to malloc(), but you want to back that up by swap, not /dev/null.
> >
> > Anonymous memory is memory which is not backed by a filesystem or a
> > device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> > will create anonymous memory as soon as the program which did the mmap
> > writes to the mapped memory (COW)), etc.
> 
> So... how can alan propose to back anonymous memory with /dev/null?

I guess he means anonymous memory backed up by /dev/null means anonymous
memory backep up by nothing.

> [see above] It should be backed by swap, no?

Not necessarily. As soon as we need to swapout anon memory, we have to
back it up by swap. (mm/vmscan.c:try_to_swap_out() job)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 23:44               ` Pavel Machek
@ 2001-09-27 13:52                 ` Eric W. Biederman
  2001-10-01 11:37                 ` Marcelo Tosatti
  1 sibling, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-27 13:52 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Marcelo Tosatti, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

Pavel Machek <pavel@suse.cz> writes:

> Hi!
> 
> > > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > > based data structure we can get the cost down under the current 8 bits
> > > > > per page that we have for the swap counts, and make allocating swap
> > > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > > an extent based system is a natural fit.
> > > >
> > > > Much of this goes away if you get rid of both the swap and anonymous page
> > > > special cases. Back anonymous pages with the "whoops everything I write
> here
> 
> > > > vanishes mysteriously" file system and swap with a swapfs
> > > 
> > > What exactly is anonymous memory? I thought it is what you do when you
> > > want to malloc(), but you want to back that up by swap, not /dev/null.
> > 
> > Anonymous memory is memory which is not backed by a filesystem or a
> > device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> > will create anonymous memory as soon as the program which did the mmap
> > writes to the mapped memory (COW)), etc.
> 
> So... how can alan propose to back anonymous memory with /dev/null?
> [see above] It should be backed by swap, no?

He's not.  Alan if I understand him correctly is advocating remove special
cases.  And making it look like all pages are backed by something.
The /dev/nullfs is just until swap is allocated for that page.  

I don't agree with the exact details of what Alan is envsions but I do
argree with the basic idea...

Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-26 18:22             ` Marcelo Tosatti
@ 2001-09-26 23:44               ` Pavel Machek
  2001-09-27 13:52                 ` Eric W. Biederman
  2001-10-01 11:37                 ` Marcelo Tosatti
  0 siblings, 2 replies; 133+ messages in thread
From: Pavel Machek @ 2001-09-26 23:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

Hi!

> > > > So my suggestion was to look at getting anonymous pages backed by what
> > > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > > based data structure we can get the cost down under the current 8 bits
> > > > per page that we have for the swap counts, and make allocating swap
> > > > pages faster.  And we want to cluster related swap pages anyway so
> > > > an extent based system is a natural fit.
> > >
> > > Much of this goes away if you get rid of both the swap and anonymous page
> > > special cases. Back anonymous pages with the "whoops everything I write here
> > > vanishes mysteriously" file system and swap with a swapfs
> > 
> > What exactly is anonymous memory? I thought it is what you do when you
> > want to malloc(), but you want to back that up by swap, not /dev/null.
> 
> Anonymous memory is memory which is not backed by a filesystem or a
> device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
> will create anonymous memory as soon as the program which did the mmap
> writes to the mapped memory (COW)), etc.

So... how can alan propose to back anonymous memory with /dev/null?
[see above] It should be backed by swap, no?
								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-24 22:50           ` Pavel Machek
@ 2001-09-26 18:22             ` Marcelo Tosatti
  2001-09-26 23:44               ` Pavel Machek
  0 siblings, 1 reply; 133+ messages in thread
From: Marcelo Tosatti @ 2001-09-26 18:22 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Alan Cox, Eric W. Biederman, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm



On Tue, 25 Sep 2001, Pavel Machek wrote:

> Hi!
> 
> > > So my suggestion was to look at getting anonymous pages backed by what
> > > amounts to a shared memory segment.  In that vein.  By using an extent
> > > based data structure we can get the cost down under the current 8 bits
> > > per page that we have for the swap counts, and make allocating swap
> > > pages faster.  And we want to cluster related swap pages anyway so
> > > an extent based system is a natural fit.
> >
> > Much of this goes away if you get rid of both the swap and anonymous page
> > special cases. Back anonymous pages with the "whoops everything I write here
> > vanishes mysteriously" file system and swap with a swapfs
> 
> What exactly is anonymous memory? I thought it is what you do when you
> want to malloc(), but you want to back that up by swap, not /dev/null.

Anonymous memory is memory which is not backed by a filesystem or a
device. eg: malloc()ed memory, shmem, mmap(MAP_PRIVATE) on a file (which
will create anonymous memory as soon as the program which did the mmap
writes to the mapped memory (COW)), etc.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22  7:09                   ` Daniel Phillips
@ 2001-09-25 11:04                     ` Mike Fedyk
  0 siblings, 0 replies; 133+ messages in thread
From: Mike Fedyk @ 2001-09-25 11:04 UTC (permalink / raw)
  To: linux-kernel, linux-mm

On Sat, Sep 22, 2001 at 09:09:10AM +0200, Daniel Phillips wrote:
> On September 21, 2001 05:27 pm, Jan Harkes wrote:
> > On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
> > >   - small inactive list really means large active list (and vice versa)
> > >   - aging increments need to depend on the size of the active list
> > >   - "exponential" aging may be completely bogus
> > 
> > I don't think so, whenever there is sufficient memory pressure, the scan
> > of the active list is not only done by kswapd, but also by the page
> > allocations.
> > 
> > This does have the nice effect that with a large active list on a system
> > that has a working set that fits in memory, pages basically always age
> > up, and we get an automatic used-once/drop-behind behaviour for
> > streaming data because the age of these pages is relatively low.
> > 
> > As soon as the rate of new allocations increases to the point that
> > kswapd can't keep up, which happens if the number of cached used-once
> > pages is too small, or the working set expands so that it doesn't fit in
> > memory. The memory shortage then causes all pages to agressively get
> > aged down, pushing out the less frequently used pages of the working set.
> > 
> > Exponential down aging simply causes us to loop fewer times in
> > do_try_to_free_pages is such situations.
> 
> In such a situation that's a horribly inefficient way to accomplish this and 
> throws away a lot of valuable information.  Consider that we're doing nothing 
> but looping in the vm in this situation, so nobody gets a chance to touch 
> pages, so nothing gets aged up.  So we are really just deactivating all the 
> pages that lie below a given theshold.
> 
> Say that the threshold happens to be 16.  We loop through the active list 5 
> times and now we have not only deactivated the pages we needed but collapsed 
> all ages between 16 and 31 to the same value, and all ages between 32 and 63 
> to just two values, losing most of the relative weighting information.
> 
> Would it not make more sense to go through the active list once, deactivate 
> all pages with age less than some computed threshold, and subtract that 
> threshold from the rest?
> 

If I understand the thread between Rik and the guy from FreeBSD (sorry,
don't remember his name), then what they are doing is they have a computed
swap level that rises as needed, and doesn't modify the aging of any of the
pages.

So, if you have pages ages at 5 7 15 30 45 each loop through
do_try_to_free_pages will raise swap_thresh by whatever increment.

Looping through, you first get the pages at 5, 7, then 15 until you swap out
enough.  While this is happening, you let the normal referencing modify the
aging, not the act of swapping.

I know this is quite simplistic, but it may help.  What do you guys think?

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
                             ` (2 preceding siblings ...)
  2001-09-20 11:28           ` Daniel Phillips
@ 2001-09-24 22:50           ` Pavel Machek
  2001-09-26 18:22             ` Marcelo Tosatti
  3 siblings, 1 reply; 133+ messages in thread
From: Pavel Machek @ 2001-09-24 22:50 UTC (permalink / raw)
  To: Alan Cox, Eric W. Biederman
  Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Hi!

> > So my suggestion was to look at getting anonymous pages backed by what
> > amounts to a shared memory segment.  In that vein.  By using an extent
> > based data structure we can get the cost down under the current 8 bits
> > per page that we have for the swap counts, and make allocating swap
> > pages faster.  And we want to cluster related swap pages anyway so
> > an extent based system is a natural fit.
> 
> Much of this goes away if you get rid of both the swap and anonymous page
> special cases. Back anonymous pages with the "whoops everything I write here
> vanishes mysteriously" file system and swap with a swapfs

What exactly is anonymous memory? I thought it is what you do when you
want to malloc(), but you want to back that up by swap, not /dev/null.

								Pavel
-- 
I'm pavel@ucw.cz. "In my country we have almost anarchy and I don't care."
Panos Katsaloulis describing me w.r.t. patents at discuss@linmodems.org

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22 20:46 ` Jan Harkes
@ 2001-09-22 21:46   ` Peter Magnusson
  0 siblings, 0 replies; 133+ messages in thread
From: Peter Magnusson @ 2001-09-22 21:46 UTC (permalink / raw)
  To: Jan Harkes; +Cc: linux-kernel

On Sat, 22 Sep 2001, Jan Harkes wrote:

> Only when the wrong pages have been swapped out and need to be swapped
> in again. Swapped out pages should be relatively inactive, the amount of
> swap space allocated number that you see in 'free' is not the same as
> the amount of pages that are actually swapped out and removed from
> memory, any active ones should still be around in the pagecache.
>
> > Use the swap as little as possible == good.
>
> Nope, Use the swap-in as little as possible == good, I don't mind having
> 4GB of data in swap, as long as I typically don't need to load it back
> into in memory. And every page that is in swap that I don't really need
> means another page that can be used to avoid paging out an executable,
> or purging the dentry lookup caches, or dropping one of those files I
> access once every 5 minutes.

I think we are talking about somewhat different things. U talk about
swapping in general. I talk about the changes 2.4.7 > now (2.4.9 for
example) in the VM system that makes linux to swap alot.

In kernel 2.4.7 maybe 5 Mbyte was put on the swap over a week. It didnt
had any need to put more on the swap because i got 512 Mbyte RAM.

Then that changed... In for example 2.4.9 it put ALOT of Mbyte on the swap
very fast, like 100-200 Mbyte. And then it will slowly swap it back to RAM
when they are needed. If it didnt put it on the swap in the first place and
keept it in RAM like under 2.4.7 it would not swap it back to RAM later and
it would go faster. I know alot of others that are very annoyed about this
and is complaing about it. I just doesnt reach the linux-kernel
mailinglist.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22 19:59 Peter Magnusson
@ 2001-09-22 20:46 ` Jan Harkes
  2001-09-22 21:46   ` Peter Magnusson
  0 siblings, 1 reply; 133+ messages in thread
From: Jan Harkes @ 2001-09-22 20:46 UTC (permalink / raw)
  To: Peter Magnusson; +Cc: linux-kernel

On Sat, Sep 22, 2001 at 09:59:59PM +0200, Peter Magnusson wrote:
> Jan Harkes wrote:
> > Because pages aren't 'aged' until there is swap allocated for them, your
> > kernel should actually work better if it has a lot of pages backed by
> > swap. The only thing is, we don't really make the right decision about
> 
> It doesnt. It just gets slower.
> If it really become faster i would not have written my orginal posting.

Only when the wrong pages have been swapped out and need to be swapped
in again. Swapped out pages should be relatively inactive, the amount of
swap space allocated number that you see in 'free' is not the same as
the amount of pages that are actually swapped out and removed from
memory, any active ones should still be around in the pagecache.

> Use the swap as little as possible == good.

Nope, Use the swap-in as little as possible == good, I don't mind having
4GB of data in swap, as long as I typically don't need to load it back
into in memory. And every page that is in swap that I don't really need
means another page that can be used to avoid paging out an executable,
or purging the dentry lookup caches, or dropping one of those files I
access once every 5 minutes.

Jan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22 19:59 Peter Magnusson
@ 2001-09-22 20:18 ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-22 20:18 UTC (permalink / raw)
  To: Peter Magnusson; +Cc: linux-kernel

On Sat, 22 Sep 2001, Peter Magnusson wrote:

> It treats the file system cache as important as normal programs and
> thats is very wrong. Its like this on all kernels over 2.4.7.

Nope, it was a bug in the page aging which caused the system
to treat file system cache as MORE important than programs.

I think I may have fixed that bug recently, I'm waiting for
Alan to run out of critical bugfixes so he has a suitable
moment to integrate it into -ac ;)

Until then, you can get the page aging patch from my home
page: http://www.surriel.com/patches/

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
@ 2001-09-22 19:59 Peter Magnusson
  2001-09-22 20:46 ` Jan Harkes
  0 siblings, 1 reply; 133+ messages in thread
From: Peter Magnusson @ 2001-09-22 19:59 UTC (permalink / raw)
  To: linux-kernel

Jan Harkes wrote:

> What do you consider as good VM?

When programs that isnt used are swapped out, or parts of them like
it worked in < 2.4.7.

> Because pages aren't 'aged' until there is swap allocated for them, your
> kernel should actually work better if it has a lot of pages backed by
> swap. The only thing is, we don't really make the right decision about

It doesnt. It just gets slower.
If it really become faster i would not have written my orginal posting.

> which pages to swap out, but that's just a detail.
>
> IMHO. A large number of cached/active pages == good.

IMHO:

Use the swap as little as possible == good.
Do you think i have 512 Mbyte of RAM just because i want
the kernel to swap out lotsa stuff? No, because it shouldnt
have the need for swapping out stuff.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
@ 2001-09-22 19:59 Peter Magnusson
  2001-09-22 20:18 ` Rik van Riel
  0 siblings, 1 reply; 133+ messages in thread
From: Peter Magnusson @ 2001-09-22 19:59 UTC (permalink / raw)
  To: linux-kernel

On Sat, 15 Sep 2001, Linus Torvalds wrote:

> In article <Pine.LNX.4.33L2.0109160031500.7740-100000@flashdance>,
> Peter Magnusson  <iocc@flashdance.nothanksok.cx> wrote:
> >
> >2.4.10-pre4: quite ok VM, but put little more on the swap than 2.4.7
> >2.4.10-pre8: not good
>
> Ehh..
>
> There are _no_ VM changes that I can see between pre4 and pre8.
>
> >2.4.10-pre9: not good ... Linux didnt had used any swap at all, then i
> >             unrared two very large files at the same time. And now 104
> >             Mbyte swap is used! :-( 2.4.7 didnt do like this.
> >             Best is to use the swap as little as possible.
>
> .. and there are none between pre8 and pre9.
>
> Basically, it sounds lik eyou have tested different loads on different
> kernels, and some loads are nice and others are not.

I guess i just was lucky when i used pre4 before.
I have rebooted to pre4 now. unrared exactly the same very big files
that i did in pre9. 70 Mbyte swap used and the box has just been
rebooted!

My guess:

It treats the file system cache as important as normal programs and
thats is very wrong. Its like this on all kernels over 2.4.7.

> Also note that the amount of "swap used" is totally meaningless in
> 2.4.x. The 2.4.x kernel will _allocate_ the swap backing store much
> earlier than 2.2.x, but that doesn't actuall ymean that it does any of
> the IO. Indeed, allocating the swap backing store just means that the

I go after what top (in the Swap used field), xosview says. It havent
lied for me yet. And its hard to miss the slowdown when my computer
trys to swap out about 100 Mbyte.

The "SWAP" field in top lies however, but its not that Im looking on.



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 15:27                 ` Jan Harkes
@ 2001-09-22  7:09                   ` Daniel Phillips
  2001-09-25 11:04                     ` Mike Fedyk
  0 siblings, 1 reply; 133+ messages in thread
From: Daniel Phillips @ 2001-09-22  7:09 UTC (permalink / raw)
  To: Jan Harkes
  Cc: Rik van Riel, Alan Cox, Eric W. Biederman, Rob Fuller,
	linux-kernel, linux-mm

On September 21, 2001 05:27 pm, Jan Harkes wrote:
> On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
> >   - small inactive list really means large active list (and vice versa)
> >   - aging increments need to depend on the size of the active list
> >   - "exponential" aging may be completely bogus
> 
> I don't think so, whenever there is sufficient memory pressure, the scan
> of the active list is not only done by kswapd, but also by the page
> allocations.
> 
> This does have the nice effect that with a large active list on a system
> that has a working set that fits in memory, pages basically always age
> up, and we get an automatic used-once/drop-behind behaviour for
> streaming data because the age of these pages is relatively low.
> 
> As soon as the rate of new allocations increases to the point that
> kswapd can't keep up, which happens if the number of cached used-once
> pages is too small, or the working set expands so that it doesn't fit in
> memory. The memory shortage then causes all pages to agressively get
> aged down, pushing out the less frequently used pages of the working set.
> 
> Exponential down aging simply causes us to loop fewer times in
> do_try_to_free_pages is such situations.

In such a situation that's a horribly inefficient way to accomplish this and 
throws away a lot of valuable information.  Consider that we're doing nothing 
but looping in the vm in this situation, so nobody gets a chance to touch 
pages, so nothing gets aged up.  So we are really just deactivating all the 
pages that lie below a given theshold.

Say that the threshold happens to be 16.  We loop through the active list 5 
times and now we have not only deactivated the pages we needed but collapsed 
all ages between 16 and 31 to the same value, and all ages between 32 and 63 
to just two values, losing most of the relative weighting information.

Would it not make more sense to go through the active list once, deactivate 
all pages with age less than some computed threshold, and subtract that 
threshold from the rest?

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-22  2:14             ` Alexander Viro
@ 2001-09-22  3:09               ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-22  3:09 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Eric W. Biederman, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm

On Fri, 21 Sep 2001, Alexander Viro wrote:

> It means that you prefer system dying under much lighter load.  At
> some point any box will get into feedback loop,

> The question being, at which point will it happen and how graceful
> will the degradation be when we get near that point.

And ... what do we do when we reach that point ?

It's obvious that we need load control to make the machine
survive at that point; load control is a horrible measure
which will make interactivity very bad, but will cause the
box to survive where otherwise it would be thrashing.

Having a better paging system would mean having the 'thrashing
point' (where we need to kick in load control' much further
out and being able to keep the system behave better under
heavier VM loads.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:23           ` Eric W. Biederman
  2001-09-21 12:01             ` Rik van Riel
@ 2001-09-22  2:14             ` Alexander Viro
  2001-09-22  3:09               ` Rik van Riel
  1 sibling, 1 reply; 133+ messages in thread
From: Alexander Viro @ 2001-09-22  2:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Rik van Riel, Alan Cox, Daniel Phillips, Rob Fuller,
	linux-kernel, linux-mm



On 21 Sep 2001, Eric W. Biederman wrote:

> Swapping is an important case.  But 9 times out of 10 you are managing
> memory in caches, and throwing unused pages into swap.  You aren't busily
> paging the data back an forth.  But if I have to make a choice in
> what kind of situation I want to take a performance hit, paging
> approaching thrashing or a system whose working set size is well
> within RAM.  I'd rather take the hit in the system that is paging.

It means that you prefer system dying under much lighter load.  At some
point any box will get into feedback loop, when slowdown from VM load
will make request handling slower, which will make temp. allocations
needed to handle these requests to be kept around for longer periods,
which will contribute to VM load.  The question being, at which point
will it happen and how graceful will the degradation be when we get
near that point.


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:13               ` Daniel Phillips
  2001-09-21 12:10                 ` Rik van Riel
@ 2001-09-21 15:27                 ` Jan Harkes
  2001-09-22  7:09                   ` Daniel Phillips
  1 sibling, 1 reply; 133+ messages in thread
From: Jan Harkes @ 2001-09-21 15:27 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Rik van Riel, Alan Cox, Eric W. Biederman, Rob Fuller,
	linux-kernel, linux-mm

On Fri, Sep 21, 2001 at 10:13:11AM +0200, Daniel Phillips wrote:
>   - small inactive list really means large active list (and vice versa)
>   - aging increments need to depend on the size of the active list
>   - "exponential" aging may be completely bogus

I don't think so, whenever there is sufficient memory pressure, the scan
of the active list is not only done by kswapd, but also by the page
allocations.

This does have the nice effect that with a large active list on a system
that has a working set that fits in memory, pages basically always age
up, and we get an automatic used-once/drop-behind behaviour for
streaming data because the age of these pages is relatively low.

As soon as the rate of new allocations increases to the point that
kswapd can't keep up, which happens if the number of cached used-once
pages is too small, or the working set expands so that it doesn't fit in
memory. The memory shortage then causes all pages to agressively get
aged down, pushing out the less frequently used pages of the working set.

Exponential down aging simply causes us to loop fewer times in
do_try_to_free_pages is such situations.

Jan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21 14:29           ` Gábor Lénárt
@ 2001-09-21 14:35             ` Horst von Brand
  0 siblings, 0 replies; 133+ messages in thread
From: Horst von Brand @ 2001-09-21 14:35 UTC (permalink / raw)
  To: lgb; +Cc: linux-kernel

=?iso-8859-2?B?R+Fib3IgTOlu4XJ0?= <lgb@lgb.hu> said:

[...]

> Maybe not since I'm not using swap :) The rule is (well at least it was ...)
> for 2.4.x desktop systems: buy 256Mb of RAM (and disable swapping at all),
> it's cheap ... and after that you will be able to use 2.4.x instead of 2.2.x
> quite well.

Strange... I've been running 2.4's for a long time (mostly -ac ones) on
64Mb and 128Mb. No problems.

[If you have a laptop, the extra RAM is not "cheap"...]
-- 
Dr. Horst H. von Brand                Usuario #22616 counter.li.org
Departamento de Informatica                     Fono: +56 32 654431
Universidad Tecnica Federico Santa Maria              +56 32 654239
Casilla 110-V, Valparaiso, Chile                Fax:  +56 32 797513

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 23:00         ` Rik van Riel
  2001-09-21  8:23           ` Eric W. Biederman
@ 2001-09-21 14:29           ` Gábor Lénárt
  2001-09-21 14:35             ` Horst von Brand
  1 sibling, 1 reply; 133+ messages in thread
From: Gábor Lénárt @ 2001-09-21 14:29 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel

On Wed, Sep 19, 2001 at 08:00:44PM -0300, Rik van Riel wrote:
> On 19 Sep 2001, Eric W. Biederman wrote:
> 
> > That added to the fact that last time someone ran the numbers linux
> > was considerably faster than the BSD for mm type operations when not
> > swapping.  And this is the common case.
> 
> Optimising the VM for not swapping sounds kind of like
> optimising your system for doing empty fork()/exec()/exit()
> loops ;)

Maybe not since I'm not using swap :) The rule is (well at least it was ...)
for 2.4.x desktop systems: buy 256Mb of RAM (and disable swapping at all),
it's cheap ... and after that you will be able to use 2.4.x instead of 2.2.x
quite well.

-- 
 --[ Gábor Lénárt ]---[ Vivendi Telecom Hungary ]---------[ lgb@lgb.hu ]--
 U have 8 bit comp or chip of them and it's unused or to be sold? Call me!
 -------[ +36 30 2270823 ]------> LGB <-----[ Linux/UNIX/8bit 4ever ]-----

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:13               ` Daniel Phillips
@ 2001-09-21 12:10                 ` Rik van Riel
  2001-09-21 15:27                 ` Jan Harkes
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-21 12:10 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On Fri, 21 Sep 2001, Daniel Phillips wrote:

> Have you tried making the down increment larger and the up increment
> smaller when the active list is larger?

This would make the page age of pages referenced in the page
tables smaller, not larger. And we already know that decreasing
the page age of heavily referenced pages isn't good.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-21  8:23           ` Eric W. Biederman
@ 2001-09-21 12:01             ` Rik van Riel
  2001-09-22  2:14             ` Alexander Viro
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-21 12:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On 21 Sep 2001, Eric W. Biederman wrote:

> Swapping is an important case.  But 9 times out of 10 you are managing
> memory in caches, and throwing unused pages into swap.  You aren't
> busily paging the data back an forth.  But if I have to make a choice
> in what kind of situation I want to take a performance hit, paging
> approaching thrashing or a system whose working set size is well
> within RAM.  I'd rather take the hit in the system that is paging.

> Besides I also like to run a lot of shell scripts, which again stress
> the fork()/exec()/exit() path.
>
> So no I don't think keeping those paths fast is silly.

Absolutely agreed.

Ben and I have already been thinking a bit about memory
objects, so we have both reverse mappings AND we can skip
copying the page tables at fork() time (needing to clear
less at the subsequent exec(), too) ...

Of course this means I'll throw away my pte-based reverse
mapping code and will look at an object-based reverse mapping
scheme like Ben made for 2.1 and DaveM made for 2.3 ;)

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 23:00         ` Rik van Riel
@ 2001-09-21  8:23           ` Eric W. Biederman
  2001-09-21 12:01             ` Rik van Riel
  2001-09-22  2:14             ` Alexander Viro
  2001-09-21 14:29           ` Gábor Lénárt
  1 sibling, 2 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-21  8:23 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Rik van Riel <riel@conectiva.com.br> writes:

> On 19 Sep 2001, Eric W. Biederman wrote:
> 
> > That added to the fact that last time someone ran the numbers linux
> > was considerably faster than the BSD for mm type operations when not
> > swapping.  And this is the common case.
> 
> Optimising the VM for not swapping sounds kind of like
> optimising your system for doing empty fork()/exec()/exit()
> loops ;)

Swapping is an important case.  But 9 times out of 10 you are managing
memory in caches, and throwing unused pages into swap.  You aren't busily
paging the data back an forth.  But if I have to make a choice in
what kind of situation I want to take a performance hit, paging
approaching thrashing or a system whose working set size is well
within RAM.  I'd rather take the hit in the system that is paging.

Further fast IPC + fork()/exec()/exit() that programmers can count on
leads to more robust programs.  Because different pieces of the program
can live in different processes.  One of the reasons for the stability
of unix is that it has always had a firewall between it's processes so
one bad pointer will not bring down the entire system.   

Besides I also like to run a lot of shell scripts, which again stress
the fork()/exec()/exit() path.

So no I don't think keeping those paths fast is silly.

I also think that being able to get good memory usage information is
important.  I know that reverse maps make that job easier.  But just
because the make an important case easier to get write I don't think
reverse maps are a shoe in. 

Eric





^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 12:06             ` Rik van Riel
@ 2001-09-21  8:13               ` Daniel Phillips
  2001-09-21 12:10                 ` Rik van Riel
  2001-09-21 15:27                 ` Jan Harkes
  0 siblings, 2 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-21  8:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> That still doesn't mean we can't _approximate_ aging in
> another way. With linear page aging (3 up, 1 down) the
> page ages of pages referenced only in the page tables
> will still go up, albeit a tad slower than expected.
> 
> It's exponential aging which makes the page age go into
> the other direction, with linear aging things seem to
> work again.
> 
> I've done some experiments recently and found that (with
> reverse mappings) exponential aging is faster when we have
> a small inactive list and linear aging is faster when we
> have a large inactive list.

Have you tried making the down increment larger and the up increment smaller 
when the active list is larger?  This has a natural interpretation: when the 
active list is large the scanning period is longer.  During this longer scan 
period an active page *should* be more likely to have its ref bit set, so it 
gets a smaller boost if it is.  If not we should penalize it more heavily.

There are three points here:

  - small inactive list really means large active list (and vice versa)
  - aging increments need to depend on the size of the active list
  - "exponential" aging may be completely bogus

> This means we need linear page aging with a large inactive
> list in order to let the page ages move into the right
> direction when we run a system without reverse mapping,
> the patch for that was sent to Alan yesterday.

So, the question is, does my suggestion produce essentially the same 
beneficial effect?  And by the way, what are your test cases?  I'd like to 
see if I can your results here.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 12:57             ` Alan Cox
@ 2001-09-20 13:40               ` Daniel Phillips
  0 siblings, 0 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-20 13:40 UTC (permalink / raw)
  To: Alan Cox; +Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On September 20, 2001 02:57 pm, Alan Cox wrote:
> > On September 20, 2001 12:04 am, Alan Cox wrote:
> > > Reverse mappings make linear aging easier to do but are not critical (we
> > > can walk all physical pages via the page map array). 
> > 
> > But you can't pick up the referenced bit that way, so no up aging, only
> > down.
> 
> #1 If you really wanted to you could update a referenced bit in the page
> struct in the fault handling path.

Right, we probably should do that.  But consider that any time this happens a 
reverse map would have eliminated the fault because we wouldn't need to unmap 
the page until we're actually going to free it.

> #2 If a page is referenced multiple times by different processes is the
> behaviour of multiple upward aging actually wrong.

With rmap it's easy to do it either way: either treat the ref bits as if 
they're all or'd together or, perhaps more sensibly, age up by an amount that 
depends on the number of ref bits set, but not as much as UP_AGE * refs.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:55       ` David S. Miller
@ 2001-09-20 13:02         ` Rik van Riel
  0 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-20 13:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: ebiederm, alan, phillips, rfuller, linux-kernel, linux-mm

On Wed, 19 Sep 2001, David S. Miller wrote:

> My own personal feeling, after having tried to implement a much
> lighter weight scheme involving "anon areas", is that reverse maps or
> something similar should be looked at as a latch ditch effort.
>
> We are tons faster than anyone else in fork/exec/exit precisely
> because we keep track of so little state for anonymous pages.

Thinking about this some more, it would seem that the
"perfect fork()" would be one where you DON'T copy the
page tables, but only set the parent's page tables to
read-only and point the VMAs of the child at some kind
of memory objects.

For example, for file-backed VMAs we might already skip
the page table copying right now.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 11:28           ` Daniel Phillips
  2001-09-20 12:06             ` Rik van Riel
@ 2001-09-20 12:57             ` Alan Cox
  2001-09-20 13:40               ` Daniel Phillips
  1 sibling, 1 reply; 133+ messages in thread
From: Alan Cox @ 2001-09-20 12:57 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> On September 20, 2001 12:04 am, Alan Cox wrote:
> > Reverse mappings make linear aging easier to do but are not critical (we
> > can walk all physical pages via the page map array). 
> 
> But you can't pick up the referenced bit that way, so no up aging, only
> down.

#1 If you really wanted to you could update a referenced bit in the page
struct in the fault handling path.

#2 If a page is referenced multiple times by different processes is the
behaviour of multiple upward aging actually wrong.

Alan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20 11:28           ` Daniel Phillips
@ 2001-09-20 12:06             ` Rik van Riel
  2001-09-21  8:13               ` Daniel Phillips
  2001-09-20 12:57             ` Alan Cox
  1 sibling, 1 reply; 133+ messages in thread
From: Rik van Riel @ 2001-09-20 12:06 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Alan Cox, Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

On Thu, 20 Sep 2001, Daniel Phillips wrote:
> On September 20, 2001 12:04 am, Alan Cox wrote:
> > Reverse mappings make linear aging easier to do but are not critical (we
> > can walk all physical pages via the page map array).
>
> But you can't pick up the referenced bit that way, so no up aging,
> only down.

That still doesn't mean we can't _approximate_ aging in
another way. With linear page aging (3 up, 1 down) the
page ages of pages referenced only in the page tables
will still go up, albeit a tad slower than expected.

It's exponential aging which makes the page age go into
the other direction, with linear aging things seem to
work again.

I've done some experiments recently and found that (with
reverse mappings) exponential aging is faster when we have
a small inactive list and linear aging is faster when we
have a large inactive list.

This means we need linear page aging with a large inactive
list in order to let the page ages move into the right
direction when we run a system without reverse mapping,
the patch for that was sent to Alan yesterday.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
  2001-09-19 23:05           ` Rik van Riel
@ 2001-09-20 11:28           ` Daniel Phillips
  2001-09-20 12:06             ` Rik van Riel
  2001-09-20 12:57             ` Alan Cox
  2001-09-24 22:50           ` Pavel Machek
  3 siblings, 2 replies; 133+ messages in thread
From: Daniel Phillips @ 2001-09-20 11:28 UTC (permalink / raw)
  To: Alan Cox, Eric W. Biederman; +Cc: Alan Cox, Rob Fuller, linux-kernel, linux-mm

On September 20, 2001 12:04 am, Alan Cox wrote:
> Reverse mappings make linear aging easier to do but are not critical (we
> can walk all physical pages via the page map array). 

But you can't pick up the referenced bit that way, so no up aging, only
down.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-20  3:16 ` GOTO Masanori
@ 2001-09-20  7:38   ` Christoph Hellwig
  0 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2001-09-20  7:38 UTC (permalink / raw)
  To: GOTO Masanori; +Cc: rfuller, linux-kernel

On Thu, Sep 20, 2001 at 12:16:56PM +0900, GOTO Masanori wrote:
> At Thu, 20 Sep 2001 00:26:05 +0200,
> Christoph Hellwig <hch@ns.caldera.de> wrote:
> > > "One argument for reverse mappings is distributed shared memory or
> > > distributed file systems and their interaction with memory mapped files.
> > > For example, a distributed file system may need to invalidate a specific
> > > page of a file that may be mapped multiple times on a node."
> > 
> > Please take a look at zap_inode_mappings in -ac.
> 
> zap_inode_mapping ?

Yes.

> 
> > Currently it only invalidates a whole mapping, but we can easily add
> > offset and lenght (and will probably do).
> 
> Adding offset/length support is good. 
> Did you mean that replacing zap_inode_mapping into more parts like
> zap_inode_mapping_whole and zap_inode_mapping_range ?

That depends.  For 2.4 at least we (the OpenGFS Project) probably won't
need it - and I didn't hear from other groups that need such functionality
yet.

	Christoph

-- 
Of course it doesn't work. We've performed a software upgrade.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
                   ` (4 preceding siblings ...)
  2001-09-19 22:51 ` Bryan O'Sullivan
@ 2001-09-20  3:16 ` GOTO Masanori
  2001-09-20  7:38   ` Christoph Hellwig
  5 siblings, 1 reply; 133+ messages in thread
From: GOTO Masanori @ 2001-09-20  3:16 UTC (permalink / raw)
  To: hch; +Cc: rfuller, linux-kernel

At Thu, 20 Sep 2001 00:26:05 +0200,
Christoph Hellwig <hch@ns.caldera.de> wrote:
> > "One argument for reverse mappings is distributed shared memory or
> > distributed file systems and their interaction with memory mapped files.
> > For example, a distributed file system may need to invalidate a specific
> > page of a file that may be mapped multiple times on a node."
> 
> Please take a look at zap_inode_mappings in -ac.

zap_inode_mapping ?

> Currently it only invalidates a whole mapping, but we can easily add
> offset and lenght (and will probably do).

Adding offset/length support is good. 
Did you mean that replacing zap_inode_mapping into more parts like
zap_inode_mapping_whole and zap_inode_mapping_range ?

-- gotom

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
@ 2001-09-19 23:05           ` Rik van Riel
  2001-09-20 11:28           ` Daniel Phillips
  2001-09-24 22:50           ` Pavel Machek
  3 siblings, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-19 23:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Eric W. Biederman, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On Wed, 19 Sep 2001, Alan Cox wrote:

> "Linux VM works wonderfully when nobody is using it"

"This OS is optimised for lmbench"


cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 22:04         ` Alan Cox
@ 2001-09-19 23:00         ` Rik van Riel
  2001-09-21  8:23           ` Eric W. Biederman
  2001-09-21 14:29           ` Gábor Lénárt
  1 sibling, 2 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-19 23:00 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

On 19 Sep 2001, Eric W. Biederman wrote:

> That added to the fact that last time someone ran the numbers linux
> was considerably faster than the BSD for mm type operations when not
> swapping.  And this is the common case.

Optimising the VM for not swapping sounds kind of like
optimising your system for doing empty fork()/exec()/exit()
loops ;)

cheers,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
                   ` (3 preceding siblings ...)
  2001-09-19 22:48 ` Eric W. Biederman
@ 2001-09-19 22:51 ` Bryan O'Sullivan
  2001-09-20  3:16 ` GOTO Masanori
  5 siblings, 0 replies; 133+ messages in thread
From: Bryan O'Sullivan @ 2001-09-19 22:51 UTC (permalink / raw)
  To: Rob Fuller; +Cc: linux-kernel, linux-mm

r> I believe reverse mappings are an essential feature for memory
r> mapped files in order for Linux to support sophisticated
r> distributed file systems or distributed shared memory.

You already have the needed mechanisms for memory-mapped files in the
distributed FS case.  Distributed shared memory is much less
convincing, as DSM types have their heads irretrievably stuck up their
ar^Hcademia.

        <b

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
                   ` (2 preceding siblings ...)
  2001-09-19 22:30 ` Alan Cox
@ 2001-09-19 22:48 ` Eric W. Biederman
  2001-09-19 22:51 ` Bryan O'Sullivan
  2001-09-20  3:16 ` GOTO Masanori
  5 siblings, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-19 22:48 UTC (permalink / raw)
  To: Rob Fuller; +Cc: David S. Miller, alan, phillips, linux-kernel, linux-mm

"Rob Fuller" <rfuller@nsisoftware.com> writes:

> In my one contribution to this thread I wrote:
> 
> "One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped files.
> For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node."
> 
> I believe reverse mappings are an essential feature for memory mapped
> files in order for Linux to support sophisticated distributed file
> systems or distributed shared memory.  In general, this memory is NOT
> anonymous.  As such, it should not affect the performance of a
> fork/exec/exit.
> 
> I suppose I confused the issue when I offered a supporting argument for
> reverse mappings.  It's not reverse mappings for anonymous pages I'm
> advocating, but reverse mappings for mapped file data.

The reverse mapping issue is not do we have a way to find where in the page
tables a page is mapped.  But if we keep track of it in a data structure
that allows us to do so extremely quickly.  The worst case for our current
data structures to unmap one page is O(page mappings).

For distributed filesystems contention sucks.  No matter how you play it
contention for file data will never be a fast case.  Not if you have
very many people contending for the data.  So this isn't a fast case.

Additionally our current data structures are optimized for unmapping
page ranges.  Since if your contention case is sane you will be
grabbing more than 4k at a time our looping through the vm_areas of
a mapping should be more efficient than doing that loop once for
each page that needs to be unmapped.

Eric





^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
  2001-09-19 22:26 ` Christoph Hellwig
@ 2001-09-19 22:30 ` Alan Cox
  2001-09-19 22:48 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 133+ messages in thread
From: Alan Cox @ 2001-09-19 22:30 UTC (permalink / raw)
  To: Rob Fuller
  Cc: David S. Miller, ebiederm, alan, phillips, linux-kernel, linux-mm

> "One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped files.
> For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node."

Wouldn't it be better for the file system itself to be doing that work. Also
do real world file systems that actually perform usably do this or just zap
the cached mappings like OpenGFS does.

Alan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:04         ` Alan Cox
@ 2001-09-19 22:26           ` Eric W. Biederman
  2001-09-19 23:05           ` Rik van Riel
                             ` (2 subsequent siblings)
  3 siblings, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-19 22:26 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> Much of this goes away if you get rid of both the swap and anonymous page
> special cases. Back anonymous pages with the "whoops everything I write here
> vanishes mysteriously" file system and swap with a swapfs

Essentially.  Though that is just the strategy it doesn't cut to the heart of the
problems that need to be addressed.  The trickiest part is to allocate persistent
id's to the pages that don't require us to fragment the VMA's.  

> Reverse mappings make linear aging easier to do but are not critical (we
> can walk all physical pages via the page map array). 

Agreed.  

What I find interesting about the 2.4.x VM is that most of the large
problems people have seen were not stupid designs mistakes in the VM
but small interaction glitches, between various pieces of code.

Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
@ 2001-09-19 22:26 ` Christoph Hellwig
  2001-09-19 22:30 ` Alan Cox
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 133+ messages in thread
From: Christoph Hellwig @ 2001-09-19 22:26 UTC (permalink / raw)
  To: "Rob Fuller"; +Cc: linux-kernel

In article <878A2048A35CD141AD5FC92C6B776E4907B7A5@xchgind02.nsisw.com> you wrote:
> In my one contribution to this thread I wrote:
>
> "One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped files.
> For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node."

Please take a look at zap_inode_mappings in -ac.
Currently it only invalidates a whole mapping, but we can easily add
offset and lenght (and will probably do).

	Christoph

-- 
Whip me.  Beat me.  Make me maintain AIX.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 22:15 Rob Fuller
@ 2001-09-19 22:21 ` David S. Miller
  2001-09-19 22:26 ` Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 133+ messages in thread
From: David S. Miller @ 2001-09-19 22:21 UTC (permalink / raw)
  To: rfuller; +Cc: ebiederm, alan, phillips, linux-kernel, linux-mm

   From: "Rob Fuller" <rfuller@nsisoftware.com>
   Date: Wed, 19 Sep 2001 17:15:21 -0500
   
   I suppose I confused the issue when I offered a supporting argument for
   reverse mappings.  It's not reverse mappings for anonymous pages I'm
   advocating, but reverse mappings for mapped file data.

We already have reverse mappings for files, via the VMA chain off the
inode.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* RE: broken VM in 2.4.10-pre9
@ 2001-09-19 22:15 Rob Fuller
  2001-09-19 22:21 ` David S. Miller
                   ` (5 more replies)
  0 siblings, 6 replies; 133+ messages in thread
From: Rob Fuller @ 2001-09-19 22:15 UTC (permalink / raw)
  To: David S. Miller, ebiederm; +Cc: alan, phillips, linux-kernel, linux-mm

In my one contribution to this thread I wrote:

"One argument for reverse mappings is distributed shared memory or
distributed file systems and their interaction with memory mapped files.
For example, a distributed file system may need to invalidate a specific
page of a file that may be mapped multiple times on a node."

I believe reverse mappings are an essential feature for memory mapped
files in order for Linux to support sophisticated distributed file
systems or distributed shared memory.  In general, this memory is NOT
anonymous.  As such, it should not affect the performance of a
fork/exec/exit.

I suppose I confused the issue when I offered a supporting argument for
reverse mappings.  It's not reverse mappings for anonymous pages I'm
advocating, but reverse mappings for mapped file data.

> -----Original Message-----
> From: David S. Miller [mailto:davem@redhat.com]
> Sent: Wednesday, September 19, 2001 4:56 PM
> To: ebiederm@xmission.com
> Cc: alan@lxorguk.ukuu.org.uk; phillips@bonn-fries.net; Rob Fuller;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: broken VM in 2.4.10-pre9
> 
> 
>    From: ebiederm@xmission.com (Eric W. Biederman)
>    Date: 19 Sep 2001 15:37:26 -0600
>    
>    That I think is a significant cost.
> 
> My own personal feeling, after having tried to implement a much
> lighter weight scheme involving "anon areas", is that reverse maps or
> something similar should be looked at as a latch ditch effort.
> 
> We are tons faster than anyone else in fork/exec/exit precisely
> because we keep track of so little state for anonymous pages.
> 
> Later,
> David S. Miller
> davem@redhat.com
> 

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 21:03       ` Eric W. Biederman
@ 2001-09-19 22:04         ` Alan Cox
  2001-09-19 22:26           ` Eric W. Biederman
                             ` (3 more replies)
  2001-09-19 23:00         ` Rik van Riel
  1 sibling, 4 replies; 133+ messages in thread
From: Alan Cox @ 2001-09-19 22:04 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alan Cox, Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

> That added to the fact that last time someone ran the numbers linux
> was considerably faster than the BSD for mm type operations when not
> swapping.  And this is the common case.

"Linux VM works wonderfully when nobody is using it" 

Which is rather like the scheduler works well for one task then by three is
making bad decisions.

> But I have not seen the argument that not having reverse maps make it
> undoable.  In fact previous versions of linux seem to put the proof
> that you can get at least reasonable swapping under load without
> reverse page tables.

The last decent Linx VM behaviour was about 2.1.100 or so - which was
without reverse maps. It's been downhill since then. So yes you may be
right.

> So my suggestion was to look at getting anonymous pages backed by what
> amounts to a shared memory segment.  In that vein.  By using an extent
> based data structure we can get the cost down under the current 8 bits
> per page that we have for the swap counts, and make allocating swap
> pages faster.  And we want to cluster related swap pages anyway so
> an extent based system is a natural fit.

Much of this goes away if you get rid of both the swap and anonymous page
special cases. Back anonymous pages with the "whoops everything I write here
vanishes mysteriously" file system and swap with a swapfs

Reverse mappings make linear aging easier to do but are not critical (we
can walk all physical pages via the page map array). 

Alan

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 19:45     ` Alan Cox
  2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 21:37       ` Eric W. Biederman
@ 2001-09-19 21:55       ` David S. Miller
  2001-09-20 13:02         ` Rik van Riel
  2 siblings, 1 reply; 133+ messages in thread
From: David S. Miller @ 2001-09-19 21:55 UTC (permalink / raw)
  To: ebiederm; +Cc: alan, phillips, rfuller, linux-kernel, linux-mm

   From: ebiederm@xmission.com (Eric W. Biederman)
   Date: 19 Sep 2001 15:37:26 -0600
   
   That I think is a significant cost.

My own personal feeling, after having tried to implement a much
lighter weight scheme involving "anon areas", is that reverse maps or
something similar should be looked at as a latch ditch effort.

We are tons faster than anyone else in fork/exec/exit precisely
because we keep track of so little state for anonymous pages.

Later,
David S. Miller
davem@redhat.com

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 19:45     ` Alan Cox
  2001-09-19 21:03       ` Eric W. Biederman
@ 2001-09-19 21:37       ` Eric W. Biederman
  2001-09-19 21:55       ` David S. Miller
  2 siblings, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-19 21:37 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > > to make the common case fast at the expense of making it more
> > > difficult to handle times when the VM system is under extreme load and
> > > we are swapping etc.
> > 
> > What do you suppose is the cost of the reverse map?  I get the impression you
> 
> > think it's more expensive than it is.
> 
> We can keep the typical page table cost lower than now (including reverse
> maps) just by doing some common sense small cleanups to get the page struct
> down to 48 bytes on x86

While there is a size cost I suspect you will notice reverse maps
a lot more in operations like fork where having them tripples the amount
of memory that you need to copy.  So you should see a double or more
in the time it takes to do a fork.

That I think is a significant cost.

Eric


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19 19:45     ` Alan Cox
@ 2001-09-19 21:03       ` Eric W. Biederman
  2001-09-19 22:04         ` Alan Cox
  2001-09-19 23:00         ` Rik van Riel
  2001-09-19 21:37       ` Eric W. Biederman
  2001-09-19 21:55       ` David S. Miller
  2 siblings, 2 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-19 21:03 UTC (permalink / raw)
  To: Alan Cox; +Cc: Daniel Phillips, Rob Fuller, linux-kernel, linux-mm

Alan Cox <alan@lxorguk.ukuu.org.uk> writes:

> > On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > > to make the common case fast at the expense of making it more
> > > difficult to handle times when the VM system is under extreme load and
> > > we are swapping etc.
> > 
> > What do you suppose is the cost of the reverse map?  I get the impression you
> 
> > think it's more expensive than it is.
> 
> We can keep the typical page table cost lower than now (including reverse
> maps) just by doing some common sense small cleanups to get the page struct
> down to 48 bytes on x86

I have to admit the first time I looked at reverse maps our struct page
was much lighter weight, then now (64 bytes x86 UP).  And our cost per
page was noticeably fewer bytes than the BSDs. average_mem_per_page =
sizeof(struct page) + sizeof(pte_t) + sizeof(reverse_pte_t)*average_user_per_page.
But struct page has grown pretty significantly since then, and could
use a cleanup.  

So I figure it is worth going through and computing the costs of
reverse page tables and not, dismissing them out of hand.  But the
fact that the linux VM could get good performance in most
circumstances without reverse page tables has always enchanted me.  

That added to the fact that last time someone ran the numbers linux
was considerably faster than the BSD for mm type operations when not
swapping.  And this is the common case.

I admit reverse page tables make it easier under a high load to get
good paging performance, as the algorithms are more straigh forward.
But I have not seen the argument that not having reverse maps make it
undoable.  In fact previous versions of linux seem to put the proof
that you can get at least reasonable swapping under load without
reverse page tables.

There is also the cache thrashing case.  While scaning page table
entries it is probably impossible to prevent cache thrashing, but
reverse page tables look like they make it worse.

With respect to the current VM the primary complaint I have heard is
that anonymous pages are not in the page cache so cannot be aged.  At
least that was the complaint that started this thread.  For adding
pages to the page cache we currently have conflicting tensions.  Do we
want it in the page cache to age better or do we not want to allocate
the swap space yet?

So my suggestion was to look at getting anonymous pages backed by what
amounts to a shared memory segment.  In that vein.  By using an extent
based data structure we can get the cost down under the current 8 bits
per page that we have for the swap counts, and make allocating swap
pages faster.  And we want to cluster related swap pages anyway so
an extent based system is a natural fit.

If we loose the requirement that swapped out pages need to be in the
page tables.  It becomes a trivial issue to drop page tables with all of
their pages swapped out.  Plus there are a million other special cases
we can remove from the current VM.

So right now I can see a bigger benefit from anonymouse pages with a
``backing store'' then I can from reverse maps.

Eric




^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-19  9:45   ` Daniel Phillips
@ 2001-09-19 19:45     ` Alan Cox
  2001-09-19 21:03       ` Eric W. Biederman
                         ` (2 more replies)
  0 siblings, 3 replies; 133+ messages in thread
From: Alan Cox @ 2001-09-19 19:45 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: Eric W. Biederman, Rob Fuller, linux-kernel, linux-mm

> On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> > In linux we have avoided reverse maps (unlike the BSD's) which tends
> > to make the common case fast at the expense of making it more
> > difficult to handle times when the VM system is under extreme load and
> > we are swapping etc.
> 
> What do you suppose is the cost of the reverse map?  I get the impression you 
> think it's more expensive than it is.

We can keep the typical page table cost lower than now (including reverse
maps) just by doing some common sense small cleanups to get the page struct
down to 48 bytes on x86

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 16:03 ` Eric W. Biederman
@ 2001-09-19  9:45   ` Daniel Phillips
  2001-09-19 19:45     ` Alan Cox
  0 siblings, 1 reply; 133+ messages in thread
From: Daniel Phillips @ 2001-09-19  9:45 UTC (permalink / raw)
  To: Eric W. Biederman, Rob Fuller; +Cc: linux-kernel, linux-mm

On September 17, 2001 06:03 pm, Eric W. Biederman wrote:
> In linux we have avoided reverse maps (unlike the BSD's) which tends
> to make the common case fast at the expense of making it more
> difficult to handle times when the VM system is under extreme load and
> we are swapping etc.

What do you suppose is the cost of the reverse map?  I get the impression you 
think it's more expensive than it is.

--
Daniel

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 15:40 Rob Fuller
@ 2001-09-17 16:03 ` Eric W. Biederman
  2001-09-19  9:45   ` Daniel Phillips
  0 siblings, 1 reply; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-17 16:03 UTC (permalink / raw)
  To: Rob Fuller; +Cc: linux-kernel, linux-mm

"Rob Fuller" <rfuller@nsisoftware.com> writes:

> One argument for reverse mappings is distributed shared memory or
> distributed file systems and their interaction with memory mapped
> files.  For example, a distributed file system may need to invalidate a specific
> page of a file that may be mapped multiple times on a node.

To reduce the time for an invalidate is indeed a good argument for
reverse maps.  However this is generally the uncommon case, and it is
fine to leave this kinds of things on the slow path.  From struct page 
we currently go to struct address_space to lists of struct vm_area
which works but is just a little slower (but generally cheaper) than
having a reverse map.

Since Rik was not seeing the invalidate or the unmap case as the
bottleneck this reverse mappings are not needed simply something
with a similiar effect on the VM.  

In linux we have avoided reverse maps (unlike the BSD's) which tends
to make the common case fast at the expense of making it more
difficult to handle times when the VM system is under extreme load and
we are swapping etc.

Eric


^ permalink raw reply	[flat|nested] 133+ messages in thread

* RE: broken VM in 2.4.10-pre9
@ 2001-09-17 15:40 Rob Fuller
  2001-09-17 16:03 ` Eric W. Biederman
  0 siblings, 1 reply; 133+ messages in thread
From: Rob Fuller @ 2001-09-17 15:40 UTC (permalink / raw)
  To: Rik van Riel, Eric W. Biederman; +Cc: linux-kernel, linux-mm

One argument for reverse mappings is distributed shared memory or
distributed file systems and their interaction with memory mapped files.
For example, a distributed file system may need to invalidate a specific
page of a file that may be mapped multiple times on a node.

This may be a naive argument given my limited knowledge of Linux memory
management internals.  If so, I will refrain from posting this sort of
thing in the future.  Let me know.

> -----Original Message-----
> From: Rik van Riel [mailto:riel@conectiva.com.br]
> Sent: Monday, September 17, 2001 7:13 AM
> To: Eric W. Biederman
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: broken VM in 2.4.10-pre9
> 
> 
> On 17 Sep 2001, Eric W. Biederman wrote:

<snip>

> > Do you have any arguments for the reverse mappings or just 
> for some of
> > the other side effects that go along with them?
> 
> Mainly for the side effects, but until somebody comes
> up with another idea to achieve all the side effects I'm
> not giving up on reverse mappings. If you can achieve
> all the good stuff in another way, show it.

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:37   ` Linus Torvalds
@ 2001-09-17 14:04     ` Olaf Zaplinski
  0 siblings, 0 replies; 133+ messages in thread
From: Olaf Zaplinski @ 2001-09-17 14:04 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds wrote:
> [...]
> What do you want to happen? You want to have an interface like
> 
>         echo 0 > /proc/bugs/mm
> 
> that makes mm bugs go away?

Good idea! ;-)

Well, I had similar problems and went back to 2.2.19... but isn't there a
tuneable yet?

On http://www.badtux.org/eric/editorial/mindcraft.html I found this one:

'Tuning the file buffer size so that more than 60% of memory can be used
(90% in this example) can be accomplished by issuing the following command:
echo "2 10 90" >/proc/sys/vm/buffermem"
This is documented in the file /usr/src/linux/Documentation/sysctl/vm.txt
along with many other tuning parameters, such as the 'bdflush' parameter.'


But vm.txt from 2.4.9ac10 and 2.2.19 says:

buffermem:

The three values in this file correspond to the values in
the struct buffer_mem. It controls how much memory should
be used for buffer memory. The percentage is calculated
as a percentage of total system memory.

The values are:
min_percent     -- this is the minimum percentage of memory
                   that should be spent on buffer memory
borrow_percent  -- UNUSED
max_percent     -- UNUSED

Is vm.txt out of date, or is there really no tuneable, neither in 2.2.x nor
in 2.4.x?

Olaf

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-15 22:43 Peter Magnusson
  2001-09-15 23:50 ` Jan Harkes
  2001-09-16  5:31 ` Linus Torvalds
@ 2001-09-17 10:25 ` Tonu Samuel
  2001-09-16 16:47   ` Jeremy Zawodny
  2001-09-16 19:37   ` Linus Torvalds
  2 siblings, 2 replies; 133+ messages in thread
From: Tonu Samuel @ 2001-09-17 10:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On 16 Sep 2001 05:31:11 +0000, Linus Torvalds wrote:

> Also note that the amount of "swap used" is totally meaningless in
> 2.4.x. The 2.4.x kernel will _allocate_ the swap backing store much
> earlier than 2.2.x, but that doesn't actuall ymean that it does any of
> the IO. Indeed, allocating the swap backing store just means that the
> swap pages are then kept track of, so that they can be aged along with
> other stores.

Problem still exists and persists. Not long time ago man from Yahoo
described well case when change from 2.2.19 to 2.4.x caused performance
problems. On 2.2.19 everything ran fine. They have MySQL running+did
backups from disk. After upgrade to 2.4.x MySQL performance felt down on
backup time. They investigated stuff and found that MySQL daemon gets
swapped out in the middle of usage to make room for buffers. In summary:
this made both sql and backup double slow. Even increasing memory from
1G->2G didn't helped. Finally they disabled swap at all and problem
lost.

If you do not want to change it back as it was in 2.2.x then would be
good if this is tunable somehow. 
 
-- 
For technical support contracts, goto https://order.mysql.com/
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /    Mr. Tonu Samuel <tonu@mysql.com>
 / /|_/ / // /\ \/ /_/ / /__   MySQL AB, Security Administrator
/_/  /_/\_, /___/\___\_\___/   Hong Kong, China
       <___/   www.mysql.com


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 15:19 ` broken VM in 2.4.10-pre9 Phillip Susi
  2001-09-16 19:33   ` Jeremy Zawodny
@ 2001-09-16 19:52   ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Rik van Riel @ 2001-09-16 19:52 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel

On Sun, 16 Sep 2001, Phillip Susi wrote:

> Maybe I'm missing something here, but it seems to me that these
> problems are due to the cache putting pressure on VM, so process pages
> get swapped out.  The obvious solution to this is to limit the size of
> the cache, or implement some sort of algorithm to slow its growth and
> reduce the pressure on VM.

> Am I way off base here?

You're absolutely right and it's only a tiny patch to
implement this thing.  I've attached a completely
untested (I haven't even compiled this thing) patch
which implements this thing. I suspect it'll apply to
any recent -ac kernel, porting it to -linus should be
easy.

regards,

Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)



--- mm/vmscan.c.orig	Sun Sep 16 16:44:14 2001
+++ mm/vmscan.c	Sun Sep 16 16:49:09 2001
@@ -731,6 +731,8 @@
  */
 #define too_many_buffers (atomic_read(&buffermem_pages) > \
 		(num_physpages * buffer_mem.borrow_percent / 100))
+#define too_much_cache (page_cache_size - swapper_space.nrpages) > \
+		(num_physpages * page_cache.borrow_percent / 100))
 int refill_inactive_scan(unsigned int priority)
 {
 	struct list_head * page_lru;
@@ -793,6 +795,18 @@
 		 * be reclaimed there...
 		 */
 		if (page->buffers && !page->mapping && too_many_buffers) {
+			deactivate_page_nolock(page);
+			page_active = 0;
+		}
+
+		/*
+		 * If the page cache is too large, move the page
+		 * to the inactive list. If it is really accessed
+		 * it'll be referenced before it reaches the point
+		 * where we'll reclaim it.
+		 */
+		if (page->mapping && too_much_cache && page_count(page) <=
+					(page->buffers ? 2 : 1)) {
 			deactivate_page_nolock(page);
 			page_active = 0;
 		}
--- mm/swap.c.orig	Sun Sep 16 16:50:43 2001
+++ mm/swap.c	Sun Sep 16 16:50:58 2001
@@ -64,7 +64,7 @@

 buffer_mem_t page_cache = {
 	2,	/* minimum percent page cache */
-	15,	/* borrow percent page cache */
+	60,	/* borrow percent page cache */
 	75	/* maximum */
 };



^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 18:36     ` Alan Cox
@ 2001-09-16 19:38       ` Linus Torvalds
  0 siblings, 0 replies; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 19:38 UTC (permalink / raw)
  To: linux-kernel

In article <E15igmC-0005bs-00@the-village.bc.nu>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>> Yep, that was me.  It was frustrating to have to double the RAM in the
>> machine and then turn off swap.  The extra RAM did help, but it really
>> only delayed the problem.
>
>That shouldnt be needed with at least the later -ac kernels - nor is the
>swap > twice ram rule present in those

Nor has it been present in the standard kernels since 2.4.8.

		Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 10:25 ` Tonu Samuel
  2001-09-16 16:47   ` Jeremy Zawodny
@ 2001-09-16 19:37   ` Linus Torvalds
  2001-09-17 14:04     ` Olaf Zaplinski
  1 sibling, 1 reply; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16 19:37 UTC (permalink / raw)
  To: linux-kernel

In article <1000722338.14005.0.camel@x153.internalnet>,
Tonu Samuel  <tonu@please.do.not.remove.this.spam.ee> wrote:
>
>Problem still exists and persists. Not long time ago man from Yahoo
>described well case when change from 2.2.19 to 2.4.x caused performance
>problems. On 2.2.19 everything ran fine. They have MySQL running+did
>backups from disk. After upgrade to 2.4.x MySQL performance felt down on
>backup time. They investigated stuff and found that MySQL daemon gets
>swapped out in the middle of usage to make room for buffers.

Note that if you're using a raw device backup strategy (ie "e2dump" or
similar), that is expected: 2.4.x up until about 2.4.7 gave _much_ too
much preference to the buffer cache. 

That should actually have been fixed in 2.4.8. We used to mark buffer
pages much too active.

> In summary:
>this made both sql and backup double slow. Even increasing memory from
>1G->2G didn't helped. Finally they disabled swap at all and problem
>lost.

You just hid the problem - by disabling swap the buffer cache couldn't
grow without bounds any more, and the proper buffer cache shrinking
couldn't happen.

Try 2.4.8 or later.

>If you do not want to change it back as it was in 2.2.x then would be
>good if this is tunable somehow. 

Tuning for bugs?

What do you want to happen? You want to have an interface like

	echo 0 > /proc/bugs/mm

that makes mm bugs go away?

		Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 15:19 ` broken VM in 2.4.10-pre9 Phillip Susi
@ 2001-09-16 19:33   ` Jeremy Zawodny
  2001-09-16 19:52   ` Rik van Riel
  1 sibling, 0 replies; 133+ messages in thread
From: Jeremy Zawodny @ 2001-09-16 19:33 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel

On Sun, Sep 16, 2001 at 03:19:29PM +0000, Phillip Susi wrote:

> Maybe I'm missing something here, but it seems to me that these
> problems are due to the cache putting pressure on VM, so process
> pages get swapped out.

That's what it felt like in the cases that I ran into it.  It was
trying to treat all memory equally, when it probably shouldn't have.

Jeremy
-- 
Jeremy D. Zawodny     |  Perl, Web, MySQL, Linux Magazine, Yahoo!
<Jeremy@Zawodny.com>  |  http://jeremy.zawodny.com/

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 16:47   ` Jeremy Zawodny
@ 2001-09-16 18:36     ` Alan Cox
  2001-09-16 19:38       ` Linus Torvalds
  0 siblings, 1 reply; 133+ messages in thread
From: Alan Cox @ 2001-09-16 18:36 UTC (permalink / raw)
  To: Jeremy Zawodny; +Cc: Tonu Samuel, Linus Torvalds, linux-kernel

> Yep, that was me.  It was frustrating to have to double the RAM in the
> machine and then turn off swap.  The extra RAM did help, but it really
> only delayed the problem.

That shouldnt be needed with at least the later -ac kernels - nor is the
swap > twice ram rule present in those

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
       [not found] ` <fa.gu977tv.1b7u0g9@ifi.uio.no>
@ 2001-09-16 18:06   ` Dan Maas
  0 siblings, 0 replies; 133+ messages in thread
From: Dan Maas @ 2001-09-16 18:06 UTC (permalink / raw)
  To: Jeremy Zawodny; +Cc: linux-kernel

> Agreed.  I'd be great if there was an option to say "Don't swap out
> memory that was allocated by these programs.  If you run out of disk
> buffers, toss the oldest ones and start re-using them."

Have you tried mlockall(MCL_FUTURE)? It's a sledge-hammer but it will do
what you ask...

Dan


^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-17 10:25 ` Tonu Samuel
@ 2001-09-16 16:47   ` Jeremy Zawodny
  2001-09-16 18:36     ` Alan Cox
  2001-09-16 19:37   ` Linus Torvalds
  1 sibling, 1 reply; 133+ messages in thread
From: Jeremy Zawodny @ 2001-09-16 16:47 UTC (permalink / raw)
  To: Tonu Samuel; +Cc: Linus Torvalds, linux-kernel

On Mon, Sep 17, 2001 at 06:25:38PM +0800, Tonu Samuel wrote:
> On 16 Sep 2001 05:31:11 +0000, Linus Torvalds wrote:
> 
> > Also note that the amount of "swap used" is totally meaningless in
> > 2.4.x. The 2.4.x kernel will _allocate_ the swap backing store much
> > earlier than 2.2.x, but that doesn't actuall ymean that it does any of
> > the IO. Indeed, allocating the swap backing store just means that the
> > swap pages are then kept track of, so that they can be aged along with
> > other stores.
> 
> Problem still exists and persists. Not long time ago man from Yahoo
> described well case when change from 2.2.19 to 2.4.x caused
> performance problems. On 2.2.19 everything ran fine. They have MySQL
> running+did backups from disk. After upgrade to 2.4.x MySQL
> performance felt down on backup time. They investigated stuff and
> found that MySQL daemon gets swapped out in the middle of usage to
> make room for buffers. In summary: this made both sql and backup
> double slow. Even increasing memory from 1G->2G didn't
> helped. Finally they disabled swap at all and problem lost.

Yep, that was me.  It was frustrating to have to double the RAM in the
machine and then turn off swap.  The extra RAM did help, but it really
only delayed the problem.

> If you do not want to change it back as it was in 2.2.x then would
> be good if this is tunable somehow.

Agreed.  I'd be great if there was an option to say "Don't swap out
memory that was allocated by these programs.  If you run out of disk
buffers, toss the oldest ones and start re-using them."

Jeremy
-- 
Jeremy D. Zawodny     |  Perl, Web, MySQL, Linux Magazine, Yahoo!
<Jeremy@Zawodny.com>  |  http://jeremy.zawodny.com/

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16 19:07 vm rewrite ready [Re: broken VM in 2.4.10-pre9] Rik van Riel
@ 2001-09-16 15:19 ` Phillip Susi
  2001-09-16 19:33   ` Jeremy Zawodny
  2001-09-16 19:52   ` Rik van Riel
  0 siblings, 2 replies; 133+ messages in thread
From: Phillip Susi @ 2001-09-16 15:19 UTC (permalink / raw)
  To: linux-kernel

Maybe I'm missing something here, but it seems to me that these problems are 
due to the cache putting pressure on VM, so process pages get swapped out.  
The obvious solution to this is to limit the size of the cache, or implement 
some sort of algorithm to slow its growth and reduce the pressure on VM.  It 
also seems that one of the causes for the cache expanding is large bulk file 
copies, or reads for say, mp3 playing.  Wasn't there a flag to disable 
caching on file IO that these programs could use, to keep from polluting the 
cache?  

Am I way off base here?

-- 
--> Phill Susi

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-16  5:31 ` Linus Torvalds
@ 2001-09-16  8:45   ` Eric W. Biederman
  0 siblings, 0 replies; 133+ messages in thread
From: Eric W. Biederman @ 2001-09-16  8:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

torvalds@transmeta.com (Linus Torvalds) writes:

> Don't look at how many pages of swap were used. That's a statistic,
> nothing more.

It is a statistic until you run out of them.  Obviously that isn't
the problem here, or we'd hear complaints about the OOM killer.  But
the number of pages used can make a difference.

Eric

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-15 22:43 Peter Magnusson
  2001-09-15 23:50 ` Jan Harkes
@ 2001-09-16  5:31 ` Linus Torvalds
  2001-09-16  8:45   ` Eric W. Biederman
  2001-09-17 10:25 ` Tonu Samuel
  2 siblings, 1 reply; 133+ messages in thread
From: Linus Torvalds @ 2001-09-16  5:31 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.33L2.0109160031500.7740-100000@flashdance>,
Peter Magnusson  <iocc@flashdance.nothanksok.cx> wrote:
>
>2.4.10-pre4: quite ok VM, but put little more on the swap than 2.4.7
>2.4.10-pre8: not good

Ehh..

There are _no_ VM changes that I can see between pre4 and pre8.

>2.4.10-pre9: not good ... Linux didnt had used any swap at all, then i
>             unrared two very large files at the same time. And now 104
>             Mbyte swap is used! :-( 2.4.7 didnt do like this.
>             Best is to use the swap as little as possible.

.. and there are none between pre8 and pre9.

Basically, it sounds lik eyou have tested different loads on different
kernels, and some loads are nice and others are not.

Also note that the amount of "swap used" is totally meaningless in
2.4.x. The 2.4.x kernel will _allocate_ the swap backing store much
earlier than 2.2.x, but that doesn't actuall ymean that it does any of
the IO. Indeed, allocating the swap backing store just means that the
swap pages are then kept track of, so that they can be aged along with
other stores.

So whether Linux uses swap or not is a 100% meaningless indicator of
"goodness".  The only thing that matters is how well the job gets done,
ie was it reasonably responsive, and did the big untars finish quickly.. 

Don't look at how many pages of swap were used. That's a statistic,
nothing more.

		Linus

^ permalink raw reply	[flat|nested] 133+ messages in thread

* Re: broken VM in 2.4.10-pre9
  2001-09-15 22:43 Peter Magnusson
@ 2001-09-15 23:50 ` Jan Harkes
  2001-09-16  5:31 ` Linus Torvalds
  2001-09-17 10:25 ` Tonu Samuel
  2 siblings, 0 replies; 133+ messages in thread
From: Jan Harkes @ 2001-09-15 23:50 UTC (permalink / raw)
  To: linux-kernel

What do you consider as good VM?

Because pages aren't 'aged' until there is swap allocated for them, your
kernel should actually work better if it has a lot of pages backed by
swap. The only thing is, we don't really make the right decision about
which pages to swap out, but that's just a detail.

IMHO. A large number of cached/active pages == good.

Jan

On Sun, Sep 16, 2001 at 12:43:35AM +0200, Peter Magnusson wrote:
> 2.4.7: good VM
> 2.4.8: not good
> 2.4.9: not good!!!++
> 2.4.10-pre4: quite ok VM, but put little more on the swap than 2.4.7
> 2.4.10-pre8: not good
> 2.4.10-pre9: not good ... Linux didnt had used any swap at all, then i
>              unrared two very large files at the same time. And now 104
>              Mbyte swap is used! :-( 2.4.7 didnt do like this.
>              Best is to use the swap as little as possible.
> 
> My cfg:
> 
> Real mem: 512684K (512 Mbyte)
> Swap    : 257032K
> compiled with: gcc version 2.96 20000731 (Linux-Mandrake 8.0 2.96-0.48mdk)

^ permalink raw reply	[flat|nested] 133+ messages in thread

* broken VM in 2.4.10-pre9
@ 2001-09-15 22:43 Peter Magnusson
  2001-09-15 23:50 ` Jan Harkes
                   ` (2 more replies)
  0 siblings, 3 replies; 133+ messages in thread
From: Peter Magnusson @ 2001-09-15 22:43 UTC (permalink / raw)
  To: linux-kernel

2.4.7: good VM
2.4.8: not good
2.4.9: not good!!!++
2.4.10-pre4: quite ok VM, but put little more on the swap than 2.4.7
2.4.10-pre8: not good
2.4.10-pre9: not good ... Linux didnt had used any swap at all, then i
             unrared two very large files at the same time. And now 104
             Mbyte swap is used! :-( 2.4.7 didnt do like this.
             Best is to use the swap as little as possible.

My cfg:

Real mem: 512684K (512 Mbyte)
Swap    : 257032K
compiled with: gcc version 2.96 20000731 (Linux-Mandrake 8.0 2.96-0.48mdk)


!! remove "nothanksok." from my email if you want to reply to me !!





^ permalink raw reply	[flat|nested] 133+ messages in thread

end of thread, other threads:[~2001-10-01 13:00 UTC | newest]

Thread overview: 133+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-16 15:19 broken VM in 2.4.10-pre9 Ricardo Galli
2001-09-16 15:23 ` Michael Rothwell
2001-09-16 16:33   ` Rik van Riel
2001-09-16 16:50     ` Andreas Steinmetz
2001-09-16 17:12       ` Ricardo Galli
2001-09-16 17:06     ` Ricardo Galli
2001-09-16 17:18       ` Jeremy Zawodny
2001-09-16 18:45       ` Stephan von Krawczynski
2001-09-21  3:16         ` Bill Davidsen
2001-09-21 10:21         ` Stephan von Krawczynski
2001-09-21 14:08           ` Bill Davidsen
2001-09-21 14:23             ` Rik van Riel
2001-09-23 13:13               ` Eric W. Biederman
2001-09-23 13:27                 ` Rik van Riel
2001-09-21 10:43         ` Stephan von Krawczynski
2001-09-21 12:13           ` Rik van Riel
2001-09-21 12:55           ` Stephan von Krawczynski
2001-09-21 13:01             ` Rik van Riel
2001-09-22 11:01           ` Daniel Phillips
2001-09-22 20:05             ` Rik van Riel
2001-09-24  9:36           ` Linux VM design VDA
2001-09-24 11:06             ` Dave Jones
2001-09-24 12:15               ` Kirill Ratkin
2001-09-24 13:29             ` Rik van Riel
2001-09-24 14:05               ` VDA
2001-09-24 14:37                 ` Rik van Riel
2001-09-24 14:42                 ` Rik van Riel
2001-09-24 18:37             ` Daniel Phillips
2001-09-24 19:32               ` Rik van Riel
2001-09-24 17:27                 ` Rob Landley
2001-09-24 21:48                   ` Rik van Riel
2001-09-25  9:58                 ` Daniel Phillips
2001-09-25 16:03               ` bill davidsen
2001-09-24 18:46             ` Jonathan Morton
2001-09-24 19:16               ` Daniel Phillips
2001-09-24 19:11             ` Dan Mann
2001-09-25 10:55             ` VDA
2001-09-16 18:16     ` broken VM in 2.4.10-pre9 Stephan von Krawczynski
2001-09-16 19:43     ` Linus Torvalds
2001-09-16 19:57       ` Rik van Riel
2001-09-16 20:17       ` Rik van Riel
2001-09-16 20:29       ` Andreas Steinmetz
2001-09-16 21:28         ` Linus Torvalds
2001-09-16 22:47           ` Alex Bligh - linux-kernel
2001-09-16 22:55             ` Linus Torvalds
2001-09-16 22:59           ` Stephan von Krawczynski
2001-09-16 22:14             ` Linus Torvalds
2001-09-16 23:29               ` Stephan von Krawczynski
2001-09-17 15:35             ` Stephan von Krawczynski
2001-09-17 15:51               ` Linus Torvalds
2001-09-17 16:34               ` Stephan von Krawczynski
2001-09-17 16:46                 ` Linus Torvalds
2001-09-17 17:20                 ` Stephan von Krawczynski
2001-09-17 17:37                   ` Linus Torvalds
2001-09-17  0:37       ` Daniel Phillips
2001-09-17  1:07         ` Linus Torvalds
2001-09-17  2:23           ` Daniel Phillips
2001-09-17  5:11           ` Jan Harkes
2001-09-17 12:33             ` Daniel Phillips
2001-09-17 12:41               ` Rik van Riel
2001-09-17 14:49                 ` Daniel Phillips
2001-09-17 16:14               ` Jan Harkes
2001-09-17 16:34                 ` Linus Torvalds
2001-09-17 15:38             ` Linus Torvalds
2001-09-17 12:26           ` Rik van Riel
2001-09-17 15:42             ` Linus Torvalds
2001-09-18 12:04               ` Rik van Riel
2001-09-17 17:33             ` Linus Torvalds
2001-09-17 18:07               ` Linus Torvalds
2001-09-18 12:09               ` Rik van Riel
2001-09-21  3:10       ` Bill Davidsen
2001-09-17  8:06     ` Eric W. Biederman
2001-09-17 12:12       ` Rik van Riel
2001-09-17 15:45         ` Eric W. Biederman
  -- strict thread matches above, loose matches on Subject: below --
2001-09-22 19:59 Peter Magnusson
2001-09-22 20:46 ` Jan Harkes
2001-09-22 21:46   ` Peter Magnusson
2001-09-22 19:59 Peter Magnusson
2001-09-22 20:18 ` Rik van Riel
2001-09-19 22:15 Rob Fuller
2001-09-19 22:21 ` David S. Miller
2001-09-19 22:26 ` Christoph Hellwig
2001-09-19 22:30 ` Alan Cox
2001-09-19 22:48 ` Eric W. Biederman
2001-09-19 22:51 ` Bryan O'Sullivan
2001-09-20  3:16 ` GOTO Masanori
2001-09-20  7:38   ` Christoph Hellwig
2001-09-17 15:40 Rob Fuller
2001-09-17 16:03 ` Eric W. Biederman
2001-09-19  9:45   ` Daniel Phillips
2001-09-19 19:45     ` Alan Cox
2001-09-19 21:03       ` Eric W. Biederman
2001-09-19 22:04         ` Alan Cox
2001-09-19 22:26           ` Eric W. Biederman
2001-09-19 23:05           ` Rik van Riel
2001-09-20 11:28           ` Daniel Phillips
2001-09-20 12:06             ` Rik van Riel
2001-09-21  8:13               ` Daniel Phillips
2001-09-21 12:10                 ` Rik van Riel
2001-09-21 15:27                 ` Jan Harkes
2001-09-22  7:09                   ` Daniel Phillips
2001-09-25 11:04                     ` Mike Fedyk
2001-09-20 12:57             ` Alan Cox
2001-09-20 13:40               ` Daniel Phillips
2001-09-24 22:50           ` Pavel Machek
2001-09-26 18:22             ` Marcelo Tosatti
2001-09-26 23:44               ` Pavel Machek
2001-09-27 13:52                 ` Eric W. Biederman
2001-10-01 11:37                 ` Marcelo Tosatti
2001-09-19 23:00         ` Rik van Riel
2001-09-21  8:23           ` Eric W. Biederman
2001-09-21 12:01             ` Rik van Riel
2001-09-22  2:14             ` Alexander Viro
2001-09-22  3:09               ` Rik van Riel
2001-09-21 14:29           ` Gábor Lénárt
2001-09-21 14:35             ` Horst von Brand
2001-09-19 21:37       ` Eric W. Biederman
2001-09-19 21:55       ` David S. Miller
2001-09-20 13:02         ` Rik van Riel
2001-09-16 19:07 vm rewrite ready [Re: broken VM in 2.4.10-pre9] Rik van Riel
2001-09-16 15:19 ` broken VM in 2.4.10-pre9 Phillip Susi
2001-09-16 19:33   ` Jeremy Zawodny
2001-09-16 19:52   ` Rik van Riel
     [not found] <fa.i95if5v.74un2p@ifi.uio.no>
     [not found] ` <fa.gu977tv.1b7u0g9@ifi.uio.no>
2001-09-16 18:06   ` Dan Maas
2001-09-15 22:43 Peter Magnusson
2001-09-15 23:50 ` Jan Harkes
2001-09-16  5:31 ` Linus Torvalds
2001-09-16  8:45   ` Eric W. Biederman
2001-09-17 10:25 ` Tonu Samuel
2001-09-16 16:47   ` Jeremy Zawodny
2001-09-16 18:36     ` Alan Cox
2001-09-16 19:38       ` Linus Torvalds
2001-09-16 19:37   ` Linus Torvalds
2001-09-17 14:04     ` Olaf Zaplinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).