* Re: broken VM in 2.4.10-pre9
@ 2001-09-16 15:19 Ricardo Galli
2001-09-16 15:23 ` Michael Rothwell
0 siblings, 1 reply; 76+ messages in thread
From: Ricardo Galli @ 2001-09-16 15:19 UTC (permalink / raw)
To: linux-kernel
> So whether Linux uses swap or not is a 100% meaningless indicator of
> "goodness". The only thing that matters is how well the job gets done,
> ie was it reasonably responsive, and did the big untars finish quickly..
I am running 2.4.9 on a PII with 448MB RAM. After listening a couple of
hours MP3 from the /dev/cdrom and KDE started, more than 70MB went to
swap, about 300 MB in cache and KDE takes about 15-20 seconds just for
logging out and showing the greeting console.
Obviously, all apps went to disk to leave space for caching mp3 files that
are read only once. Altough logging out is not a very often process...
Regards,
--ricardo
^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 15:19 broken VM in 2.4.10-pre9 Ricardo Galli @ 2001-09-16 15:23 ` Michael Rothwell 2001-09-16 16:33 ` Rik van Riel 0 siblings, 1 reply; 76+ messages in thread From: Michael Rothwell @ 2001-09-16 15:23 UTC (permalink / raw) To: Ricardo Galli; +Cc: linux-kernel Is there a way to tell the VM to prune its cache? Or a way to limit the amount of cache it uses? On 16 Sep 2001 17:19:43 +0200, Ricardo Galli wrote: > > So whether Linux uses swap or not is a 100% meaningless indicator of > > "goodness". The only thing that matters is how well the job gets done, > > ie was it reasonably responsive, and did the big untars finish quickly.. > > I am running 2.4.9 on a PII with 448MB RAM. After listening a couple of > hours MP3 from the /dev/cdrom and KDE started, more than 70MB went to > swap, about 300 MB in cache and KDE takes about 15-20 seconds just for > logging out and showing the greeting console. > > Obviously, all apps went to disk to leave space for caching mp3 files that > are read only once. Altough logging out is not a very often process... > > Regards, > > > --ricardo > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 15:23 ` Michael Rothwell @ 2001-09-16 16:33 ` Rik van Riel 2001-09-16 16:50 ` Andreas Steinmetz ` (4 more replies) 0 siblings, 5 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-16 16:33 UTC (permalink / raw) To: Michael Rothwell; +Cc: Ricardo Galli, linux-kernel On 16 Sep 2001, Michael Rothwell wrote: > Is there a way to tell the VM to prune its cache? Or a way to limit > the amount of cache it uses? Not yet, I'll make a quick hack for this when I get back next week. It's pretty obvious now that the 2.4 kernel cannot get enough information to select the right pages to evict from memory. For 2.5 I'm making a VM subsystem with reverse mappings, the first iterations are giving very sweet performance so I will continue with this project regardless of what other kernel hackers might say ;) cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:33 ` Rik van Riel @ 2001-09-16 16:50 ` Andreas Steinmetz 2001-09-16 17:12 ` Ricardo Galli 2001-09-16 17:06 ` Ricardo Galli ` (3 subsequent siblings) 4 siblings, 1 reply; 76+ messages in thread From: Andreas Steinmetz @ 2001-09-16 16:50 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, Ricardo Galli, Michael Rothwell On 16-Sep-2001 Rik van Riel wrote: > On 16 Sep 2001, Michael Rothwell wrote: > >> Is there a way to tell the VM to prune its cache? Or a way to limit >> the amount of cache it uses? > > Not yet, I'll make a quick hack for this when I get back next > week. It's pretty obvious now that the 2.4 kernel cannot get > enough information to select the right pages to evict from > memory. > In my experience you should try to run aide (ftp://ftp.cs.tut.fi/pub/src/gnu/aide-0.7.tar.gz) for tests. This is a case of one single process doing a file system consistency check and stopping all other processes cold due to swapout due to heavy cacheing. While aide runs the system just becomes unusable. Andreas Steinmetz D.O.M. Datenverarbeitung GmbH ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:50 ` Andreas Steinmetz @ 2001-09-16 17:12 ` Ricardo Galli 0 siblings, 0 replies; 76+ messages in thread From: Ricardo Galli @ 2001-09-16 17:12 UTC (permalink / raw) To: linux-kernel On Sun, 16 Sep 2001, Andreas Steinmetz wrote: > More easy though (for cases of listening mp3's and backups): cache pages > that were accesed only "once"(*) several seconds ago must be discarded > first. It only implies a check against an access counter and a "last > accesed" epoch fields of the page. > > > (*) Or by the same process/process group in a very short period, i.e. the > last-access timestamp should be updated only if the previous access was ^^^^^^^^^^^^^^^^^^^^ Sorry, I wanted to say "access counter". --ricardo ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:33 ` Rik van Riel 2001-09-16 16:50 ` Andreas Steinmetz @ 2001-09-16 17:06 ` Ricardo Galli 2001-09-16 17:18 ` Jeremy Zawodny 2001-09-16 18:45 ` Stephan von Krawczynski 2001-09-16 18:16 ` broken VM in 2.4.10-pre9 Stephan von Krawczynski ` (2 subsequent siblings) 4 siblings, 2 replies; 76+ messages in thread From: Ricardo Galli @ 2001-09-16 17:06 UTC (permalink / raw) To: linux-kernel On Sun, 16 Sep 2001, Rik van Riel wrote: > > > Is there a way to tell the VM to prune its cache? Or a way to limit > > the amount of cache it uses? > > Not yet, I'll make a quick hack for this when I get back next > week. It's pretty obvious now that the 2.4 kernel cannot get > enough information to select the right pages to evict from > memory. .... On Sun, 16 Sep 2001, Jeremy Zawodny wrote: > > Agreed. I'd be great if there was an option to say "Don't swap out > memory that was allocated by these programs. If you run out of disk > buffers, toss the oldest ones and start re-using them." More easy though (for cases of listening mp3's and backups): cache pages that were accesed only "once"(*) several seconds ago must be discarded first. It only implies a check against an access counter and a "last accesed" epoch fields of the page. (*) Or by the same process/process group in a very short period, i.e. the last-access timestamp should be updated only if the previous access was few seconds ago. --ricardo ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 17:06 ` Ricardo Galli @ 2001-09-16 17:18 ` Jeremy Zawodny 2001-09-16 18:45 ` Stephan von Krawczynski 1 sibling, 0 replies; 76+ messages in thread From: Jeremy Zawodny @ 2001-09-16 17:18 UTC (permalink / raw) To: Ricardo Galli; +Cc: linux-kernel On Sun, Sep 16, 2001 at 07:06:45PM +0200, Ricardo Galli wrote: > On Sun, 16 Sep 2001, Rik van Riel wrote: > > > > > Is there a way to tell the VM to prune its cache? Or a way to limit > > > the amount of cache it uses? > > > > Not yet, I'll make a quick hack for this when I get back next > > week. It's pretty obvious now that the 2.4 kernel cannot get > > enough information to select the right pages to evict from > > memory. > > .... > > On Sun, 16 Sep 2001, Jeremy Zawodny wrote: > > > > Agreed. I'd be great if there was an option to say "Don't swap out > > memory that was allocated by these programs. If you run out of disk > > buffers, toss the oldest ones and start re-using them." > > More easy though (for cases of listening mp3's and backups): cache > pages that were accesed only "once"(*) several seconds ago must be > discarded first. It only implies a check against an access counter > and a "last accesed" epoch fields of the page. Yeah, something along those lines would be great. It would keep a big (13 GB) drive to drive file copy from causing a large (400MB) and relatively active process from having its memory swapped out (and then back in 20 seconds later). Imagine watching that for 45 minutes while a backup that used to take 5-8 minutes runs. Jeremy -- Jeremy D. Zawodny | Perl, Web, MySQL, Linux Magazine, Yahoo! <Jeremy@Zawodny.com> | http://jeremy.zawodny.com/ ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 17:06 ` Ricardo Galli 2001-09-16 17:18 ` Jeremy Zawodny @ 2001-09-16 18:45 ` Stephan von Krawczynski 2001-09-21 3:16 ` Bill Davidsen ` (2 more replies) 1 sibling, 3 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-16 18:45 UTC (permalink / raw) To: gallir; +Cc: linux-kernel On Sun, 16 Sep 2001 20:16:57 +0200 Stephan von Krawczynski <skraw@ithnet.com> wrote: > On Sun, 16 Sep 2001 19:06:45 +0200 (MET) Ricardo Galli <gallir@m3d.uib.es> > wrote: > > > On Sun, 16 Sep 2001, Jeremy Zawodny wrote: > > > > > > Agreed. I'd be great if there was an option to say "Don't swap out > > > memory that was allocated by these programs. If you run out of disk > > > buffers, toss the oldest ones and start re-using them." > > > > More easy though (for cases of listening mp3's and backups): cache pages > > that were accesed only "once"(*) several seconds ago must be discarded > > first. It only implies a check against an access counter and a "last > > accesed" epoch fields of the page. > > Well, I guess this is everybody's first idea about the problem: make an initial > timestamp for knowing how _old_ an allocation really is, Thinking again about it, I guess I would prefer a FIFO-list of allocated pages. This would allow to "know" the age simply by its position in the list. You wouldn't need a timestamp then, and even better it works equally well for systems with high vm load and low, because you do not deal with absolute time comparisons, but relative. That sounds pretty good for me. Still the problem with page accesses is not solved, but if you had an idea on that, you could manipulate the alloc-list simply by moving accessed pages to the end (one or several positions) of the list which effectively "youngers" them. So you came around any new members of some structs with simple (and fast) list operations. Comments? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 18:45 ` Stephan von Krawczynski @ 2001-09-21 3:16 ` Bill Davidsen 2001-09-21 10:21 ` Stephan von Krawczynski 2001-09-21 10:43 ` Stephan von Krawczynski 2 siblings, 0 replies; 76+ messages in thread From: Bill Davidsen @ 2001-09-21 3:16 UTC (permalink / raw) To: linux-kernel On Sun, 16 Sep 2001, Stephan von Krawczynski wrote: > Thinking again about it, I guess I would prefer a FIFO-list of allocated pages. > This would allow to "know" the age simply by its position in the list. You > wouldn't need a timestamp then, and even better it works equally well for > systems with high vm load and low, because you do not deal with absolute time > comparisons, but relative. > That sounds pretty good for me. The problem is that when many things effect the optimal ratio of text, data, buffer and free space a solution which doesn't measure all the important factors will produce sub-optimal results. Your proposal is simple and elegant, but I think it's too simple to produce good results. See my reply to Linus' comments. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 18:45 ` Stephan von Krawczynski 2001-09-21 3:16 ` Bill Davidsen @ 2001-09-21 10:21 ` Stephan von Krawczynski 2001-09-21 14:08 ` Bill Davidsen 2001-09-21 10:43 ` Stephan von Krawczynski 2 siblings, 1 reply; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-21 10:21 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com> wrote: > On Sun, 16 Sep 2001, Stephan von Krawczynski wrote: > > > > Thinking again about it, I guess I would prefer a FIFO-list of allocated pages. > > This would allow to "know" the age simply by its position in the list. You > > wouldn't need a timestamp then, and even better it works equally well for > > systems with high vm load and low, because you do not deal with absolute time > > comparisons, but relative. > > That sounds pretty good for me. > > The problem is that when many things effect the optimal ratio of text, > data, buffer and free space a solution which doesn't measure all the > important factors will produce sub-optimal results. Your proposal is > simple and elegant, but I think it's too simple to produce good results. > See my reply to Linus' comments. Actually I did not really propose a method of valueing the several pros and cons in aging itself, but a very basic idea of how this could be done without fiddling around with page->members (like page->age) which always implies you have to walk down a whole list to get the full picture in case of urgent need for freeable pages. If you age something by re-arranging its position in a list you have the drawback of list-locking, but the gain of fast finding the best freeable pages by simply using the first ones in that list. Even better you can add whatever criteria you like to this aging, e.g. you could rearrange to let consecutive pages be freed together and so on, all would be pretty easy to achieve, and page-struct becomes even smaller. The more I think about it the better it sounds. Your opinion? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 10:21 ` Stephan von Krawczynski @ 2001-09-21 14:08 ` Bill Davidsen 2001-09-21 14:23 ` Rik van Riel 0 siblings, 1 reply; 76+ messages in thread From: Bill Davidsen @ 2001-09-21 14:08 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel On Fri, 21 Sep 2001, Stephan von Krawczynski wrote: > On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com> > wrote: > > > On Sun, 16 Sep 2001, Stephan von Krawczynski wrote: [... snip ...] > > The problem is that when many things effect the optimal ratio of text, > > data, buffer and free space a solution which doesn't measure all the > > important factors will produce sub-optimal results. Your proposal is > > simple and elegant, but I think it's too simple to produce good results. > > See my reply to Linus' comments. > > Actually I did not really propose a method of valueing the several pros and > cons in aging itself, but a very basic idea of how this could be done without > fiddling around with page->members (like page->age) which always implies you > have to walk down a whole list to get the full picture in case of urgent need > for freeable pages. > If you age something by re-arranging its position in a list you have the > drawback of list-locking, but the gain of fast finding the best freeable pages > by simply using the first ones in that list. Even better you can add whatever > criteria you like to this aging, e.g. you could rearrange to let consecutive > pages be freed together and so on, all would be pretty easy to achieve, and > page-struct becomes even smaller. > The more I think about it the better it sounds. > Your opinion? The list is an okay way to determine rank within a class, but I still think that there is a need for some balance between text, program data, pages loaded via i/o, perhaps more. My disquiet with the new implementation is based on a desire to avoid swapping program data to make room for i/o data (using those terms in a loose way for identification). I would also like to have time to investigate what happens if the pages associated with a program load are handled in larger blocks, meta-pages perhaps, which would at least cause many to be loaded at once on a page fault, rather than faulting them in one at a time. I have to look at the code again in my spare time, my last serious visit was 2.2.15 or so, looking to improve SMP performance. -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 14:08 ` Bill Davidsen @ 2001-09-21 14:23 ` Rik van Riel 2001-09-23 13:13 ` Eric W. Biederman 0 siblings, 1 reply; 76+ messages in thread From: Rik van Riel @ 2001-09-21 14:23 UTC (permalink / raw) To: Bill Davidsen; +Cc: Stephan von Krawczynski, linux-kernel On Fri, 21 Sep 2001, Bill Davidsen wrote: > The list is an okay way to determine rank within a class, but I still > think that there is a need for some balance between text, program data, > pages loaded via i/o, perhaps more. My disquiet with the new > implementation is based on a desire to avoid swapping program data to make > room for i/o data (using those terms in a loose way for identification). Preference for evicting one kind of cache is indeed a bad thing. It might work for 90% of the workloads, but you can be sure it breaks horribly for the other 10%. I'm currently busy tweaking the old 2.4 VM (in the -ac kernels) to try and get optimal performance from that one, without giving preference to one kind of cache ... except in the situation where the amount of cache is excessive. > I would also like to have time to investigate what happens if the pages > associated with a program load are handled in larger blocks, meta-pages > perhaps, which would at least cause many to be loaded at once on a page > fault, rather than faulting them in one at a time. This is an interesting thing, too. Something to look into for 2.5 and if it turns out simple enough we may even want to backport it to 2.4. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 14:23 ` Rik van Riel @ 2001-09-23 13:13 ` Eric W. Biederman 2001-09-23 13:27 ` Rik van Riel 0 siblings, 1 reply; 76+ messages in thread From: Eric W. Biederman @ 2001-09-23 13:13 UTC (permalink / raw) To: Rik van Riel; +Cc: Bill Davidsen, Stephan von Krawczynski, linux-kernel Rik van Riel <riel@conectiva.com.br> writes: > > I would also like to have time to investigate what happens if the pages > > associated with a program load are handled in larger blocks, meta-pages > > perhaps, which would at least cause many to be loaded at once on a page > > fault, rather than faulting them in one at a time. > > This is an interesting thing, too. Something to look into for > 2.5 and if it turns out simple enough we may even want to > backport it to 2.4. filemap_nopage already does all of this except put the page in the page table. Eric ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-23 13:13 ` Eric W. Biederman @ 2001-09-23 13:27 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-23 13:27 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Bill Davidsen, Stephan von Krawczynski, linux-kernel On 23 Sep 2001, Eric W. Biederman wrote: > Rik van Riel <riel@conectiva.com.br> writes: > > > I would also like to have time to investigate what happens if the pages > > > associated with a program load are handled in larger blocks, meta-pages > > > perhaps, which would at least cause many to be loaded at once on a page > > > fault, rather than faulting them in one at a time. > > > > This is an interesting thing, too. Something to look into for > > 2.5 and if it turns out simple enough we may even want to > > backport it to 2.4. > > filemap_nopage already does all of this except put the page in the > page table. Exactly, there are two things we need to fix: 1) set up the page tables in a clustered way 2) make filemap_nopage() aware of sequential IO and teach it to do asynchronous readahead .. maybe even with drop-behind on the VMA level ? regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 18:45 ` Stephan von Krawczynski 2001-09-21 3:16 ` Bill Davidsen 2001-09-21 10:21 ` Stephan von Krawczynski @ 2001-09-21 10:43 ` Stephan von Krawczynski 2001-09-21 12:13 ` Rik van Riel ` (3 more replies) 2 siblings, 4 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-21 10:43 UTC (permalink / raw) To: Bill Davidsen; +Cc: linux-kernel On Thu, 20 Sep 2001 23:16:55 -0400 (EDT) Bill Davidsen <davidsen@tmr.com> wrote: > On Sun, 16 Sep 2001, Stephan von Krawczynski wrote: > > > > Thinking again about it, I guess I would prefer a FIFO-list of allocated pages. > > This would allow to "know" the age simply by its position in the list. You > > wouldn't need a timestamp then, and even better it works equally well for > > systems with high vm load and low, because you do not deal with absolute time > > comparisons, but relative. > > That sounds pretty good for me. > > The problem is that when many things effect the optimal ratio of text, > data, buffer and free space a solution which doesn't measure all the > important factors will produce sub-optimal results. Your proposal is > simple and elegant, but I think it's too simple to produce good results. > See my reply to Linus' comments. Sorry to followup to the same post again, but I just read across another thread where people discuss heavily about aging up by 3 and down by 1 or vice versa is considered to be better or worse. The real problem behind this is that they are trying to bring some order in the pages by the age. Unfortunately this cannot really work out well, because you will _always_ end up with few or a lot of pages with the same age, which does not help you all that much in a situation where you need to know who is the _best one_ that is to be dropped next. In this drawing situation you have nothing to rely on but the age and some rough guesses (or even worse: performance issues, _not_ to walk the whole tree to find the best fitting page). This _solely_ comes from the fact that you have no steady rising order in page->age ordering (is this correct english for this mathematical term?). You _cannot_ know from the current function. So it must be considered _bad_. On the other hand a list is always ordered and therefore does not have this problem. Shit, if I only were able to implement that. Can anybody help me to proove my point? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 10:43 ` Stephan von Krawczynski @ 2001-09-21 12:13 ` Rik van Riel 2001-09-21 12:55 ` Stephan von Krawczynski ` (2 subsequent siblings) 3 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-21 12:13 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Bill Davidsen, linux-kernel On Fri, 21 Sep 2001, Stephan von Krawczynski wrote: > Shit, if I only were able to implement that. Can anybody help me to > proove my point? Trying to implement your idea would probably pose a nice counter-argument. Without measuring which pages are in heavy use, how are you going to evict the right pages ? regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 10:43 ` Stephan von Krawczynski 2001-09-21 12:13 ` Rik van Riel @ 2001-09-21 12:55 ` Stephan von Krawczynski 2001-09-21 13:01 ` Rik van Riel 2001-09-22 11:01 ` Daniel Phillips 2001-09-24 9:36 ` Linux VM design VDA 3 siblings, 1 reply; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-21 12:55 UTC (permalink / raw) To: Rik van Riel; +Cc: davidsen, linux-kernel On Fri, 21 Sep 2001 09:13:07 -0300 (BRST) Rik van Riel <riel@conectiva.com.br> wrote: > On Fri, 21 Sep 2001, Stephan von Krawczynski wrote: > > > Shit, if I only were able to implement that. Can anybody help me to > > proove my point? > > Trying to implement your idea would probably pose a nice > counter-argument. Without measuring which pages are in > heavy use, how are you going to evict the right pages ? Hi Rik, The really beautiful thing about it is that you can divide it completely in two parts: 1) basic list handling, you obviously need the list itself and some atomic functions to queue/dequeue/requeue entries, possibly as well as get_next_freeable() for simplicity. The rest vm only uses this to work. 2) the management "plugins" where you can virtually do any check of heavy use or aging or buddy-finding or whatever comes to your mind and requeue accordingly. You may do that on every alloc (surely not nice), or on page hits, or on low-mem condition (like page_launder), or in a independant process (somehow like kswapd), whatever you tend to believe is the best performing way - feel free to find the killer-plugin :-). BUT (and thats the real good point): (2) is completely independant in structure and processing from the basic mem-handling, because the only interaction is requeuing, which means as a first step (only experimental of course) you could just fill the list with addtail and shorten it on demand of free pages with remhead (hope my short-terms are understandable). This implements a _very_ simple aging based on only the age of the allocation and nothing else (FIFO). You can spend any time and brain in refining the strategy without ever touching the vm basics _and_ (because of the simple and clean interface between (1) and (2), you got no chance to screw things up (unless you do not drop entries in a bug-implementation)). No obvious need for patches or weird workarounds. Your opinion? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 12:55 ` Stephan von Krawczynski @ 2001-09-21 13:01 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-21 13:01 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: davidsen, linux-kernel On Fri, 21 Sep 2001, Stephan von Krawczynski wrote: > The really beautiful thing about it is that you can divide it > completely in two parts: > Your opinion? I'll believe it when I see it, your idea is still very abstract and I haven't seen you even start to talk about even the data structures used in the implementation. (except for "list" showing up at random places in the text) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-21 10:43 ` Stephan von Krawczynski 2001-09-21 12:13 ` Rik van Riel 2001-09-21 12:55 ` Stephan von Krawczynski @ 2001-09-22 11:01 ` Daniel Phillips 2001-09-22 20:05 ` Rik van Riel 2001-09-24 9:36 ` Linux VM design VDA 3 siblings, 1 reply; 76+ messages in thread From: Daniel Phillips @ 2001-09-22 11:01 UTC (permalink / raw) To: Stephan von Krawczynski, Bill Davidsen; +Cc: linux-kernel On September 21, 2001 12:43 pm, Stephan von Krawczynski wrote: > Sorry to followup to the same post again, but I just read across another thread > where people discuss heavily about aging up by 3 and down by 1 or vice versa is > considered to be better or worse. The real problem behind this is that they are > trying to bring some order in the pages by the age. Unfortunately this cannot > really work out well, because you will _always_ end up with few or a lot of > pages with the same age, which does not help you all that much in a situation > where you need to know who is the _best one_ that is to be dropped next. In > this drawing situation you have nothing to rely on but the age and some rough > guesses (or even worse: performance issues, _not_ to walk the whole tree to > find the best fitting page). This _solely_ comes from the fact that you have no > steady rising order in page->age ordering (is this correct english for this > mathematical term?). You _cannot_ know from the current function. So it must be > considered _bad_. On the other hand a list is always ordered and therefore does > not have this problem. > Shit, if I only were able to implement that. Can anybody help me to proove my > point? You got your wish. Andrea's mm patch introduced into the main tree at 2.4.10-pre11 uses a standard LRU: + if (PageTestandClearReferenced(page)) { + if (PageInactive(page)) { + del_page_from_inactive_list(page); + add_page_to_active_list(page); + } else if (PageActive(page)) { + list_del(entry); + list_add(entry, &active_list); <musings> There are arguments about whether page aging can be superior to standard LRU, and I personally believe it can be, but there's no question that ordinary LRU is a lot easier to implement correctly and will perform a lot better than incorrectly implemented/untuned page aging. The arguments in support of aging over LRU that I'm aware of are: - incrementing an age is more efficient than resetting several LRU list links - also captures some frequency-of-use information - it can be implemented in hardware (not that that matters much) - allows more scope for tuning/balancing (and also rope to hang oneself) The big problem with aging is that unless it's entirely correctly balanced its just not going to work very well. To balance it well requires knowing a lot about rates of list scanning and so on. Matt Dillon perfected this art in BSD, but we never did, being preoccupied with things like just getting the mm scanners to activate when required, and sorting out our special complexities like zones and highmem buffers. Probably another few months of working on it would let us get past the remaining structural problems and actually start tuning it, but we've already made people wait way too long for a stable 2.4. A more robust strategy makes a lot of sense right now. We can still play with stronger magic in 2.5, and of course Rik's aging strategy will continue to be developed in Alan's tree while Andrea's is still going through the test of fire. </musings> I'll keep reading Andrea's code and maybe I'll be able to shed some more light on the algorithms he's using, since he doesn't seem to be in a big hurry to do that himself. (Hi Andrea ;-) -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-22 11:01 ` Daniel Phillips @ 2001-09-22 20:05 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-22 20:05 UTC (permalink / raw) To: Daniel Phillips; +Cc: Stephan von Krawczynski, Bill Davidsen, linux-kernel On Sat, 22 Sep 2001, Daniel Phillips wrote: > I'll keep reading Andrea's code and maybe I'll be able to shed some > more light on the algorithms he's using, since he doesn't seem to be > in a big hurry to do that himself. (Hi Andrea ;-) Heh, this'll probably lead to the same maintenance nightmare we had in 2.2 (undocumented code and nobody even agreeing on exactly what the code does, let alone what it's supposed to do). cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Linux VM design 2001-09-21 10:43 ` Stephan von Krawczynski ` (2 preceding siblings ...) 2001-09-22 11:01 ` Daniel Phillips @ 2001-09-24 9:36 ` VDA 2001-09-24 11:06 ` Dave Jones ` (5 more replies) 3 siblings, 6 replies; 76+ messages in thread From: VDA @ 2001-09-24 9:36 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel, Alexander Viro, Daniel Phillips Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 3776 bytes --] Hi VM folks, I'd like to understand Linux VM but there's not much in Documentation/vm/* on the subject. I understand that with current frantic development pace it is hard to maintain such docs. However, with only a handful of people really understading how VM works we risk ending in a situation when nobody will know how to fix it (does recent Andrea's VM rewrite just replaced large part of hardly maintenable, not-quite-right VM in 2.4.10?) When I have a stalled problem to solve, I sometimes catch unsuspecting victim and start explainig what I am trying to do and how I'm doing that. Often, in the middle of my explanation, I realize myself what I did wrong. There is an old teacher's joke: "My pupils are dumb! I explained them this theme once, then twice, I finally myself understood it, and they still don't". ^^^^^^^^^^^^^^^^^^^^ Since we reached some kind of stability with 2.4, maybe Andrea, Rik and whoever else is considering himself VM geek would tell us not-so-clever lkml readers how VM works and put it in vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*, I will be unbelievably happy. Matt Dillon's post belongs there too. I have an example how I would describe VM if I knew anything about it. I am putting it in the zip attachment just to reduce number of people laughing on how stupid I am :-). Most lkml readers won't open it, I hope :-). If VM geeks are disagreeing with each other on some VM inner workings, they can describe their views in those separate files, giving readers ability to compare their VM designs. Maybe these files will evolve in VM FAQs. Saturday, September 22, 2001, 2:01:02 PM, Daniel Phillips <phillips@bonn-fries.net> wrote: DP> The arguments in support of aging over LRU that I'm aware of are: DP> - incrementing an age is more efficient than resetting several LRU list links DP> - also captures some frequency-of-use information Of what use this info can be? If one page is accessed 100 times/second and other one once in 10 seconds, they both have to stay in RAM. VM should take 'time since last access' into account whan deciding which page to swap out, not how often it was referenced. DP> - it can be implemented in hardware (not that that matters much) DP> - allows more scope for tuning/balancing (and also rope to hang oneself) DP> The big problem with aging is that unless it's entirely correctly balanced its DP> just not going to work very well. To balance it well requires knowing a lot DP> about rates of list scanning and so on. Matt Dillon perfected this art in BSD, DP> but we never did, being preoccupied with things like just getting the mm DP> scanners to activate when required, and sorting out our special complexities DP> like zones and highmem buffers. Probably another few months of working on it DP> would let us get past the remaining structural problems and actually start DP> tuning it, but we've already made people wait way too long for a stable 2.4. DP> A more robust strategy makes a lot of sense right now. We can still play with DP> stronger magic in 2.5, and of course Rik's aging strategy will continue to be DP> developed in Alan's tree while Andrea's is still going through the test of DP> fire. DP> </musings> DP> I'll keep reading Andrea's code and maybe I'll be able to shed some more light DP> on the algorithms he's using, since he doesn't seem to be in a big hurry to DP> do that himself. (Hi Andrea ;-) DP> -- DP> Daniel DP> - DP> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in DP> the body of a message to majordomo@vger.kernel.org DP> More majordomo info at http://vger.kernel.org/majordomo-info.html DP> Please read the FAQ at http://www.tux.org/lkml/ -- Best regards, VDA mailto:VDA@port.imtp.ilyichevsk.odessa.ua [-- Attachment #2: Vm-dumb.zip --] [-- Type: application/x-zip-compressed, Size: 2006 bytes --] ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA @ 2001-09-24 11:06 ` Dave Jones 2001-09-24 12:15 ` Kirill Ratkin 2001-09-24 13:29 ` Rik van Riel ` (4 subsequent siblings) 5 siblings, 1 reply; 76+ messages in thread From: Dave Jones @ 2001-09-24 11:06 UTC (permalink / raw) To: VDA; +Cc: Linux Kernel Mailing List On Mon, 24 Sep 2001, VDA wrote: > I'd like to understand Linux VM but there's not much in > Documentation/vm/* on the subject. I understand that with current > frantic development pace it is hard to maintain such docs. In case you're not aware of it, http://linux-mm.org/wiki/moin.cgi is starting to fill out with documentation/ideas/etc on VM strategies past, present and future. regards, Dave. -- | Dave Jones. http://www.suse.de/~davej | SuSE Labs ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 11:06 ` Dave Jones @ 2001-09-24 12:15 ` Kirill Ratkin 0 siblings, 0 replies; 76+ messages in thread From: Kirill Ratkin @ 2001-09-24 12:15 UTC (permalink / raw) To: Dave Jones, VDA; +Cc: Linux Kernel Mailing List --- Dave Jones <davej@suse.de> wrote: > On Mon, 24 Sep 2001, VDA wrote: > > > I'd like to understand Linux VM but there's not > much in > > Documentation/vm/* on the subject. I understand > that with current > > frantic development pace it is hard to maintain > such docs. > > In case you're not aware of it, > http://linux-mm.org/wiki/moin.cgi > is starting to fill out with documentation/ideas/etc > on VM strategies > past, present and future. > > regards, > > Dave. > > -- > | Dave Jones. http://www.suse.de/~davej > | SuSE Labs > And here: http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html > - > To unsubscribe from this list: send the line > "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at > http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ __________________________________________________ Do You Yahoo!? Get email alerts & NEW webcam video instant messaging with Yahoo! Messenger. http://im.yahoo.com ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA 2001-09-24 11:06 ` Dave Jones @ 2001-09-24 13:29 ` Rik van Riel 2001-09-24 14:05 ` VDA 2001-09-24 18:37 ` Daniel Phillips ` (3 subsequent siblings) 5 siblings, 1 reply; 76+ messages in thread From: Rik van Riel @ 2001-09-24 13:29 UTC (permalink / raw) To: VDA; +Cc: Andrea Arcangeli, Alexander Viro, Daniel Phillips, linux-kernel On Mon, 24 Sep 2001, VDA wrote: > I'd like to understand Linux VM but there's not much in > Documentation/vm/* on the subject. http://linux-mm.org/ has some stuff and I wrote a freenix paper on the subject as well http://www.surriel.com/lectures/. > Since we reached some kind of stability with 2.4, maybe > Andrea, Rik and whoever else is considering himself VM geek > would tell us not-so-clever lkml readers how VM works and put it in > vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*, > I will be unbelievably happy. Matt Dillon's post belongs there too. http://linux-mm.org/ The only thing missing is an explanation of Andrea's VM, but knowing Andrea's enthusiasm at documentation I wouldn't really count on that any time soon ;) cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 13:29 ` Rik van Riel @ 2001-09-24 14:05 ` VDA 2001-09-24 14:37 ` Rik van Riel 2001-09-24 14:42 ` Rik van Riel 0 siblings, 2 replies; 76+ messages in thread From: VDA @ 2001-09-24 14:05 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel Hello Rik, Monday, September 24, 2001, 4:29:46 PM, you wrote: >> Since we reached some kind of stability with 2.4, maybe >> Andrea, Rik and whoever else is considering himself VM geek >> would tell us not-so-clever lkml readers how VM works and put it in >> vm-2.4andrea, vm-2.4rik or whatever in Doc/vm/*, >> I will be unbelievably happy. Matt Dillon's post belongs there too. RvR> http://linux-mm.org/ I was there today. Good. Can this stuff be placed as Doc/mv/vm2.4rik to prevent it from being outdated in 2-3 months? Linus? Also I'd like to be enlightened why this: >Virtual Memory Management Policy >-------------------------------- >The basic principle of the Linux VM system is page aging. We've seen >that refill_inactive_scan() is invoked periodically to try to >deactivate pages, and that it ages pages down as it does so, >deactivating them when their age reaches 0. We've also seen that >swap_out() will age referenced page frames up while scanning process >memory maps. This is the fundamental mechanism for VM resource >balancing in Linux: pages are aged down at a more-or-less steady rate, >and deactivated when they become sufficiently old; but processes can >keep pages "young" by referencing them frequently. is better than plain simple LRU? We definitely need VM FAQ to have these questions answered once per VM design, not once per week :-) RvR> The only thing missing is an explanation of Andrea's RvR> VM, but knowing Andrea's enthusiasm at documentation RvR> I wouldn't really count on that any time soon ;) :-) -- Best regards, VDA mailto:VDA@port.imtp.ilyichevsk.odessa.ua ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 14:05 ` VDA @ 2001-09-24 14:37 ` Rik van Riel 2001-09-24 14:42 ` Rik van Riel 1 sibling, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-24 14:37 UTC (permalink / raw) To: VDA; +Cc: linux-kernel On Mon, 24 Sep 2001, VDA wrote: > RvR> http://linux-mm.org/ > > I was there today. Good. Can this stuff be placed as > Doc/mv/vm2.4rik > to prevent it from being outdated in 2-3 months? Putting documents in the kernel tree has never worked as a means of keeping them up to date. Unless, of course, you're volunteering to keep them up to date ;) > Also I'd like to be enlightened why this: > > >Virtual Memory Management Policy > >-------------------------------- > >The basic principle of the Linux VM system is page aging. > is better than plain simple LRU? > > We definitely need VM FAQ to have these questions answered once per VM > design, not once per week :-) > RvR> The only thing missing is an explanation of Andrea's > RvR> VM, but knowing Andrea's enthusiasm at documentation > RvR> I wouldn't really count on that any time soon ;) > > :-) > > -- > Best regards, VDA > mailto:VDA@port.imtp.ilyichevsk.odessa.ua > > Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 14:05 ` VDA 2001-09-24 14:37 ` Rik van Riel @ 2001-09-24 14:42 ` Rik van Riel 1 sibling, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-24 14:42 UTC (permalink / raw) To: VDA; +Cc: linux-kernel [grrrrr, the dog was sitting against my arm and I pressed the wrong key ;)] On Mon, 24 Sep 2001, VDA wrote: > >Virtual Memory Management Policy > >-------------------------------- > >The basic principle of the Linux VM system is page aging. > is better than plain simple LRU? All research I've seen indicates that it's better to take frequency into account as well instead of only access recency. Plain LRU just breaks down under sequential IO, LRU with a large enough inactive list should hold up decently under streaming IO, but only a replacement strategy which keeps access frequency into account too will be able to make proper decisions as to which pages to keep in memory and which pages to throw out. Note that it's not me making this up, it's simply the info I've seen everywhere ... I don't like reinventing the wheel ;) > We definitely need VM FAQ to have these questions answered once per VM > design, not once per week :-) Go ahead, make on on the Linux-MM wiki: http://linux-mm.org/wiki/ (note that for some reason the thing gives an internal server error once in a while ... I haven't yet been able to find a pattern to it, so I it's not fixed yet) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA 2001-09-24 11:06 ` Dave Jones 2001-09-24 13:29 ` Rik van Riel @ 2001-09-24 18:37 ` Daniel Phillips 2001-09-24 19:32 ` Rik van Riel 2001-09-25 16:03 ` bill davidsen 2001-09-24 18:46 ` Jonathan Morton ` (2 subsequent siblings) 5 siblings, 2 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-24 18:37 UTC (permalink / raw) To: VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro; +Cc: linux-kernel On September 24, 2001 11:36 am, VDA wrote: > Daniel Phillips <phillips@bonn-fries.net> wrote: > DP> The arguments in support of aging over LRU that I'm aware of are: > > DP> - incrementing an age is more efficient than resetting several LRU > DP> list links > DP> - also captures some frequency-of-use information > > Of what use this info can be? If one page is accessed 100 times/second > and other one once in 10 seconds, they both have to stay in RAM. > VM should take 'time since last access' into account whan deciding > which page to swap out, not how often it was referenced. You might want to have a look at this: http://archi.snu.ac.kr/jhkim/seminar/96-004.ps (lrfu algorithm) To tell the truth, I don't really see why the frequency information is all that useful either. Rik suggested it's good for streaming IO but we already have effective means of dealing with that that don't rely on any frequency information. So the list of reasons why aging is good is looking really short. -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 18:37 ` Daniel Phillips @ 2001-09-24 19:32 ` Rik van Riel 2001-09-24 17:27 ` Rob Landley 2001-09-25 9:58 ` Daniel Phillips 2001-09-25 16:03 ` bill davidsen 1 sibling, 2 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-24 19:32 UTC (permalink / raw) To: Daniel Phillips; +Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel On Mon, 24 Sep 2001, Daniel Phillips wrote: > To tell the truth, I don't really see why the frequency > information is all that useful either. > So the list of reasons why aging is good is looking really short. Ummmm, that _you_ can't see it doesn't mean suddenly all VM research from the last 15 years has been invalidated. cheers, Rik -- IA64: a worthy successor to the i860. http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 19:32 ` Rik van Riel @ 2001-09-24 17:27 ` Rob Landley 2001-09-24 21:48 ` Rik van Riel 2001-09-25 9:58 ` Daniel Phillips 1 sibling, 1 reply; 76+ messages in thread From: Rob Landley @ 2001-09-24 17:27 UTC (permalink / raw) To: Rik van Riel, Daniel Phillips Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel On Monday 24 September 2001 15:32, Rik van Riel wrote: > On Mon, 24 Sep 2001, Daniel Phillips wrote: > > To tell the truth, I don't really see why the frequency > > information is all that useful either. > > > > So the list of reasons why aging is good is looking really short. > > Ummmm, that _you_ can't see it doesn't mean suddenly all > VM research from the last 15 years has been invalidated. Out of morbid curiosity, how much of that research either said or assumed that microkernels were a good idea? Rob ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 17:27 ` Rob Landley @ 2001-09-24 21:48 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-24 21:48 UTC (permalink / raw) To: Rob Landley Cc: Daniel Phillips, VDA, Andrea Arcangeli, Alexander Viro, linux-kernel On Mon, 24 Sep 2001, Rob Landley wrote: > Out of morbid curiosity, how much of that research either said > or assumed that microkernels were a good idea? *grin* None that I can remember even dealt with this. The page replacement research I've read was both generic OS and database page replacement, maybe 50 to 100 papers total... cheers, Rik -- IA64: a worthy successor to the i860. http://www.surriel.com/ http://www.conectiva.com/ http://distro.conectiva.com/ ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 19:32 ` Rik van Riel 2001-09-24 17:27 ` Rob Landley @ 2001-09-25 9:58 ` Daniel Phillips 1 sibling, 0 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-25 9:58 UTC (permalink / raw) To: Rik van Riel; +Cc: VDA, Andrea Arcangeli, Alexander Viro, linux-kernel On September 24, 2001 09:32 pm, Rik van Riel wrote: > On Mon, 24 Sep 2001, Daniel Phillips wrote: > > To tell the truth, I don't really see why the frequency > > information is all that useful either. > > > So the list of reasons why aging is good is looking really short. > > Ummmm, that _you_ can't see it doesn't mean suddenly all > VM research from the last 15 years has been invalidated. Did you have some more reasons to add to the list? -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 18:37 ` Daniel Phillips 2001-09-24 19:32 ` Rik van Riel @ 2001-09-25 16:03 ` bill davidsen 1 sibling, 0 replies; 76+ messages in thread From: bill davidsen @ 2001-09-25 16:03 UTC (permalink / raw) To: linux-kernel In article <20010924182948Z16175-2757+1593@humbolt.nl.linux.org> phillips@bonn-fries.net wrote: | You might want to have a look at this: | | http://archi.snu.ac.kr/jhkim/seminar/96-004.ps | (lrfu algorithm) | | To tell the truth, I don't really see why the frequency information is all | that useful either. Rik suggested it's good for streaming IO but we already | have effective means of dealing with that that don't rely on any frequency | information. A count which may actually be useful is a count of how many time the page has been swapped in (after being swapped out) as a predictor that it will be a good page to keep. The problem is that there are many things which help, and I don't think we have the balance quite right yet. I suspect that there need to be some hysteresis and runtime tuning over seconds to get optimal performance. Of course systems with really odd loads will still need to have hand tuning, and the /proc/sys interface should include sensible ways to do this. | So the list of reasons why aging is good is looking really short. The primary reason on my list is that under some load conditions it produces much better response. Note that I didn't say all conditions before you rush to disagree with me. Sometimes people will trade a little steady state performance to avoid a really bad worst case. How the problem is solved really isn't the issue, but responsiveness is important. Right now it seems some people are reporting that their loads work better with aging. -- bill davidsen <davidsen@tmr.com> "If I were a diplomat, in the best case I'd go hungry. In the worst case, people would die." -- Robert Lipe ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA ` (2 preceding siblings ...) 2001-09-24 18:37 ` Daniel Phillips @ 2001-09-24 18:46 ` Jonathan Morton 2001-09-24 19:16 ` Daniel Phillips 2001-09-24 19:11 ` Dan Mann 2001-09-25 10:55 ` VDA 5 siblings, 1 reply; 76+ messages in thread From: Jonathan Morton @ 2001-09-24 18:46 UTC (permalink / raw) To: Daniel Phillips, VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro Cc: linux-kernel > > DP> The arguments in support of aging over LRU that I'm aware of are: >> >> DP> - incrementing an age is more efficient than resetting several LRU >> DP> list links >> DP> - also captures some frequency-of-use information >> >> Of what use this info can be? If one page is accessed 100 times/second >> and other one once in 10 seconds, they both have to stay in RAM. >> VM should take 'time since last access' into account whan deciding >> which page to swap out, not how often it was referenced. > >You might want to have a look at this: > > http://archi.snu.ac.kr/jhkim/seminar/96-004.ps > (lrfu algorithm) > >To tell the truth, I don't really see why the frequency information is all >that useful either. Rik suggested it's good for streaming IO but we already >have effective means of dealing with that that don't rely on any frequency >information. > >So the list of reasons why aging is good is looking really short. It's not really frequency information. If a page is accessed 1000 times during a single schedule cycle, that will count as a single increment in the age come the time. However, *macro* frequency information of this type *is* useful in the case where thrashing is taking place. You want to swap out the page that is accessed only once every other schedule cycle, before the one accessed every cycle. This is of course moot if one process is being suspended (as it probably should), but the criteria for suspension might include this access information. -- -------------------------------------------------------------- from: Jonathan "Chromatix" Morton mail: chromi@cyberspace.org (not for attachments) website: http://www.chromatix.uklinux.net/vnc/ geekcode: GCS$/E dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r++ y+(*) tagline: The key to knowledge is not to rely on people to teach you it. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 18:46 ` Jonathan Morton @ 2001-09-24 19:16 ` Daniel Phillips 0 siblings, 0 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-24 19:16 UTC (permalink / raw) To: Jonathan Morton, VDA, Andrea Arcangeli, Rik van Riel, Alexander Viro Cc: linux-kernel On September 24, 2001 08:46 pm, Jonathan Morton wrote: > > > DP> The arguments in support of aging over LRU that I'm aware of are: > >> > >> DP> - incrementing an age is more efficient than resetting several LRU > >> DP> list links > >> DP> - also captures some frequency-of-use information > >> > >> Of what use this info can be? If one page is accessed 100 times/second > >> and other one once in 10 seconds, they both have to stay in RAM. > >> VM should take 'time since last access' into account whan deciding > >> which page to swap out, not how often it was referenced. > > > >You might want to have a look at this: > > > > http://archi.snu.ac.kr/jhkim/seminar/96-004.ps > > (lrfu algorithm) > > > >To tell the truth, I don't really see why the frequency information is all > >that useful either. Rik suggested it's good for streaming IO but we already > >have effective means of dealing with that that don't rely on any frequency > >information. > > > >So the list of reasons why aging is good is looking really short. > > It's not really frequency information. If a page is accessed 1000 > times during a single schedule cycle, that will count as a single > increment in the age come the time. However, *macro* frequency > information of this type *is* useful in the case where thrashing is > taking place. You want to swap out the page that is accessed only > once every other schedule cycle, before the one accessed every cycle. But this happens naturally with LRU. Think how it works: to get evicted a page has to progress all the way from the head to the tail of the LRU list. Any page that's accessed frequently is going to keep being put back at the head of the list, and only infrequently accessed pages will drop off the tail. > This is of course moot if one process is being suspended (as it > probably should), but the criteria for suspension might include this > access information. OK, that does get at something you can do with aging that you can't do with an LRU list: look at the weightings of random pages. You can't do that with the LRU list because there's no efficient way to determine which position a page holds in the list. One application where you would want to know the weightings of random pages is defragmentation. That might become important in the future but we're not doing it now. A little contemplation will probably turn up other uses for this special property. -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA ` (3 preceding siblings ...) 2001-09-24 18:46 ` Jonathan Morton @ 2001-09-24 19:11 ` Dan Mann 2001-09-25 10:55 ` VDA 5 siblings, 0 replies; 76+ messages in thread From: Dan Mann @ 2001-09-24 19:11 UTC (permalink / raw) To: VDA; +Cc: linux-kernel I hope this isn't the wrong place to ask this but, wouldn't it be better to increase ram size and decrease swap size as memory requirements grow? For instance, say I have a lightly loaded machine, that has 192MB of ram. From everything I've heard in the past, I'd use roughly 192MB of swap with this machine. The problem I would imagine is that if all 192MB got used wouldn't it be terribly slow to read/write that much data back in? Would less swap, say 32 MB make the kernel more restrictive with it's available memory and make the box more responsive when it's heavily using swap? Or am I way off and just smoking crack? (which I may very well be) This damn mailing list is addictive. Now I read it at work. Dan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-24 9:36 ` Linux VM design VDA ` (4 preceding siblings ...) 2001-09-24 19:11 ` Dan Mann @ 2001-09-25 10:55 ` VDA 5 siblings, 0 replies; 76+ messages in thread From: VDA @ 2001-09-25 10:55 UTC (permalink / raw) To: Dan Mann; +Cc: linux-kernel Hello Dan, Monday, September 24, 2001, 10:11:08 PM, you wrote: DM> I hope this isn't the wrong place to ask this but, wouldn't it be better to DM> increase ram size and decrease swap size as memory requirements grow? For DM> instance, say I have a lightly loaded machine, that has 192MB of ram. From DM> everything I've heard in the past, I'd use roughly 192MB of swap with this DM> machine. The problem I would imagine is that if all 192MB got used wouldn't DM> it be terribly slow to read/write that much data back in? Would less swap, DM> say 32 MB make the kernel more restrictive with it's available memory and DM> make the box more responsive when it's heavily using swap? If you want everything to be fast, buy more RAM and use no swap whatsoever. Swap is useful if your total memory requirements are big but working set is significantly smaller. You need RAM to cover working set and RAM+swap to cover total memory requirements. As you can see, amount of RAM and swap thus *application dependent*. -- Best regards, VDA mailto:VDA@port.imtp.ilyichevsk.odessa.ua ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:33 ` Rik van Riel 2001-09-16 16:50 ` Andreas Steinmetz 2001-09-16 17:06 ` Ricardo Galli @ 2001-09-16 18:16 ` Stephan von Krawczynski 2001-09-16 19:43 ` Linus Torvalds 2001-09-17 8:06 ` Eric W. Biederman 4 siblings, 0 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-16 18:16 UTC (permalink / raw) To: Ricardo Galli; +Cc: linux-kernel On Sun, 16 Sep 2001 19:06:45 +0200 (MET) Ricardo Galli <gallir@m3d.uib.es> wrote: > On Sun, 16 Sep 2001, Jeremy Zawodny wrote: > > > > Agreed. I'd be great if there was an option to say "Don't swap out > > memory that was allocated by these programs. If you run out of disk > > buffers, toss the oldest ones and start re-using them." > > More easy though (for cases of listening mp3's and backups): cache pages > that were accesed only "once"(*) several seconds ago must be discarded > first. It only implies a check against an access counter and a "last > accesed" epoch fields of the page. Well, I guess this is everybody's first idea about the problem: make an initial timestamp for knowing how _old_ an allocation really is, and make an access counter. Ok. The first is easy, but how do you achieve an access-counter? If this was solved, the problem is solved. You can do really nice aging with access to such kind of information. Any ideas? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:33 ` Rik van Riel ` (2 preceding siblings ...) 2001-09-16 18:16 ` broken VM in 2.4.10-pre9 Stephan von Krawczynski @ 2001-09-16 19:43 ` Linus Torvalds 2001-09-16 19:57 ` Rik van Riel ` (4 more replies) 2001-09-17 8:06 ` Eric W. Biederman 4 siblings, 5 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-16 19:43 UTC (permalink / raw) To: linux-kernel In article <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>, Rik van Riel <riel@conectiva.com.br> wrote: >On 16 Sep 2001, Michael Rothwell wrote: > >> Is there a way to tell the VM to prune its cache? Or a way to limit >> the amount of cache it uses? > >Not yet, I'll make a quick hack for this when I get back next >week. It's pretty obvious now that the 2.4 kernel cannot get >enough information to select the right pages to evict from >memory. Don't be stupid. The desribed behaviour has nothing to do with limiting the cache or anything else "cannot get enough information", except for the fact that the kernel obviously cannot know what will happen in the future. The kernel _correctly_ swapped out tons of pages that weren't touched in a long long time. That's what you want to happen - the fact that they then all became active on logout is sad. The fact that the "use-once" logic didn't kick in is the problem. It's hard to tell _why_ it didn't kick in, possibly because the MP3 player read small chunks of the pages (touching them multiple times). THAT is worth looking into. But blathering about "reverse mappings will help this" is just incredibly stupid. You seem to think that they are a panacea for all problems, ranging from MP3 playback to world peace and re-building the WTC. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 19:43 ` Linus Torvalds @ 2001-09-16 19:57 ` Rik van Riel 2001-09-16 20:17 ` Rik van Riel ` (3 subsequent siblings) 4 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-16 19:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, 16 Sep 2001, Linus Torvalds wrote: > The desribed behaviour has nothing to do with limiting the cache or > anything else "cannot get enough information", except for the fact that > the kernel obviously cannot know what will happen in the future. > > The kernel _correctly_ swapped out tons of pages that weren't touched in > a long long time. That's what you want to happen - the fact that they > then all became active on logout is sad. The problem is that a too large cache reliably makes the system unsuitable for interactive use. In that case its probably worth it to make evicting pages from that cache more likely than evicting pages from user processes, while still giving the truly busy cache pages a chance to stay resident. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 19:43 ` Linus Torvalds 2001-09-16 19:57 ` Rik van Riel @ 2001-09-16 20:17 ` Rik van Riel 2001-09-16 20:29 ` Andreas Steinmetz ` (2 subsequent siblings) 4 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-16 20:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Sun, 16 Sep 2001, Linus Torvalds wrote: > The fact that the "use-once" logic didn't kick in is the problem. It's > hard to tell _why_ it didn't kick in, possibly because the MP3 player > read small chunks of the pages (touching them multiple times). It's probably because it used mmap(), all mp3 players seem to do that. If they also use MADV_SEQUENTIAL, I guess it'd be easy to also do drop_behind on them, though... > THAT is worth looking into. But blathering about "reverse mappings > will help this" is just incredibly stupid. You seem to think that they > are a panacea for all problems, Absoluteley not, all reverse mappings allow us is an easier framework to get the other decisions right. By implementing _just_ the reverse mappings and leaving the other stuff the same I've already found my desktop to be more usable. This seems to be the effect of the fact that reverse mappings allow us to get page aging right because we see all referenced bits on a page. If you think we can do this without reverse mappings I only have to point at linux 1.2, 2.0, 2.2 and 2.4 as a suggestion to the contrary. If it was possible, surely we would have succeeded in 7 years of trying ? Add to that the fact that reverse mappings makes it trivial to do things like defragmenting memory a bit to make sure fork() succeeds or sparc64 users can allocate page tables or being able to keep the page tables mapped until the page is cleaned in page_launder() (reducing soft page faults) or doing a physical page scan to deal with gross imbalance between memory zones and we'll have something which, IMHO, is worth experimenting with. Sure, reverse mappings also have disadvantages, like one pointer extra in the page_struct or as much as 2 pointers per mapping for shared pages and a slight complication of the locking, but I'm not convinced that these disadvantages are so severe we should continue the VM the same way we failed to make it work right the last 7 years. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 19:43 ` Linus Torvalds 2001-09-16 19:57 ` Rik van Riel 2001-09-16 20:17 ` Rik van Riel @ 2001-09-16 20:29 ` Andreas Steinmetz 2001-09-16 21:28 ` Linus Torvalds 2001-09-17 0:37 ` Daniel Phillips 2001-09-21 3:10 ` Bill Davidsen 4 siblings, 1 reply; 76+ messages in thread From: Andreas Steinmetz @ 2001-09-16 20:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > The fact that the "use-once" logic didn't kick in is the problem. It's > hard to tell _why_ it didn't kick in, possibly because the MP3 player > read small chunks of the pages (touching them multiple times). > Then you should have an eye on mmap(). aide uses it. And it causes a real problem. The basic logic is here: open(file,O_RDONLY); mmap(whole-file,PROT_READ,MAP_SHARED); <do md5sum of mapped file> munmap(); close(); No matter how you call the thing above (not my code, anyway): I strongly feel that the use once logic isn't a great idea. What if lots of pages get accessed twice? Where to set the limit? How about adding a swapout cost factor? This would prevent swapping until pressure is really high without any fixed limits. Calculate clean page reuse in microseconds whereas swapout followed by swapin is going to be milliseconds. That's a factor of at least 1000 which needs to be applied in page selection. Andreas Steinmetz D.O.M. Datenverarbeitung GmbH ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 20:29 ` Andreas Steinmetz @ 2001-09-16 21:28 ` Linus Torvalds 2001-09-16 22:47 ` Alex Bligh - linux-kernel 2001-09-16 22:59 ` Stephan von Krawczynski 0 siblings, 2 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-16 21:28 UTC (permalink / raw) To: Andreas Steinmetz; +Cc: linux-kernel On Sun, 16 Sep 2001, Andreas Steinmetz wrote: > > > The fact that the "use-once" logic didn't kick in is the problem. It's > > hard to tell _why_ it didn't kick in, possibly because the MP3 player > > read small chunks of the pages (touching them multiple times). > > > Then you should have an eye on mmap(). aide uses it. And it causes a real > problem. The basic logic is here: > > open(file,O_RDONLY); > mmap(whole-file,PROT_READ,MAP_SHARED); > <do md5sum of mapped file> > munmap(); > close(); Okey-dokey. I actually started looking at the current Linux page referenced logic, and it just looks incredibly broken. There's no logic to it, and it's obvious how some of it has grown over time without people really understanding or caring about the referenced bit. It looks very much like part of the VM was done with only "page->age", and another part was done with the reference bit. So some users will totally ignore the information that other users use and update. It's not pretty. > No matter how you call the thing above (not my code, anyway): I strongly feel > that the use once logic isn't a great idea. What if lots of pages get accessed > twice? Where to set the limit? Actually, the once-used approach _should_ work fine for mmap'ed pages too, but the fact is that the code didn't even try, partly because the mmap code was the code that used page->age and didn't care about the referenced bit at all (except it _did_ care about the referenced bit in the page tables: just not the bit in "struct page". And it's the latter bit that actually ends up being the best once-used logic). > How about adding a swapout cost factor? This would prevent swapping until > pressure is really high without any fixed limits. Calculate clean page reuse in > microseconds whereas swapout followed by swapin is going to be milliseconds. > That's a factor of at least 1000 which needs to be applied in page selection. Well, the thing is, swap-out is often cheaper than read-in, and just dropping a page is often the cheapest of all. And all of these things are a bit intertwined. I actually have a _sane_ generic "used-once" approach that works with mmap'ed memory and with other kinds too, and right now it doesn't bother with "page->age" _at_all_. Instead, the aging is done by moving things from one list to another, which actually seems to be better, but who knows. And that automatically gets used-once right - any pages are always added to the inactive lists, and get bumped up to active only after they are physically referenced the second time. This is actually incredibly trivial to do without any aging at all: void mark_page_accessed(struct page *page) { if (!PageActive(page) && PageReferenced(page)) { activate_page(page); ClearPageReferenced(page); return; } /* Mark the page referenced, AFTER checking for previous usage.. */ SetPageReferenced(page); } and the other importan tpart that we got (completely) wrong wrt the use-once logic is the fact that when we scan the inactive lists and find a page that is marked "referenced", we should NOT move it to the active list (that defeats the whole point of use-once), but we should instead just clear the reference bit and move it to the head of the right inactive list. So it actually looks like the use-once logic only worked under some very specific circumstances, not in general. Anybody willing to test the simple used-once cleanups? No guarantees, but at least they make sense (some of the old interactions certainly do not). (The new code is a simple state machine: - touch non-referenced page: set the reference bit - touch already referenced page: move it to next list "upwards" (ie the active list) - age a non-referenced page on a list: move to "next" list downwards (ie free if already inactive, move to inactive if currently active) - age a referenced page on a list: clear ref bit and move to beginning of same list. which works fine for mmap pages too. I left the age updates, because the page age may well make sense within the active list). I'll make a 2.4.10pre10. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 21:28 ` Linus Torvalds @ 2001-09-16 22:47 ` Alex Bligh - linux-kernel 2001-09-16 22:55 ` Linus Torvalds 2001-09-16 22:59 ` Stephan von Krawczynski 1 sibling, 1 reply; 76+ messages in thread From: Alex Bligh - linux-kernel @ 2001-09-16 22:47 UTC (permalink / raw) To: Linus Torvalds, Andreas Steinmetz; +Cc: linux-kernel, Alex Bligh - linux-kernel X-Mailer: Mulberry/2.1.0 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline --On Sunday, 16 September, 2001 2:28 PM -0700 Linus Torvalds <torvalds@transmeta.com> wrote: > - age a non-referenced page on a list: move to "next" list downwards (ie > free if already inactive, move to inactive if currently active) Do you still make the distinction between Inactive Clean and Inactive Dirty (& just move to appropriate list)? Effectively this is just a 'binary' aging function (OK position on the list matters too). Others on the list have observed page->age performs in a binary manner anyhow with exponential aging. How do you balance between Inactive Clean before Inactive Dirty and avoid evicting many (infrequently used) code pages at the expense of many (historic, even less frequently used) dirty data pages? Or don't we care? -- Alex Bligh ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 22:47 ` Alex Bligh - linux-kernel @ 2001-09-16 22:55 ` Linus Torvalds 0 siblings, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-16 22:55 UTC (permalink / raw) To: Alex Bligh - linux-kernel; +Cc: Andreas Steinmetz, linux-kernel On Sun, 16 Sep 2001, Alex Bligh - linux-kernel wrote: > > > - age a non-referenced page on a list: move to "next" list downwards (ie > > free if already inactive, move to inactive if currently active) > > Do you still make the distinction between Inactive Clean > and Inactive Dirty (& just move to appropriate list)? That part of the code doesn't. > Effectively this is just a 'binary' aging function (OK position > on the list matters too). Others on the list have observed > page->age performs in a binary manner anyhow with exponential > aging. Right. I'm not saying that this is anything _exiting_ - I'm just saying that the old code did not have any clear behaviour at all. It would inappropriately raise a page from the inactive lists to the active list even if the code that actually _touched_ the page had decided that the page was not active. And the behaviour of the referenced bit once on the active list was unclear. I personally think that clearing the reference bit when moving to the active list is the right thing to do (so that it is marked "unimportant" on the active list and needs a _third_ access to be marked important), but this is an example of one of the tweaks/decisions we should clearly make instead of leaving the behaviour undefined (which it was before). > How do you balance between Inactive Clean before Inactive Dirty > and avoid evicting many (infrequently used) code pages at > the expense of many (historic, even less frequently used) dirty > data pages? Or don't we care? We probably _do_ care. I suspect that if there are balancing problems, they could easily be in the reclaim_page() vs page_launder() balance (ie aging of inactive_clean vs inactive_dirty). Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 21:28 ` Linus Torvalds 2001-09-16 22:47 ` Alex Bligh - linux-kernel @ 2001-09-16 22:59 ` Stephan von Krawczynski 2001-09-16 22:14 ` Linus Torvalds 2001-09-17 15:35 ` Stephan von Krawczynski 1 sibling, 2 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-16 22:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andreas Steinmetz, linux-kernel > [...] > Anybody willing to test the simple used-once cleanups? No guarantees, but > at least they make sense (some of the old interactions certainly do not). Very willing. Just send it to me, please. > (The new code is a simple state machine: > > - touch non-referenced page: set the reference bit > > - touch already referenced page: move it to next list "upwards" (ie the > active list) > > - age a non-referenced page on a list: move to "next" list downwards (ie > free if already inactive, move to inactive if currently active) > > - age a referenced page on a list: clear ref bit and move to beginning of > same list. Are you sure about the _beginning_? You are aging out _all_ non-ref pages in the next step? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 22:59 ` Stephan von Krawczynski @ 2001-09-16 22:14 ` Linus Torvalds 2001-09-16 23:29 ` Stephan von Krawczynski 2001-09-17 15:35 ` Stephan von Krawczynski 1 sibling, 1 reply; 76+ messages in thread From: Linus Torvalds @ 2001-09-16 22:14 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: Andreas Steinmetz, linux-kernel On Sun, 16 Sep 2001, Stephan von Krawczynski wrote: > > [...] > > Anybody willing to test the simple used-once cleanups? No > guarantees, but > > at least they make sense (some of the old interactions certainly do > not). > > Very willing. Just send it to me, please. It's there as 2.4.10pre10, on ftp.kernel.org under "testing" now. However, note that it hasn't gotten any "tweaking", ie there's none of the small changes that aging differences usually tend to need. I'm hoping that's ok, as the new behaviour shouldn't be that different from the old behaviour in most cases, and that the biggest differences _should_ be just proper once-use things. But it would be interesting to hear which loads show markedly worse/better behaviour. If any. > > - age a referenced page on a list: clear ref bit and move to beginning of > > same list. > > Are you sure about the _beginning_? You are aging out _all_ non-ref > pages in the next step? Well, it depends on what your definition of "is" is.. Or rather, what the "beginning" is. The way things work now, is that all pages are added to the "beginning", and the aging is done from the end, moving pages at the end to other lists (or, in the case of a referenced page, back to the beginning). You could, of course, define the list to be done the other way around. It won't make any actual behavioural difference, unless there are bugs due to confusion about which end is "new" and which is "old". Which there might well be, of course. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 22:14 ` Linus Torvalds @ 2001-09-16 23:29 ` Stephan von Krawczynski 0 siblings, 0 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-16 23:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Stephan von Krawczynski, Andreas Steinmetz, linux-kernel > >>> (The new code is a simple state machine: >>> >>> - touch non-referenced page: set the reference bit >>> >>> - touch already referenced page: move it to next list "upwards" (ie the >>> active list) >>> >>> - age a non-referenced page on a list: move to "next" list downwards (ie >>> free if already inactive, move to inactive if currently active) >>> >>> - age a referenced page on a list: clear ref bit and move to beginning of >>> same list. > > Are you sure about the _beginning_? You are aging out _all_ non-ref > > pages in the next step? > > Well, it depends on what your definition of "is" is.. > > Or rather, what the "beginning" is. The way things work now, is that all > pages are added to the "beginning", and the aging is done from the end, > moving pages at the end to other lists (or, in the case of a referenced > page, back to the beginning). Wait a minute: if you age a page in active by clearing ref-bit and moving it to beginning of the list, and your next aging cycle starts from the end, that reads like you have to walk the whole list to find all the ageable pages for moving "downwards". In fact I expected your aging to start the list from the beginning, as there should be "most" of the non-ref entries, and you could stop age-walking hitting the first ref-page in that list. This was my guess when aging is used to find possible free pages _fast_. Am I wrong or is this idea generally not useable? Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 22:59 ` Stephan von Krawczynski 2001-09-16 22:14 ` Linus Torvalds @ 2001-09-17 15:35 ` Stephan von Krawczynski 2001-09-17 15:51 ` Linus Torvalds 2001-09-17 16:34 ` Stephan von Krawczynski 1 sibling, 2 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-17 15:35 UTC (permalink / raw) To: Linus Torvalds; +Cc: ast, linux-kernel On Sun, 16 Sep 2001 15:14:22 -0700 (PDT) Linus Torvalds <torvalds@transmeta.com> wrote: > > Very willing. Just send it to me, please. > > It's there as 2.4.10pre10, on ftp.kernel.org under "testing" now. > > However, note that it hasn't gotten any "tweaking", ie there's none of the > small changes that aging differences usually tend to need. I'm hoping > that's ok, as the new behaviour shouldn't be that different from the old > behaviour in most cases, and that the biggest differences _should_ be just > proper once-use things. > > But it would be interesting to hear which loads show markedly worse/better > behaviour. If any. Hello, I tried my usual test setup today with 2.4.10-pre10 and experienced the following: - cpu load goes pretty high (11-12 according to xosview)during several occasions, upto the point where you cannot even move the mouse. Compared to an once tested ac-version it is not _that_ nice. I have some problems cat'ing /proc/meminfo, too. I takes sometimes pretty long (minutes). - the meminfo shows me great difference to former versions in the balancing of inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty pages than active (compared to pre9 and before) in my test. I guess this is intended by this (used-once) patch. So take this as a hint, that your work performs as expected. - of course the alloc problems itself stayed the same. Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 15:35 ` Stephan von Krawczynski @ 2001-09-17 15:51 ` Linus Torvalds 2001-09-17 16:34 ` Stephan von Krawczynski 1 sibling, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 15:51 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: ast, linux-kernel On Mon, 17 Sep 2001, Stephan von Krawczynski wrote: > > - cpu load goes pretty high (11-12 according to xosview)during several > occasions, upto the point where you cannot even move the mouse. Compared to an > once tested ac-version it is not _that_ nice. I have some problems cat'ing > /proc/meminfo, too. I takes sometimes pretty long (minutes). It's not really CPU load - the loadaverage in Linux (and some other UNIXes too) also accounts for disk wait. > - the meminfo shows me great difference to former versions in the balancing of > inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty pages > than active (compared to pre9 and before) in my test. I guess this is intended > by this (used-once) patch. So take this as a hint, that your work performs as > expected. No, I think they are related, and bad. I suspect it just means that pages really do not get elevated to the active list, and it's probably _too_ unwilling to activate pages. That's bad too - it means that the inactive list is the one solely responsible for working set changes, and the VM won't bother with any other pages. Which also leads to bad results.. That's always the downside with having multiple lists of any kind - if the balance between the lists is bad, performance will be bad. Historically, the active list was the big one, and the other ones mostly didn't matter, which makes the balancing issue much less noticeable. [ This is also the very same problem we used to have with buffer cache pages vs mapped pages vs other caches ] The fix may be to just make the inactive lists not do aging at all. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 15:35 ` Stephan von Krawczynski 2001-09-17 15:51 ` Linus Torvalds @ 2001-09-17 16:34 ` Stephan von Krawczynski 2001-09-17 16:46 ` Linus Torvalds 2001-09-17 17:20 ` Stephan von Krawczynski 1 sibling, 2 replies; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-17 16:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, ast On Mon, 17 Sep 2001 08:51:54 -0700 (PDT) Linus Torvalds <torvalds@transmeta.com> wrote: > > On Mon, 17 Sep 2001, Stephan von Krawczynski wrote: > > > > - cpu load goes pretty high (11-12 according to xosview)during several > > occasions, upto the point where you cannot even move the mouse. Compared to an > > once tested ac-version it is not _that_ nice. I have some problems cat'ing > > /proc/meminfo, too. I takes sometimes pretty long (minutes). > > It's not really CPU load - the loadaverage in Linux (and some other UNIXes > too) also accounts for disk wait. Well, what I meant was: compared to the _same_ situation and test bed, the load seems "pretty high". ac versions are somewhat lower in this setup. > > - the meminfo shows me great difference to former versions in the balancing of > > inact_dirty and active. This pre10 tends to have a _lot_ more inact_dirty pages > > than active (compared to pre9 and before) in my test. I guess this is intended > > by this (used-once) patch. So take this as a hint, that your work performs as > > expected. > > No, I think they are related, and bad. I suspect it just means that pages > really do not get elevated to the active list, and it's probably _too_ > unwilling to activate pages. That's bad too - it means that the inactive > list is the one solely responsible for working set changes, and the VM > won't bother with any other pages. Which also leads to bad results.. Hm, remember my setup: I read a lot from CD, write it to disk and read a lot from nfs and write it to disk. Basically both are read once - write once setups, so the pages are touched once (or worst twice) at maximum, so I see a good chance none of them ever make it to the active list, according to your state explanation from previous posts. And thats what I see (I guess). If I do a CD compare (read disk, read CD and compare) I see lots of pages walk over to active. And that again looks as you told before. I think it does work as you said. Anyway I cannot "feel" a difference in performance (maybe even worse than before), but it _looks_ cleaner. How about taking it as a first step in the cleanup direction? :-) Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 16:34 ` Stephan von Krawczynski @ 2001-09-17 16:46 ` Linus Torvalds 2001-09-17 17:20 ` Stephan von Krawczynski 1 sibling, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 16:46 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, ast On Mon, 17 Sep 2001, Stephan von Krawczynski wrote: > > > > No, I think they are related, and bad. I suspect it just means that pages > > really do not get elevated to the active list, and it's probably _too_ > > unwilling to activate pages. That's bad too - it means that the inactive > > list is the one solely responsible for working set changes, and the VM > > won't bother with any other pages. Which also leads to bad results.. > > Hm, remember my setup: I read a lot from CD, write it to disk and read a lot > from nfs and write it to disk. Basically both are read once - write once > setups, so the pages are touched once (or worst twice) at maximum, so I see a > good chance none of them ever make it to the active list, according to your > state explanation from previous posts. Right. That part is fine. The problematic part is that I suspect that _because_ there's a lot of inactive pages, the VM layer won't even try to age the active ones. Which will result in the inactive pages being re-circulated reasonably quickly.. Hmm. Although maybe that's the right behaviour, considering that you don't actually _want_ to cache them. It leaves your _truly_ active set untouched. > Anyway I cannot "feel" a difference in performance (maybe even worse than > before), but it _looks_ cleaner. How about taking it as a first step in the > cleanup direction? :-) "Looks cleaner" is very important for me for maintenance reasons - having behaviour that you cannot explain tends to result in more and more ad-hoc hacks over time, and it just tends to get worse and worse. However, at the same time I'd really like to hear about improved behaviour, not just "feels the same". And certainly not "(maybe even worse.." Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 16:34 ` Stephan von Krawczynski 2001-09-17 16:46 ` Linus Torvalds @ 2001-09-17 17:20 ` Stephan von Krawczynski 2001-09-17 17:37 ` Linus Torvalds 1 sibling, 1 reply; 76+ messages in thread From: Stephan von Krawczynski @ 2001-09-17 17:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel, ast On Mon, 17 Sep 2001 09:46:28 -0700 (PDT) Linus Torvalds <torvalds@transmeta.com> wrote: > "Looks cleaner" is very important for me for maintenance reasons - having > behaviour that you cannot explain tends to result in more and more ad-hoc > hacks over time, and it just tends to get worse and worse. Agreed. > However, at the same time I'd really like to hear about improved > behaviour, not just "feels the same". And certainly not "(maybe even > worse.." Hm, sorry for that. But that's what I see. Maybe the problem is now on a different field. > The problematic part is that I suspect that _because_ there's a lot of > inactive pages, the VM layer won't even try to age the active ones. > Which will result in the inactive pages being re-circulated reasonably > quickly.. Do you think this re-circulation is _fast_ in current code? Maybe performance loss comes from this point? BTW: I tried Andrea's brand new patch and have to admit it has a _big_ performance gain, though I understand you dislike the design very much. Regards, Stephan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 17:20 ` Stephan von Krawczynski @ 2001-09-17 17:37 ` Linus Torvalds 0 siblings, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 17:37 UTC (permalink / raw) To: Stephan von Krawczynski; +Cc: linux-kernel, ast On Mon, 17 Sep 2001, Stephan von Krawczynski wrote: > > > However, at the same time I'd really like to hear about improved > > behaviour, not just "feels the same". And certainly not "(maybe even > > worse.." > > Hm, sorry for that. But that's what I see. Maybe the problem is now on a > different field. Heh. I wasn't blaming you. The code obviously leaves something to be desired, still. > BTW: I tried Andrea's brand new patch and have to admit it has a _big_ > performance gain, though I understand you dislike the design very much. I only dislike one aspect of it, not the whole patch. Andrea has spent a lot of time doing tuning, which is hugely important for real-world workloads. I also suspect from previous patches that he increases read-ahead aggressively etc. I'll take a look, Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 19:43 ` Linus Torvalds ` (2 preceding siblings ...) 2001-09-16 20:29 ` Andreas Steinmetz @ 2001-09-17 0:37 ` Daniel Phillips 2001-09-17 1:07 ` Linus Torvalds 2001-09-21 3:10 ` Bill Davidsen 4 siblings, 1 reply; 76+ messages in thread From: Daniel Phillips @ 2001-09-17 0:37 UTC (permalink / raw) To: Linus Torvalds, linux-kernel On September 16, 2001 09:43 pm, Linus Torvalds wrote: > The fact that the "use-once" logic didn't kick in is the problem. It's > hard to tell _why_ it didn't kick in, possibly because the MP3 player > read small chunks of the pages (touching them multiple times). Can we confirm that the mp3 player is making subpage accesses? (strace) The 'partially read/written' state isn't handled properly now. The transition to the 'used-once' state should only occur if the transfer ends at the exact end of the page. Right now it always takes place after the *first* transfer on the page which is correct only for full-page transfers. It's still best to start all pages unreferenced, because otherwise we don't have a means of distinguishing between the first and subsequent page cache lookups. The check_used_once logic should set the page referenced if the IO transfer ends in the interior of the page or unreferenced if it ends at the end of the page. This straightforward to fix, I'll have a tested patch by Tuesday if nobody beats me to it. I don't think this is the whole problem though, it's just exposing a balancing problem. Even if I did go and randomly access a huge file so that all cache pages have high age (the effect we're simulating by accident here) I still shouldn't be able to drive all my swap pages out of memory. -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 0:37 ` Daniel Phillips @ 2001-09-17 1:07 ` Linus Torvalds 2001-09-17 2:23 ` Daniel Phillips ` (2 more replies) 0 siblings, 3 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 1:07 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-kernel On Mon, 17 Sep 2001, Daniel Phillips wrote: > > Can we confirm that the mp3 player is making subpage accesses? (strace) People claim that they do mmap's, which the old code definitely didn't handle correctly at all. I'm not 100% sure that the 2.4.10-pre10 aging is right for anonymous pages either, and the page-referenced handling at COW time looks suspiciously broken, for example. It's not something we have ever gotten right, I think - if the old pre-C-O-W page was marked accessed, we should mark that page referenced before we break the COW. Otherwise we'll move over to a new page without crediting the source. > The 'partially read/written' state isn't handled properly now. The > transition to the 'used-once' state should only occur if the transfer ends at > the exact end of the page. Right now it always takes place after the *first* > transfer on the page which is correct only for full-page transfers. No, it's not as easy as you make it sound. The problem is that partial accesses are real, and they should be counted as such - except when they are _linear_ partial accesses, in which case they should not be counted at all except for the first one. Having some "if transfer ends at end of page" logic would minimally get the enf-of-file case wrong, for example, never mind the case of a reader that is seeking around in the file. The EOF case could be worked around with yet another hack, but I suspect that the real fix is to try to fix applications that do bad things. > It's still best to start all pages unreferenced, because otherwise we don't > have a means of distinguishing between the first and subsequent page cache > lookups. The check_used_once logic should set the page referenced if the IO > transfer ends in the interior of the page or unreferenced if it ends at the > end of the page. See how 2.4.10-pre10 doesn't have any use_once hackery at all, but instead has a clear path on references: prefetching: non-referenced page on inactive list after 1st reference: refrenced page on inactive list after 2nd reference: non-referenced page on active list after 3rd and subsequent accesses: referenced page on active list while the "age down" logic is the exact reverse of the above. Logical and easy to implement, and gives four distinct "stages" for all pages (along with the LRU ordering within each list, of course). Now, the above _is_ different from what we used to do. For one thing, it's logical. But it might be different enough that the heuristics we have for aging may need some tuning again. "Logical" is not enough.. There's also a few issues that I don't like right now wrt reference handling, notably: - COW issue mentioned above. Probably trivially fixed by something like diff -u --recursive --new-file pre10/linux/mm/memory.c linux/mm/memory.c --- pre10/linux/mm/memory.c Sun Sep 16 18:01:48 2001 +++ linux/mm/memory.c Sun Sep 16 18:00:59 2001 @@ -955,6 +955,8 @@ if (pte_same(*page_table, pte)) { if (PageReserved(old_page)) ++mm->rss; + if (pte_young(pte)) + mark_page_accessed(old_page); break_cow(vma, new_page, address, page_table); /* Free the old page.. */ which looks right (it basically saves off the referenced bit for the old page table entry in the physical page reference count). - truly anonymous pages (ie before they've been added to the swap cache) are not necessarily going to behave as nicely as other pages. They magically appear after VM scanning as a "1st reference", and I have a reasonably good argument that says that they'll have been aged up and down roughly the same number of times, which makes this more-or-less correct. But it's still a theoretical argument, nothing more. This could reasonably easily be fixed by adding these anonymous pages to the LRU lists anyway (with a bogus "page->mapping" which causes them to be re-mapped as _real_ swap cache pages when they need writeout), but that's a bit too subtle for my taste. If anybody wants to look into this, I'd love to know if it makes a difference in behaviour, though.. - I don't like the lack of aging in 'reclaim_page()'. It will walk the whole LRU list if required, which kind of defeats the purpose of having reference bits and LRU on that list. The code _claims_ that it almost always succeeds with the first page, but I don't see why it would. I think that comment assumed that the inactive_clean list cannot have any referenced pages, but that's never been true. There are probably other issues too, these are the ones I was wondering about when I walked over the use of the PG_referenced bit.. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 1:07 ` Linus Torvalds @ 2001-09-17 2:23 ` Daniel Phillips 2001-09-17 5:11 ` Jan Harkes 2001-09-17 12:26 ` Rik van Riel 2 siblings, 0 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-17 2:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On September 17, 2001 03:07 am, Linus Torvalds wrote: > > The 'partially read/written' state isn't handled properly now. The > > transition to the 'used-once' state should only occur if the transfer ends at > > the exact end of the page. Right now it always takes place after the *first* > > transfer on the page which is correct only for full-page transfers. > > No, it's not as easy as you make it sound. > > The problem is that partial accesses are real, and they should be counted > as such - except when they are _linear_ partial accesses, in which case > they should not be counted at all except for the first one. Yes, in a really fancy VM manager we'd analyze the access patterns to get a reliable determination of what is serial access and what is not, and we'd do things like retroactively lowering the priority of pages we didn't initially know much about but were later able to determine were part of a serial access. But we can pick the low-hanging fruit by just looking at where the most recent access lands. This relies on the fact that most serial transfers procede forward. If we get it wrong, a few pages end up referenced when they should not be, but so what? Also, if a page ends up unreferenced when it should be then it still has a good chance to be rescued. > Having some "if transfer ends at end of page" logic would minimally get > the enf-of-file case wrong, Right, the condition should be "transfer ends exactly at the lower of the end of page or end of file". > for example, never mind the case of a reader > that is seeking around in the file. The EOF case could be worked around > with yet another hack, but I suspect that the real fix is to try to fix > applications that do bad things. I'd say this one isn't a hack, its just a matter of finishing the job. Sorry I should have done tested this a week ago but I got a little distracted if you know what I mean. Seeking around in a file will be handled OK. We'll tend to drop those pages that are fully accessed but not reaccessed soon and retain pages that are partially accessed. So long as the aging mechanism doesn't drop the ball, such pages will just live a little longer in cache, not forever. What we're missing is a way for the swap cache to 'push back' at the page cache. Right now, the little bit of extra pressure I accidently created by ignoring the subpage transfers is pushing all anonymous pages out of memory. That's way too fragile. -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 1:07 ` Linus Torvalds 2001-09-17 2:23 ` Daniel Phillips @ 2001-09-17 5:11 ` Jan Harkes 2001-09-17 12:33 ` Daniel Phillips 2001-09-17 15:38 ` Linus Torvalds 2001-09-17 12:26 ` Rik van Riel 2 siblings, 2 replies; 76+ messages in thread From: Jan Harkes @ 2001-09-17 5:11 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel On Sun, Sep 16, 2001 at 06:07:34PM -0700, Linus Torvalds wrote: > See how 2.4.10-pre10 doesn't have any use_once hackery at all, but instead > has a clear path on references: > > prefetching: non-referenced page on inactive list > after 1st reference: refrenced page on inactive list > after 2nd reference: non-referenced page on active list > after 3rd and subsequent accesses: referenced page on active list So it ends up using a 'used_thrice' hack. Yeah, that does solve some of the used once problems ;) > - COW issue mentioned above. Probably trivially fixed by something like The COW is triggered by a pagefault, so the page will be accessed and the hardware bits (both accessed and dirty) should get set automatically. > - truly anonymous pages (ie before they've been added to the swap cache) > are not necessarily going to behave as nicely as other pages. They I just found a simple example that none of the 2.4.x kernels really like that much. Create a program that malloc's the available free memory minus 5-10MB, memset's it to page the memory in as anonymous pages and then goes to sleep. Then run something like a kernel compile. If there is enough memory left to catch the allocation spikes to avoid swapping, the system will be heavily paging with the small amount of "aged memory" that is left over to work with. > but that's a bit too subtle for my taste. If anybody wants to look into > this, I'd love to know if it makes a difference in behaviour, though.. pre10 right after booting, MemTotal: 127104 kB MemFree: 41844 kB Active: 11632 kB Inact_dirty: 19148 kB Inact_clean: 0 kB Inact_target: 1004 kB pre9 with Rik's reverse mapping & delayed swap allocation and my local hacks, MemTotal: 126976 kB MemFree: 41244 kB Active: 80032 kB Inact_dirty: 0 kB Inact_clean: 0 kB Inact_target: 984 kB Inactive target is interesting, because it is directly related to the amount of memory pressure we've seen (memory_pressure >> 6). Also as we're still far from running low on free memory, nothing was pushed into the inactive lists (yes, there is no used_once, or used_thrice stuff at all). While pre10 has about 50 MB that is 'lost' to anonymous pages which don't get aged until we start swapping things out. Differences are definitely noticeable, but I'm almost sure that is mostly related to the fact that we have all potentially pageable or swappable memory on the lists. > - I don't like the lack of aging in 'reclaim_page()'. It will walk the > whole LRU list if required, which kind of defeats the purpose of having > reference bits and LRU on that list. The code _claims_ that it almost > always succeeds with the first page, but I don't see why it would. I > think that comment assumed that the inactive_clean list cannot have any > referenced pages, but that's never been true. As far as I can understand the _original_ design on which the current VM is based, aging only occurs to pages on the active 'ring', the inactive lists are basically LRU-ordered victim caches. Pages are unmapped before they go to the inactive_dirty list and buffers are flushed before they can go to inactive_clean. Ofcourse both the used_once changes and -pre10 sort of flushed these designs down the toilet by putting mapped pages on the inactive_dirty list and turning the active list into an LRU. Jan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 5:11 ` Jan Harkes @ 2001-09-17 12:33 ` Daniel Phillips 2001-09-17 12:41 ` Rik van Riel 2001-09-17 16:14 ` Jan Harkes 2001-09-17 15:38 ` Linus Torvalds 1 sibling, 2 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-17 12:33 UTC (permalink / raw) To: Jan Harkes, Linus Torvalds; +Cc: linux-kernel On September 17, 2001 07:11 am, Jan Harkes wrote: > As far as I can understand the _original_ design on which the current VM > is based, aging only occurs to pages on the active 'ring', the inactive > lists are basically LRU-ordered victim caches. Pages are unmapped before > they go to the inactive_dirty list and buffers are flushed before they > can go to inactive_clean. > > Ofcourse both the used_once changes and -pre10 sort of flushed these > designs down the toilet by putting mapped pages on the inactive_dirty > list and turning the active list into an LRU. The active list is *supposed* to approximate an LRU. The inactive lists are not LRUs but queues, and have always been. The inactive queues have always had both mapped and unmapped pages on them. The reason for unmapping a swap cache page page when putting it on the inactive queue is to give it some time to be rescued, since we otherwise have no information about its short-term activity because we have no way of accessing the hardware dirty bit given the physical page on the lru. A second reason for unmapping it is, we don't have any choice. The point where we place it on the inactive queue is the last point where we're able to find its userspace page table entry. <paid advertisement> We'd be able to avoid unmapping swap cache pages with Rik's rmap patch because we can easily check the hardware referenced bit before finally evicting the page. Plus, and I hope I'm interpreting this correctly, we can allocate the swap slot and perform swap clustering at that time, greatly simplifying the swapout code. </paid advertisment> ;-) Drifting a little further offtopic. As far as I can tell, there's no fundamental reason why we cannot make the current strategy work as well as Rik's rmaps probably will, with some more blood, sweat and code study. On the other hand, Matt Dillon, the reigning champion of virtual memory managment, was quite firm in stating that we should drop the current virtually scanning strategy in favor of 100% physical scanning as BSD uses, relying on reverse mapping. http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html (Matt Dillon holds forth on the design of BSD's memory manager) -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:33 ` Daniel Phillips @ 2001-09-17 12:41 ` Rik van Riel 2001-09-17 14:49 ` Daniel Phillips 2001-09-17 16:14 ` Jan Harkes 1 sibling, 1 reply; 76+ messages in thread From: Rik van Riel @ 2001-09-17 12:41 UTC (permalink / raw) To: Daniel Phillips; +Cc: Jan Harkes, Linus Torvalds, linux-kernel On Mon, 17 Sep 2001, Daniel Phillips wrote: > Drifting a little further offtopic. As far as I can tell, there's no > fundamental reason why we cannot make the current strategy work as > well as Rik's rmaps probably will, with some more blood, sweat and > code study. I don't see any possibility to get that to work without reverse mapping. Of course, that could be me overlooking some possibility, but I'm not holding by breath waiting for somebody to invent this other possibility. > On the other hand, Matt Dillon, the reigning champion of > virtual memory managment, was quite firm in stating that we should > drop the current virtually scanning strategy in favor of 100% > physical scanning as BSD uses, relying on reverse mapping. > > http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html > (Matt Dillon holds forth on the design of BSD's memory manager) His claims are backed up by FreeBSD's VM performance, so I'm inclined to believe them. If you think you can come up with something better, I'll believe you when you show it... regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:41 ` Rik van Riel @ 2001-09-17 14:49 ` Daniel Phillips 0 siblings, 0 replies; 76+ messages in thread From: Daniel Phillips @ 2001-09-17 14:49 UTC (permalink / raw) To: Rik van Riel; +Cc: Jan Harkes, Linus Torvalds, linux-kernel On September 17, 2001 02:41 pm, Rik van Riel wrote: > > On the other hand, Matt Dillon, the reigning champion of > > virtual memory managment, was quite firm in stating that we should > > drop the current virtually scanning strategy in favor of 100% > > physical scanning as BSD uses, relying on reverse mapping. > > > > http://mail.nl.linux.org/linux-mm/2000-05/msg00419.html > > (Matt Dillon holds forth on the design of BSD's memory manager) > > His claims are backed up by FreeBSD's VM performance, > so I'm inclined to believe them. If you think you can > come up with something better, I'll believe you when > you show it... Rik, read the post, I'm supporting you. Please don't be so paranoid ;-) -- Daniel ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:33 ` Daniel Phillips 2001-09-17 12:41 ` Rik van Riel @ 2001-09-17 16:14 ` Jan Harkes 2001-09-17 16:34 ` Linus Torvalds 1 sibling, 1 reply; 76+ messages in thread From: Jan Harkes @ 2001-09-17 16:14 UTC (permalink / raw) To: Daniel Phillips; +Cc: Linus Torvalds, linux-kernel On Mon, Sep 17, 2001 at 02:33:12PM +0200, Daniel Phillips wrote: > The inactive queues have always had both mapped and unmapped pages on > them. The reason for unmapping a swap cache page page when putting it So the following code in refill_inactive_scan only exists in my imagination? if (page_count(page) <= (page->buffers ? 2 : 1)) { deactivate_page_nolock(page); page_active = 0; } else { page_active = 1; } We only move pages to the inactive list when they have one reference from the page cache and one from buffers. Since all mapped pte's also keep a reference, this means that there cannot be any pte's that point to this page by the time we decide to deactivate the page. > any choice. The point where we place it on the inactive queue is the > last point where we're able to find its userspace page table entry. And that is because we only move it after all pte's have been unmapped. Jan ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 16:14 ` Jan Harkes @ 2001-09-17 16:34 ` Linus Torvalds 0 siblings, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 16:34 UTC (permalink / raw) To: Jan Harkes; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Jan Harkes wrote: > On Mon, Sep 17, 2001 at 02:33:12PM +0200, Daniel Phillips wrote: > > The inactive queues have always had both mapped and unmapped pages on > > them. The reason for unmapping a swap cache page page when putting it > > So the following code in refill_inactive_scan only exists in my > imagination? > > if (page_count(page) <= (page->buffers ? 2 : 1)) { > deactivate_page_nolock(page); No, but I agree with Daniel that it's wrong. The reason it exists there is because the current inactive_clean list scanning doesn't have any pressure into VM scanning, so if we'd let mapped pages on the inactive queue, then reclaim_page() would be unhappy about them. That can be solved several ways: - like we do now. Hackish and wrong, but kind-of-works. - make reclaim_page() have the ability to do vm scanning pressure (ie if it starts noticing that there are too many mapped pages on the reclaim list, it should cause VM scan) - physical maps Actually, now that I look at it, the lack of de-activation actually hurts page_launder() - which doesn't get to launder pages that are still mapped (even though getting rid of buffers from them would almost certainly be good under memory pressure). Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 5:11 ` Jan Harkes 2001-09-17 12:33 ` Daniel Phillips @ 2001-09-17 15:38 ` Linus Torvalds 1 sibling, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 15:38 UTC (permalink / raw) To: Jan Harkes; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Jan Harkes wrote: > > > - COW issue mentioned above. Probably trivially fixed by something like > > The COW is triggered by a pagefault, so the page will be accessed and > the hardware bits (both accessed and dirty) should get set automatically. No. The point is that yes, the bits are set in the _page_table_, but we've never set them on the physical page. And the COW fault will switch the page table entry to a new page, so if we don't set the referenced bit on the physical page at that time, we _never_ will. Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 1:07 ` Linus Torvalds 2001-09-17 2:23 ` Daniel Phillips 2001-09-17 5:11 ` Jan Harkes @ 2001-09-17 12:26 ` Rik van Riel 2001-09-17 15:42 ` Linus Torvalds 2001-09-17 17:33 ` Linus Torvalds 2 siblings, 2 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-17 12:26 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel On Sun, 16 Sep 2001, Linus Torvalds wrote: > - truly anonymous pages (ie before they've been added to the swap cache) > are not necessarily going to behave as nicely as other pages. They > magically appear after VM scanning as a "1st reference", and I have a > reasonably good argument that says that they'll have been aged up and > down roughly the same number of times, which makes this more-or-less > correct. But it's still a theoretical argument, nothing more. This nicely points out the problem with page aging which Linux has always had. Pages which are referenced all the time by the processes using them STILL get aged down all the time. I suspect that the biggest impact the reverse mapping patch has right now seems to be caused by fixing this behaviour and just aging up a page when it is referenced and down when it is not. > - I don't like the lack of aging in 'reclaim_page()'. It will walk the > whole LRU list if required, which kind of defeats the purpose of having > reference bits and LRU on that list. The code _claims_ that it almost > always succeeds with the first page, but I don't see why it would. I > think that comment assumed that the inactive_clean list cannot have any > referenced pages, but that's never been true. This depends on whether we do reactivation in __find_page_nolock() or if we leave the page alone and wait for kswapd to do that for us. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:26 ` Rik van Riel @ 2001-09-17 15:42 ` Linus Torvalds 2001-09-18 12:04 ` Rik van Riel 2001-09-17 17:33 ` Linus Torvalds 1 sibling, 1 reply; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 15:42 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Rik van Riel wrote: > > > - I don't like the lack of aging in 'reclaim_page()'. It will walk the > > whole LRU list if required, which kind of defeats the purpose of having > > reference bits and LRU on that list. The code _claims_ that it almost > > always succeeds with the first page, but I don't see why it would. I > > think that comment assumed that the inactive_clean list cannot have any > > referenced pages, but that's never been true. > > This depends on whether we do reactivation in __find_page_nolock() > or if we leave the page alone and wait for kswapd to do that for > us. We should not do _anything_ in __find_page_nolock(). It's positively wrong to touch any aging information there - if you do, you are guaranteed to not get read-ahead right (ie a page that gets read-ahead first will behave differently than a page that got read directly, which just cannot be right). The aging has to be done at a higher level (ie when you actually _use_ it, not when you search the hash queues). Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 15:42 ` Linus Torvalds @ 2001-09-18 12:04 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-18 12:04 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Linus Torvalds wrote: > We should not do _anything_ in __find_page_nolock(). > The aging has to be done at a higher level (ie when you actually _use_ > it, not when you search the hash queues). Absolutely agreed. In fact, I already did this last week in the -still not published- new version of the reverse mapping patch ;) (now if I only could get that thing SMP safe in an efficient way) regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:26 ` Rik van Riel 2001-09-17 15:42 ` Linus Torvalds @ 2001-09-17 17:33 ` Linus Torvalds 2001-09-17 18:07 ` Linus Torvalds 2001-09-18 12:09 ` Rik van Riel 1 sibling, 2 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 17:33 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Rik van Riel wrote: > > > - truly anonymous pages (ie before they've been added to the swap cache) > > are not necessarily going to behave as nicely as other pages. They > > magically appear after VM scanning as a "1st reference", and I have a > > reasonably good argument that says that they'll have been aged up and > > down roughly the same number of times, which makes this more-or-less > > correct. But it's still a theoretical argument, nothing more. > > This nicely points out the problem with page aging which Linux > has always had. Pages which are referenced all the time by the > processes using them STILL get aged down all the time. > > I suspect that the biggest impact the reverse mapping patch > has right now seems to be caused by fixing this behaviour and > just aging up a page when it is referenced and down when it is > not. Well, here's a 10-line patch to make the anonymous pages get on the LRU queues, and thus get aged along with all the others. NOTE NOTE NOTE! This is _literally_ a 15-minute hack, and I expect that there are paths where I forget to remove the page from the LRU queue (which should result in a nice big oops in __free_pages_ok()). Also, I didn't look into shm handling - it _looks_ like shm will remove the page from the LRU list and re-insert it, which will lose all list information, of course. But the point being that keeping anonymous pages on the LRU list shouldn't be all that hard. Even if I missed something on this first try. Linus ------ diff -u --recursive --new-file penguin/linux/mm/filemap.c linux/mm/filemap.c --- penguin/linux/mm/filemap.c Mon Sep 17 09:22:57 2001 +++ linux/mm/filemap.c Mon Sep 17 09:15:45 2001 @@ -489,7 +489,6 @@ page->index = index; add_page_to_inode_queue(mapping, page); add_page_to_hash_queue(page, page_hash(mapping, index)); - lru_cache_add(page); spin_unlock(&pagecache_lock); } diff -u --recursive --new-file penguin/linux/mm/memory.c linux/mm/memory.c --- penguin/linux/mm/memory.c Mon Sep 17 09:23:55 2001 +++ linux/mm/memory.c Mon Sep 17 10:15:57 2001 @@ -958,6 +958,7 @@ if (pte_young(pte)) mark_page_accessed(old_page); break_cow(vma, new_page, address, page_table); + lru_cache_add(new_page); /* Free the old page.. */ new_page = old_page; @@ -1198,6 +1199,7 @@ mm->rss++; flush_page_to_ram(page); entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); + lru_cache_add(page); } set_pte(page_table, entry); diff -u --recursive --new-file penguin/linux/mm/shmem.c linux/mm/shmem.c --- penguin/linux/mm/shmem.c Mon Sep 17 09:22:57 2001 +++ linux/mm/shmem.c Mon Sep 17 09:17:12 2001 @@ -356,6 +356,7 @@ flags = page->flags & ~((1 << PG_uptodate) | (1 << PG_error) | (1 << PG_referenced) | (1 << PG_arch_1)); page->flags = flags | (1 << PG_dirty); add_to_page_cache_locked(page, mapping, idx); + lru_cache_add(page); info->swapped--; spin_unlock (&info->lock); } else { diff -u --recursive --new-file penguin/linux/mm/swap.c linux/mm/swap.c --- penguin/linux/mm/swap.c Wed Aug 8 15:17:26 2001 +++ linux/mm/swap.c Mon Sep 17 09:50:33 2001 @@ -153,8 +153,6 @@ void lru_cache_add(struct page * page) { spin_lock(&pagemap_lru_lock); - if (!PageLocked(page)) - BUG(); add_page_to_inactive_dirty_list(page); page->age = 0; spin_unlock(&pagemap_lru_lock); @@ -176,7 +174,7 @@ } else if (PageInactiveClean(page)) { del_page_from_inactive_clean_list(page); } else { - printk("VM: __lru_cache_del, found unknown page ?!\n"); +// printk("VM: __lru_cache_del, found unknown page ?!\n"); } DEBUG_ADD_PAGE } @@ -187,8 +185,6 @@ */ void lru_cache_del(struct page * page) { - if (!PageLocked(page)) - BUG(); spin_lock(&pagemap_lru_lock); __lru_cache_del(page); spin_unlock(&pagemap_lru_lock); diff -u --recursive --new-file penguin/linux/mm/swap_state.c linux/mm/swap_state.c --- penguin/linux/mm/swap_state.c Mon Sep 17 09:22:57 2001 +++ linux/mm/swap_state.c Mon Sep 17 09:42:20 2001 @@ -147,6 +147,10 @@ */ void free_page_and_swap_cache(struct page *page) { + if (page_count(page) == 1 && !page->mapping) { + lru_cache_del(page); + } + /* * If we are the only user, then try to free up the swap cache. * ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 17:33 ` Linus Torvalds @ 2001-09-17 18:07 ` Linus Torvalds 2001-09-18 12:09 ` Rik van Riel 1 sibling, 0 replies; 76+ messages in thread From: Linus Torvalds @ 2001-09-17 18:07 UTC (permalink / raw) To: Rik van Riel; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Linus Torvalds wrote: > > NOTE NOTE NOTE! This is _literally_ a 15-minute hack, and I expect that > there are paths where I forget to remove the page from the LRU queue > (which should result in a nice big oops in __free_pages_ok()). Actually, the most common failure mode seems to be that we have plenty of inactive pages (all the anonymous pages that we added to the LRU list and thus to the statistics). And because we have tons of these pages, the VM scanning is never even started, because do_try_to_free_pages() thinks that it can just launder them. Which means that we'll never get rid of them. Oops. So it's easy adding anonymous pages to the LRU lists per se, but it obviously needs some more work to make the scanners be aware of the fact that they are there... (I suspect that the easiest way to make them be aware of the anonymous pages is to have a bogus address space associated with the anonymous pages, with no actual hashing going on. And then make that address space have a "writepage()" function that turns an anonymous pages into a swap cache page. But I was hoping to get off more easily ;). Linus ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 17:33 ` Linus Torvalds 2001-09-17 18:07 ` Linus Torvalds @ 2001-09-18 12:09 ` Rik van Riel 1 sibling, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-18 12:09 UTC (permalink / raw) To: Linus Torvalds; +Cc: Daniel Phillips, linux-kernel On Mon, 17 Sep 2001, Linus Torvalds wrote: > Well, here's a 10-line patch to make the anonymous pages get on the LRU > queues, and thus get aged along with all the others. The problem is that they will still get aged DOWN all the time, even if they are accessed continuously by the process which owns the page.... regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 19:43 ` Linus Torvalds ` (3 preceding siblings ...) 2001-09-17 0:37 ` Daniel Phillips @ 2001-09-21 3:10 ` Bill Davidsen 4 siblings, 0 replies; 76+ messages in thread From: Bill Davidsen @ 2001-09-21 3:10 UTC (permalink / raw) To: linux-kernel On Sun, 16 Sep 2001, Linus Torvalds wrote: > In article <Pine.LNX.4.33L.0109161330000.9536-100000@imladris.rielhome.conectiva>, > Rik van Riel <riel@conectiva.com.br> wrote: > >On 16 Sep 2001, Michael Rothwell wrote: > > > >> Is there a way to tell the VM to prune its cache? Or a way to limit > >> the amount of cache it uses? > > > >Not yet, I'll make a quick hack for this when I get back next > >week. It's pretty obvious now that the 2.4 kernel cannot get > >enough information to select the right pages to evict from > >memory. > > Don't be stupid. > > The desribed behaviour has nothing to do with limiting the cache or > anything else "cannot get enough information", except for the fact that > the kernel obviously cannot know what will happen in the future. I think that's very harsh, because while the kernel can't predict the future, in many cases the sysadmin can, and some of the tools used to act on that information are not gone due to "enhancement." Most particularly, the free pages are now not settable, leaving the admin to diddle with *unrelated* things trying to get correct function, instead of setting the free required and letting the kernel do a balance between buffes, cache, etc. So if I know I have an application which will need 12MB suddenly to maintain good response, I have lost my tool to just tell the system that much free is needed. And honestly the fact that the kernel makes good overall choices pales when the worst case is to blatently bad. > The kernel _correctly_ swapped out tons of pages that weren't touched in > a long long time. That's what you want to happen - the fact that they > then all became active on logout is sad. It reflects poor decisions in the kernel. To balance program and i/o pages the kernel should track the i/o rate while increasing the cache used. When the i/o rate stops getting better, the kernel should assume that the program is not reusing the data pages at this time. Obviously this need hysterisis to keep the program vs. data ratio from changing too fast after some good initial setting, but having a file copy or CD rip push programs out of memory shows that the kernel is not making optimal use of the information it has. > The fact that the "use-once" logic didn't kick in is the problem. It's > hard to tell _why_ it didn't kick in, possibly because the MP3 player > read small chunks of the pages (touching them multiple times). > > THAT is worth looking into. But blathering about "reverse mappings will > help this" is just incredibly stupid. You seem to think that they are a > panacea for all problems, ranging from MP3 playback to world peace and > re-building the WTC. Sorry, I think the problem is that the existing logic is just not working. When you trade a small gain in overall performance for a really bad worst case you are balancing a gain which is measured rather than felt with a loss which is instantly painful. Please rethink, the use-once is elegant, but it just doesn't work, and until the kernel makes some effort to avoid paging out text for data when it doesn't help performance you will have these ugly pauses. I will note that we were doing just this type of balance of space in 1968 in GECOS (as in the arcane GECOS password field). Hopefully you will find this criticism constructive... -- bill davidsen <davidsen@tmr.com> CTO, TMR Associates, Inc Doing interesting things with little computers since 1979. ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-16 16:33 ` Rik van Riel ` (3 preceding siblings ...) 2001-09-16 19:43 ` Linus Torvalds @ 2001-09-17 8:06 ` Eric W. Biederman 2001-09-17 12:12 ` Rik van Riel 4 siblings, 1 reply; 76+ messages in thread From: Eric W. Biederman @ 2001-09-17 8:06 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm Rik van Riel <riel@conectiva.com.br> writes: > On 16 Sep 2001, Michael Rothwell wrote: > > > Is there a way to tell the VM to prune its cache? Or a way to limit > > the amount of cache it uses? > > Not yet, I'll make a quick hack for this when I get back next > week. It's pretty obvious now that the 2.4 kernel cannot get > enough information to select the right pages to evict from > memory. Hmm. Perhaps or perhaps it is using the information poorly. There is an alternative approach to have better aging information. An address_space can be allocated per mm_struct. And all of the anonymous pages can be allocated to that address_space. The address_space can then have an array or better a tree of extents that list which indexes correspond to which swap pages. With some pages not being backed. Getting the allocation of indices correct so that merging will work is a little trickier then now, as is the case of a private writeable mapping of a file. But in a lot of other ways the logic becomes simpler. > For 2.5 I'm making a VM subsystem with reverse mappings, the > first iterations are giving very sweet performance so I will > continue with this project regardless of what other kernel > hackers might say ;) Do you have any arguments for the reverse mappings or just for some of the other side effects that go along with them? Eric ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 8:06 ` Eric W. Biederman @ 2001-09-17 12:12 ` Rik van Riel 2001-09-17 15:45 ` Eric W. Biederman 0 siblings, 1 reply; 76+ messages in thread From: Rik van Riel @ 2001-09-17 12:12 UTC (permalink / raw) To: Eric W. Biederman; +Cc: linux-kernel, linux-mm On 17 Sep 2001, Eric W. Biederman wrote: > There is an alternative approach to have better aging information. [snip incomplete description of data structure] What you didn't explain is how your idea is related to aging. > > For 2.5 I'm making a VM subsystem with reverse mappings, the > > first iterations are giving very sweet performance so I will > > continue with this project regardless of what other kernel > > hackers might say ;) > > Do you have any arguments for the reverse mappings or just for some of > the other side effects that go along with them? Mainly for the side effects, but until somebody comes up with another idea to achieve all the side effects I'm not giving up on reverse mappings. If you can achieve all the good stuff in another way, show it. regards, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: broken VM in 2.4.10-pre9 2001-09-17 12:12 ` Rik van Riel @ 2001-09-17 15:45 ` Eric W. Biederman 0 siblings, 0 replies; 76+ messages in thread From: Eric W. Biederman @ 2001-09-17 15:45 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, linux-mm Rik van Riel <riel@conectiva.com.br> writes: > On 17 Sep 2001, Eric W. Biederman wrote: > > > There is an alternative approach to have better aging information. > > [snip incomplete description of data structure] > > What you didn't explain is how your idea is related to > aging. Sorry I thought you had been staring at the problem long enough to see. In any case the problem with the current code is that you can't put all pages in the swap cache immediately because you don't want to allocate the swap space just yet. And without being in the swap cache aging isn't especially effective. By using something like a shared memory segment behind every anonymous page, you can put the page in the swap cache before you allocate swap for it (because it has a persistent identity). Further since you no longer need counts for every swap page. You can deallocate swap space from pages simply by walking through the ``indirect pages'' and removing the reference to swap space. > > > For 2.5 I'm making a VM subsystem with reverse mappings, the > > > first iterations are giving very sweet performance so I will > > > continue with this project regardless of what other kernel > > > hackers might say ;) > > > > Do you have any arguments for the reverse mappings or just for some of > > the other side effects that go along with them? > > Mainly for the side effects, but until somebody comes > up with another idea to achieve all the side effects I'm > not giving up on reverse mappings. If you can achieve > all the good stuff in another way, show it. I think I can I haven't had time to implement it. Given the way Alan and some of the others were talking I though my idea has long ago been thought of and put on the plate for 2.5. If it really is a new idea under the sun I'll look at implementing it as soon as I have a hole in my schedule. Eric ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design @ 2001-09-25 11:00 VDA 2001-09-25 11:07 ` Rik van Riel 0 siblings, 1 reply; 76+ messages in thread From: VDA @ 2001-09-25 11:00 UTC (permalink / raw) To: Andrea Arcangeli, Rik van Riel, Alexander Viro, Daniel Phillips Cc: linux-kernel Hi VM folks, I'd like your comments on this (rather simplistic) hypothetic VM description. If you know what's wrong with it, please tell me. Ram page can be: free Not used by anything. clean non-backed Initially allocated page. All such pages share global zero-filled readonly page. On write COW magic is making a dirty non-backed copy dirty non-backed This ram page have copy neither in swap nor in fs. Under light i/o load page laundering machinery _may_ allocate swap for it and write it back. Only after a timeout. No point in writing back too early/too often. This laundering can be turned off completely without much harm. clean swap-backed This ram page has a copy on swap. It is not modified (either swapped in or is already laundered) Note: as soon as it gets dirty, it becomes dirty *non-backed* and swap page is freed dirty swap-backed This ram page was modified some time ago, become dirty non-backed, and is being written back right now to newly allocated swap page (laundering or (evicting LRU ram page under memory pressure)). A temporary stage. Turns clean swap-backed as soon as write completes. clean fs-backed This ram page is mmapped from a file and is not modified or already written back dirty fs-backed This ram page is mmapped from a file and is modified. It needs to be written back within reasonable timeout to keep fs data consistent How LRU works All non-free ram pages are in LRU list. Top ram page on the list is the most recently accessed one. Bottom one is least recently accessed one. VM periodically scans all process pte's looking for 'accessed' and 'dirty' bits, resets those bits, modifies page status: Accessed bit set: Move ram page to top of LRU list. Dirty bit set: clean non-backed - can't happen (global zero page is RO) dirty non-backed -> dirty non-backed clean swap-backed -> dirty non-backed (note: swap page freed!) dirty swap-backed -> dirty non-backed (note: complicated case. See below) clean fs-backed -> dirty fs-backed dirty fs-backed -> dirty fs-backed Complicated case: We are writing ram page to swap either due to: 1) Evicting LRU ram page under memory pressure 2) Laundering ram page under light io load and page gets accessed/dirtied by some process. In first case we are in deep trouble. To prevent this we must unmap ram page from all processes before evicting. In second case we are fine, however, laundering io is wasted. Ram page should become dirty non-backed again and moved to top of LRU list, swap page freed. Ram page eviction We evict ram pages when we have no free ram pages and we have a memory request: 1) a process accesses not-present (swapped out) page 2) a process accesses not-present page mmaped to a file 3) a process writes to clean non-backed ram page and COW needs new ram page to make a copy Rate of such evictions is a good measure of mem pressure. We evict ram page from the bottom of LRU list by unmapping it from all processes, writing back to fs or allocating swap and writing back to swap if it is dirty and using freed ram page to fulfill memory request. -- Best regards, VDA mailto:VDA@port.imtp.ilyichevsk.odessa.ua ^ permalink raw reply [flat|nested] 76+ messages in thread
* Re: Linux VM design 2001-09-25 11:00 Linux VM design VDA @ 2001-09-25 11:07 ` Rik van Riel 0 siblings, 0 replies; 76+ messages in thread From: Rik van Riel @ 2001-09-25 11:07 UTC (permalink / raw) To: VDA; +Cc: Andrea Arcangeli, Alexander Viro, Daniel Phillips, linux-kernel On Tue, 25 Sep 2001, VDA wrote: > I'd like your comments on this (rather simplistic) hypothetic > VM description. If you know what's wrong with it, please tell me. It's a nice start, but not quite accurate. You may want to read my freenix paper and some of the documents on the linux-mm page: http://linux-mm.org/ http://www.surriel.com/lectures/ cheers, Rik -- IA64: a worthy successor to i860. http://www.surriel.com/ http://distro.conectiva.com/ Send all your spam to aardvark@nl.linux.org (spam digging piggy) ^ permalink raw reply [flat|nested] 76+ messages in thread
end of thread, other threads:[~2001-09-25 16:03 UTC | newest] Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2001-09-16 15:19 broken VM in 2.4.10-pre9 Ricardo Galli 2001-09-16 15:23 ` Michael Rothwell 2001-09-16 16:33 ` Rik van Riel 2001-09-16 16:50 ` Andreas Steinmetz 2001-09-16 17:12 ` Ricardo Galli 2001-09-16 17:06 ` Ricardo Galli 2001-09-16 17:18 ` Jeremy Zawodny 2001-09-16 18:45 ` Stephan von Krawczynski 2001-09-21 3:16 ` Bill Davidsen 2001-09-21 10:21 ` Stephan von Krawczynski 2001-09-21 14:08 ` Bill Davidsen 2001-09-21 14:23 ` Rik van Riel 2001-09-23 13:13 ` Eric W. Biederman 2001-09-23 13:27 ` Rik van Riel 2001-09-21 10:43 ` Stephan von Krawczynski 2001-09-21 12:13 ` Rik van Riel 2001-09-21 12:55 ` Stephan von Krawczynski 2001-09-21 13:01 ` Rik van Riel 2001-09-22 11:01 ` Daniel Phillips 2001-09-22 20:05 ` Rik van Riel 2001-09-24 9:36 ` Linux VM design VDA 2001-09-24 11:06 ` Dave Jones 2001-09-24 12:15 ` Kirill Ratkin 2001-09-24 13:29 ` Rik van Riel 2001-09-24 14:05 ` VDA 2001-09-24 14:37 ` Rik van Riel 2001-09-24 14:42 ` Rik van Riel 2001-09-24 18:37 ` Daniel Phillips 2001-09-24 19:32 ` Rik van Riel 2001-09-24 17:27 ` Rob Landley 2001-09-24 21:48 ` Rik van Riel 2001-09-25 9:58 ` Daniel Phillips 2001-09-25 16:03 ` bill davidsen 2001-09-24 18:46 ` Jonathan Morton 2001-09-24 19:16 ` Daniel Phillips 2001-09-24 19:11 ` Dan Mann 2001-09-25 10:55 ` VDA 2001-09-16 18:16 ` broken VM in 2.4.10-pre9 Stephan von Krawczynski 2001-09-16 19:43 ` Linus Torvalds 2001-09-16 19:57 ` Rik van Riel 2001-09-16 20:17 ` Rik van Riel 2001-09-16 20:29 ` Andreas Steinmetz 2001-09-16 21:28 ` Linus Torvalds 2001-09-16 22:47 ` Alex Bligh - linux-kernel 2001-09-16 22:55 ` Linus Torvalds 2001-09-16 22:59 ` Stephan von Krawczynski 2001-09-16 22:14 ` Linus Torvalds 2001-09-16 23:29 ` Stephan von Krawczynski 2001-09-17 15:35 ` Stephan von Krawczynski 2001-09-17 15:51 ` Linus Torvalds 2001-09-17 16:34 ` Stephan von Krawczynski 2001-09-17 16:46 ` Linus Torvalds 2001-09-17 17:20 ` Stephan von Krawczynski 2001-09-17 17:37 ` Linus Torvalds 2001-09-17 0:37 ` Daniel Phillips 2001-09-17 1:07 ` Linus Torvalds 2001-09-17 2:23 ` Daniel Phillips 2001-09-17 5:11 ` Jan Harkes 2001-09-17 12:33 ` Daniel Phillips 2001-09-17 12:41 ` Rik van Riel 2001-09-17 14:49 ` Daniel Phillips 2001-09-17 16:14 ` Jan Harkes 2001-09-17 16:34 ` Linus Torvalds 2001-09-17 15:38 ` Linus Torvalds 2001-09-17 12:26 ` Rik van Riel 2001-09-17 15:42 ` Linus Torvalds 2001-09-18 12:04 ` Rik van Riel 2001-09-17 17:33 ` Linus Torvalds 2001-09-17 18:07 ` Linus Torvalds 2001-09-18 12:09 ` Rik van Riel 2001-09-21 3:10 ` Bill Davidsen 2001-09-17 8:06 ` Eric W. Biederman 2001-09-17 12:12 ` Rik van Riel 2001-09-17 15:45 ` Eric W. Biederman 2001-09-25 11:00 Linux VM design VDA 2001-09-25 11:07 ` Rik van Riel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).